📊 Language Model Scaling Laws
Predict performance from model size and training data
Your Progress
0 / 5 completed←
Previous Module
AlphaGo Strategy Breakdown
Introduction to Scaling Laws
🎯 What are Scaling Laws?
Scaling laws describe how language model performance improves predictably with increased model size, training data, and compute. These power-law relationships enable forecasting capabilities before training.
💡
Key Insight
Loss follows power laws: L(N) ∝ N^(-α). Doubling model size yields predictable performance gains.
🔢
Model Parameters
From 117M (GPT-1) to 175B (GPT-3) to 1.76T (GPT-4 estimated)
📚
Training Tokens
Dataset size matters: optimal ratio is ~20 tokens per parameter
⚡
Compute Budget
FLOPs determine what's feasible: balance size and data optimally
📈 Historical Milestones
GPT-2 (2019)1.5B params
First large-scale demonstration of scaling benefits
OpenAI Scaling Laws (2020)Research paper
Formalized power-law relationships for loss prediction
Chinchilla (2022)70B params
Showed most models are undertrained: data matters more than thought
📊 Predictable Scaling
- •Loss decreases as smooth power law
- •Forecast performance before training
- •Optimize resource allocation
🎯 Emergent Abilities
- •New capabilities appear at scale
- •Few-shot learning emerges ~13B
- •Chain-of-thought reasoning ~100B