📊 Language Model Scaling Laws

Predict performance from model size and training data

Your Progress

0 / 5 completed
Previous Module
AlphaGo Strategy Breakdown

Introduction to Scaling Laws

🎯 What are Scaling Laws?

Scaling laws describe how language model performance improves predictably with increased model size, training data, and compute. These power-law relationships enable forecasting capabilities before training.

💡
Key Insight

Loss follows power laws: L(N) ∝ N^(-α). Doubling model size yields predictable performance gains.

🔢
Model Parameters

From 117M (GPT-1) to 175B (GPT-3) to 1.76T (GPT-4 estimated)

📚
Training Tokens

Dataset size matters: optimal ratio is ~20 tokens per parameter

Compute Budget

FLOPs determine what's feasible: balance size and data optimally

📈 Historical Milestones

GPT-2 (2019)1.5B params

First large-scale demonstration of scaling benefits

OpenAI Scaling Laws (2020)Research paper

Formalized power-law relationships for loss prediction

Chinchilla (2022)70B params

Showed most models are undertrained: data matters more than thought

📊 Predictable Scaling

  • Loss decreases as smooth power law
  • Forecast performance before training
  • Optimize resource allocation

🎯 Emergent Abilities

  • New capabilities appear at scale
  • Few-shot learning emerges ~13B
  • Chain-of-thought reasoning ~100B