Home/AI/Language Model Scaling Laws/Introduction

📊 Language Model Scaling Laws

Predict performance from model size and training data

Your Progress

0 / 5 completed

←

Previous Module

AlphaGo Strategy Breakdown

Introduction to Scaling Laws

🎯 What are Scaling Laws?

Scaling laws describe how language model performance improves predictably with increased model size, training data, and compute. These power-law relationships enable forecasting capabilities before training.

💡

Key Insight

Loss follows power laws: L(N) ∝ N^(-α). Doubling model size yields predictable performance gains.

🔢

Model Parameters

From 117M (GPT-1) to 175B (GPT-3) to 1.76T (GPT-4 estimated)

📚

Training Tokens

Dataset size matters: optimal ratio is ~20 tokens per parameter

⚡

Compute Budget

FLOPs determine what's feasible: balance size and data optimally

📈 Historical Milestones

GPT-2 (2019)1.5B params

First large-scale demonstration of scaling benefits

OpenAI Scaling Laws (2020)Research paper

Formalized power-law relationships for loss prediction

Chinchilla (2022)70B params

Showed most models are undertrained: data matters more than thought

📊 Predictable Scaling

•Loss decreases as smooth power law
•Forecast performance before training
•Optimize resource allocation

🎯 Emergent Abilities

•New capabilities appear at scale
•Few-shot learning emerges ~13B
•Chain-of-thought reasoning ~100B