Home/Concepts/Artificial Intelligence/Machine Learning Optimization

Machine Learning Optimization

Explore gradient descent variants and hyperparameter tuning

⏱️ 22 min⚡ 20 interactions

What is ML Optimization?

Machine Learning Optimization is the art and science of training models efficiently. It's about finding the best parameters (weights) that minimize your loss function, using clever algorithms and hyperparameter tuning.

💡 The Core Goal

🎯

Minimize Loss

Find weights that make predictions close to actual values

⚡

Train Efficiently

Converge quickly without wasting compute resources

🎚️

Tune Hyperparameters

Learning rate, batch size, momentum - get them right!

Gradient Descent: The Optimization Workhorse

📐 The Mathematics of Learning

Why Gradients Point Uphill

• The optimization problem: Find parameters θ that minimize loss J(θ)

min_θ J(θ) = min_θ (1/n) Σ L(f(x_i; θ), y_i)

• Gradient ∇J(θ): Vector of partial derivatives - shows direction of steepest ascent

∇J(θ) = [∂J/∂θ₁, ∂J/∂θ₂, ..., ∂J/∂θ_n]

• Key insight: To minimize (go downhill), move opposite to gradient: θ = θ - η∇J(θ)

🔄 Three Flavors of Gradient Descent

1. Batch Gradient Descent (BGD)

θ = θ - η × (1/n) Σ_i=1..n ∇L(f(x_i; θ), y_i)

• Uses: Entire dataset for each update

✓ Accurate gradient, stable convergence, guaranteed convergence for convex functions

✗ Slow for large datasets, requires entire dataset in memory, stuck in local minima

Example: 1M samples → compute 1M gradients before single update!

2. Stochastic Gradient Descent (SGD)

θ = θ - η × ∇L(f(x_i; θ), y_i) // single random sample

• Uses: One random sample per update

✓ Fast updates, can escape local minima (noisy), online learning possible

✗ Noisy gradients, never truly converges (oscillates), hard to parallelize

Example: 1M samples → 1M updates per epoch (vs 1 for batch)

3. Mini-Batch Gradient Descent (Standard)

θ = θ - η × (1/B) Σ_i∈batch ∇L(f(x_i; θ), y_i)

• Uses: Small batch (typically 32-256 samples)

✓ Best of both worlds: faster than batch, less noisy than SGD

✓ Vectorization on GPUs, efficient memory usage, some exploration

🎯 This is the standard in deep learning!

🎯 Convergence & Stopping Criteria

When to Stop?

1. Gradient norm is small: ||∇J(θ)|| < ε (e.g., ε = 10⁻⁵)

→ We're at (local) minimum, gradient ≈ 0

2. Loss change is tiny: |J_t - J_t-1| < ε

→ Not improving anymore

3. Max iterations: Reached predefined epoch limit

→ Time/compute budget exhausted

4. Validation loss increases: Overfitting detected

→ Early stopping (best practice!)

Convergence Rate

• Convex functions: O(1/k) convergence for GD, O(1/√k) for SGD

• Non-convex (neural nets): No global optimum guarantees

• Practical: Loss decreases exponentially at first, then plateaus

Typical curve: J(t) = J* + (J₀ - J*)e^-λt

⚡ Gradient Computation in Practice

• Backpropagation: Efficient algorithm to compute ∇J(θ) in O(n) time

Chain Rule Application:

∂J/∂w_l = ∂J/∂z_l × ∂z_l/∂w_l × ... (chain through layers)

Forward pass: compute outputs and cache activations

Backward pass: compute gradients layer-by-layer from output to input

• Automatic differentiation: PyTorch, TensorFlow handle this automatically

loss.backward() # Magic! Computes all gradients
optimizer.step() # Updates all parameters

🎯 Learning Rate Impact

Too small: Slow convergence, stuck in plateaus

Too large: Divergence, oscillation around minimum

Just right: Fast, stable convergence

📊 Batch Size Impact

Small (1-32): Noisy, explores more, generalizes better

Large (128-512): Stable, faster per epoch, may overfit

Sweet spot: 32-128 for most tasks

💡 Key Insight

Gradient descent is iterative refinement: each step moves slightly toward better parameters. The "descent" comes from following negative gradient!

1. Gradient Descent: The Foundation

⛰️ Interactive: Descend the Loss Surface

Learning Rate: 0.100

Position: 5.00Loss: 25.00Steps: 0

📐 Formula: New Weight = Old Weight - Learning Rate × Gradient. The gradient points uphill, so we subtract to go downhill!

2. Learning Rate: The Most Important Hyperparameter

🎚️ Interactive: Compare Learning Rates

Convergence Speed

Speed80%

Stability85%

Perfect balance - fast and stable convergence

3. Batch Size: Speed vs Accuracy Trade-off

📦 Interactive: Adjust Batch Size

Batch Size: 32

1 (SGD)32 (Mini-batch)256 (Large)

Convergence

50%

Memory Usage

12%

Updates/Epoch

312

Small Batch (1-32)

Noisy gradients, better generalization, slower convergence

Large Batch (128-512)

Stable gradients, faster training, may overfit, needs more memory

Advanced Optimizers: Beyond Vanilla SGD

🚀 Why We Need Better Optimizers

SGD's Limitations

• Problem 1 - Same learning rate for all parameters: Some dimensions need large steps, others small

• Problem 2 - Stuck in ravines: Oscillates across narrow valleys while making slow progress along the valley

• Problem 3 - Local minima & saddle points: Gets trapped, especially in high-dimensional spaces

• Problem 4 - Learning rate sensitivity: Must carefully tune η or training fails

Solution: Adaptive optimizers that adjust learning rates per-parameter based on history

🔄 Momentum: Adding Velocity

Mathematical Formulation

v_t = β v_t-1 + ∇J(θ_t) // accumulate velocity
θ_t+1 = θ_t - η v_t // update with velocity

• β (momentum coefficient): Typically 0.9 (retains 90% of previous velocity)

• Exponentially weighted average: Recent gradients weighted more, but history matters

Physical Intuition: Ball Rolling Down Hill

• Without momentum: Ball stops at each step, chooses new direction

• With momentum: Ball builds speed, continues rolling even uphill (inertia)

• Result: Escapes local minima, dampens oscillations in ravines, accelerates in consistent directions

Nesterov Momentum (NAG) - Look Ahead

θ_lookahead = θ_t - β v_t-1 // anticipate future position
v_t = β v_t-1 + ∇J(θ_lookahead) // compute gradient there
θ_t+1 = θ_t - η v_t

Smarter: evaluates gradient at where we're about to be, not where we are!

📊 Adaptive Learning Rates

AdaGrad (2011) - Per-Parameter Adaptation

G_t = G_t-1 + g_t ⊙ g_t // accumulate squared gradients
θ_t+1 = θ_t - (η / √(G_t + ε)) ⊙ g_t

• Key idea: Smaller updates for frequently updated parameters, larger for rare ones

✓ Good for sparse data (NLP, embeddings)

✗ Learning rate decays to zero too quickly (G_t grows monotonically)

RMSprop (2012) - Exponential Moving Average

E[g²]_t = β E[g²]_t-1 + (1-β) g_t² // decay old gradients
θ_t+1 = θ_t - (η / √(E[g²]_t + ε)) g_t

• Fix for AdaGrad: Uses exponential moving average instead of sum

• β typically 0.9: Recent gradients matter more

✓ Doesn't decay to zero, works well for RNNs, non-stationary problems

Hinton introduced in Coursera lecture (unpublished!)

Adam (2014) - Combining Momentum + RMSprop

m_t = β₁ m_t-1 + (1-β₁) g_t // 1st moment (mean)
v_t = β₂ v_t-1 + (1-β₂) g_t² // 2nd moment (variance)
m̂_t = m_t / (1-β₁ᵗ) // bias correction
v̂_t = v_t / (1-β₂ᵗ) // bias correction
θ_t+1 = θ_t - η (m̂_t / (√v̂_t + ε))

• m_t: Momentum-like, smooths gradient direction

• v_t: RMSprop-like, adapts learning rate per parameter

• Bias correction: Accounts for initialization at zero (important early training!)

• Default hyperparameters: β₁=0.9, β₂=0.999, η=0.001, ε=10⁻⁸

✓ Robust, fast convergence, works well with default params, most popular optimizer

✗ Can generalize slightly worse than SGD+momentum with extensive tuning

⚖️ Optimizer Comparison

Optimizer	Speed	Final Acc	Tuning	Best For
SGD	Slow	Best*	Hard	ConvNets with time
SGD+Mom	Medium	Excellent	Medium	Computer Vision
RMSprop	Fast	Good	Easy	RNNs, non-stationary
Adam	Fast	Very Good	Easiest	Default choice, NLP
AdamW	Fast	Best	Easy	Transformers, SOTA

*SGD+momentum can achieve slightly better generalization than Adam with extensive tuning, but Adam is more robust to hyperparameters

🎯 When to Use What?

Adam: Default choice, works 90% of time

SGD+Mom: Best final accuracy with tuning (CV)

AdamW: Transformers, modern best practice

⚡ Modern Variants

AdamW: Decoupled weight decay

RAdam: Rectified Adam, warm-up free

Ranger: RAdam + LookAhead

💡 Key Insight

Adam = Momentum (direction) + RMSprop (step size). It adapts both where to go and how far to step for each parameter!

4. Optimizer Algorithms

🚀 Interactive: Compare Optimizers

🚀

Adam

Adaptive Moment Estimation - combines momentum + RMSprop

Update Rule:

m = β₁m + (1-β₁)g; v = β₂v + (1-β₂)g²

✓ Advantages

Fast convergence, robust, default choice

✗ Limitations

Can generalize worse than SGD sometimes

5. Momentum: Accelerating Convergence

💨 Interactive: Feel the Momentum

Momentum (β): 0.90

No momentumTypical: 0.9High momentum

Effect Visualization

Velocity Accumulation

0.0

🚀 Fast - strong momentum, typical choice

How It Works

Momentum accumulates past gradients like a ball rolling down a hill, building velocity and smoothing out noisy updates.

Why Use It?

Escapes local minima, dampens oscillations, accelerates convergence in relevant directions.

Typical Value

β = 0.9 is most common (90% of past velocity retained)

6. Loss Landscape Navigation

🗺️ Interactive: Explore 2D Loss Surface

Weight 1: 0.00

Weight 2: 0.00

Current Loss

0.000

🎯 Near optimal!

🟢 Optimal point (0, 0) • 🔵 Your position

7. Early Stopping: Preventing Overfitting

⏱️ Interactive: Training with Patience

Patience (epochs): 5

Epoch: 0/50Val Loss: 0.536

💡 Early Stopping: Stop training when validation loss doesn't improve for 5 consecutive epochs. Prevents overfitting and saves compute!

Learning Rate Schedules: Dynamic Adaptation

📉 Why Decay the Learning Rate?

The Training Journey

Early Training: Large Steps

• Far from optimum: Loss is high, gradients are large

• Need speed: Large learning rate to make rapid progress

• Risk: Some instability okay, exploring loss landscape

Late Training: Fine-tuning

• Near optimum: Loss plateauing, in basin of attraction

• Need precision: Small learning rate to settle into minimum

• Risk: Large LR causes oscillation around minimum, never converges!

Analogy: Like driving - speed up on highway (early), slow down near destination (late)

📊 Common Decay Schedules

1. Constant Learning Rate (Baseline)

η_t = η₀ // same throughout training

✓ Simple, no hyperparameters, good for short training

✗ Suboptimal for long training, oscillates near minimum

Use when: Training ≤ 10 epochs, or with Adam (has adaptive rates)

2. Step Decay (Staircase)

η_t = η₀ × γ^{⌊epoch/drop_every⌋}

• Example: η₀=0.1, γ=0.5, drop_every=30

Epochs 0-29: η=0.1

Epochs 30-59: η=0.05

Epochs 60-89: η=0.025

✓ Simple, interpretable, works well in practice

✗ Requires knowing when to drop (domain knowledge)

Classic choice: Divide by 10 every 30-50 epochs

3. Exponential Decay

η_t = η₀ × e^-λt or η_t = η₀ × γ^t

• Smooth decay: Gradual reduction every epoch

• Typical γ: 0.95-0.99 (smaller = faster decay)

✓ Smooth, no sudden changes, mathematically elegant

✗ Decays too fast or too slow if λ not tuned

Use when: Want gradual refinement, have budget for tuning λ

4. Polynomial Decay

η_t = η₀ × (1 - t/T)^power

• T: Total epochs

• power: Typically 1.0 (linear) or 2.0 (quadratic)

✓ Reaches near-zero at end of training, controlled decay rate

Use when: You know total training time in advance

5. Cosine Annealing

η_t = η_min + (η₀ - η_min)/2 × (1 + cos(πt/T))

• Smooth decay: Follows cosine curve from η₀ to η_min

• Variant - With Restarts (SGDR):

Periodically reset to η₀ and decay again (escape local minima!)

✓ Smooth, widely used, restarts help exploration

✓ SOTA in many tasks (ResNet, Transformers)

Popular in modern deep learning (2017+)

🔥 Warmup: Starting Carefully

• Problem: Large learning rate at start can cause divergence (especially with Adam, batch norm)

• Solution: Linearly increase LR from small value to target over first N epochs

η_t = η₀ × min(1, t/T_warmup) // first T_warmup epochs

Example: Transformer Training

1. Warmup: Linearly increase LR for 4K steps (0 → 0.001)

2. Decay: Inverse square root decay after warmup

η_t = d_model^-0.5 × min(t^-0.5, t × warmup_steps^-1.5)

BERT, GPT use warmup! Typical: 1-10% of total steps

🔄 Cyclical Learning Rates

• Idea: Vary LR between min and max in cycles (triangle, cosine waves)

η_t cycles between [η_min, η_max] every C epochs

Benefits:

• Escape saddle points: High LR periods help jump out

• Faster convergence: Can traverse loss landscape efficiently

• Better generalization: Regularization effect from oscillation

Less common than monotonic decay, but powerful tool (Leslie Smith, 2017)

🎯 Practical Recommendations

For Computer Vision (CNNs)

• Optimizer: SGD + momentum

• Schedule: Step decay (÷10 at epochs 30, 60, 90)

• Initial LR: 0.1

Classic ImageNet recipe

For NLP (Transformers)

• Optimizer: Adam or AdamW

• Schedule: Warmup + cosine decay

• Initial LR: 1e-4 to 5e-4

BERT, GPT standard

Quick Experiment / Baseline

• Optimizer: Adam

• Schedule: Constant LR

• Initial LR: 1e-3

Works 80% of the time

SOTA / Competition

• Optimizer: AdamW or SGD+momentum

• Schedule: Cosine with warmup + restarts

• Initial LR: Grid search [1e-5, 1e-2]

Max performance, more tuning

📉 When to Decay?

Always for long training (>50 epochs)

Optional for Adam with early stopping

Essential for SGD to converge

🎛️ Tuning Priority

1. Initial LR: Most important!

2. Schedule type: Step vs cosine

3. Decay rate/timing: Fine-tune if needed

💡 Key Insight

LR schedules are about exploration vs exploitation: explore early (high LR), exploit late (low LR). Like simulated annealing!

8. Learning Rate Schedules

📉 Interactive: Decay Strategies

constant Schedule

Same LR throughout training

✓ When to Use

Simple tasks, short training

⚙️ Implementation

lr = initial_lr

9. L2 Regularization (Weight Decay)

⚖️ Interactive: Balance Fitting vs Simplicity

L2 Lambda (λ): 0.0100

No regularization0.01 (typical)Strong penalty

Data Loss

0.500

Reg Loss

0.500

Total Loss

1.000

Formula:

Total Loss = Data Loss + λ × Σ(weights²)

🎯 Effect: Balanced - prevents overfitting while allowing flexibility

10. Hyperparameter Overview

🎛️ Interactive: Essential Hyperparameters

🎯 Key Takeaways

🎚️

Learning Rate is Critical

The most important hyperparameter. Too high = divergence, too low = slow training. Start with 0.001 for Adam, 0.01 for SGD. Use learning rate schedules for long training.

🚀

Adam is Usually Best

Adam combines momentum and adaptive learning rates. It's the default choice for most tasks. Use SGD with momentum for better generalization if you have time to tune.

📦

Batch Size Trade-offs

32-128 is typical. Larger batches = faster training, more memory, less noise. Smaller batches = better generalization, slower, less memory. Balance based on your GPU.

⏱️

Early Stopping Prevents Overfitting

Monitor validation loss. Stop when it stops improving for N epochs (patience). Saves compute and prevents overfitting. Keep checkpoints of best model.

⚖️

Regularization for Generalization

L2 (weight decay), dropout, and data augmentation prevent overfitting. L2 λ = 0.0001-0.01 typical. Don't over-regularize or you'll underfit.

🎛️

Tune Systematically

Start with learning rate, then batch size, then others. Use grid search or random search. Modern: Bayesian optimization, hyperband. Log everything!