📊 Layer Normalization

Stabilize training and improve transformer performance

Your Progress

0 / 5 completed
Previous Module
Multi-Head Attention

Why Layer Normalization?

🎯 The Problem

Deep neural networks suffer from internal covariate shift – layer input distributions change during training, slowing convergence. Normalization techniques stabilize activations, enabling faster training and deeper architectures.

💡
Key Insight

Layer normalization normalizes across features (per sample), while batch normalization normalizes across the batch dimension. This makes LayerNorm ideal for transformers and RNNs.

⚠️ Without Normalization

  • Vanishing/exploding gradients – activations grow unbounded or shrink to zero
  • Slow convergence – requires careful learning rate tuning and initialization
  • Training instability – loss spikes and divergence common in deep networks
  • Limited depth – difficult to train networks beyond 10-20 layers effectively

✅ With Layer Normalization

  • Stable gradients – normalized activations keep gradients in healthy range
  • Faster training – 2-3x speedup common, higher learning rates possible
  • Reduced sensitivity – less dependent on initialization and hyperparameters
  • Enables depth – transformers with 100+ layers train successfully

🏗️ Where LayerNorm Appears

🔄
Transformers

After multi-head attention and feedforward layers in every block

📝
RNNs/LSTMs

Within recurrent cells to stabilize hidden state evolution

🎨
GANs

Generator and discriminator networks for training stability