Home/AI/Layer Normalization/Introduction

📊 Layer Normalization

Stabilize training and improve transformer performance

Your Progress

0 / 5 completed

←

Previous Module

Multi-Head Attention

Why Layer Normalization?

🎯 The Problem

Deep neural networks suffer from internal covariate shift – layer input distributions change during training, slowing convergence. Normalization techniques stabilize activations, enabling faster training and deeper architectures.

💡

Key Insight

Layer normalization normalizes across features (per sample), while batch normalization normalizes across the batch dimension. This makes LayerNorm ideal for transformers and RNNs.

⚠️ Without Normalization

•Vanishing/exploding gradients – activations grow unbounded or shrink to zero
•Slow convergence – requires careful learning rate tuning and initialization
•Training instability – loss spikes and divergence common in deep networks
•Limited depth – difficult to train networks beyond 10-20 layers effectively

✅ With Layer Normalization

•Stable gradients – normalized activations keep gradients in healthy range
•Faster training – 2-3x speedup common, higher learning rates possible
•Reduced sensitivity – less dependent on initialization and hyperparameters
•Enables depth – transformers with 100+ layers train successfully

🏗️ Where LayerNorm Appears

🔄

Transformers

After multi-head attention and feedforward layers in every block

📝

RNNs/LSTMs

Within recurrent cells to stabilize hidden state evolution

🎨

GANs

Generator and discriminator networks for training stability