📐 Positional Encoding Deep Dive

Understanding how transformers encode sequential information

Your Progress

0 / 5 completed
Previous Module
Tokenizer Comparison

Why Positional Encoding?

🎯 The Position Problem

Unlike RNNs, transformers process all tokens in parallel. This efficiency comes at a cost: self-attention is permutation-invariant. Without positional information, "dog bites man" and "man bites dog" are identical.

💡
Key Insight

Positional encoding injects order information into token embeddings, allowing transformers to understand sequence structure while maintaining parallel processing.

🔄 How It Works

Positional encodings are vectors added to token embeddings before the first attention layer. Each position gets a unique, fixed-dimensional vector encoding its location.

final_embedding = token_embedding + positional_encoding
🌊
Sinusoidal

Original Transformer approach using sine and cosine waves at different frequencies

🎓
Learned

Trainable position embeddings optimized during model training (BERT, GPT)

🔗
Relative

Modern methods encoding relative distances between tokens (RoPE, ALiBi)

⚖️ Design Requirements

Unique per Position

Each position must have a distinct encoding

Bounded Values

Values should stay within a fixed range (e.g., [-1, 1])

Extrapolation

Should generalize to sequence lengths unseen during training

Consistent Distances

Relative distances between positions should be meaningful