Transformer Architecture Explained
Understand attention mechanisms and modern language models
What is Transformer Architecture?
The Transformer revolutionized AI by replacing recurrence with attention mechanisms. It's the architecture behind GPT, BERT, and modern language models.
๐ก Core Innovation: Attention Is All You Need
Self-Attention: The Core Innovation
๐ง Why Attention Revolutionized NLP
Before transformers, RNNs processed words sequentiallyโslow and prone to forgetting. Self-attention lets each word directly look at every other word in parallel, computing relationships in one step.
โ RNN/LSTM Problems
โ Self-Attention Solutions
๐ฌ Attention Score Computation (Simplified)
๐ Computational Complexity
1. Self-Attention Mechanism
๐ฏ Interactive: Click Words to See Attention
Self-attention lets each word understand its relationship with every other word in the sentence.
Attention from "sat" to other words:
๐ก Key Insight: "sat" pays most attention to "cat" (semantic relationship).
Query-Key-Value: The Retrieval Metaphor
๐ Database-Inspired Attention Mechanism
Think of attention as a soft database lookup: Query searches, Keys match, Values return. This elegant abstraction powers all transformer attention.
๐ก The Analogy: Search Engine
(Linear projection of input embedding)
(Different projection, same dimension as Q)
(Actual content to return if matched)
๐งฎ The Math: Scaled Dot-Product Attention
โข Shape: [seq_len, seq_len]
โข Example: "sat" query matches all keys
โข Higher score = more relevant
โข Without scaling: large dot products โ saturated softmax
โข With scaling: gradients flow better
โข Critical for training stability
โข Each row sums to 1.0
โข Softmax(x_i) = e^(x_i) / ฮฃe^(x_j)
โข Differentiable = backprop works
โข High attention โ more influence
โข Output shape: [seq_len, d_v]
โข Each position = context-aware representation
๐ฏ Why Three Separate Projections?
Answer: Learned projections allow the model to transform embeddings into "search-friendly" and "content-friendly" spaces.
Answer: Q and K are optimized for matching (finding relevant positions). V is optimized for content (what to return).
Answer: Q and K must match (for dot product). V can differ, but typically d_q = d_k = d_v = d_model / num_heads.
2. Query, Key, Value Mechanism
๐ Interactive: How Attention Computes
Step 1: Input Embeddings
Each word is represented as a vector (e.g., 512 dimensions).
Multi-Head Attention: Ensemble of Perspectives
๐ญ Why Multiple Attention Heads?
A single attention head might miss nuances. Multiple heads let the model attend to different types of relationships simultaneouslyโsyntax, semantics, position, coreference.
๐ฌ What Different Heads Learn (Empirical Observations)
โข Adjective โ Noun modifications
โข Preposition โ Object dependencies
โข Contextual word sense
โข Thematic relationships
โข Local n-gram patterns
โข Relative position awareness
โข Discourse coherence
โข Sentence-level structure
๐งฎ The Mathematics: Splitting and Concatenating
h = 8 heads
d_k = d_model / h = 512 / 8 = 64 per head
โข Final linear projection W_O mixes information from all heads
โข Output dimension = d_model (512), same as input
๐ก Empirical Findings from Research
3. Multi-Head Attention
๐ญ Interactive: Multiple Attention Heads
Head 1 Focus:
๐ก Why Multiple Heads? Each head learns different types of relationships. They're concatenated and projected to form the final output.
Positional Encoding: Injecting Word Order
๐ The Problem: Attention is Position-Agnostic
Self-attention is permutation-invariantโit treats "cat sat mat" and "mat sat cat" identically. Positional encodings add unique patterns for each position so the model can distinguish word order.
๐ง Why Sinusoidal Functions?
๐ข The Formula Explained
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
โ Sinusoidal Advantages
๐ Alternatives Used in Practice
4. Positional Encoding
๐ Interactive: Adding Position Information
Since transformers process all positions simultaneously, we add positional encodings to preserve word order.
Sinusoidal Pattern Visualization
Encoder vs Decoder: Two Architectural Paradigms
๐ Three Architectural Families
Encoder-Only
Decoder-Only
Encoder-Decoder
๐ Encoder Architecture Deep Dive
Token "cat" attends to: [The, cat, sat, on, the, mat] โ sees everything
โจ Decoder Architecture Deep Dive
Token "cat" attends to: [The, cat] โ can't see "sat", "on", "the", "mat"
๐ Cross-Attention: The Bridge Between Encoder & Decoder
In encoder-decoder models, the decoder has a third attention sublayer called cross-attention that connects encoder outputs to the decoder.
5. Encoder vs Decoder Architecture
๐๏ธ Interactive: Full Transformer Structure
๐ฅ Encoder (Understanding)
๐ค Decoder (Generation)
6. Layer Normalization
๐ Interactive: Normalize Activations
Before (Raw Values)
After Normalization
7. Feed-Forward Network
๐งฎ Interactive: Position-wise FFN
8. Attention Masking
๐ญ Interactive: Masking Patterns
Attention Matrix Visualization
9. Transformer Applications
๐ Interactive: Famous Models
GPT (Decoder-only)
Decoder-only with causal masking
Autoregressive language model. Predicts next token given previous context.
10. Model Size Calculator
๐ Interactive: Estimate Parameters
๐ก Reference: GPT-2 (117M), BERT-base (110M), GPT-3 (175B), GPT-4 (~1.7T estimated)
๐ฏ Key Takeaways
Self-Attention is Key
Transformers replace recurrence with attention. Each position attends to all others simultaneously, enabling parallelization and long-range dependencies.
Multi-Head Attention
Multiple attention heads learn different relationships: syntax, semantics, position, long-range. They're concatenated for rich representations.
Positional Encoding
Since attention is permutation-invariant, positional encodings inject word order information using sinusoidal patterns the model can learn.
Encoder-Decoder Structure
Encoder for understanding (BERT), decoder for generation (GPT), or both for translation (T5). Each serves different purposes.
Masking Mechanisms
Causal masking for autoregressive generation (GPT), padding masking for variable lengths, no masking for bidirectional understanding (BERT).
Scalability
Transformers scale beautifully. From 110M (BERT) to 175B (GPT-3) to trillions of parameters. More compute + data = better performance.