🧠 Mixture of Experts (MoE)
Efficient scaling with sparse expert networks
Your Progress
0 / 5 completedIntroduction to Mixture of Experts
🎯 What is MoE?
Mixture of Experts (MoE) is a sparse neural network architecture where multiple specialized sub-networks (experts) process different parts of the input, with a gating mechanism selecting which experts to activate for each input.
Use massive models but activate only a small fraction per input - scale capacity without compute cost
🌟 Why MoE?
Efficient Scaling
Increase model capacity without proportional compute increase
Specialization
Different experts learn different skills and patterns
Cost Effective
Sparse activation reduces inference costs dramatically
Better Performance
Outperform dense models of similar compute budget
🔑 Core Concepts
Experts
Independent neural networks (typically feed-forward layers) that specialize in different input patterns
Gating Network
Learned routing function that decides which experts to activate for each input
Sparse Activation
Only top-k experts are activated per input (e.g., 2 out of 128), keeping compute constant
Load Balancing
Auxiliary losses ensure experts are used evenly, preventing collapse
📊 MoE vs Dense Models
| Aspect | Dense Model | MoE Model |
|---|---|---|
| Parameters | All used per input | Only top-k experts |
| Compute | Scales with params | Fixed (sparse) |
| Capacity | Limited by compute | Massive scaling |
| Training | Simpler | Complex routing |
🏆 Success Stories
Switch Transformer (Google, 2021)
1.6T paramsSimplified MoE with 1 expert per token, achieved SOTA on many benchmarks
GLaM (Google, 2021)
1.2T paramsMoE language model using 1/3 compute of GPT-3 for better performance
Mixtral 8x7B (Mistral, 2023)
47B paramsOpen-source MoE matching GPT-3.5 with 13B active params per token
GPT-4 (rumored)
MoESpeculated to use MoE architecture for efficient trillion-scale models
⚡ Efficiency Example
Dense Model
100B parameters
100B FLOPs per token
MoE Model
800B total parameters
100B FLOPs per token (8 experts, top-1)