🧠 Mixture of Experts (MoE)

Efficient scaling with sparse expert networks

Your Progress

0 / 5 completed
Previous Module
Multimodal Foundation Models

Introduction to Mixture of Experts

🎯 What is MoE?

Mixture of Experts (MoE) is a sparse neural network architecture where multiple specialized sub-networks (experts) process different parts of the input, with a gating mechanism selecting which experts to activate for each input.

💡
Key Insight

Use massive models but activate only a small fraction per input - scale capacity without compute cost

🌟 Why MoE?

Efficient Scaling

Increase model capacity without proportional compute increase

🎯

Specialization

Different experts learn different skills and patterns

💰

Cost Effective

Sparse activation reduces inference costs dramatically

🚀

Better Performance

Outperform dense models of similar compute budget

🔑 Core Concepts

Experts

Independent neural networks (typically feed-forward layers) that specialize in different input patterns

Gating Network

Learned routing function that decides which experts to activate for each input

Sparse Activation

Only top-k experts are activated per input (e.g., 2 out of 128), keeping compute constant

Load Balancing

Auxiliary losses ensure experts are used evenly, preventing collapse

📊 MoE vs Dense Models

AspectDense ModelMoE Model
ParametersAll used per inputOnly top-k experts
ComputeScales with paramsFixed (sparse)
CapacityLimited by computeMassive scaling
TrainingSimplerComplex routing

🏆 Success Stories

Switch Transformer (Google, 2021)

1.6T params

Simplified MoE with 1 expert per token, achieved SOTA on many benchmarks

GLaM (Google, 2021)

1.2T params

MoE language model using 1/3 compute of GPT-3 for better performance

Mixtral 8x7B (Mistral, 2023)

47B params

Open-source MoE matching GPT-3.5 with 13B active params per token

GPT-4 (rumored)

MoE

Speculated to use MoE architecture for efficient trillion-scale models

⚡ Efficiency Example

Dense Model

100B parameters

100B FLOPs per token

MoE Model

800B total parameters

100B FLOPs per token (8 experts, top-1)

8x model capacity with same compute cost! 🚀