🎯 Q-Learning Visualizer

Master temporal difference learning through interactive Q-value exploration

Your Progress

0 / 5 completed
Previous Module
Reinforcement Learning Intro

What is Q-Learning?

Learning Optimal Actions

Q-Learning is a model-free reinforcement learning algorithm that learns the quality (Q-value) of actions in different states. It discovers optimal policies by iteratively updating action-value estimates based on experience, without requiring a model of the environment.

🎯

Q-Value Function

Q(s,a) represents expected cumulative reward for taking action a in state s. The agent learns these values through experience.

🔄

Temporal Difference

Updates Q-values using the difference between predicted and actual rewards, enabling online learning.

🎲

Off-Policy Learning

Learns optimal policy while following exploratory behavior policy, separating learning from action.

Model-Free

No need to know environment dynamics—learns directly from interactions and observed rewards.

The Q-Learning Equation

Q(s,a) ← Q(s,a) + α [r + γ max Q(s',a') - Q(s,a)]
Q(s,a):Current Q-value estimate
α:Learning rate (0-1)
r:Immediate reward received
γ:Discount factor (0-1)
💡

Key Insight

Q-Learning bootstraps—it updates estimates using other estimates. This temporal difference approach allows learning before reaching terminal states, making it efficient for episodic and continuing tasks.

✅ Advantages

  • Simple and effective algorithm
  • Converges to optimal policy
  • Works with discrete state-action spaces

⚠️ Limitations

  • Doesn't scale to large state spaces
  • Slow convergence in complex environments
  • Requires careful exploration tuning