🎭 Actor-Critic Architectures

Combining policy and value learning for powerful reinforcement learning

Your Progress

0 / 5 completed
Previous Module
Policy Gradient Methods

Introduction to Actor-Critic

🎯 What is Actor-Critic?

Actor-Critic combines the best of policy gradient and value-based methods. The actor learns the policy (what to do), while the critic evaluates actions by estimating value functions. This synergy reduces variance and accelerates learning.

💡
Key Insight

The critic provides a baseline that reduces the variance of policy gradient updates, making learning more stable and sample-efficient than pure policy gradients.

🎭
Actor Network

Learns the policy π(a|s) mapping states to action probabilities

  • • Outputs action distribution
  • • Updated via policy gradient
  • • Guided by critic's feedback
📊
Critic Network

Estimates value function V(s) or Q(s,a) to judge action quality

  • • Evaluates state/action pairs
  • • Updated via TD learning
  • • Provides advantage estimates

🔄 The Actor-Critic Loop

1
Actor Selects Action

Sample action a from policy π(a|s) given current state s

2
Environment Responds

Execute action, observe reward r and next state s'

3
Critic Evaluates

Compute TD error: δ = r + γV(s') - V(s)

4
Update Both Networks

Critic learns V(s), Actor improves policy using advantage A(s,a)

✅ Advantages

  • • Lower variance than REINFORCE
  • • More sample-efficient learning
  • • Online and incremental updates
  • • Works with continuous actions

⚠️ Challenges

  • • Two networks to train simultaneously
  • • Can suffer from bias
  • • Hyperparameter sensitivity
  • • Stability issues possible