๐Ÿ“Š Policy Gradient Methods

Learn to optimize policies directly through gradient ascent

Your Progress

0 / 5 completed
โ†
Previous Module
Q-Learning Visualizer

What are Policy Gradient Methods?

Direct Policy Optimization

Unlike value-based methods (Q-Learning), policy gradient methods directly optimize the policy by computing gradients of expected reward with respect to policy parameters. This enables learning in continuous action spaces and stochastic policies.

๐ŸŽฏ

Parameterized Policy

Policy ฯ€(a|s,ฮธ) is represented by parameters ฮธ (e.g., neural network weights) that we optimize directly.

๐Ÿ“ˆ

Gradient Ascent

Update parameters in direction that increases expected cumulative reward using gradient ascent.

๐ŸŽฒ

Stochastic Policies

Naturally handle exploration through probability distributions over actions.

๐Ÿ”„

Continuous Actions

Work seamlessly with continuous action spaces where value methods struggle.

The Policy Gradient Theorem

โˆ‡ฮธJ(ฮธ) = Eฯ€[โˆ‡ฮธ log ฯ€(a|s,ฮธ) Qฯ€(s,a)]
โˆ‡ฮธJ(ฮธ):Gradient of expected return
ฯ€(a|s,ฮธ):Parameterized policy
Qฯ€(s,a):Action-value function
Eฯ€[ยท]:Expected value under policy
๐Ÿ’ก

Key Insight

The policy gradient theorem shows we can compute gradients without knowing environment dynamics. We only need to sample trajectories from the policy and use observed rewards.

โœ… Advantages

  • โ€ขEffective in high-dimensional spaces
  • โ€ขNatural exploration via stochastic policies
  • โ€ขHandles continuous action spaces
  • โ€ขConverges to local optimum guaranteed

โš ๏ธ Challenges

  • โ€ขHigh variance in gradient estimates
  • โ€ขSample inefficient (needs many episodes)
  • โ€ขSensitive to hyperparameters
  • โ€ขCan get stuck in local optima