๐ Policy Gradient Methods
Learn to optimize policies directly through gradient ascent
Your Progress
0 / 5 completedWhat are Policy Gradient Methods?
Direct Policy Optimization
Unlike value-based methods (Q-Learning), policy gradient methods directly optimize the policy by computing gradients of expected reward with respect to policy parameters. This enables learning in continuous action spaces and stochastic policies.
Parameterized Policy
Policy ฯ(a|s,ฮธ) is represented by parameters ฮธ (e.g., neural network weights) that we optimize directly.
Gradient Ascent
Update parameters in direction that increases expected cumulative reward using gradient ascent.
Stochastic Policies
Naturally handle exploration through probability distributions over actions.
Continuous Actions
Work seamlessly with continuous action spaces where value methods struggle.
The Policy Gradient Theorem
Key Insight
The policy gradient theorem shows we can compute gradients without knowing environment dynamics. We only need to sample trajectories from the policy and use observed rewards.
โ Advantages
- โขEffective in high-dimensional spaces
- โขNatural exploration via stochastic policies
- โขHandles continuous action spaces
- โขConverges to local optimum guaranteed
โ ๏ธ Challenges
- โขHigh variance in gradient estimates
- โขSample inefficient (needs many episodes)
- โขSensitive to hyperparameters
- โขCan get stuck in local optima