Policy Gradient Methods
While Q-learning learns a value function and derives a policy from it, policy gradient methods learn the policy directly. The policy is parameterized (e.g., a neural network) and optimized using gradient ascent on expected reward.
Why Policy Gradients?
| Feature | Value-Based (DQN) | Policy-Based |
|---|---|---|
| Action space | Discrete only | Discrete or continuous |
| Policy type | Deterministic (argmax) | Stochastic (probability distribution) |
| Convergence | Can oscillate | Smoother convergence |
| Exploration | Needs epsilon-greedy | Built-in via stochastic policy |
The REINFORCE Algorithm
REINFORCE (Monte Carlo Policy Gradient) is the simplest policy gradient algorithm.
Key idea: Increase the probability of actions that led to high returns, decrease the probability of actions that led to low returns.
Policy gradient theorem:
nabla J(theta) = E[sum_t (nabla log pi(a_t|s_t; theta)) * G_t]
Where:
The gradient says: adjust parameters in the direction that makes high-return actions more likely.
1import numpy as np
2
3class REINFORCEAgent:
4 """
5 REINFORCE (Monte Carlo Policy Gradient) with a simple linear policy.
6 For discrete actions, the policy outputs a softmax distribution.
7 """
8
9 def __init__(self, state_dim, action_dim, lr=0.01, gamma=0.99):
10 self.lr = lr
11 self.gamma = gamma
12 self.action_dim = action_dim
13
14 # Linear policy: softmax(state @ W + b)
15 self.W = np.random.randn(state_dim, action_dim) * 0.01
16 self.b = np.zeros(action_dim)
17
18 # Episode storage
19 self.saved_log_probs = []
20 self.rewards = []
21
22 def policy(self, state):
23 """Compute action probabilities using softmax."""
24 logits = state @ self.W + self.b
25 # Numerically stable softmax
26 logits -= np.max(logits)
27 exp_logits = np.exp(logits)
28 probs = exp_logits / np.sum(exp_logits)
29 return probs
30
31 def select_action(self, state):
32 """Sample action from policy distribution."""
33 probs = self.policy(state)
34 action = np.random.choice(self.action_dim, p=probs)
35 # Store log probability for gradient computation
36 self.saved_log_probs.append((state, action, np.log(probs[action] + 1e-8)))
37 return action
38
39 def compute_returns(self):
40 """Compute discounted returns G_t for each timestep."""
41 returns = []
42 G = 0
43 for r in reversed(self.rewards):
44 G = r + self.gamma * G
45 returns.insert(0, G)
46 returns = np.array(returns)
47 # Normalize returns (variance reduction)
48 if len(returns) > 1:
49 returns = (returns - returns.mean()) / (returns.std() + 1e-8)
50 return returns
51
52 def update(self):
53 """Update policy using REINFORCE gradient."""
54 returns = self.compute_returns()
55
56 for (state, action, log_prob), G in zip(self.saved_log_probs, returns):
57 # Policy gradient: increase prob of good actions
58 probs = self.policy(state)
59 # Gradient of log softmax w.r.t. weights
60 one_hot = np.zeros(self.action_dim)
61 one_hot[action] = 1
62 grad_logits = one_hot - probs # d(log pi) / d(logits)
63
64 # Update weights: theta += lr * G * grad(log pi)
65 self.W += self.lr * G * np.outer(state, grad_logits)
66 self.b += self.lr * G * grad_logits
67
68 # Clear episode data
69 self.saved_log_probs = []
70 self.rewards = []
71
72# Demo
73agent = REINFORCEAgent(state_dim=4, action_dim=2)
74state = np.random.randn(4)
75probs = agent.policy(state)
76print(f"Action probabilities: {probs}")
77print(f"Selected action: {agent.select_action(state)}")Advantage Estimation
Raw returns G_t have high variance. The advantage function reduces variance by subtracting a baseline:
A(s, a) = Q(s, a) - V(s)
The advantage tells us: "How much better is this action compared to the average?" Positive advantage means the action was better than expected, negative means worse.
Actor-Critic Methods
Actor-Critic combines policy gradients with value function learning:
The critic's value estimate serves as the baseline:
PPO - Proximal Policy Optimization
1import numpy as np
2
3def ppo_clipped_objective(old_probs, new_probs, advantages, epsilon=0.2):
4 """
5 Compute the PPO clipped surrogate objective.
6
7 Args:
8 old_probs: Action probabilities under old policy (batch,)
9 new_probs: Action probabilities under new policy (batch,)
10 advantages: Advantage estimates (batch,)
11 epsilon: Clipping parameter (default 0.2)
12
13 Returns:
14 Clipped objective value (scalar)
15 """
16 # Probability ratio
17 ratio = new_probs / (old_probs + 1e-8)
18
19 # Two surrogate objectives
20 surr1 = ratio * advantages
21 surr2 = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantages
22
23 # PPO takes the minimum (pessimistic bound)
24 objective = np.mean(np.minimum(surr1, surr2))
25 return objective
26
27# Example
28batch_size = 8
29old_probs = np.random.uniform(0.3, 0.7, batch_size)
30new_probs = old_probs + np.random.uniform(-0.1, 0.1, batch_size)
31new_probs = np.clip(new_probs, 0.01, 0.99)
32advantages = np.random.randn(batch_size)
33
34obj = ppo_clipped_objective(old_probs, new_probs, advantages)
35print(f"PPO objective: {obj:.4f}")
36print(f"Ratios: {new_probs / old_probs}")Comparison of Policy Gradient Methods
| Algorithm | Baseline | On/Off Policy | Key Feature |
|---|---|---|---|
| REINFORCE | None (or simple) | On-policy | Simplest, high variance |
| A2C | Learned V(s) | On-policy | Lower variance via critic |
| A3C | Learned V(s) | On-policy | Parallel actors for speed |
| PPO | Learned V(s) | On-policy | Clipped updates, very stable |
| SAC | Learned V(s) | Off-policy | Entropy bonus, continuous actions |