Policy Gradient Methods

While Q-learning learns a value function and derives a policy from it, policy gradient methods learn the policy directly. The policy is parameterized (e.g., a neural network) and optimized using gradient ascent on expected reward.

Why Policy Gradients?

Feature	Value-Based (DQN)	Policy-Based
Action space	Discrete only	Discrete or continuous
Policy type	Deterministic (argmax)	Stochastic (probability distribution)
Convergence	Can oscillate	Smoother convergence
Exploration	Needs epsilon-greedy	Built-in via stochastic policy

Policy gradients excel in environments with continuous action spaces (robotics, control) and where you want a stochastic policy (mixed strategies in games).

The REINFORCE Algorithm

REINFORCE (Monte Carlo Policy Gradient) is the simplest policy gradient algorithm.

Key idea: Increase the probability of actions that led to high returns, decrease the probability of actions that led to low returns.

Policy gradient theorem:

nabla J(theta) = E[sum_t (nabla log pi(a_t|s_t; theta)) * G_t]

Where:

pi(a|s; theta) is the policy (probability of action a in state s)

G_t is the return (sum of discounted rewards from time t)

nabla log pi is the score function (direction to increase action probability)

The gradient says: adjust parameters in the direction that makes high-return actions more likely.

python

1import numpy as np
2
3class REINFORCEAgent:
4    """
5    REINFORCE (Monte Carlo Policy Gradient) with a simple linear policy.
6    For discrete actions, the policy outputs a softmax distribution.
7    """
8
9    def __init__(self, state_dim, action_dim, lr=0.01, gamma=0.99):
10        self.lr = lr
11        self.gamma = gamma
12        self.action_dim = action_dim
13
14        # Linear policy: softmax(state @ W + b)
15        self.W = np.random.randn(state_dim, action_dim) * 0.01
16        self.b = np.zeros(action_dim)
17
18        # Episode storage
19        self.saved_log_probs = []
20        self.rewards = []
21
22    def policy(self, state):
23        """Compute action probabilities using softmax."""
24        logits = state @ self.W + self.b
25        # Numerically stable softmax
26        logits -= np.max(logits)
27        exp_logits = np.exp(logits)
28        probs = exp_logits / np.sum(exp_logits)
29        return probs
30
31    def select_action(self, state):
32        """Sample action from policy distribution."""
33        probs = self.policy(state)
34        action = np.random.choice(self.action_dim, p=probs)
35        # Store log probability for gradient computation
36        self.saved_log_probs.append((state, action, np.log(probs[action] + 1e-8)))
37        return action
38
39    def compute_returns(self):
40        """Compute discounted returns G_t for each timestep."""
41        returns = []
42        G = 0
43        for r in reversed(self.rewards):
44            G = r + self.gamma * G
45            returns.insert(0, G)
46        returns = np.array(returns)
47        # Normalize returns (variance reduction)
48        if len(returns) > 1:
49            returns = (returns - returns.mean()) / (returns.std() + 1e-8)
50        return returns
51
52    def update(self):
53        """Update policy using REINFORCE gradient."""
54        returns = self.compute_returns()
55
56        for (state, action, log_prob), G in zip(self.saved_log_probs, returns):
57            # Policy gradient: increase prob of good actions
58            probs = self.policy(state)
59            # Gradient of log softmax w.r.t. weights
60            one_hot = np.zeros(self.action_dim)
61            one_hot[action] = 1
62            grad_logits = one_hot - probs  # d(log pi) / d(logits)
63
64            # Update weights: theta += lr * G * grad(log pi)
65            self.W += self.lr * G * np.outer(state, grad_logits)
66            self.b += self.lr * G * grad_logits
67
68        # Clear episode data
69        self.saved_log_probs = []
70        self.rewards = []
71
72# Demo
73agent = REINFORCEAgent(state_dim=4, action_dim=2)
74state = np.random.randn(4)
75probs = agent.policy(state)
76print(f"Action probabilities: {probs}")
77print(f"Selected action: {agent.select_action(state)}")

Advantage Estimation

Raw returns G_t have high variance. The advantage function reduces variance by subtracting a baseline:

A(s, a) = Q(s, a) - V(s)

The advantage tells us: "How much better is this action compared to the average?" Positive advantage means the action was better than expected, negative means worse.

Actor-Critic Methods

Actor-Critic combines policy gradients with value function learning:

Actor: The policy network pi(a|s; theta) that selects actions

Critic: The value network V(s; w) that estimates state values

The critic's value estimate serves as the baseline:

A2C (Advantage Actor-Critic): Synchronous update, single environment

A3C: Asynchronous parallel actors for faster training

PPO - Proximal Policy Optimization

PPO is the most widely used policy gradient algorithm (used to train ChatGPT via RLHF). It prevents large, destructive policy updates by clipping the probability ratio: L = min(ratio * A, clip(ratio, 1-eps, 1+eps) * A). This keeps the new policy close to the old policy, ensuring stable training.

python

1import numpy as np
2
3def ppo_clipped_objective(old_probs, new_probs, advantages, epsilon=0.2):
4    """
5    Compute the PPO clipped surrogate objective.
6
7    Args:
8        old_probs: Action probabilities under old policy (batch,)
9        new_probs: Action probabilities under new policy (batch,)
10        advantages: Advantage estimates (batch,)
11        epsilon: Clipping parameter (default 0.2)
12
13    Returns:
14        Clipped objective value (scalar)
15    """
16    # Probability ratio
17    ratio = new_probs / (old_probs + 1e-8)
18
19    # Two surrogate objectives
20    surr1 = ratio * advantages
21    surr2 = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantages
22
23    # PPO takes the minimum (pessimistic bound)
24    objective = np.mean(np.minimum(surr1, surr2))
25    return objective
26
27# Example
28batch_size = 8
29old_probs = np.random.uniform(0.3, 0.7, batch_size)
30new_probs = old_probs + np.random.uniform(-0.1, 0.1, batch_size)
31new_probs = np.clip(new_probs, 0.01, 0.99)
32advantages = np.random.randn(batch_size)
33
34obj = ppo_clipped_objective(old_probs, new_probs, advantages)
35print(f"PPO objective: {obj:.4f}")
36print(f"Ratios: {new_probs / old_probs}")

Comparison of Policy Gradient Methods

Algorithm	Baseline	On/Off Policy	Key Feature
REINFORCE	None (or simple)	On-policy	Simplest, high variance
A2C	Learned V(s)	On-policy	Lower variance via critic
A3C	Learned V(s)	On-policy	Parallel actors for speed
PPO	Learned V(s)	On-policy	Clipped updates, very stable
SAC	Learned V(s)	Off-policy	Entropy bonus, continuous actions

PPO is the default choice for most RL applications today due to its simplicity, stability, and strong performance.

Policy Gradient Methods

Why Policy Gradients?

Feature	Value-Based (DQN)	Policy-Based
Action space	Discrete only	Discrete or continuous
Policy type	Deterministic (argmax)	Stochastic (probability distribution)
Convergence	Can oscillate	Smoother convergence
Exploration	Needs epsilon-greedy	Built-in via stochastic policy

Policy gradients excel in environments with continuous action spaces (robotics, control) and where you want a stochastic policy (mixed strategies in games).

The REINFORCE Algorithm

REINFORCE (Monte Carlo Policy Gradient) is the simplest policy gradient algorithm.

Key idea: Increase the probability of actions that led to high returns, decrease the probability of actions that led to low returns.

Policy gradient theorem:

nabla J(theta) = E[sum_t (nabla log pi(a_t|s_t; theta)) * G_t]

Where:

pi(a|s; theta) is the policy (probability of action a in state s)

G_t is the return (sum of discounted rewards from time t)

nabla log pi is the score function (direction to increase action probability)

The gradient says: adjust parameters in the direction that makes high-return actions more likely.

python

1import numpy as np
2
3class REINFORCEAgent:
4    """
5    REINFORCE (Monte Carlo Policy Gradient) with a simple linear policy.
6    For discrete actions, the policy outputs a softmax distribution.
7    """
8
9    def __init__(self, state_dim, action_dim, lr=0.01, gamma=0.99):
10        self.lr = lr
11        self.gamma = gamma
12        self.action_dim = action_dim
13
14        # Linear policy: softmax(state @ W + b)
15        self.W = np.random.randn(state_dim, action_dim) * 0.01
16        self.b = np.zeros(action_dim)
17
18        # Episode storage
19        self.saved_log_probs = []
20        self.rewards = []
21
22    def policy(self, state):
23        """Compute action probabilities using softmax."""
24        logits = state @ self.W + self.b
25        # Numerically stable softmax
26        logits -= np.max(logits)
27        exp_logits = np.exp(logits)
28        probs = exp_logits / np.sum(exp_logits)
29        return probs
30
31    def select_action(self, state):
32        """Sample action from policy distribution."""
33        probs = self.policy(state)
34        action = np.random.choice(self.action_dim, p=probs)
35        # Store log probability for gradient computation
36        self.saved_log_probs.append((state, action, np.log(probs[action] + 1e-8)))
37        return action
38
39    def compute_returns(self):
40        """Compute discounted returns G_t for each timestep."""
41        returns = []
42        G = 0
43        for r in reversed(self.rewards):
44            G = r + self.gamma * G
45            returns.insert(0, G)
46        returns = np.array(returns)
47        # Normalize returns (variance reduction)
48        if len(returns) > 1:
49            returns = (returns - returns.mean()) / (returns.std() + 1e-8)
50        return returns
51
52    def update(self):
53        """Update policy using REINFORCE gradient."""
54        returns = self.compute_returns()
55
56        for (state, action, log_prob), G in zip(self.saved_log_probs, returns):
57            # Policy gradient: increase prob of good actions
58            probs = self.policy(state)
59            # Gradient of log softmax w.r.t. weights
60            one_hot = np.zeros(self.action_dim)
61            one_hot[action] = 1
62            grad_logits = one_hot - probs  # d(log pi) / d(logits)
63
64            # Update weights: theta += lr * G * grad(log pi)
65            self.W += self.lr * G * np.outer(state, grad_logits)
66            self.b += self.lr * G * grad_logits
67
68        # Clear episode data
69        self.saved_log_probs = []
70        self.rewards = []
71
72# Demo
73agent = REINFORCEAgent(state_dim=4, action_dim=2)
74state = np.random.randn(4)
75probs = agent.policy(state)
76print(f"Action probabilities: {probs}")
77print(f"Selected action: {agent.select_action(state)}")

Advantage Estimation

Raw returns G_t have high variance. The advantage function reduces variance by subtracting a baseline:

A(s, a) = Q(s, a) - V(s)

The advantage tells us: "How much better is this action compared to the average?" Positive advantage means the action was better than expected, negative means worse.

Actor-Critic Methods

Actor-Critic combines policy gradients with value function learning:

Actor: The policy network pi(a|s; theta) that selects actions

Critic: The value network V(s; w) that estimates state values

The critic's value estimate serves as the baseline:

A2C (Advantage Actor-Critic): Synchronous update, single environment

A3C: Asynchronous parallel actors for faster training

PPO - Proximal Policy Optimization

python

1import numpy as np
2
3def ppo_clipped_objective(old_probs, new_probs, advantages, epsilon=0.2):
4    """
5    Compute the PPO clipped surrogate objective.
6
7    Args:
8        old_probs: Action probabilities under old policy (batch,)
9        new_probs: Action probabilities under new policy (batch,)
10        advantages: Advantage estimates (batch,)
11        epsilon: Clipping parameter (default 0.2)
12
13    Returns:
14        Clipped objective value (scalar)
15    """
16    # Probability ratio
17    ratio = new_probs / (old_probs + 1e-8)
18
19    # Two surrogate objectives
20    surr1 = ratio * advantages
21    surr2 = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantages
22
23    # PPO takes the minimum (pessimistic bound)
24    objective = np.mean(np.minimum(surr1, surr2))
25    return objective
26
27# Example
28batch_size = 8
29old_probs = np.random.uniform(0.3, 0.7, batch_size)
30new_probs = old_probs + np.random.uniform(-0.1, 0.1, batch_size)
31new_probs = np.clip(new_probs, 0.01, 0.99)
32advantages = np.random.randn(batch_size)
33
34obj = ppo_clipped_objective(old_probs, new_probs, advantages)
35print(f"PPO objective: {obj:.4f}")
36print(f"Ratios: {new_probs / old_probs}")

Comparison of Policy Gradient Methods

Algorithm	Baseline	On/Off Policy	Key Feature
REINFORCE	None (or simple)	On-policy	Simplest, high variance
A2C	Learned V(s)	On-policy	Lower variance via critic
A3C	Learned V(s)	On-policy	Parallel actors for speed
PPO	Learned V(s)	On-policy	Clipped updates, very stable
SAC	Learned V(s)	Off-policy	Entropy bonus, continuous actions

PPO is the default choice for most RL applications today due to its simplicity, stability, and strong performance.