Skip to main content

Policy Gradient Methods

REINFORCE, Actor-Critic, and Proximal Policy Optimization

~50 min
Listen to this lesson

Policy Gradient Methods

While Q-learning learns a value function and derives a policy from it, policy gradient methods learn the policy directly. The policy is parameterized (e.g., a neural network) and optimized using gradient ascent on expected reward.

Why Policy Gradients?

FeatureValue-Based (DQN)Policy-Based
Action spaceDiscrete onlyDiscrete or continuous
Policy typeDeterministic (argmax)Stochastic (probability distribution)
ConvergenceCan oscillateSmoother convergence
ExplorationNeeds epsilon-greedyBuilt-in via stochastic policy
Policy gradients excel in environments with continuous action spaces (robotics, control) and where you want a stochastic policy (mixed strategies in games).

The REINFORCE Algorithm

REINFORCE (Monte Carlo Policy Gradient) is the simplest policy gradient algorithm.

Key idea: Increase the probability of actions that led to high returns, decrease the probability of actions that led to low returns.

Policy gradient theorem:

nabla J(theta) = E[sum_t (nabla log pi(a_t|s_t; theta)) * G_t]

Where:

  • pi(a|s; theta) is the policy (probability of action a in state s)
  • G_t is the return (sum of discounted rewards from time t)
  • nabla log pi is the score function (direction to increase action probability)
  • The gradient says: adjust parameters in the direction that makes high-return actions more likely.

    python
    1import numpy as np
    2
    3class REINFORCEAgent:
    4    """
    5    REINFORCE (Monte Carlo Policy Gradient) with a simple linear policy.
    6    For discrete actions, the policy outputs a softmax distribution.
    7    """
    8
    9    def __init__(self, state_dim, action_dim, lr=0.01, gamma=0.99):
    10        self.lr = lr
    11        self.gamma = gamma
    12        self.action_dim = action_dim
    13
    14        # Linear policy: softmax(state @ W + b)
    15        self.W = np.random.randn(state_dim, action_dim) * 0.01
    16        self.b = np.zeros(action_dim)
    17
    18        # Episode storage
    19        self.saved_log_probs = []
    20        self.rewards = []
    21
    22    def policy(self, state):
    23        """Compute action probabilities using softmax."""
    24        logits = state @ self.W + self.b
    25        # Numerically stable softmax
    26        logits -= np.max(logits)
    27        exp_logits = np.exp(logits)
    28        probs = exp_logits / np.sum(exp_logits)
    29        return probs
    30
    31    def select_action(self, state):
    32        """Sample action from policy distribution."""
    33        probs = self.policy(state)
    34        action = np.random.choice(self.action_dim, p=probs)
    35        # Store log probability for gradient computation
    36        self.saved_log_probs.append((state, action, np.log(probs[action] + 1e-8)))
    37        return action
    38
    39    def compute_returns(self):
    40        """Compute discounted returns G_t for each timestep."""
    41        returns = []
    42        G = 0
    43        for r in reversed(self.rewards):
    44            G = r + self.gamma * G
    45            returns.insert(0, G)
    46        returns = np.array(returns)
    47        # Normalize returns (variance reduction)
    48        if len(returns) > 1:
    49            returns = (returns - returns.mean()) / (returns.std() + 1e-8)
    50        return returns
    51
    52    def update(self):
    53        """Update policy using REINFORCE gradient."""
    54        returns = self.compute_returns()
    55
    56        for (state, action, log_prob), G in zip(self.saved_log_probs, returns):
    57            # Policy gradient: increase prob of good actions
    58            probs = self.policy(state)
    59            # Gradient of log softmax w.r.t. weights
    60            one_hot = np.zeros(self.action_dim)
    61            one_hot[action] = 1
    62            grad_logits = one_hot - probs  # d(log pi) / d(logits)
    63
    64            # Update weights: theta += lr * G * grad(log pi)
    65            self.W += self.lr * G * np.outer(state, grad_logits)
    66            self.b += self.lr * G * grad_logits
    67
    68        # Clear episode data
    69        self.saved_log_probs = []
    70        self.rewards = []
    71
    72# Demo
    73agent = REINFORCEAgent(state_dim=4, action_dim=2)
    74state = np.random.randn(4)
    75probs = agent.policy(state)
    76print(f"Action probabilities: {probs}")
    77print(f"Selected action: {agent.select_action(state)}")

    Advantage Estimation

    Raw returns G_t have high variance. The advantage function reduces variance by subtracting a baseline:

    A(s, a) = Q(s, a) - V(s)

    The advantage tells us: "How much better is this action compared to the average?" Positive advantage means the action was better than expected, negative means worse.

    Actor-Critic Methods

    Actor-Critic combines policy gradients with value function learning:

  • Actor: The policy network pi(a|s; theta) that selects actions
  • Critic: The value network V(s; w) that estimates state values
  • The critic's value estimate serves as the baseline:

  • A2C (Advantage Actor-Critic): Synchronous update, single environment
  • A3C: Asynchronous parallel actors for faster training
  • PPO - Proximal Policy Optimization

    PPO is the most widely used policy gradient algorithm (used to train ChatGPT via RLHF). It prevents large, destructive policy updates by clipping the probability ratio: L = min(ratio * A, clip(ratio, 1-eps, 1+eps) * A). This keeps the new policy close to the old policy, ensuring stable training.
    python
    1import numpy as np
    2
    3def ppo_clipped_objective(old_probs, new_probs, advantages, epsilon=0.2):
    4    """
    5    Compute the PPO clipped surrogate objective.
    6
    7    Args:
    8        old_probs: Action probabilities under old policy (batch,)
    9        new_probs: Action probabilities under new policy (batch,)
    10        advantages: Advantage estimates (batch,)
    11        epsilon: Clipping parameter (default 0.2)
    12
    13    Returns:
    14        Clipped objective value (scalar)
    15    """
    16    # Probability ratio
    17    ratio = new_probs / (old_probs + 1e-8)
    18
    19    # Two surrogate objectives
    20    surr1 = ratio * advantages
    21    surr2 = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantages
    22
    23    # PPO takes the minimum (pessimistic bound)
    24    objective = np.mean(np.minimum(surr1, surr2))
    25    return objective
    26
    27# Example
    28batch_size = 8
    29old_probs = np.random.uniform(0.3, 0.7, batch_size)
    30new_probs = old_probs + np.random.uniform(-0.1, 0.1, batch_size)
    31new_probs = np.clip(new_probs, 0.01, 0.99)
    32advantages = np.random.randn(batch_size)
    33
    34obj = ppo_clipped_objective(old_probs, new_probs, advantages)
    35print(f"PPO objective: {obj:.4f}")
    36print(f"Ratios: {new_probs / old_probs}")

    Comparison of Policy Gradient Methods

    AlgorithmBaselineOn/Off PolicyKey Feature
    REINFORCENone (or simple)On-policySimplest, high variance
    A2CLearned V(s)On-policyLower variance via critic
    A3CLearned V(s)On-policyParallel actors for speed
    PPOLearned V(s)On-policyClipped updates, very stable
    SACLearned V(s)Off-policyEntropy bonus, continuous actions
    PPO is the default choice for most RL applications today due to its simplicity, stability, and strong performance.