Skip to main content

RL Applications & Tools

Gymnasium, Stable Baselines3, RLHF, robotics, and game AI

~45 min
Listen to this lesson

RL Applications & Tools

Now that you understand RL algorithms, let's explore the ecosystem and real-world applications.

OpenAI Gym / Gymnasium

Gymnasium (formerly OpenAI Gym) is the standard API for RL environments. Every environment follows the same interface:

import gymnasium as gym

env = gym.make("CartPole-v1") state, info = env.reset()

for _ in range(1000): action = env.action_space.sample() # Random action next_state, reward, terminated, truncated, info = env.step(action) done = terminated or truncated if done: state, info = env.reset() else: state = next_state

env.close()

Popular Environments

EnvironmentTypeStateActions
CartPole-v1Classic Control4D continuous2 discrete
LunarLander-v2Box2D8D continuous4 discrete
Pendulum-v1Classic Control3D continuous1D continuous
MountainCar-v0Classic Control2D continuous3 discrete
Ant-v4MuJoCo111D continuous8D continuous

Stable Baselines3

Stable Baselines3 (SB3) provides reliable, well-tested implementations of popular RL algorithms. It's the go-to library for applying RL without implementing algorithms from scratch.

Supported Algorithms

  • PPO: General-purpose, works well for most problems
  • A2C: Synchronous advantage actor-critic
  • DQN: For discrete action spaces
  • SAC: Soft Actor-Critic for continuous actions
  • TD3: Twin Delayed DDPG for continuous actions
  • HER: Hindsight Experience Replay for goal-conditioned RL
  • python
    1# Stable Baselines3 usage example
    2"""
    3from stable_baselines3 import PPO
    4import gymnasium as gym
    5
    6# Create environment
    7env = gym.make("CartPole-v1")
    8
    9# Create and train agent (just 2 lines!)
    10model = PPO("MlpPolicy", env, verbose=1)
    11model.learn(total_timesteps=50_000)
    12
    13# Evaluate
    14obs, _ = env.reset()
    15total_reward = 0
    16done = False
    17
    18while not done:
    19    action, _ = model.predict(obs, deterministic=True)
    20    obs, reward, terminated, truncated, _ = env.step(action)
    21    total_reward += reward
    22    done = terminated or truncated
    23
    24print(f"Total reward: {total_reward}")
    25
    26# Save and load
    27model.save("ppo_cartpole")
    28loaded_model = PPO.load("ppo_cartpole")
    29"""
    30
    31# Key SB3 features demonstrated:
    32features = {
    33    "Easy training": "model.learn(total_timesteps=50000)",
    34    "Built-in logging": "verbose=1 shows training progress",
    35    "Deterministic eval": "model.predict(obs, deterministic=True)",
    36    "Save/Load": "model.save() and PPO.load()",
    37    "Custom policies": "PPO('MlpPolicy', ...) or custom network architectures",
    38    "Callbacks": "EvalCallback, CheckpointCallback for monitoring",
    39}
    40
    41for feature, example in features.items():
    42    print(f"{feature}: {example}")

    RLHF - Reinforcement Learning from Human Feedback

    RLHF is how ChatGPT, Claude, and other LLMs are aligned with human preferences. The process: 1) Pre-train a language model on text. 2) Collect human comparisons of model outputs. 3) Train a reward model on these comparisons. 4) Fine-tune the LLM using PPO to maximize the reward model's score, with a KL penalty to stay close to the original model.

    RLHF for LLMs

    The RLHF pipeline for training language models:

    Step 1: Supervised Fine-Tuning (SFT)

  • Fine-tune the base LLM on high-quality instruction-response pairs
  • Step 2: Reward Model Training

  • Generate multiple responses to each prompt
  • Humans rank responses (e.g., response A > response B)
  • Train a reward model to predict these preferences
  • Step 3: PPO Fine-Tuning

  • The LLM generates responses (the "policy")
  • The reward model scores them (the "reward")
  • PPO updates the LLM to generate higher-scoring responses
  • A KL divergence penalty keeps outputs close to the SFT model
  • This is why PPO knowledge is directly relevant to understanding how modern AI assistants are trained!

    Robotics

    RL is increasingly used in robotics for tasks that are hard to program explicitly:

  • Locomotion: Quadruped and humanoid robots learning to walk, run, and recover from pushes
  • Manipulation: Robot arms learning to grasp, assemble, and manipulate objects
  • Sim-to-Real: Train in simulation (fast, cheap, safe), then transfer to real robot
  • Key challenge: Real-world training is slow and dangerous. Sim-to-real transfer requires domain randomization to bridge the gap between simulation and reality.
  • Game AI

    RL has achieved superhuman performance in many games:

  • AlphaGo/AlphaZero: Mastered Go, chess, and shogi from self-play
  • OpenAI Five: Defeated world champions in Dota 2
  • AlphaStar: Grandmaster-level StarCraft II play
  • Atari: DQN first demonstrated game-playing from raw pixels
  • Multi-Agent RL

    When multiple agents interact in the same environment:

  • Cooperative: Agents work together (team sports, multi-robot coordination)
  • Competitive: Agents compete (board games, economic simulations)
  • Mixed: Some cooperation, some competition (real-world markets)
  • Challenges: Non-stationarity (other agents change), credit assignment, communication
  • python
    1# Creating a Custom Gymnasium Environment
    2"""
    3import gymnasium as gym
    4from gymnasium import spaces
    5import numpy as np
    6
    7class SimpleTrading(gym.Env):
    8    """Custom trading environment."""
    9
    10    metadata = {"render_modes": ["human"]}
    11
    12    def __init__(self, prices, initial_balance=10000):
    13        super().__init__()
    14        self.prices = prices
    15        self.initial_balance = initial_balance
    16
    17        # Action: 0=hold, 1=buy, 2=sell
    18        self.action_space = spaces.Discrete(3)
    19
    20        # Observation: [balance, shares_held, current_price, price_change]
    21        self.observation_space = spaces.Box(
    22            low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
    23        )
    24
    25    def reset(self, seed=None, options=None):
    26        super().reset(seed=seed)
    27        self.balance = self.initial_balance
    28        self.shares = 0
    29        self.current_step = 0
    30        return self._get_obs(), {}
    31
    32    def _get_obs(self):
    33        price = self.prices[self.current_step]
    34        change = 0 if self.current_step == 0 else (
    35            price - self.prices[self.current_step - 1]
    36        ) / self.prices[self.current_step - 1]
    37        return np.array([
    38            self.balance, self.shares, price, change
    39        ], dtype=np.float32)
    40
    41    def step(self, action):
    42        price = self.prices[self.current_step]
    43
    44        if action == 1 and self.balance >= price:  # Buy
    45            self.shares += 1
    46            self.balance -= price
    47        elif action == 2 and self.shares > 0:  # Sell
    48            self.shares -= 1
    49            self.balance += price
    50
    51        self.current_step += 1
    52        terminated = self.current_step >= len(self.prices) - 1
    53        truncated = False
    54
    55        # Reward: change in portfolio value
    56        portfolio = self.balance + self.shares * self.prices[self.current_step]
    57        reward = portfolio - self.initial_balance
    58
    59        return self._get_obs(), reward, terminated, truncated, {}
    60
    61# Usage with SB3:
    62# env = SimpleTrading(prices=stock_data)
    63# model = PPO("MlpPolicy", env, verbose=1)
    64# model.learn(total_timesteps=100000)
    65"""
    66
    67print("Custom Gymnasium environment template defined")
    68print("Key components: __init__, reset, step, observation_space, action_space")