RL Applications & Tools

Now that you understand RL algorithms, let's explore the ecosystem and real-world applications.

OpenAI Gym / Gymnasium

Gymnasium (formerly OpenAI Gym) is the standard API for RL environments. Every environment follows the same interface:

import gymnasium as gym
env = gym.make("CartPole-v1")
state, info = env.reset()
for _ in range(1000):
    action = env.action_space.sample()  # Random action
    next_state, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    if done:
        state, info = env.reset()
    else:
        state = next_stateenv.close()

Popular Environments

Environment	Type	State	Actions
CartPole-v1	Classic Control	4D continuous	2 discrete
LunarLander-v2	Box2D	8D continuous	4 discrete
Pendulum-v1	Classic Control	3D continuous	1D continuous
MountainCar-v0	Classic Control	2D continuous	3 discrete
Ant-v4	MuJoCo	111D continuous	8D continuous

Stable Baselines3

Stable Baselines3 (SB3) provides reliable, well-tested implementations of popular RL algorithms. It's the go-to library for applying RL without implementing algorithms from scratch.

Supported Algorithms

PPO: General-purpose, works well for most problems

A2C: Synchronous advantage actor-critic

DQN: For discrete action spaces

SAC: Soft Actor-Critic for continuous actions

TD3: Twin Delayed DDPG for continuous actions

HER: Hindsight Experience Replay for goal-conditioned RL

python

1# Stable Baselines3 usage example
2"""
3from stable_baselines3 import PPO
4import gymnasium as gym
5
6# Create environment
7env = gym.make("CartPole-v1")
8
9# Create and train agent (just 2 lines!)
10model = PPO("MlpPolicy", env, verbose=1)
11model.learn(total_timesteps=50_000)
12
13# Evaluate
14obs, _ = env.reset()
15total_reward = 0
16done = False
17
18while not done:
19    action, _ = model.predict(obs, deterministic=True)
20    obs, reward, terminated, truncated, _ = env.step(action)
21    total_reward += reward
22    done = terminated or truncated
23
24print(f"Total reward: {total_reward}")
25
26# Save and load
27model.save("ppo_cartpole")
28loaded_model = PPO.load("ppo_cartpole")
29"""
30
31# Key SB3 features demonstrated:
32features = {
33    "Easy training": "model.learn(total_timesteps=50000)",
34    "Built-in logging": "verbose=1 shows training progress",
35    "Deterministic eval": "model.predict(obs, deterministic=True)",
36    "Save/Load": "model.save() and PPO.load()",
37    "Custom policies": "PPO('MlpPolicy', ...) or custom network architectures",
38    "Callbacks": "EvalCallback, CheckpointCallback for monitoring",
39}
40
41for feature, example in features.items():
42    print(f"{feature}: {example}")

RLHF - Reinforcement Learning from Human Feedback

RLHF is how ChatGPT, Claude, and other LLMs are aligned with human preferences. The process: 1) Pre-train a language model on text. 2) Collect human comparisons of model outputs. 3) Train a reward model on these comparisons. 4) Fine-tune the LLM using PPO to maximize the reward model's score, with a KL penalty to stay close to the original model.

RLHF for LLMs

The RLHF pipeline for training language models:

Step 1: Supervised Fine-Tuning (SFT)

Fine-tune the base LLM on high-quality instruction-response pairs

Step 2: Reward Model Training

Generate multiple responses to each prompt

Humans rank responses (e.g., response A > response B)

Train a reward model to predict these preferences

Step 3: PPO Fine-Tuning

The LLM generates responses (the "policy")

The reward model scores them (the "reward")

PPO updates the LLM to generate higher-scoring responses

A KL divergence penalty keeps outputs close to the SFT model

This is why PPO knowledge is directly relevant to understanding how modern AI assistants are trained!

Robotics

RL is increasingly used in robotics for tasks that are hard to program explicitly:

Locomotion: Quadruped and humanoid robots learning to walk, run, and recover from pushes

Manipulation: Robot arms learning to grasp, assemble, and manipulate objects

Sim-to-Real: Train in simulation (fast, cheap, safe), then transfer to real robot

Key challenge: Real-world training is slow and dangerous. Sim-to-real transfer requires domain randomization to bridge the gap between simulation and reality.

Game AI

RL has achieved superhuman performance in many games:

AlphaGo/AlphaZero: Mastered Go, chess, and shogi from self-play

OpenAI Five: Defeated world champions in Dota 2

AlphaStar: Grandmaster-level StarCraft II play

Atari: DQN first demonstrated game-playing from raw pixels

Multi-Agent RL

When multiple agents interact in the same environment:

Cooperative: Agents work together (team sports, multi-robot coordination)

Competitive: Agents compete (board games, economic simulations)

Mixed: Some cooperation, some competition (real-world markets)

Challenges: Non-stationarity (other agents change), credit assignment, communication

python

1# Creating a Custom Gymnasium Environment
2"""
3import gymnasium as gym
4from gymnasium import spaces
5import numpy as np
6
7class SimpleTrading(gym.Env):
8    """Custom trading environment."""
9
10    metadata = {"render_modes": ["human"]}
11
12    def __init__(self, prices, initial_balance=10000):
13        super().__init__()
14        self.prices = prices
15        self.initial_balance = initial_balance
16
17        # Action: 0=hold, 1=buy, 2=sell
18        self.action_space = spaces.Discrete(3)
19
20        # Observation: [balance, shares_held, current_price, price_change]
21        self.observation_space = spaces.Box(
22            low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
23        )
24
25    def reset(self, seed=None, options=None):
26        super().reset(seed=seed)
27        self.balance = self.initial_balance
28        self.shares = 0
29        self.current_step = 0
30        return self._get_obs(), {}
31
32    def _get_obs(self):
33        price = self.prices[self.current_step]
34        change = 0 if self.current_step == 0 else (
35            price - self.prices[self.current_step - 1]
36        ) / self.prices[self.current_step - 1]
37        return np.array([
38            self.balance, self.shares, price, change
39        ], dtype=np.float32)
40
41    def step(self, action):
42        price = self.prices[self.current_step]
43
44        if action == 1 and self.balance >= price:  # Buy
45            self.shares += 1
46            self.balance -= price
47        elif action == 2 and self.shares > 0:  # Sell
48            self.shares -= 1
49            self.balance += price
50
51        self.current_step += 1
52        terminated = self.current_step >= len(self.prices) - 1
53        truncated = False
54
55        # Reward: change in portfolio value
56        portfolio = self.balance + self.shares * self.prices[self.current_step]
57        reward = portfolio - self.initial_balance
58
59        return self._get_obs(), reward, terminated, truncated, {}
60
61# Usage with SB3:
62# env = SimpleTrading(prices=stock_data)
63# model = PPO("MlpPolicy", env, verbose=1)
64# model.learn(total_timesteps=100000)
65"""
66
67print("Custom Gymnasium environment template defined")
68print("Key components: __init__, reset, step, observation_space, action_space")

RL Applications & Tools

Now that you understand RL algorithms, let's explore the ecosystem and real-world applications.

OpenAI Gym / Gymnasium

Gymnasium (formerly OpenAI Gym) is the standard API for RL environments. Every environment follows the same interface:

import gymnasium as gym
env = gym.make("CartPole-v1")
state, info = env.reset()
for _ in range(1000):
    action = env.action_space.sample()  # Random action
    next_state, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    if done:
        state, info = env.reset()
    else:
        state = next_stateenv.close()

Popular Environments

Environment	Type	State	Actions
CartPole-v1	Classic Control	4D continuous	2 discrete
LunarLander-v2	Box2D	8D continuous	4 discrete
Pendulum-v1	Classic Control	3D continuous	1D continuous
MountainCar-v0	Classic Control	2D continuous	3 discrete
Ant-v4	MuJoCo	111D continuous	8D continuous

Stable Baselines3

Stable Baselines3 (SB3) provides reliable, well-tested implementations of popular RL algorithms. It's the go-to library for applying RL without implementing algorithms from scratch.

Supported Algorithms

PPO: General-purpose, works well for most problems

A2C: Synchronous advantage actor-critic

DQN: For discrete action spaces

SAC: Soft Actor-Critic for continuous actions

TD3: Twin Delayed DDPG for continuous actions

HER: Hindsight Experience Replay for goal-conditioned RL

python

1# Stable Baselines3 usage example
2"""
3from stable_baselines3 import PPO
4import gymnasium as gym
5
6# Create environment
7env = gym.make("CartPole-v1")
8
9# Create and train agent (just 2 lines!)
10model = PPO("MlpPolicy", env, verbose=1)
11model.learn(total_timesteps=50_000)
12
13# Evaluate
14obs, _ = env.reset()
15total_reward = 0
16done = False
17
18while not done:
19    action, _ = model.predict(obs, deterministic=True)
20    obs, reward, terminated, truncated, _ = env.step(action)
21    total_reward += reward
22    done = terminated or truncated
23
24print(f"Total reward: {total_reward}")
25
26# Save and load
27model.save("ppo_cartpole")
28loaded_model = PPO.load("ppo_cartpole")
29"""
30
31# Key SB3 features demonstrated:
32features = {
33    "Easy training": "model.learn(total_timesteps=50000)",
34    "Built-in logging": "verbose=1 shows training progress",
35    "Deterministic eval": "model.predict(obs, deterministic=True)",
36    "Save/Load": "model.save() and PPO.load()",
37    "Custom policies": "PPO('MlpPolicy', ...) or custom network architectures",
38    "Callbacks": "EvalCallback, CheckpointCallback for monitoring",
39}
40
41for feature, example in features.items():
42    print(f"{feature}: {example}")

RLHF - Reinforcement Learning from Human Feedback

RLHF for LLMs

The RLHF pipeline for training language models:

Step 1: Supervised Fine-Tuning (SFT)

Fine-tune the base LLM on high-quality instruction-response pairs

Step 2: Reward Model Training

Generate multiple responses to each prompt

Humans rank responses (e.g., response A > response B)

Train a reward model to predict these preferences

Step 3: PPO Fine-Tuning

The LLM generates responses (the "policy")

The reward model scores them (the "reward")

PPO updates the LLM to generate higher-scoring responses

A KL divergence penalty keeps outputs close to the SFT model

This is why PPO knowledge is directly relevant to understanding how modern AI assistants are trained!

Robotics

RL is increasingly used in robotics for tasks that are hard to program explicitly:

Locomotion: Quadruped and humanoid robots learning to walk, run, and recover from pushes

Manipulation: Robot arms learning to grasp, assemble, and manipulate objects

Sim-to-Real: Train in simulation (fast, cheap, safe), then transfer to real robot

Key challenge: Real-world training is slow and dangerous. Sim-to-real transfer requires domain randomization to bridge the gap between simulation and reality.

Game AI

RL has achieved superhuman performance in many games:

AlphaGo/AlphaZero: Mastered Go, chess, and shogi from self-play

OpenAI Five: Defeated world champions in Dota 2

AlphaStar: Grandmaster-level StarCraft II play

Atari: DQN first demonstrated game-playing from raw pixels

Multi-Agent RL

When multiple agents interact in the same environment:

Cooperative: Agents work together (team sports, multi-robot coordination)

Competitive: Agents compete (board games, economic simulations)

Mixed: Some cooperation, some competition (real-world markets)

Challenges: Non-stationarity (other agents change), credit assignment, communication

python

1# Creating a Custom Gymnasium Environment
2"""
3import gymnasium as gym
4from gymnasium import spaces
5import numpy as np
6
7class SimpleTrading(gym.Env):
8    """Custom trading environment."""
9
10    metadata = {"render_modes": ["human"]}
11
12    def __init__(self, prices, initial_balance=10000):
13        super().__init__()
14        self.prices = prices
15        self.initial_balance = initial_balance
16
17        # Action: 0=hold, 1=buy, 2=sell
18        self.action_space = spaces.Discrete(3)
19
20        # Observation: [balance, shares_held, current_price, price_change]
21        self.observation_space = spaces.Box(
22            low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
23        )
24
25    def reset(self, seed=None, options=None):
26        super().reset(seed=seed)
27        self.balance = self.initial_balance
28        self.shares = 0
29        self.current_step = 0
30        return self._get_obs(), {}
31
32    def _get_obs(self):
33        price = self.prices[self.current_step]
34        change = 0 if self.current_step == 0 else (
35            price - self.prices[self.current_step - 1]
36        ) / self.prices[self.current_step - 1]
37        return np.array([
38            self.balance, self.shares, price, change
39        ], dtype=np.float32)
40
41    def step(self, action):
42        price = self.prices[self.current_step]
43
44        if action == 1 and self.balance >= price:  # Buy
45            self.shares += 1
46            self.balance -= price
47        elif action == 2 and self.shares > 0:  # Sell
48            self.shares -= 1
49            self.balance += price
50
51        self.current_step += 1
52        terminated = self.current_step >= len(self.prices) - 1
53        truncated = False
54
55        # Reward: change in portfolio value
56        portfolio = self.balance + self.shares * self.prices[self.current_step]
57        reward = portfolio - self.initial_balance
58
59        return self._get_obs(), reward, terminated, truncated, {}
60
61# Usage with SB3:
62# env = SimpleTrading(prices=stock_data)
63# model = PPO("MlpPolicy", env, verbose=1)
64# model.learn(total_timesteps=100000)
65"""
66
67print("Custom Gymnasium environment template defined")
68print("Key components: __init__, reset, step, observation_space, action_space")