Monday, May 04, 2026

REINFORCEMENT LEARNING: FROM ZERO TO HERO

Reinforcement Learning: From Zero to Hero

A complete beginner's tutorial on the most exciting field in Artificial Intelligence — covering theory, mathematics, algorithms, code, and the future of the discipline.

1. Introduction: Why Reinforcement Learning Matters Right Now

Imagine teaching a dog a new trick. You do not hand it a textbook on how to sit, roll over, or fetch. Instead, you reward it when it does something right and withhold the reward when it does not. Over time, through thousands of tiny interactions, the dog figures out exactly which behaviors lead to treats and which do not. It learns by doing, by trying, by failing, and by succeeding.

This is, in essence, the philosophy behind Reinforcement Learning (RL), and it is arguably the most natural and powerful form of learning that exists. It is how humans learn to walk, how chess grandmasters develop intuition, and how some of the most astonishing AI systems ever built have achieved superhuman performance.

In 2016, a program called AlphaGo defeated the world champion Go player Lee Sedol in a match watched by over 200 million people. Go is a board game of such staggering complexity that the number of possible board positions exceeds the number of atoms in the observable universe. For decades, experts believed that no computer program could defeat a top human player within their lifetime. AlphaGo did it, and it did so primarily through Reinforcement Learning.

Since then, RL-powered systems have beaten the world's best players at StarCraft II, solved the protein folding problem that stumped biologists for fifty years, trained robots to walk and manipulate objects with dexterity that rivals humans, and — perhaps most visibly — made the large language models you interact with every day far more helpful through a technique called Reinforcement Learning from Human Feedback (RLHF).

The field is not slowing down. It is accelerating. The architect of AlphaGo, David Silver, has left Google DeepMind to found a new company called Ineffable Intelligence, raising over a billion dollars in seed funding with a singular mission: to build a "superlearner" — an AI that discovers all knowledge from its own experience and, in doing so, achieves Artificial General Intelligence. We will return to this extraordinary story in detail.

This tutorial is your complete guide to Reinforcement Learning. We will start from absolute first principles, build up the mathematics carefully and clearly, walk through every major algorithm family with working code, and end by looking at where this discipline is headed.

2. The Big Picture: What Is Reinforcement Learning?

Machine learning, broadly speaking, comes in three flavors. In supervised learning, you provide the algorithm with labeled examples. In unsupervised learning, you provide data without labels and ask the algorithm to find hidden structure. Reinforcement Learning is the third flavor, and it is fundamentally different from both.

In RL, there are no labeled examples and no pre-existing dataset. Instead, there is an agent that lives inside an environment. The agent takes actions, the environment responds by transitioning to a new state and handing the agent a reward signal, and the agent's entire goal is to figure out which sequence of actions leads to the most cumulative reward.

This is a profoundly general framework. The "environment" could be a chess board, a video game, a financial market, a hospital's treatment protocol, a robot's physical surroundings, or even the abstract space of possible responses to a human's question. The "reward" could be winning the game, making a profit, improving a patient's health, or receiving a thumbs-up from a human evaluator.

The key insight that makes RL so powerful is that it does not require a human to specify how to solve a problem — only what success looks like, through the reward signal. The agent then figures out the how entirely on its own, through experience.

This is also what makes RL so challenging. The agent must explore a potentially enormous space of possible behaviors, and the reward signal is often sparse and delayed. A chess program does not know whether its move on turn 10 was good or bad until the game ends 80 moves later. Figuring out which past actions deserve credit for a future reward is called the credit assignment problem, and it is one of the central challenges of the field.

3. The Core Vocabulary: Agents, Environments, States, Actions, Rewards

Every concept in RL maps onto one of five fundamental ideas.

The Agent is the learner and the decision-maker. It observes the world, chooses actions, and receives rewards. In a video game context, the agent is the AI player. In a robotics context, the agent is the robot's control software.

The Environment is everything the agent interacts with. It receives the agent's actions, updates its internal state, and returns an observation and a reward. In a chess game, the environment is the board, the rules, and the opponent.

The State \(s \in \mathcal{S}\) is a description of the current situation. A state must contain all the information the agent needs to make a good decision. When the agent has access to the full state, we call this a fully observable environment. When the agent can only see a partial view, we call it partially observable.

The Action \(a \in \mathcal{A}\) is a choice the agent can make at any given state. Action spaces can be discrete (a finite list of options) or continuous (a real-valued vector, like the exact torque to apply to a robot's joint).

The Reward \(R\) is a scalar signal the environment sends to the agent after each action. Crucially, the agent does not optimize for the immediate reward alone — it optimizes for the sum of all future rewards, which we call the return.

The Policy \(\pi\) is the agent's strategy: a mapping from states to actions. A deterministic policy maps each state to a single action. A stochastic policy maps each state to a probability distribution over actions: \(\pi(a \mid s) = P(A_t = a \mid S_t = s)\).

The Value Function \(V^\pi(s)\) estimates the expected total future reward the agent will accumulate starting from state \(s\), following policy \(\pi\). A high value means the agent can expect a lot of future reward from this state.

The Q-Function \(Q^\pi(s, a)\) tells the agent how good it is to take a specific action \(a\) in a specific state \(s\), then follow policy \(\pi\). It is also called the action-value function.

4. The Mathematics of Reinforcement Learning

Now we arrive at the mathematical heart of the field. We will build up the formalism step by step, and every equation will be explained in plain English.

4.1 The Markov Decision Process (MDP)

The mathematical framework that underlies virtually all of reinforcement learning is the Markov Decision Process (MDP). An MDP is defined by a tuple of five components: \((\mathcal{S},\, \mathcal{A},\, P,\, R,\, \gamma)\).

  • \(\mathcal{S}\) — the state space, the set of all possible states.
  • \(\mathcal{A}\) — the action space, the set of all possible actions.
  • \(P(s' \mid s, a)\) — the transition probability: the probability of reaching state \(s'\) after taking action \(a\) in state \(s\).
  • \(R(s, a, s')\) — the reward function: the expected reward received on that transition.
  • \(\gamma \in [0, 1)\) — the discount factor, controlling how much the agent values future rewards relative to immediate ones.

The Markov property is the key assumption that makes MDPs tractable. It states that the future is conditionally independent of the past given the present:

$$P(S_{t+1} \mid S_t, A_t, S_{t-1}, A_{t-1}, \ldots) = P(S_{t+1} \mid S_t, A_t)$$

The agent's interaction with the environment unfolds as a trajectory (also called an episode or rollout):

$$S_0,\; A_0,\; R_1,\; S_1,\; A_1,\; R_2,\; S_2,\; A_2,\; R_3,\; \ldots$$

The return \(G_t\) is the total discounted reward from time step \(t\) onwards:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

Notice that the return has a beautiful recursive structure that is the seed from which the Bellman equation grows:

$$G_t = R_{t+1} + \gamma\, G_{t+1}$$

4.2 The Value Function and the Bellman Equation

The state-value function \(V^\pi(s)\) tells us the expected return starting from state \(s\) and following policy \(\pi\) thereafter:

$$V^\pi(s) = \mathbb{E}_\pi \!\left[ G_t \mid S_t = s \right]$$

By substituting the recursive definition of \(G_t\) and using the linearity of expectation, we derive the Bellman expectation equation for \(V^\pi\):

$$V^\pi(s) = \sum_{a} \pi(a \mid s) \sum_{s',\, r} p(s', r \mid s, a) \left[ r + \gamma\, V^\pi(s') \right]$$

Let us read this equation carefully, because it is one of the most important equations in all of machine learning. The outer sum is over all possible actions, weighted by the probability of taking each action under policy \(\pi\). The inner sum is over all possible next states \(s'\) and rewards \(r\). The term in brackets is the immediate reward \(r\) plus the discounted value of the next state \(\gamma V^\pi(s')\). This is the Bellman equation: a consistency condition that must hold for every state if our value estimates are correct.

4.3 The Q-Function (Action-Value Function)

The action-value function \(Q^\pi(s, a)\) tells us the expected return starting from state \(s\), taking action \(a\), and then following policy \(\pi\):

$$Q^\pi(s, a) = \mathbb{E}_\pi \!\left[ G_t \mid S_t = s,\; A_t = a \right]$$

The Bellman equation for the Q-function is:

$$Q^\pi(s, a) = \sum_{s',\, r} p(s', r \mid s, a) \left[ r + \gamma \sum_{a'} \pi(a' \mid s')\, Q^\pi(s', a') \right]$$

The relationship between \(V\) and \(Q\) is straightforward. The value of a state is the expected Q-value over all actions, weighted by the policy:

$$V^\pi(s) = \sum_{a} \pi(a \mid s)\, Q^\pi(s, a)$$

And the Q-value of a state-action pair equals the immediate reward plus the discounted value of the next state:

$$Q^\pi(s, a) = \mathbb{E}\!\left[ R_{t+1} + \gamma\, V^\pi(S_{t+1}) \mid S_t = s,\; A_t = a \right]$$

4.4 The Bellman Optimality Equation

The optimal value function \(V^*(s)\) and optimal Q-function \(Q^*(s,a)\) are defined as:

$$V^*(s) = \max_\pi\, V^\pi(s), \qquad Q^*(s, a) = \max_\pi\, Q^\pi(s, a)$$

The Bellman optimality equation for \(V^*\) replaces the weighted average over actions with a maximization:

$$V^*(s) = \max_{a} \sum_{s',\, r} p(s', r \mid s, a) \left[ r + \gamma\, V^*(s') \right]$$

And for \(Q^*\):

$$Q^*(s, a) = \sum_{s',\, r} p(s', r \mid s, a) \left[ r + \gamma \max_{a'} Q^*(s', a') \right]$$

Once we have \(Q^*\), deriving the optimal policy is trivial — in any state \(s\), simply take the action that maximizes \(Q^*\):

$$\pi^*(s) = \arg\max_{a}\; Q^*(s, a)$$

This is the holy grail of reinforcement learning. Q-learning, which we will implement shortly, is an algorithm that directly tries to estimate \(Q^*\) through experience.

4.5 Policy Gradient and the REINFORCE Theorem

Policy gradient methods directly parameterize the policy as a function with parameters \(\theta\) (for example, the weights of a neural network), and optimize those parameters to maximize the expected return:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \!\left[ R(\tau) \right]$$

where \(\tau\) is a trajectory and \(R(\tau)\) is its total reward. Using the log-derivative trick, the Policy Gradient Theorem gives us:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \!\left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(A_t \mid S_t) \cdot G_t \right]$$

This remarkable result says: to improve the policy, increase the log-probability of actions that led to high returns, and decrease it for actions that led to low returns — without needing to know the environment's transition probabilities at all.

The advantage function \(A^\pi(s, a)\) is a refinement that reduces the variance of policy gradient estimates:

$$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$

The advantage tells us how much better action \(a\) is compared to the average action in state \(s\). Using the advantage instead of the raw return leads to much more stable and efficient learning, and it is the foundation of actor-critic methods.

5. The Exploration vs. Exploitation Dilemma

Exploitation means using what you already know to get the best reward you can right now. Exploration means trying new things to discover whether they might be even better. An agent that only exploits will get stuck in a local optimum. An agent that only explores will never settle on a good strategy.

The most common solution for discrete action spaces is the epsilon-greedy strategy. With probability \(\varepsilon\) the agent takes a random action (exploration); with probability \(1 - \varepsilon\) it takes the action it currently believes is best (exploitation). Over training, \(\varepsilon\) is annealed from a high value (e.g., 1.0) down to a low value (e.g., 0.01).

A more sophisticated approach is Upper Confidence Bound (UCB), which chooses actions based not just on their estimated value but also on how uncertain we are about that estimate:

$$a_t = \arg\max_{a} \left[ Q(s, a) + c \sqrt{\frac{\ln t}{N(s, a)}} \right]$$

where \(N(s,a)\) is the number of times action \(a\) has been selected in state \(s\), and \(c\) is an exploration constant. Actions that have been tried fewer times receive a confidence bonus, encouraging exploration of underexplored options.

For continuous action spaces, exploration is often achieved by adding noise to the actions selected by the policy. In policy gradient methods, exploration is naturally encouraged by maintaining a stochastic policy and explicitly maximizing its entropy \(\mathcal{H}(\pi(\cdot \mid s))\) as a bonus reward.

6. A Taxonomy of Reinforcement Learning Algorithms

The landscape of RL algorithms can be organized along several key dimensions:

  • Model-free vs. Model-based — does the agent build an explicit model of the environment's dynamics?
  • Value-based vs. Policy-based — does the agent learn a value function, a policy directly, or both?
  • On-policy vs. Off-policy — does the agent learn only from data generated by its current policy, or can it reuse old data?
Model-Free RL
├── Value-Based
│ ├── Tabular (small state spaces)
│ │ ├── Dynamic Programming (requires model)
│ │ ├── Monte Carlo Methods
│ │ └── TD Learning → Q-Learning (off-policy), SARSA (on-policy)
│ └── Function Approximation (large/continuous spaces)
│ └── DQN, Double DQN, Dueling DQN, Rainbow
├── Policy-Based
│ └── REINFORCE, TRPO, PPO
└── Actor-Critic
└── A2C/A3C, DDPG, TD3, SAC

Model-Based RL
└── Dyna-Q, World Models, MuZero, Dreamer

7. Value-Based Methods

7.1 Dynamic Programming

Dynamic Programming (DP) requires complete knowledge of the environment's transition probabilities and reward function, but it is the theoretical foundation from which all other RL algorithms are derived. Value Iteration combines policy evaluation and improvement into a single update applied repeatedly until convergence:

$$V_{k+1}(s) \leftarrow \max_{a} \sum_{s',\, r} p(s', r \mid s, a) \left[ r + \gamma\, V_k(s') \right]$$

7.2 Monte Carlo Methods

Monte Carlo (MC) methods learn from complete episodes of experience. After an episode, for each state \(S_t\) visited, the value estimate is updated using the actual observed return \(G_t\):

$$V(S_t) \leftarrow V(S_t) + \alpha \left[ G_t - V(S_t) \right]$$

where \(\alpha\) is the learning rate. MC methods have low bias (they use actual returns) but high variance (a single episode's return can vary wildly).

7.3 Temporal Difference Learning

Temporal Difference (TD) learning combines the model-free nature of Monte Carlo with the ability to learn from incomplete episodes. The simplest algorithm, TD(0), updates the value function after every single step using a bootstrapped estimate:

$$V(S_t) \leftarrow V(S_t) + \alpha \underbrace{\left[ R_{t+1} + \gamma\, V(S_{t+1}) - V(S_t) \right]}_{\delta_t \;=\; \text{TD error}}$$

The term \(\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\) is called the TD error. It measures how surprised the agent was by the outcome of its action. TD learning has lower variance than Monte Carlo but higher bias, and in practice tends to learn faster and more stably.

7.4 Q-Learning: The Classic Algorithm

Q-learning is the most famous RL algorithm. It is an off-policy TD algorithm that directly learns \(Q^*\) by applying the Bellman optimality equation as an update rule:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \right]$$

The key feature is the max operator in the TD target. Instead of using the Q-value of the action actually taken in the next state, it uses the maximum Q-value over all possible next actions — always updating towards the best possible future, regardless of what the agent actually does.

Here is a complete Q-learning implementation for the FrozenLake environment:

import numpy as np
import gymnasium as gym


def run_q_learning(
    num_episodes: int = 10000,
    learning_rate: float = 0.8,
    discount_factor: float = 0.95,
    epsilon_start: float = 1.0,
    epsilon_end: float = 0.01,
    epsilon_decay: float = 0.001,
) -> tuple[np.ndarray, list[float]]:
    """
    Train a Q-learning agent on FrozenLake-v1.

    Q(s,a) <- Q(s,a) + alpha * [r + gamma * max_a' Q(s',a') - Q(s,a)]
    """
    env = gym.make("FrozenLake-v1", is_slippery=False)
    num_states  = env.observation_space.n   # 16 states (4x4 grid)
    num_actions = env.action_space.n        # 4 actions: L, D, R, U

    # Initialize Q-table to zeros — no prior knowledge
    q_table = np.zeros((num_states, num_actions))

    rewards_per_episode = []
    epsilon = epsilon_start

    for episode in range(num_episodes):
        state, _ = env.reset()
        total_reward = 0.0
        done = False

        while not done:
            # ── Epsilon-greedy action selection ──────────────
            if np.random.uniform(0, 1) < epsilon:
                action = env.action_space.sample()   # explore
            else:
                action = np.argmax(q_table[state])   # exploit

            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # ── Q-learning update ─────────────────────────────
            td_target = reward + discount_factor * np.max(q_table[next_state])
            td_error  = td_target - q_table[state, action]
            q_table[state, action] += learning_rate * td_error

            state = next_state
            total_reward += reward

        # Decay epsilon: shift from exploration to exploitation
        epsilon = max(epsilon_end, epsilon - epsilon_decay)
        rewards_per_episode.append(total_reward)

    env.close()
    return q_table, rewards_per_episode


if __name__ == "__main__":
    q_table, rewards = run_q_learning()
    success_rate = np.mean(rewards[-1000:])
    print(f"Success rate (last 1000 episodes): {success_rate:.2%}")
    print("\nLearned Q-table (rows=states, cols=actions L/D/R/U):")
    print(np.round(q_table, 3))

7.5 SARSA: The On-Policy Cousin

SARSA (State-Action-Reward-State-Action) is Q-learning's on-policy twin. Instead of using the maximum Q-value of the next state, it uses the Q-value of the action the agent actually took:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma\, Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]$$

SARSA learns about the policy it is actually following, including its exploratory behavior. In environments with dangerous states (like cliffs), SARSA tends to learn safer paths than Q-learning, because it accounts for the possibility of accidental exploration into danger zones.

8. Deep Reinforcement Learning

The algorithms above use a table to store Q-values. This works when the state space is small and discrete, but consider the game of Atari Breakout: the state is a 210×160 pixel image with 128 possible colors per pixel. The number of possible states is astronomically large. Deep Reinforcement Learning solves this by replacing the Q-table with a deep neural network.

8.1 Deep Q-Networks (DQN)

The Deep Q-Network, introduced by DeepMind in 2013/2015, was the first algorithm to successfully combine deep learning with Q-learning at scale, learning to play 49 Atari games from raw pixel input using the same algorithm and hyperparameters for all games.

Naively replacing the Q-table with a neural network leads to catastrophic instability due to two problems: temporal correlation between consecutive experiences, and the fact that training targets change as the network's weights change. DQN solved this with two key innovations:

  • Experience Replay — experiences \((s, a, r, s', \text{done})\) are stored in a large circular buffer. Random mini-batches are sampled during training, breaking temporal correlations and allowing experiences to be reused.
  • Target Network — a separate copy of the Q-network with weights \(\theta^-\) is updated only periodically. The TD target is computed using this stable target network: $$\text{target} = R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a';\, \theta^-)$$
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
from collections import deque
import random


class QNetwork(nn.Module):
    """Feedforward network approximating Q(s, a)."""

    def __init__(self, state_size: int, action_size: int, hidden_size: int = 64):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_size, hidden_size), nn.ReLU(),
            nn.Linear(hidden_size, hidden_size), nn.ReLU(),
            nn.Linear(hidden_size, action_size),
        )

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        return self.network(state)


class ReplayBuffer:
    """Circular buffer storing past (s, a, r, s', done) tuples."""

    def __init__(self, capacity: int = 10_000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size: int):
        batch = random.sample(self.buffer, batch_size)
        s, a, r, ns, d = zip(*batch)
        return (np.array(s), np.array(a),
                np.array(r, dtype=np.float32),
                np.array(ns), np.array(d, dtype=np.float32))

    def __len__(self): return len(self.buffer)


class DQNAgent:
    """DQN with experience replay and a target network."""

    def __init__(self, state_size, action_size,
                 lr=1e-3, gamma=0.99,
                 eps_start=1.0, eps_end=0.01, eps_decay=0.995,
                 batch_size=64, target_update_freq=100):
        self.action_size = action_size
        self.gamma = gamma
        self.epsilon = eps_start
        self.eps_end = eps_end
        self.eps_decay = eps_decay
        self.batch_size = batch_size
        self.target_update_freq = target_update_freq
        self.step_count = 0
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        self.online_net = QNetwork(state_size, action_size).to(self.device)
        self.target_net = QNetwork(state_size, action_size).to(self.device)
        self.target_net.load_state_dict(self.online_net.state_dict())
        self.target_net.eval()

        self.optimizer = optim.Adam(self.online_net.parameters(), lr=lr)
        self.replay = ReplayBuffer()

    def select_action(self, state):
        if random.random() < self.epsilon:
            return random.randint(0, self.action_size - 1)
        s = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        with torch.no_grad():
            return self.online_net(s).argmax(dim=1).item()

    def update(self):
        if len(self.replay) < self.batch_size:
            return None
        s, a, r, ns, d = self.replay.sample(self.batch_size)
        s  = torch.FloatTensor(s).to(self.device)
        a  = torch.LongTensor(a).to(self.device)
        r  = torch.FloatTensor(r).to(self.device)
        ns = torch.FloatTensor(ns).to(self.device)
        d  = torch.FloatTensor(d).to(self.device)

        # Current Q-values for actions taken
        current_q = self.online_net(s).gather(1, a.unsqueeze(1)).squeeze(1)

        # TD target from the frozen target network
        with torch.no_grad():
            max_next_q = self.target_net(ns).max(dim=1)[0]
            td_target  = r + self.gamma * max_next_q * (1 - d)

        loss = nn.functional.smooth_l1_loss(current_q, td_target)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(self.online_net.parameters(), 10)
        self.optimizer.step()

        self.step_count += 1
        if self.step_count % self.target_update_freq == 0:
            self.target_net.load_state_dict(self.online_net.state_dict())

        self.epsilon = max(self.eps_end, self.epsilon * self.eps_decay)
        return loss.item()


def train_dqn(num_episodes: int = 500):
    env = gym.make("CartPole-v1")
    agent = DQNAgent(env.observation_space.shape[0], env.action_space.n)
    rewards = []

    for ep in range(num_episodes):
        state, _ = env.reset()
        total_r, done = 0, False
        while not done:
            action = agent.select_action(state)
            next_s, r, term, trunc, _ = env.step(action)
            done = term or trunc
            agent.replay.push(state, action, r, next_s, done)
            agent.update()
            state, total_r = next_s, total_r + r
        rewards.append(total_r)
        if (ep + 1) % 50 == 0:
            print(f"Ep {ep+1:4d} | Avg(50): {np.mean(rewards[-50:]):6.1f}"
                  f" | eps: {agent.epsilon:.3f}")

    env.close()
    return rewards

if __name__ == "__main__":
    train_dqn()

8.2 Double DQN

Standard DQN systematically overestimates Q-values because the max operator picks whichever estimate happens to be inflated by noise. Double DQN fixes this by decoupling action selection from action evaluation:

$$\text{target} = R_{t+1} + \gamma\, Q\!\left(S_{t+1},\; \arg\max_{a'} Q(S_{t+1}, a';\, \theta);\; \theta^-\right)$$

The online network \(\theta\) selects the best action; the target network \(\theta^-\) evaluates it. This small change consistently improves performance and stability.

8.3 Dueling DQN

Dueling DQN changes the network architecture itself. Instead of directly outputting Q-values, the network splits into two streams — a value stream \(V(s)\) and an advantage stream \(A(s,a)\) — then recombines them:

$$Q(s, a) = V(s) + \left( A(s, a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a') \right)$$

Subtracting the mean advantage ensures identifiability (V and A cannot otherwise be uniquely determined from Q). The architecture allows the network to learn which states are valuable independently of the specific actions available, leading to better generalization.

8.4 Rainbow DQN

Rainbow DQN combines six independent improvements to DQN into a single agent: Double DQN, Dueling DQN, Prioritized Experience Replay, Multi-step Returns, Distributional RL, and Noisy Networks. The resulting agent significantly outperformed each individual improvement and set a new state of the art on the Atari benchmark.

9. Policy-Based Methods

9.1 REINFORCE

REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. It runs a complete episode, computes the return \(G_t\) for each time step, and updates the policy parameters by gradient ascent:

$$\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(A_t \mid S_t) \cdot G_t$$

A common variance reduction technique is to subtract a baseline \(b(s)\) from the return. The most common choice is the state value function \(V(s)\), giving us the advantage:

$$\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(A_t \mid S_t) \cdot \underbrace{\left(G_t - V(S_t)\right)}_{A^\pi(S_t,\, A_t)}$$

9.2 Trust Region Policy Optimization (TRPO)

A fundamental problem with naive policy gradient methods is that a single bad update can catastrophically destroy the policy. TRPO (Schulman et al., 2015) constrains each policy update to stay within a "trust region" around the current policy, expressed as a KL divergence constraint:

$$\max_\theta \;\hat{\mathbb{E}}_t \!\left[ \frac{\pi_\theta(A_t \mid S_t)}{\pi_{\theta_{\text{old}}}(A_t \mid S_t)} \hat{A}_t \right] \quad \text{subject to} \quad \hat{\mathbb{E}}_t \!\left[ \mathrm{KL}\!\left[\pi_{\theta_{\text{old}}}(\cdot \mid S_t),\, \pi_\theta(\cdot \mid S_t)\right] \right] \leq \delta$$

9.3 Proximal Policy Optimization (PPO)

PPO (Schulman et al., 2017) achieves the stability benefits of TRPO with a much simpler implementation. Instead of a hard KL constraint, it uses a clipped surrogate objective:

$$L^{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t \!\left[ \min\!\left( r_t(\theta)\,\hat{A}_t,\;\; \mathrm{clip}\!\left(r_t(\theta),\, 1-\varepsilon,\, 1+\varepsilon\right)\hat{A}_t \right) \right]$$

where \(r_t(\theta) = \dfrac{\pi_\theta(A_t \mid S_t)}{\pi_{\theta_{\text{old}}}(A_t \mid S_t)}\) is the importance ratio and \(\varepsilon \approx 0.2\). If the advantage is positive, the ratio is clipped at \(1+\varepsilon\) to prevent increasing the action's probability too much. If the advantage is negative, it is clipped at \(1-\varepsilon\). The min operator always takes the more conservative estimate.

PPO is currently one of the most widely used RL algorithms in practice. It is the algorithm used to fine-tune GPT models with human feedback (RLHF), and it is the default choice for many robotics and game-playing applications.

import numpy as np
import torch, torch.nn as nn, torch.optim as optim
import gymnasium as gym


class ActorCritic(nn.Module):
    """Combined actor-critic network for PPO."""

    def __init__(self, state_size, action_size, hidden=64):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_size, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden),     nn.Tanh(),
        )
        self.actor  = nn.Linear(hidden, action_size)
        self.critic = nn.Linear(hidden, 1)

    def forward(self, x):
        f = self.shared(x)
        return self.actor(f), self.critic(f).squeeze(-1)

    def get_action_and_value(self, x, action=None):
        logits, value = self.forward(x)
        dist = torch.distributions.Categorical(logits=logits)
        if action is None:
            action = dist.sample()
        return action, dist.log_prob(action), dist.entropy(), value


class PPOAgent:
    """Proximal Policy Optimization agent."""

    def __init__(self, state_size, action_size,
                 lr=3e-4, gamma=0.99, lam=0.95,
                 clip_eps=0.2, epochs=4, batch=64,
                 ent_coef=0.01, val_coef=0.5):
        self.gamma, self.lam = gamma, lam
        self.clip_eps = clip_eps
        self.epochs, self.batch = epochs, batch
        self.ent_coef, self.val_coef = ent_coef, val_coef
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.net = ActorCritic(state_size, action_size).to(self.device)
        self.opt = optim.Adam(self.net.parameters(), lr=lr)

    def compute_gae(self, rewards, values, dones, next_val):
        """Generalized Advantage Estimation."""
        adv = np.zeros(len(rewards), dtype=np.float32)
        gae = 0.0
        for t in reversed(range(len(rewards))):
            nv   = next_val if t == len(rewards)-1 else values[t+1]
            nd   = float(dones[t+1]) if t < len(rewards)-1 else 0.0
            delta = rewards[t] + self.gamma * nv * (1-nd) - values[t]
            gae   = delta + self.gamma * self.lam * (1-nd) * gae
            adv[t] = gae
        return adv, adv + np.array(values, dtype=np.float32)

    def update(self, states, actions, old_lp, advantages, returns):
        s  = torch.FloatTensor(states).to(self.device)
        a  = torch.LongTensor(actions).to(self.device)
        lp = torch.FloatTensor(old_lp).to(self.device)
        adv = torch.FloatTensor(advantages).to(self.device)
        ret = torch.FloatTensor(returns).to(self.device)
        adv = (adv - adv.mean()) / (adv.std() + 1e-8)

        for _ in range(self.epochs):
            idx = np.random.permutation(len(states))
            for start in range(0, len(states), self.batch):
                b = idx[start:start+self.batch]
                _, new_lp, ent, val = self.net.get_action_and_value(s[b], a[b])
                ratio = torch.exp(new_lp - lp[b])
                s1 = ratio * adv[b]
                s2 = torch.clamp(ratio, 1-self.clip_eps, 1+self.clip_eps) * adv[b]
                loss = (-torch.min(s1,s2).mean()
                        + self.val_coef * nn.functional.mse_loss(val, ret[b])
                        - self.ent_coef * ent.mean())
                self.opt.zero_grad(); loss.backward()
                nn.utils.clip_grad_norm_(self.net.parameters(), 0.5)
                self.opt.step()


def train_ppo(steps_per_update=2048, total=200_000):
    env = gym.make("CartPole-v1")
    agent = PPOAgent(env.observation_space.shape[0], env.action_space.n)
    state, _ = env.reset()
    ep_rewards, all_ep = [], []
    t = 0

    while t < total:
        rs, as_, lps, vs, rws, ds = [], [], [], [], [], []
        for _ in range(steps_per_update):
            st = torch.FloatTensor(state).unsqueeze(0).to(agent.device)
            with torch.no_grad():
                act, lp, _, v = agent.net.get_action_and_value(st)
            ns, r, term, trunc, _ = env.step(act.item())
            done = term or trunc
            rs.append(state); as_.append(act.item())
            lps.append(lp.item()); vs.append(v.item())
            rws.append(r); ds.append(done)
            state = ns; ep_rewards.append(r); t += 1
            if done:
                all_ep.append(sum(ep_rewards)); ep_rewards = []
                state, _ = env.reset()

        with torch.no_grad():
            _, _, _, nv = agent.net.get_action_and_value(
                torch.FloatTensor(state).unsqueeze(0).to(agent.device))
        adv, ret = agent.compute_gae(rws, vs, ds, nv.item())
        agent.update(np.array(rs), np.array(as_),
                     np.array(lps), adv, ret)
        if all_ep:
            print(f"t={t:6d} | ep={len(all_ep):4d} "
                  f"| avg(10)={np.mean(all_ep[-10:]):6.1f}")

    env.close()
    return all_ep

if __name__ == "__main__":
    train_ppo()

10. Actor-Critic Methods

10.1 Advantage Actor-Critic (A2C / A3C)

Actor-critic methods maintain two components: an actor that selects actions (the policy) and a critic that evaluates those actions (the value function). The actor is updated using the policy gradient with the advantage as the baseline:

$$\nabla_\theta J(\theta) = \hat{\mathbb{E}}_t \!\left[ \nabla_\theta \log \pi_\theta(A_t \mid S_t)\; \hat{A}_t \right]$$

The critic is updated by minimizing the mean squared error between its value predictions and the actual returns:

$$L_{\text{critic}} = \hat{\mathbb{E}}_t \!\left[ \left( V_\phi(S_t) - G_t \right)^2 \right]$$

A3C (Asynchronous Advantage Actor-Critic) runs multiple agents in parallel on different copies of the environment, collecting diverse experience and updating a shared global network. A2C is the synchronous version, which in practice often performs comparably and is simpler to implement.

10.2 Soft Actor-Critic (SAC)

SAC (Haarnoja et al., 2018) is one of the most powerful and sample-efficient algorithms for continuous action spaces. It operates within the maximum entropy RL framework, augmenting the standard reward objective with an entropy bonus:

$$J(\pi) = \sum_{t=0}^{T} \mathbb{E}\!\left[ R(s_t, a_t) + \alpha\, \mathcal{H}\!\left(\pi(\cdot \mid s_t)\right) \right]$$

where \(\mathcal{H}(\pi)\) is the entropy of the policy and \(\alpha\) is the temperature parameter. Maximizing entropy encourages the agent to be as random as possible while still achieving high rewards, naturally promoting exploration. SAC uses twin Q-networks and takes the minimum of their predictions to reduce overestimation bias:

$$y = r + \gamma \min_{i=1,2} Q_{\phi_i}(s', \tilde{a}') - \alpha \log \pi_\theta(\tilde{a}' \mid s'), \quad \tilde{a}' \sim \pi_\theta(\cdot \mid s')$$

10.3 TD3 (Twin Delayed Deep Deterministic Policy Gradient)

TD3 (Fujimoto et al., 2018) improves upon DDPG for continuous action spaces with three key tricks:

  1. Twin Q-networks — take the minimum of two critic estimates to reduce overestimation bias.
  2. Delayed policy updates — update the actor only every two critic updates, allowing the critic to stabilize first.
  3. Target policy smoothing — add small random noise to the target policy when computing TD targets, smoothing out sharp peaks in the Q-function landscape.

11. Model-Based Reinforcement Learning

All the algorithms above are model-free: they learn directly from experience without building an explicit model of the environment. Model-based RL takes a different approach: the agent learns or is given a model of the environment's dynamics, and uses that model to plan and generate synthetic experience.

The key advantage is sample efficiency. A model-based agent can generate thousands of simulated experiences from its learned model for every real experience it collects. The key challenge is model bias: if the learned model is wrong, the agent will optimize for the wrong objective.

Dyna-Q (Sutton, 1990) is the simplest model-based algorithm. After each real interaction, the agent updates both the Q-function (using real experience) and the model. It then performs \(k\) additional Q-learning updates using simulated experience generated by the model.

MuZero (DeepMind, 2020) learns a model of the environment purely in terms of what is useful for planning, without trying to reconstruct the full observation. It achieved superhuman performance on Atari, chess, shogi, and Go — all without being given the rules of any of these games.

DreamerV3 (Hafner et al., 2023) learns a world model from high-dimensional observations and trains the agent entirely within the latent space of that model. Remarkably, it uses the same hyperparameters to achieve strong performance on tasks ranging from Atari games to 3D control tasks to Minecraft.

12. Advanced and Specialized Variants

12.1 Multi-Agent Reinforcement Learning (MARL)

In MARL, multiple agents coexist in the same environment and interact with each other. The environment is no longer stationary from any single agent's perspective, because the other agents are also changing their behavior. Agents may need to cooperate, compete, or do both simultaneously. The field draws heavily on game theory, particularly the concept of Nash equilibria — stable outcomes where no agent can improve its reward by unilaterally changing its strategy.

12.2 Hierarchical Reinforcement Learning (HRL)

HRL addresses long-horizon tasks by decomposing them into a hierarchy of sub-tasks. A high-level manager policy sets goals for a low-level worker policy, which executes primitive actions to achieve those goals. This temporal abstraction allows the agent to reason at multiple levels and to reuse learned sub-skills across different tasks.

12.3 Inverse Reinforcement Learning (IRL)

In standard RL, the reward function is given and the agent must learn a policy. In IRL, the situation is reversed: the agent observes expert demonstrations and must infer the reward function that the expert is optimizing. This is useful when it is easier to demonstrate desired behavior than to specify a reward function explicitly.

12.4 Reinforcement Learning from Human Feedback (RLHF)

RLHF has become one of the most practically important RL techniques, primarily because of its role in making large language models more helpful and aligned. The pipeline has three stages: (1) pre-train a base language model on text; (2) train a reward model from human preference comparisons; (3) fine-tune the language model using PPO, with the reward model providing the reward signal.

A more recent variant called RLVR (Reinforcement Learning with Verifiable Rewards) replaces the learned reward model with an automatic verifier for tasks where correctness can be checked objectively — such as mathematics and coding. This approach, used in models like DeepSeek-R1 and OpenAI's o1, has led to dramatic improvements in reasoning capabilities.

12.5 Offline Reinforcement Learning

Offline RL learns a policy from a fixed dataset of previously collected experience, without any further interaction with the environment. The key challenge is distributional shift: the learned policy may want to take actions not well-represented in the dataset, leading to unreliable Q-value estimates. Conservative Q-Learning (CQL) and Implicit Q-Learning (IQL) address this by penalizing Q-values for out-of-distribution actions.

12.6 Meta-Reinforcement Learning

Meta-RL, or "learning to learn," trains agents that can quickly adapt to new tasks with very few interactions. The agent is trained on a distribution of related tasks and learns a general strategy for rapid adaptation. MAML (Model-Agnostic Meta-Learning) learns an initialization of the policy parameters such that a small number of gradient steps on a new task leads to good performance.

12.7 Safe Reinforcement Learning

Safe RL incorporates constraints into the optimization problem, ensuring the agent's behavior satisfies safety requirements at all times. Constrained Markov Decision Processes (CMDPs) extend MDPs with additional cost functions \(C(s,a)\) and constraints on the expected cumulative cost \(\mathbb{E}[\sum_t C(S_t, A_t)] \leq d\). Algorithms like Constrained Policy Optimization (CPO) solve the constrained optimization problem while still maximizing reward.

13. Practical Implementation: Building Your First RL Agent

The most important library for RL environments is Gymnasium (formerly OpenAI Gym), which provides a standardized interface. Stable-Baselines3 provides clean, well-tested implementations of many popular algorithms that you can use out of the box:

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env


def train_with_stable_baselines():
    """
    Train a PPO agent on LunarLander-v2 using Stable-Baselines3.
    LunarLander requires landing a spacecraft between two flags using thrusters.
    """
    # Vectorized environments run multiple copies in parallel
    env = make_vec_env("LunarLander-v2", n_envs=4)

    model = PPO(
        policy="MlpPolicy",
        env=env,
        learning_rate=3e-4,
        n_steps=1024,
        batch_size=64,
        n_epochs=4,
        gamma=0.999,
        gae_lambda=0.98,
        clip_range=0.2,
        ent_coef=0.01,
        verbose=1,
    )
    model.learn(total_timesteps=500_000)
    model.save("ppo_lunarlander")

    eval_env = gym.make("LunarLander-v2")
    mean_r, std_r = evaluate_policy(model, eval_env,
                                    n_eval_episodes=20,
                                    deterministic=True)
    print(f"Mean reward: {mean_r:.2f} +/- {std_r:.2f}")
    eval_env.close(); env.close()


if __name__ == "__main__":
    train_with_stable_baselines()

For custom environments, inherit from gym.Env and implement reset(), step(), and define observation_space and action_space. This makes your environment compatible with all Gymnasium-compatible RL algorithms automatically.

14. Real-World Applications

Game playing — AlphaGo and AlphaZero surpassed the best human players in Go, chess, and shogi. OpenAI Five defeated world champions at Dota 2. AlphaStar achieved Grandmaster level in StarCraft II.

Robotics — RL is enabling a new generation of dexterous, adaptive robots. OpenAI's Dactyl system used RL to train a robot hand to solve a Rubik's cube using only touch and vision. The key challenge is the sim-to-real gap: policies trained in simulation often fail on real hardware due to differences in physics and sensor noise.

Healthcare — RL is being used to optimize treatment protocols for sepsis, diabetes, and cancer. By learning from historical patient data, RL agents can discover treatment strategies that outperform standard clinical guidelines.

Finance — RL is used for algorithmic trading, portfolio optimization, and market making. The non-stationarity of financial markets and the risk of catastrophic losses make this a particularly challenging domain.

Energy management — Google DeepMind used RL to reduce data center cooling energy consumption by 40%. RL is also being applied to smart grid management and power plant optimization.

Natural language processing — RLHF is the dominant technique for aligning large language models with human preferences. Every major AI assistant today — ChatGPT, Claude, Gemini, Llama — has been fine-tuned using some form of RL.

Scientific discovery — AlphaFold 2 solved the protein structure prediction problem that had stumped biologists for fifty years. AlphaTensor discovered new matrix multiplication algorithms faster than those humans had found over decades of research.

15. The Vision of David Silver and the Era of Experience

To understand where reinforcement learning is headed, we need to understand the vision of David Silver, the man who more than anyone else is responsible for AlphaGo, AlphaZero, and the modern era of deep RL.

Silver spent over a decade at Google DeepMind, leading the teams that created some of the most astonishing AI systems ever built. AlphaZero, which followed AlphaGo in 2017, was even more remarkable: it learned to play Go, chess, and shogi at superhuman level from scratch, starting with nothing but the rules of the game and playing against itself. AlphaZero discovered strategies that human players had never conceived of in thousands of years of playing these games.

In January 2026, Silver left DeepMind to found Ineffable Intelligence, based in London. In April 2026, the company raised an extraordinary $1.1 billion in seed funding — one of the largest early-stage financings ever recorded — with investors including Nvidia, Google, Sequoia Capital, Lightspeed Venture Partners, and the UK government's Sovereign AI Fund. The company is valued at $5.1 billion.

The mission is to build what Silver calls a "superlearner": an AI system that discovers all knowledge autonomously through its own experience, from basic motor skills to profound intellectual breakthroughs. This vision is articulated in a paper Silver co-authored with Richard Sutton (the father of modern reinforcement learning) in April 2025, titled "Welcome to the Era of Experience."

The Core Argument: AI has made enormous progress by training on human-generated data — text, images, code. But this approach has a fundamental ceiling. Human-generated data is limited by what humans know and have expressed. An AI that learns only from human data can, at best, match human performance. It cannot surpass it in a fundamental way. The next era of AI will be defined by agents that learn primarily from their own experience, through interaction with the world — the "Era of Experience."

Silver believes that a sufficiently capable superlearner, given access to the right environments and reward signals, could discover knowledge that no human has ever possessed: new mathematical theorems, new physical theories, new drug compounds, new economic systems. AlphaZero already discovered novel chess strategies that surprised grandmasters. AlphaTensor discovered new matrix multiplication algorithms. These are early glimpses of what a more general superlearner might achieve.

Demis Hassabis, the CEO of Google DeepMind and the other key architect of AlphaGo, shares a similar vision. He has predicted that AGI could arrive by 2030. Hassabis was awarded the Nobel Prize in Chemistry in 2024 for AlphaFold's contributions to protein structure prediction — a remarkable recognition of AI's potential to transform science.

Both Silver and Hassabis believe that reinforcement learning is not just one tool among many in the AI toolkit. They believe it is the fundamental mechanism by which intelligence — both biological and artificial — is created and refined. The brain itself can be understood as an RL system, with dopamine acting as a reward signal. The algorithms in this tutorial are, in a deep sense, mathematical formalizations of how learning works in nature.

16. The Future of Reinforcement Learning

The field of reinforcement learning is evolving at a breathtaking pace. Here are the most important trends shaping its future.

Integration with large language models — RLHF has made language models dramatically more useful and aligned. RLVR is making them dramatically more capable at reasoning. The next step is using language models as world models, reward functions, and policy components within RL systems — creating agents that can understand natural language instructions, reason about their actions, and communicate their reasoning to humans.

World models — Rather than learning purely from real experience, agents will learn rich internal models of the world and use those models to plan, imagine, and reason. The convergence of video generation, robotics, and simulation through world models suggests a future where AI agents can learn to navigate and manipulate the physical world with unprecedented capability.

Multi-agent systems — As AI is deployed in complex social and economic environments, the coordination, competition, and communication between multiple AI agents will require new theoretical frameworks that go beyond what single-agent RL can provide.

Safe and interpretable RL — Current RL algorithms are often opaque black boxes that can behave unpredictably in novel situations. Future algorithms will need to provide safety guarantees, explain their decisions, and remain robust to distribution shift.

New scaling laws — In supervised learning, scaling up data, compute, and model size has consistently led to dramatic improvements. Whether the same scaling laws apply to RL is an open question, but early evidence from AlphaZero and MuZero suggests that RL can benefit enormously from scale.

The Era of Experience — Perhaps most profoundly, the vision articulated by Silver and Sutton suggests a future where AI systems are not trained on human data at all, but discover knowledge entirely through their own experience. If this vision is realized, the implications for science, technology, and society would be difficult to overstate. We would have created systems capable of generating new knowledge at a rate and depth that far exceeds what any human or team of humans could achieve.

17. Conclusion

We have traveled a long distance in this tutorial. We started with the simple but profound idea that an agent can learn to behave intelligently by interacting with its environment and receiving rewards. We built up the mathematical framework of Markov Decision Processes, the Bellman equations, and the policy gradient theorem. We implemented Q-learning, Deep Q-Networks, and Proximal Policy Optimization from scratch. We surveyed the full landscape of RL algorithms, from dynamic programming to model-based methods to RLHF. And we looked at the extraordinary vision of the people who believe that reinforcement learning is the key to artificial general intelligence.

The most important thing to take away is not any specific algorithm or equation. It is the underlying philosophy: intelligence, at its core, is the ability to learn from experience. Every algorithm we have discussed is a different way of formalizing and implementing this simple idea.

Reinforcement learning is hard. The credit assignment problem is hard. The exploration-exploitation dilemma is hard. Scaling RL to complex real-world environments is hard. But the progress of the last decade has been astonishing, and the pace of progress is accelerating.

If you are a beginner, the best way to deepen your understanding is to implement these algorithms yourself, experiment with different environments, and read the original papers. The textbook Reinforcement Learning: An Introduction by Sutton and Barto is the definitive reference and is freely available online. The OpenAI Spinning Up documentation is an excellent practical guide.

The field is young, the problems are profound, and the potential impact is enormous. Welcome to reinforcement learning.


📚 References & Further Reading

Sutton, R.S. and Barto, A.G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free at: incompleteideas.net/book/the-book.html

Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533. (The DQN paper.)

Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484–489. (AlphaGo.)

Silver, D. et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362, 1140–1144. (AlphaZero.)

Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.

Haarnoja, T. et al. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv:1801.01290.

Silver, D. and Sutton, R.S. (2025). Welcome to the Era of Experience.

Hafner, D. et al. (2023). Mastering Diverse Domains through World Models. arXiv:2301.04104. (DreamerV3.)

No comments: