Hitchhiker's Guide to AI, Software Architecture, and Everything Else: Foundations of Reinforcement Learning: A Developer's Guide

Introduction

Reinforcement Learning (RL) stands as one of the most fascinating branches of machine learning, offering a framework where agents learn to make decisions by interacting with an environment. Unlike supervised learning, where models learn from labeled examples, or unsupervised learning, which finds patterns in unlabeled data, reinforcement learning focuses on how agents should act to maximize cumulative rewards. This approach mirrors how humans and animals naturally learn through trial and error, making it particularly powerful for solving complex sequential decision-making problems.

The applications of reinforcement learning span numerous domains, from robotics and autonomous vehicles to recommendation systems and game playing. The field gained significant public attention when DeepMind's AlphaGo defeated the world champion in Go, a feat previously thought to be decades away. This success exemplified the potential of combining reinforcement learning with deep neural networks, giving rise to what we now call deep reinforcement learning.

As a developer beginning your journey into reinforcement learning, you'll encounter a rich landscape of concepts, algorithms, and implementation details. This article aims to navigate you through the foundations, providing both theoretical understanding and practical code examples to help you build your first reinforcement learning systems.

Core Concepts of Reinforcement Learning

At its heart, reinforcement learning involves an agent learning to make decisions by interacting with an environment. This interaction follows a cycle: the agent takes an action based on its current state, the environment responds by transitioning to a new state and providing a reward signal, and the agent uses this feedback to improve its decision-making strategy.

The environment represents the world in which the agent operates. It could be a physical environment like a robot navigating a room, or a virtual one like a game. The agent observes the environment through a state representation, which captures relevant information about the environment's current configuration. Based on this state, the agent selects an action according to its policy, which is a mapping from states to actions.

After taking an action, the agent receives two pieces of feedback from the environment: the new state resulting from its action, and a reward signal indicating the immediate value of that action. The agent's goal is to learn a policy that maximizes the expected cumulative reward over time, not just the immediate reward. This long-term perspective distinguishes reinforcement learning from simpler approaches like greedy algorithms.

The mathematical framework formalizing this process is called a Markov Decision Process (MDP). An MDP is defined by its state space, action space, transition probabilities between states, reward function, and a discount factor determining how much the agent values future rewards compared to immediate ones.

Implementing a Basic Environment and Agent

Let's implement a simple environment and agent to illustrate these concepts. We'll create a grid world where an agent needs to navigate from a starting position to a goal while avoiding obstacles.

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

import random

class GridWorldEnv:

def __init__(self, size=5):

self.size = size

# Create grid: 0 = empty, 1 = obstacle, 2 = goal

self.grid = np.zeros((size, size))

# Set obstacles

self.grid[1, 1] = 1

self.grid[2, 3] = 1

self.grid[3, 1] = 1

# Set goal

self.grid[size-1, size-1] = 2

# Starting position

self.agent_pos = (0, 0)

# Define possible actions: up, right, down, left

self.actions = [(0, 1), (1, 0), (0, -1), (-1, 0)]

def reset(self):

self.agent_pos = (0, 0)

return self.agent_pos

def step(self, action_idx):

action = self.actions[action_idx]

# Calculate new position

new_pos = (self.agent_pos[0] + action[0], self.agent_pos[1] + action[1])

# Check if the new position is valid

if (0 <= new_pos[0] < self.size and

0 <= new_pos[1] < self.size and

self.grid[new_pos] != 1):

self.agent_pos = new_pos

# Check if goal reached

done = self.grid[self.agent_pos] == 2

# Define rewards

if done:

reward = 10 # Reaching the goal

elif self.agent_pos == (0, 0):

reward = -0.1 # Penalty for returning to start

else:

reward = -0.1 # Small penalty for each step (encourages finding shortest path)

return self.agent_pos, reward, done

def render(self):

grid_copy = self.grid.copy()

# Mark agent position

if grid_copy[self.agent_pos] == 0: # Don't overwrite goal

grid_copy[self.agent_pos] = 3

# Create custom colormap: white=empty, black=obstacle, green=goal, red=agent

cmap = ListedColormap(['white', 'black', 'green', 'red'])

plt.figure(figsize=(6, 6))

plt.imshow(grid_copy, cmap=cmap)

plt.grid(True)

plt.xticks(np.arange(self.size))

plt.yticks(np.arange(self.size))

plt.title('Grid World')

plt.show()

This code defines our environment, a 5x5 grid world with obstacles and a goal. The agent can move in four directions: up, right, down, and left. It receives a positive reward for reaching the goal and a small negative reward for each step to encourage finding the shortest path.

Now, let's implement a simple agent that makes random moves to explore this environment:

class RandomAgent:

def __init__(self, env):

self.env = env

self.n_actions = len(env.actions)

def choose_action(self, state):

# Simply choose a random action

return random.randint(0, self.n_actions - 1)

def train(self, episodes=10):

for episode in range(episodes):

state = self.env.reset()

done = False

total_reward = 0

steps = 0

print(f"Episode {episode+1}")

self.env.render()

while not done and steps < 100: # Limit to prevent infinite loops

action = self.choose_action(state)

new_state, reward, done = self.env.step(action)

total_reward += reward

state = new_state

steps += 1

print(f"Step {steps}, Action: {action}, Reward: {reward}")

self.env.render()

print(f"Episode {episode+1} finished in {steps} steps with total reward: {total_reward}")

print("-" * 40)

# Create environment and agent

env = GridWorldEnv()

agent = RandomAgent(env)

# Train the agent (or rather, watch it explore randomly)

agent.train(episodes=3)

This agent simply selects random actions, which is not an effective learning strategy but serves to demonstrate the environment dynamics. The train method runs episodes where the agent interacts with the environment until it reaches the goal or a maximum number of steps.

Q-Learning: A Fundamental RL Algorithm

Random actions won't get us far in complex environments. Let's implement Q-learning, a fundamental reinforcement learning algorithm that learns a value function for state-action pairs.

Q-learning works by maintaining a table (the Q-table) that estimates the expected future reward for each state-action pair. The agent uses this table to select actions that maximize expected rewards, while also occasionally exploring new actions to improve its estimates.

class QLearningAgent:

def __init__(self, env, learning_rate=0.1, discount_factor=0.9, exploration_rate=1.0, exploration_decay=0.99):

self.env = env

self.learning_rate = learning_rate # How quickly we update our Q-values

self.discount_factor = discount_factor # How much we value future rewards

self.exploration_rate = exploration_rate # Probability of taking a random action

self.exploration_decay = exploration_decay # How quickly exploration rate decays

self.n_actions = len(env.actions)

# Initialize Q-table with zeros

# Since our state is the agent's position, we create a 2D grid for each possible action

self.q_table = np.zeros((env.size, env.size, self.n_actions))

def choose_action(self, state):

# Exploration: choose a random action

if random.uniform(0, 1) < self.exploration_rate:

return random.randint(0, self.n_actions - 1)

# Exploitation: choose the best action based on Q-values

return np.argmax(self.q_table[state])

def learn(self, state, action, reward, next_state, done):

# Get the current Q-value

current_q = self.q_table[state][action]

# Get the maximum Q-value for the next state

max_next_q = np.max(self.q_table[next_state]) if not done else 0

# Calculate the new Q-value using the Q-learning formula

new_q = current_q + self.learning_rate * (reward + self.discount_factor * max_next_q - current_q)

# Update the Q-table

self.q_table[state][action] = new_q

def train(self, episodes=1000, max_steps=100, render_interval=100):

rewards_per_episode = []

for episode in range(episodes):

state = self.env.reset()

done = False

total_reward = 0

steps = 0

# Render occasionally to see progress

should_render = episode % render_interval == 0

if should_render:

print(f"Episode {episode+1}")

self.env.render()

while not done and steps < max_steps:

action = self.choose_action(state)

next_state, reward, done = self.env.step(action)

self.learn(state, action, reward, next_state, done)

total_reward += reward

state = next_state

steps += 1

if should_render:

print(f"Step {steps}, Action: {action}, Reward: {reward}")

self.env.render()

# Decay exploration rate

self.exploration_rate *= self.exploration_decay

rewards_per_episode.append(total_reward)

if should_render:

print(f"Episode {episode+1} finished in {steps} steps with total reward: {total_reward}")

print(f"Exploration rate: {self.exploration_rate:.4f}")

print("-" * 40)

# Plot rewards over episodes

plt.figure(figsize=(10, 6))

plt.plot(rewards_per_episode)

plt.xlabel('Episode')

plt.ylabel('Total Reward')

plt.title('Rewards per Episode')

plt.grid(True)

plt.show()

return rewards_per_episode

# Create environment and Q-learning agent

env = GridWorldEnv()

agent = QLearningAgent(env)

# Train the agent

rewards = agent.train(episodes=500, render_interval=100)

In this implementation, the Q-learning agent maintains a Q-table with values for each state-action pair. The `choose_action` method balances exploration (trying new actions) with exploitation (choosing the best known action). The `learn` method updates the Q-values using the Q-learning update formula, which incorporates the immediate reward and the estimated future reward based on the next state.

The exploration rate starts high, encouraging the agent to try different actions, and gradually decreases as the agent learns, allowing it to exploit its knowledge more often. This exploration-exploitation tradeoff is crucial in reinforcement learning.

Deep Q-Networks: Combining RL with Deep Learning

While Q-learning works well for small, discrete state spaces, it becomes impractical for larger or continuous environments. Deep Q-Networks (DQN) address this limitation by using a neural network to approximate the Q-function.

Let's implement a DQN agent for our grid world, though in practice, DQNs are typically used for more complex environments:

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.optimizers import Adam

from collections import deque

import numpy as np

import random

class DQNAgent:

def __init__(self, env, memory_size=2000, batch_size=32, learning_rate=0.001,

discount_factor=0.95, exploration_rate=1.0, exploration_min=0.01,

exploration_decay=0.995):

self.env = env

self.memory = deque(maxlen=memory_size)

self.batch_size = batch_size

self.learning_rate = learning_rate

self.discount_factor = discount_factor

self.exploration_rate = exploration_rate

self.exploration_min = exploration_min

self.exploration_decay = exploration_decay

self.n_actions = len(env.actions)

# State representation: x position, y position

self.state_size = 2

# Create neural network model

self.model = self._build_model()

def _build_model(self):

model = Sequential([

Dense(24, input_dim=self.state_size, activation='relu'),

Dense(24, activation='relu'),

Dense(self.n_actions, activation='linear')

])

model.compile(loss='mse', optimizer=Adam(learning_rate=self.learning_rate))

return model

def remember(self, state, action, reward, next_state, done):

# Store experience in memory

self.memory.append((state, action, reward, next_state, done))

def choose_action(self, state):

# Convert grid position to feature vector

state_array = np.array([state[0], state[1]]).reshape(1, -1)

# Exploration: choose a random action

if random.uniform(0, 1) < self.exploration_rate:

return random.randint(0, self.n_actions - 1)

# Exploitation: choose the best action based on Q-values predicted by the neural network

act_values = self.model.predict(state_array, verbose=0)

return np.argmax(act_values[0])

def replay(self):

# Train the model on a batch of experiences from memory

if len(self.memory) < self.batch_size:

return

# Sample a batch from memory

minibatch = random.sample(self.memory, self.batch_size)

for state, action, reward, next_state, done in minibatch:

state_array = np.array([state[0], state[1]]).reshape(1, -1)

next_state_array = np.array([next_state[0], next_state[1]]).reshape(1, -1)

# If done, target is just the reward

if done:

target = reward

else:

# Target is reward plus discounted max Q-value for next state

target = reward + self.discount_factor * np.max(

self.model.predict(next_state_array, verbose=0)[0])

# Get current predictions

target_f = self.model.predict(state_array, verbose=0)

# Update the Q-value for the action taken

target_f[0][action] = target

# Train the model

self.model.fit(state_array, target_f, epochs=1, verbose=0)

# Decay exploration rate

if self.exploration_rate > self.exploration_min:

self.exploration_rate *= self.exploration_decay

def train(self, episodes=1000, max_steps=100, render_interval=100):

rewards_per_episode = []

for episode in range(episodes):

state = self.env.reset()

done = False

total_reward = 0

steps = 0

# Render occasionally to see progress

should_render = episode % render_interval == 0

if should_render:

print(f"Episode {episode+1}")

self.env.render()

while not done and steps < max_steps:

action = self.choose_action(state)

next_state, reward, done = self.env.step(action)

self.remember(state, action, reward, next_state, done)

total_reward += reward

state = next_state

steps += 1

if should_render:

print(f"Step {steps}, Action: {action}, Reward: {reward}")

self.env.render()

self.replay()

rewards_per_episode.append(total_reward)

if should_render:

print(f"Episode {episode+1} finished in {steps} steps with total reward: {total_reward}")

print(f"Exploration rate: {self.exploration_rate:.4f}")

print("-" * 40)

# Plot rewards over episodes

plt.figure(figsize=(10, 6))

plt.plot(rewards_per_episode)

plt.xlabel('Episode')

plt.ylabel('Total Reward')

plt.title('Rewards per Episode (DQN)')

plt.grid(True)

plt.show()

return rewards_per_episode

# Create environment and DQN agent

env = GridWorldEnv()

dqn_agent = DQNAgent(env)

# Train the agent

dqn_rewards = dqn_agent.train(episodes=500, render_interval=100)

The DQN agent uses a neural network to approximate the Q-function, allowing it to handle more complex state spaces. It also introduces experience replay, where past experiences are stored in a memory buffer and randomly sampled for training. This helps break the correlation between consecutive samples and improves the stability of learning.

The architecture consists of a simple neural network with two hidden layers, taking the agent's position as input and outputting Q-values for each possible action. The agent follows the same exploration-exploitation strategy as in Q-learning but updates the Q-values through neural network training.

Policy Gradient Methods: Learning Policies Directly

Q-learning and DQN are value-based methods that learn a value function and derive a policy from it. In contrast, policy gradient methods learn the policy directly, optimizing it to maximize expected rewards.

Let's implement REINFORCE, a basic policy gradient algorithm:

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.optimizers import Adam

import numpy as np

class REINFORCEAgent:

def __init__(self, env, learning_rate=0.01, discount_factor=0.99):

self.env = env

self.learning_rate = learning_rate

self.discount_factor = discount_factor

self.n_actions = len(env.actions)

# State representation: x position, y position

self.state_size = 2

# Create policy model

self.model = self._build_model()

# Lists to store episode data

self.states = []

self.actions = []

self.rewards = []

def _build_model(self):

model = Sequential([

Dense(24, input_dim=self.state_size, activation='relu'),

Dense(24, activation='relu'),

Dense(self.n_actions, activation='softmax') # Softmax for probability distribution over actions

])

model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=self.learning_rate))

return model

def remember(self, state, action, reward):

# Store episode data

self.states.append(state)

self.actions.append(action)

self.rewards.append(reward)

def choose_action(self, state):

# Convert grid position to feature vector

state_array = np.array([state[0], state[1]]).reshape(1, -1)

# Get action probabilities from the policy network

action_probs = self.model.predict(state_array, verbose=0)[0]

# Choose action based on the probability distribution

return np.random.choice(self.n_actions, p=action_probs)

def discount_rewards(self):

# Calculate discounted rewards

discounted_rewards = np.zeros_like(self.rewards, dtype=np.float32)

running_reward = 0

# Calculate discounted rewards from the end of the episode

for i in reversed(range(len(self.rewards))):

running_reward = self.rewards[i] + self.discount_factor * running_reward

discounted_rewards[i] = running_reward

# Normalize rewards to have zero mean and unit variance

discounted_rewards -= np.mean(discounted_rewards)

discounted_rewards /= np.std(discounted_rewards) + 1e-8 # Add epsilon to avoid division by zero

return discounted_rewards

def train_model(self):

# Get discounted rewards

discounted_rewards = self.discount_rewards()

# Prepare training data

states = np.array([s[0] for s in self.states]).reshape(-1, 1)

states = np.hstack((states, np.array([s[1] for s in self.states]).reshape(-1, 1)))

# One-hot encode actions

actions = np.zeros((len(self.actions), self.n_actions))

for i, action in enumerate(self.actions):

actions[i, action] = 1

# Scale the actions by discounted rewards

# This increases the probability of actions that led to high rewards

actions = actions * discounted_rewards[:, np.newaxis]

# Train the model

self.model.fit(states, actions, epochs=1, verbose=0)

# Clear episode data

self.states = []

self.actions = []

self.rewards = []

def train(self, episodes=1000, max_steps=100, render_interval=100):

rewards_per_episode = []

for episode in range(episodes):

state = self.env.reset()

done = False

total_reward = 0

steps = 0

# Clear episode data

self.states = []

self.actions = []

self.rewards = []

# Render occasionally to see progress

should_render = episode % render_interval == 0

if should_render:

print(f"Episode {episode+1}")

self.env.render()

while not done and steps < max_steps:

action = self.choose_action(state)

next_state, reward, done = self.env.step(action)

self.remember(state, action, reward)

total_reward += reward

state = next_state

steps += 1

if should_render:

print(f"Step {steps}, Action: {action}, Reward: {reward}")

self.env.render()

# Train after each episode

self.train_model()

rewards_per_episode.append(total_reward)

if should_render:

print(f"Episode {episode+1} finished in {steps} steps with total reward: {total_reward}")

print("-" * 40)

# Plot rewards over episodes

plt.figure(figsize=(10, 6))

plt.plot(rewards_per_episode)

plt.xlabel('Episode')

plt.ylabel('Total Reward')

plt.title('Rewards per Episode (REINFORCE)')

plt.grid(True)

plt.show()

return rewards_per_episode

# Create environment and REINFORCE agent

env = GridWorldEnv()

pg_agent = REINFORCEAgent(env)

# Train the agent

pg_rewards = pg_agent.train(episodes=500, render_interval=100)

REINFORCE is a Monte Carlo policy gradient method that works by collecting a complete episode of experience before updating the policy. The key insight is that we want to increase the probability of actions that led to high rewards and decrease the probability of actions that led to low rewards.

The policy network outputs a probability distribution over actions, and the agent samples from this distribution to select actions. After an episode, the agent calculates the discounted rewards and uses them to scale the gradients for policy updates. Actions that led to higher discounted rewards will have a stronger influence on the policy update.

Actor-Critic Methods: Combining Value and Policy Learning

Actor-Critic methods combine the advantages of value-based and policy-based approaches. The "actor" learns a policy, while the "critic" learns to evaluate the policy by estimating the value function.

Let's implement a simple Actor-Critic agent:

import tensorflow as tf

from tensorflow.keras.models import Sequential, Model

from tensorflow.keras.layers import Dense, Input

from tensorflow.keras.optimizers import Adam

import numpy as np

class ActorCriticAgent:

def __init__(self, env, actor_lr=0.001, critic_lr=0.005, discount_factor=0.99):

self.env = env

self.actor_lr = actor_lr

self.critic_lr = critic_lr

self.discount_factor = discount_factor

self.n_actions = len(env.actions)

# State representation: x position, y position

self.state_size = 2

# Create actor (policy) and critic (value) models

self.actor = self._build_actor()

self.critic = self._build_critic()

def _build_actor(self):

# Actor model outputs action probabilities

actor = Sequential([

Dense(24, input_dim=self.state_size, activation='relu'),

Dense(24, activation='relu'),

Dense(self.n_actions, activation='softmax')

])

actor.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=self.actor_lr))

return actor

def _build_critic(self):

# Critic model estimates state value

critic = Sequential([

Dense(24, input_dim=self.state_size, activation='relu'),

Dense(24, activation='relu'),

Dense(1, activation='linear')

])

critic.compile(loss='mse', optimizer=Adam(learning_rate=self.critic_lr))

return critic

def choose_action(self, state):

# Convert grid position to feature vector

state_array = np.array([state[0], state[1]]).reshape(1, -1)

# Get action probabilities from the actor network

action_probs = self.actor.predict(state_array, verbose=0)[0]

# Choose action based on the probability distribution

return np.random.choice(self.n_actions, p=action_probs)

def learn(self, state, action, reward, next_state, done):

# Convert states to feature vectors

state_array = np.array([state[0], state[1]]).reshape(1, -1)

next_state_array = np.array([next_state[0], next_state[1]]).reshape(1, -1)

# Predict state values

state_value = self.critic.predict(state_array, verbose=0)[0]

next_state_value = self.critic.predict(next_state_array, verbose=0)[0] if not done else 0

# Calculate TD error (temporal difference)

td_error = reward + self.discount_factor * next_state_value - state_value

# Update critic (value function)

target = reward + self.discount_factor * next_state_value

self.critic.fit(state_array, np.array([target]), epochs=1, verbose=0)

# Update actor (policy)

# Create a target that encourages the chosen action if td_error is positive

target_actor = np.zeros((1, self.n_actions))

target_actor[0, action] = td_error

self.actor.fit(state_array, target_actor, epochs=1, verbose=0)

def train(self, episodes=1000, max_steps=100, render_interval=100):

rewards_per_episode = []

for episode in range(episodes):

state = self.env.reset()

done = False

total_reward = 0

steps = 0

# Render occasionally to see progress

should_render = episode % render_interval == 0

if should_render:

print(f"Episode {episode+1}")

self.env.render()

while not done and steps < max_steps:

action = self.choose_action(state)

next_state, reward, done = self.env.step(action)

self.learn(state, action, reward, next_state, done)

total_reward += reward

state = next_state

steps += 1

if should_render:

print(f"Step {steps}, Action: {action}, Reward: {reward}")

self.env.render()

rewards_per_episode.append(total_reward)

if should_render:

print(f"Episode {episode+1} finished in {steps} steps with total reward: {total_reward}")

print("-" * 40)

# Plot rewards over episodes

plt.figure(figsize=(10, 6))

plt.plot(rewards_per_episode)

plt.xlabel('Episode')

plt.ylabel('Total Reward')

plt.title('Rewards per Episode (Actor-Critic)')

plt.grid(True)

plt.show()

return rewards_per_episode

# Create environment and Actor-Critic agent

env = GridWorldEnv()

ac_agent = ActorCriticAgent(env)

# Train the agent

ac_rewards = ac_agent.train(episodes=500, render_interval=100)

In this Actor-Critic implementation, the actor (policy network) learns to select actions, while the critic (value network) learns to estimate the value of states. The critic's estimate of the temporal difference (TD) error guides the actor's policy updates.

The TD error represents the difference between the expected return and the actual return, acting as a measure of how much better or worse an action was than expected. If the TD error is positive, the action was better than expected, and the actor increases the probability of taking that action in the future.

Actor-Critic methods can be more stable than pure policy gradient methods and more efficient than value-based methods, making them popular in practical applications.

Conclusion

Reinforcement learning offers a powerful framework for solving sequential decision-making problems. In this article, we've explored the foundations of reinforcement learning, from basic concepts to implementation details of several key algorithms.

We started with the core concepts of agents, environments, states, actions, and rewards, which form the basis of all reinforcement learning systems. We then implemented a simple grid world environment and a random agent to illustrate these concepts.

Moving to more sophisticated approaches, we explored Q-learning, a fundamental value-based algorithm that learns a table of state-action values to guide decision-making. We then extended this to Deep Q-Networks, which use neural networks to approximate the Q-function, enabling reinforcement learning in complex environments.

We also examined policy gradient methods like REINFORCE, which learn policies directly by optimizing the expected rewards. Finally, we implemented an Actor-Critic agent that combines value-based and policy-based approaches for more stable and efficient learning.

As you continue your journey in reinforcement learning, you'll encounter more advanced topics like Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO), and Soft Actor-Critic (SAC). You'll also explore applications in various domains, from robotics and autonomous systems to games and recommendation systems.

The field of reinforcement learning is vast and rapidly evolving, with new algorithms and applications emerging regularly. By understanding the foundations covered in this article, you're well-equipped to explore these advanced topics and contribute to this exciting field.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Saturday, May 17, 2025

Foundations of Reinforcement Learning: A Developer's Guide

Introduction

Core Concepts of Reinforcement Learning

Implementing a Basic Environment and Agent

Q-Learning: A Fundamental RL Algorithm

Deep Q-Networks: Combining RL with Deep Learning

Policy Gradient Methods: Learning Policies Directly

Actor-Critic Methods: Combining Value and Policy Learning

Conclusion

No comments:

About Me