Chapter 1: The Artificial Neuron
An artificial neuron is the fundamental building block of every neural network. Its design takes inspiration from biological neurons, but its mathematical form is far simpler and well suited to computation. At its core, an artificial neuron receives one or more numerical inputs, multiplies each input by a corresponding weight, sums the results, adds a bias term, and then passes the sum through a nonlinear activation function. In this section we will examine the mathematical foundation of that process and see how it maps directly to code.
Consider a neuron with inputs x₁, x₂, …, xₙ and learnable parameters w₁, w₂, …, wₙ (the weights) and b (the bias). The neuron computes the quantity
z = w₁·x₁ + w₂·x₂ + … + wₙ·xₙ + b
This expression can also be written in compact vector-matrix notation as
z = wᵀ·x + b
where w and x are n-dimensional column vectors. The scalar z is often called the pre-activation, because before the neuron can produce its final output it must pass z through an activation function σ, yielding
a = σ(z)
The bias term b allows the activation function to be shifted left or right, enabling the network to fit patterns that do not pass through the origin. The weights wᵢ determine the strength and direction of each input’s influence on the output. During training, the network adjusts these parameters to minimize some loss function.
In practical code, we implement a single neuron in PyTorch as follows:
import torch
# Define example inputs and parameters
x = torch.tensor([0.5, -1.2, 3.3], dtype=torch.float32)
w = torch.tensor([0.1, 0.4, -0.7], dtype=torch.float32)
b = torch.tensor(0.2, dtype=torch.float32)
# Compute the weighted sum z = w·x + b
z = torch.dot(w, x) + b
# Apply an activation function (for now, identity)
a = z
Each line in this snippet plays a clear role. The first line imports PyTorch, a library that makes tensor computations fast and automatically tracks operations for gradient calculation. The next three lines create one-dimensional tensors for inputs, weights, and bias. Specifying dtype=torch.float32 ensures that all computations use 32-bit floating point arithmetic, which balances precision and performance on modern hardware.
The expression torch.dot(w, x) computes the dot product of the tensors w and x, performing the sum of element-wise products (w₁·x₁ + w₂·x₂ + w₃·x₃). Adding b broadcasts the scalar bias across the result, yielding the complete pre-activation value z. At this stage the neuron’s output is simply z, but as soon as we introduce a nonlinear activation function σ, we will replace the line a = z with a = σ(z).
Behind the scenes, PyTorch records the operations on these tensors in a computation graph. When we later call z.backward(), PyTorch will traverse this graph and compute the partial derivatives ∂z/∂wᵢ and ∂z/∂b automatically. This automatic differentiation is what makes it possible to optimize millions of parameters in large networks without manual derivative calculations.
The mathematical simplicity of a single neuron belies its expressive power once many neurons are connected into layers and networks. By adjusting its weights and bias, a neuron can learn to detect linear combinations of its inputs. When we combine dozens or hundreds of such neurons and interleave them with nonlinear activations, the network can approximate arbitrarily complex functions.
In the next chapter we will see how to assemble many artificial neurons into layers, define the data flow between them, and implement the forward pass of a complete network in PyTorch.
Chapter 2: Layers and Feed-Forward Networks
When multiple artificial neurons are grouped together so that they all receive the same set of inputs and produce their outputs in parallel, we call that collection a layer. In a single layer every neuron applies its own weight vector and bias term to the incoming signal, but all neurons share the same input. By arranging layers in sequence we create a network that can transform simple numerical inputs into arbitrarily rich representations.
Mathematically, if we denote the activations of layer ℓ−1 by the column vector aℓ−1 and the weights of layer ℓ by a matrix Wℓ whose rows are each neuron’s weight vector, then the pre-activation vector zℓ of layer ℓ is given by the matrix-vector product
zℓ = Wℓ · aℓ−1 + bℓ
where bℓ is the bias vector for layer ℓ. We then apply an element-wise nonlinear activation function σ to obtain the layer’s output:
aℓ = σ(zℓ)
When we feed a batch of inputs, we simply stack each input as a column (or row, depending on convention) of a matrix X and replace the vector operations with matrix multiplications on the batch, yielding a highly efficient vectorized computation.
In code, PyTorch makes it straightforward to express layers and their connections. The built-in class torch.nn.Linear encapsulates both the weight matrix and bias vector and wires them into the computation graph for automatic differentiation. Below is a minimal example of a small feed-forward network with one hidden layer. Every line is explained in detail.
import torch
import torch.nn as nn
class SimpleMLP(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(SimpleMLP, self).__init__()
# Define a fully connected layer mapping input_dim to hidden_dim
self.fc1 = nn.Linear(input_dim, hidden_dim)
# Choose a nonlinear activation function for the hidden layer
self.relu = nn.ReLU()
# Define a second fully connected layer mapping hidden_dim to output_dim
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
# Apply the first linear transformation
x = self.fc1(x)
# Apply the nonlinear activation function element-wise
x = self.relu(x)
# Apply the second linear transformation to produce the output
x = self.fc2(x)
return x
The import statements bring in torch for tensor operations and torch.nn as the namespace for neural-network building blocks. The class SimpleMLP inherits from nn.Module, which is PyTorch’s base class for all neural network components. Calling super(SimpleMLP, self).init() ensures that the internal machinery of Module is initialized properly.
Inside the constructor, self.fc1 and self.fc2 are instances of nn.Linear. Each Linear layer allocates a weight matrix of shape (output_features, input_features) and a bias vector of length output_features. By storing these layers as attributes of the Module, PyTorch will automatically register their parameters so that when we call model.parameters(), all weight and bias tensors are returned in a single iterable.
The choice of ReLU for self.relu reflects its widespread use: the rectified linear unit returns zero for any negative input and returns the input itself for any nonnegative input. This simple nonlinear operation introduces the nonlinearity required for the network to approximate complex functions.
The forward method defines how an input tensor x is transformed as it flows through the network. If x has shape (batch_size, input_dim), then after self.fc1(x) it will have shape (batch_size, hidden_dim), and after applying ReLU it retains the same shape but with negative values set to zero. The final call to self.fc2 produces an output of shape (batch_size, output_dim). By returning x at the end of forward, we make it possible to call the network as if it were a function:
batch_of_inputs = torch.randn(32, 10)
outputs = model(batch_of_inputs)
In this example, batch_of_inputs is a tensor of shape (32, 10), representing thirty-two samples each with ten features. The call model(batch_of_inputs) invokes forward under the hood, and outputs will have shape (32, 1), giving one prediction per sample.
Underneath, PyTorch constructs a computation graph that records each operation—matrix multiplications, additions, and nonlinearities—so that when we compute a loss based on outputs and then call loss.backward(), the gradients of every parameter in fc1 and fc2 will be computed automatically. These gradients can then be used by optimizers to update the weight matrices and bias vectors.
By stacking more layers—perhaps alternating Linear and activation layers several times—and by varying the hidden dimensions, one can create deeper and wider networks capable of learning highly complex mappings. In the next chapter we will explore how these batched, vectorized operations constitute the forward pass in full generality, and how activation choices interact with network depth.
Chapter 3: Forward Propagation and Activation Functions
Forward propagation, also known as the forward pass, refers to the process of computing the output of a neural network by successively applying linear transformations and nonlinear activation functions to the input data. For a single layer ℓ, if we denote the activations of the previous layer by a^(ℓ–1), the weight matrix by W^(ℓ), and the bias vector by b^(ℓ), then the layer computes the pre-activation vector z^(ℓ) as
z^(ℓ) = W^(ℓ) · a^(ℓ–1) + b^(ℓ)
After computing z^(ℓ), the layer applies an element-wise activation function σ to produce the output activations
a^(ℓ) = σ(z^(ℓ)).
By composing these operations for every layer from the input to the output, the network transforms an initial input vector into a final output vector that represents a prediction or feature embedding.
To illustrate these computations concretely, consider a batch of input vectors represented as a matrix X with shape (batch_size, input_dim). A fully connected layer in PyTorch can be implemented manually as follows, where W has shape (output_dim, input_dim) and b has shape (output_dim):
import torch
batch_size = 32
input_dim = 10
output_dim = 50
# Simulate a batch of inputs
X = torch.randn(batch_size, input_dim)
# Initialize weight matrix and bias vector
W = torch.randn(output_dim, input_dim)
b = torch.randn(output_dim)
# Compute the pre-activation for the batch: Z = X @ W^T + b
Z = X.matmul(W.t()) + b
# Apply a nonlinearity (for example, ReLU) to obtain activations A
A = torch.relu(Z)
In this snippet the tensor X contains random values for demonstration. The expression W.t() denotes the transpose of W, which has shape (input_dim, output_dim). The batch matrix multiplication X.matmul(W.t()) produces a new tensor Z of shape (batch_size, output_dim), since each of the batch_size rows of X is multiplied by Wᵀ. Adding b to Z broadcasts the bias vector across every row, yielding the complete pre-activation values. Lastly, torch.relu performs an element-wise rectified linear unit on Z to produce A.
Activation functions introduce the nonlinearity that enables neural networks to learn complex patterns. The logistic sigmoid function, defined by
σ(z) = 1 / (1 + exp(−z)),
squashes any real-valued input into the range between zero and one. Its derivative can be expressed in terms of its output as
σ′(z) = σ(z) · (1 − σ(z)).
While sigmoid was historically popular, its gradients can become very small when |z| is large, causing slow learning deep in a network.
In PyTorch, you can compute a sigmoid activation for a tensor Z with:
A_sigmoid = torch.sigmoid(Z)
where torch.sigmoid applies the element-wise logistic function to each entry of Z, producing a new tensor with values between zero and one.
The hyperbolic tangent function, defined by
tanh(z) = (exp(z) − exp(−z)) / (exp(z) + exp(−z)),
maps real inputs to the range between negative one and one. Its derivative is given by 1 − tanh(z)^2. Because tanh is zero-centered, it often leads to faster convergence than sigmoid, but it still suffers from vanishing gradients for large positive or negative inputs.
In PyTorch you can compute a tanh activation with:
A_tanh = torch.tanh(Z)
The most widely used activation in modern deep networks is the rectified linear unit, or ReLU, defined as
Its derivative is zero whenever z is negative and one whenever z is positive. This simple piecewise linear form avoids saturation for positive inputs, which significantly mitigates the vanishing gradient problem. However, neurons can “die” during training if their inputs become negative and remain so, resulting in permanently zero gradients.
You can apply a ReLU activation in PyTorch either by using torch.relu or by creating an instance of the module:
A_relu = torch.relu(Z)
import torch.nn as nn
relu_layer = nn.ReLU()
A_relu_mod = relu_layer(Z)
To address the “dying ReLU” issue, leaky ReLU introduces a small slope α for negative inputs, defined as
LeakyReLU(z) = max(α·z, z).
Its derivative is α when z is negative and one when z is positive. A typical choice for α is 0.01. In code, you can construct and apply a leaky ReLU layer in PyTorch with:
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
A_leaky = leaky_relu(Z)
When dealing with classification over multiple classes, the softmax function is used to convert a vector of arbitrary real values into a probability distribution. For a vector z with components z_i,
softmax(z)_i = exp(z_i) / Σ_j exp(z_j).
The Jacobian matrix of softmax has entries ∂σ_i/∂z_j = σ_i (δ_{ij} − σ_j). In practice, one typically pairs softmax with the cross-entropy loss, which leads to a numerically stable and simpler combined gradient. In PyTorch, softmax can be applied along a specified dimension. For example, for a batch of logits named logits with shape (batch_size, num_classes), you can write:
import torch.nn.functional as F
probabilities = F.softmax(logits, dim=1)
This computes the exponentials of each row in logits, normalizes by the row sums, and returns a tensor of the same shape containing values between zero and one that sum to one along each row.
Beyond these classical activations, more recent functions such as Swish, defined as z · sigmoid(z), and GELU, the Gaussian error linear unit, have gained popularity for certain architectures because they provide smoother gradients and improved performance on tasks such as language modeling. Although these functions are available in libraries like PyTorch (for instance via the nn.GELU module), their additional computational cost means that ReLU and its variants remain the default choice for many practitioners.
Having covered both the linear transformations that make up each layer’s pre-activations and the nonlinear activation functions that follow, we are now equipped to build deep neural networks that transform input data into rich internal representations. In the next chapter we will turn our attention to measuring how well a network’s predictions match the desired targets by introducing loss functions, and we will then develop the gradient-based optimization algorithms that adjust the network’s parameters.
Chapter 4: Loss Functions and Gradient Descent
To teach a neural network to perform useful work, we must quantify how well its predictions match the target values. A loss function is a scalar measure of error that the training process seeks to minimize by adjusting the network’s parameters. In this chapter we introduce the most common loss functions, derive their gradients, and then develop the basic gradient-descent update rule and its more sophisticated variants.
Mean Squared Error for Regression
When the network’s task is to predict continuous quantities, a natural choice is the mean squared error. Suppose we have a dataset of N examples, where each example i has a target value yᵢ and the network produces a prediction ŷᵢ. The mean squared error loss L is defined by
L(ŷ, y) = (1/N) Σᵢ (ŷᵢ − yᵢ)²
This expression sums the squared differences between prediction and target over all examples and divides by N to yield an average. Squaring the error penalizes large deviations more heavily than small ones, and the average makes the loss independent of dataset size.
To see how this loss drives parameter updates, we compute its derivative with respect to a single prediction ŷⱼ:
∂L/∂ŷⱼ = (2/N) (ŷⱼ − yⱼ)
During back-propagation this gradient propagates through the network, guiding each weight update towards reducing the squared error.
In PyTorch, one writes:
import torch
import torch.nn as nn
# Suppose `model` maps inputs to predictions ŷ of shape (batch_size, 1)
loss_fn = nn.MSELoss() # creates a mean squared error loss module
predictions = model(inputs) # forward pass produces ŷ
loss = loss_fn(predictions, targets)
Here nn.MSELoss() encapsulates the formula above. When we later call loss.backward(), PyTorch computes ∂L/∂parameters automatically by chaining the partial derivatives through the computational graph.
Cross-Entropy for Classification
When the task is classification into one of C classes, the network typically outputs a vector of logits z ∈ ℝᶜ for each example. To convert those logits into a probability distribution p, we apply the softmax function:
softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ)
If the true class label for example i is encoded as a one-hot vector y where yᵢ = 1 and yⱼ = 0 for j ≠ i, then the cross-entropy loss is
L(z, y) = − Σᵢ yᵢ · log( softmax(z)ᵢ )
Because only one entry of y is nonzero, this simplifies to the negative log-probability assigned to the correct class. When using PyTorch’s nn.CrossEntropyLoss, the implementation fuses the softmax and log steps in a numerically stable way and expects raw logits and integer class indices:
import torch.nn as nn
loss_fn = nn.CrossEntropyLoss() # creates a combined log-softmax + NLL loss
logits = model(inputs) # shape (batch_size, num_classes)
loss = loss_fn(logits, class_indices)
Under the hood, the gradient of L with respect to each logit zₖ is
∂L/∂zₖ = softmax(z)ₖ − yₖ
which is exactly the difference between the predicted probability and the true label for each class.
Basic Gradient Descent
Once a loss has been specified, the simplest parameter‐update rule is gradient descent. Denote by θ a single scalar parameter (one entry of a weight matrix or bias vector) and by L(θ) the loss as a function of θ. The gradient descent rule updates θ in the direction of steepest descent:
θ ← θ − η · ∂L/∂θ
Here η > 0 is the learning rate, a hyperparameter controlling the step size. A small η yields slow but stable convergence, while a large η can cause the loss to oscillate or diverge.
In practice one distinguishes three variants: when the gradient is computed over the entire dataset before each update, the method is called batch gradient descent; when only a single example is used for each update, it is called stochastic gradient descent; and when small subsets of examples (mini-batches) are used, it is called mini-batch gradient descent. Mini-batch updates strike a balance between noisy but fast stochastic updates and stable but costly batch updates.
PyTorch supplies an easy way to perform gradient descent with mini-batches via the torch.optim package. For example, to use vanilla stochastic gradient descent with momentum, one writes:
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
for inputs, targets in data_loader: # data_loader yields mini-batches
optimizer.zero_grad() # clear any accumulated gradients
outputs = model(inputs) # forward pass
loss = loss_fn(outputs, targets) # compute loss
loss.backward() # back-propagate to compute gradients
optimizer.step() # update parameters in place
Calling optimizer.zero_grad() clears the gradient buffers of all parameters so that gradients from previous iterations do not accumulate. The call loss.backward() populates each parameter’s .grad attribute with the computed ∂L/∂θ, and optimizer.step() uses those gradients to update the parameters according to the chosen update rule.
Advanced Optimizers: RMSProp and Adam
While plain momentum can accelerate convergence in valleys of the loss landscape, adaptive methods adjust each parameter’s learning rate individually based on the history of its gradients. RMSProp maintains an exponentially weighted moving average of past squared gradients:
sₜ = γ·sₜ₋₁ + (1−γ)·gₜ²
θ ← θ − (η / sqrt(sₜ + ε)) · gₜ
where gₜ = ∂L/∂θ at time t, γ is typically around 0.9, and ε is a small constant for numerical stability. In PyTorch one constructs it as:
optimizer = optim.RMSprop(model.parameters(),
lr=0.001,
alpha=0.99,
eps=1e-8)
Adam combines momentum and RMSProp by maintaining both a moving average of gradients mₜ and of squared gradients vₜ, and applying bias correction:
mₜ = β₁·mₜ₋₁ + (1−β₁)·gₜ
vₜ = β₂·vₜ₋₁ + (1−β₂)·gₜ²
m̂ₜ = mₜ / (1−β₁ᵗ)
v̂ₜ = vₜ / (1−β₂ᵗ)
θ ← θ − η · ( m̂ₜ / ( sqrt(v̂ₜ) + ε ) )
with default β₁=0.9, β₂=0.999, and ε=1e-8. In code:
optimizer = optim.Adam(model.parameters(),
lr=0.001,
betas=(0.9, 0.999),
eps=1e-8)
Each of these optimizers requires tuning of its hyperparameters—learning rate, decay rates, and epsilon—to achieve the best performance on a given problem.
In the next chapter we will see how to assemble these components into a complete training loop, explore the differences between training with and without explicit mini-batches, and then introduce techniques such as dropout and weight decay to improve generalization.
Chapter 5: Training Loops, Batching, and Regularization
To teach a network to minimize its loss, we must repeatedly present data, compute predictions, measure error, propagate gradients, and update parameters. This cycle of computation forms the training loop. Depending on computational resources and problem dimensions, one may choose to process the entire dataset at once, one sample at a time, or several samples grouped into mini-batches. In this chapter we will develop a complete training loop in PyTorch, compare training with and without explicit mini-batches, and then introduce techniques that improve generalization, including dropout and weight decay.
A basic training loop in PyTorch begins by defining a data loader that yields batches of input–target pairs, instantiating an optimizer and loss function, and then iterating over epochs. Below is a complete example that uses mini-batches. Every part of the code is explained in detail.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# Suppose we have feature tensor X of shape (1000, 20) and target tensor y of shape (1000,)
dataset = TensorDataset(X, y)
# Create a DataLoader that yields batches of size 32, shuffling each epoch
data_loader = DataLoader(dataset, batch_size=32, shuffle=True)
model = SimpleMLP(input_dim=20, hidden_dim=50, output_dim=1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
num_epochs = 20
for epoch in range(num_epochs):
epoch_loss = 0.0
# Iterate over the dataset in mini-batches
for batch_inputs, batch_targets in data_loader:
# Zero out gradients accumulated from the previous step
optimizer.zero_grad()
# Compute model predictions for the current batch
batch_predictions = model(batch_inputs)
# Compute the loss between predictions and true targets
loss = loss_fn(batch_predictions, batch_targets)
# Back-propagate through the network to compute gradients
loss.backward()
# Update the model’s parameters based on the gradients
optimizer.step()
# Accumulate the loss value for reporting
epoch_loss += loss.item() * batch_inputs.size(0)
# Divide by the total number of samples to get average loss
epoch_loss /= len(dataset)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}")
In this code snippet we first wrap our feature and target tensors in a TensorDataset, which pairs each input with its corresponding label. We then create a DataLoader that will yield subsets of the data of size thirty-two in random order every epoch. Instantiating the model, loss function, and optimizer follows the patterns we have seen previously.
The outer loop runs for num_epochs full passes through the data. Inside that loop, we initialize a running sum for the epoch’s loss. Each time the DataLoader yields a batch of inputs and targets, we clear any previous gradient information by calling optimizer.zero_grad(). Computing model(batch_inputs) invokes the forward method of our network, yielding predictions. We compare those predictions to the true targets by calling the loss function, which produces a scalar tensor.
Calling loss.backward() triggers PyTorch’s automatic differentiation to compute gradients of the loss with respect to every learnable parameter in the model. Those gradients are stored in each parameter’s .grad attribute. The call to optimizer.step() then modifies the parameter values in-place according to the selected update rule (stochastic gradient descent with momentum in this case). We multiply loss.item() by the batch size to recover the sum of per-sample losses, accumulate them, and at the end divide by the dataset size to report the average loss for the epoch.
Training without explicit mini-batches is possible by treating the entire dataset as one batch. In that case, one can skip the DataLoader and write:
# Treat all data as a single batch
optimizer.zero_grad()
predictions = model(X)
loss = loss_fn(predictions, y)
loss.backward()
optimizer.step()
Although this full-batch approach yields the true gradient at each step, it can be inefficient when the dataset is large and may require more memory than is available. Conversely, using single-sample (stochastic) updates by setting batch_size=1 in the DataLoader can introduce high variance in the gradient estimate, leading to noisy convergence that nonetheless can escape shallow local minima more easily. Mini-batches strike a pragmatic balance by reducing variance while staying within memory constraints.
Even when following this training procedure, large neural networks can overfit the training data, memorizing noise instead of learning patterns that generalize to new examples. To mitigate overfitting, one may apply regularization techniques.
Weight decay, mathematically equivalent to L2 regularization, adds a penalty proportional to the squared norm of the weights to the loss. In practice, one signals weight decay to the optimizer. For example, to add a coefficient of 1e-4 to every parameter except biases, you write:
optimizer = optim.SGD(
[
{ 'params': model.fc1.weight, 'weight_decay': 1e-4 },
{ 'params': model.fc2.weight, 'weight_decay': 1e-4 },
{ 'params': model.fc1.bias, 'weight_decay': 0 },
{ 'params': model.fc2.bias, 'weight_decay': 0 }
],
lr=0.01,
momentum=0.9
)
By specifying weight_decay on each parameter group, the optimizer adds weight_decay * θ to the gradient of each weight, effectively performing the update rule
θ ← θ − η ( ∂L/∂θ + λ · θ )
for each weight θ, where λ is the weight-decay coefficient.
Another powerful regularization technique is dropout. During training, dropout randomly sets a fraction p of each layer’s activations to zero on every forward pass, preventing co-adaptation of neurons. At test time, dropout is disabled and activations are scaled by (1–p) to match the expected magnitude. In PyTorch, one inserts dropout layers into the model definition. For example, to add dropout after the first hidden layer:
import torch.nn as nn
class MLPWithDropout(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, p=0.5):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.dropout = nn.Dropout(p=p)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
# Randomly zero a fraction p of elements during training
x = self.dropout(x)
x = self.fc2(x)
return x
When the model is in training mode—ensured by calling model.train()—each forward pass samples a new random binary mask that zeroes out p fraction of elements in the hidden representation. Calling model.eval() before evaluating on validation data disables dropout and uses the full set of activations.
Beyond weight decay and dropout, early stopping can guard against overfitting by monitoring performance on a hold-out validation set and halting training when the validation loss stops improving. One typically saves the model’s parameters whenever the validation loss decreases and terminates training if no improvement occurs for a specified number of epochs.
With these regularization strategies in place, one can train deeper and wider networks while maintaining robust generalization. In the next chapter we will build on this foundation to explore convolutional neural networks, recurrent networks, and hybrid architectures that combine multiple layer types.
Chapter 6: Convolutional Neural Networks
Convolutional neural networks, often abbreviated CNNs, are designed to process data that have a grid-like topology such as images. Instead of fully connecting every input to every output, a convolutional layer links each output to a localized region of the input. This localized connection exploits the spatial structure of the data to reduce the number of parameters and to capture patterns that remain meaningful when shifted across the input.
At the heart of a convolutional layer lies the discrete two-dimensional convolution operation. If we denote the input image by I and a learnable filter or kernel by K, then the convolution at spatial location (i, j) is defined by the double sum:
(I * K)[i, j] = Σ_{m=0}^{k_h−1} Σ_{n=0}^{k_w−1} K[m, n] · I[i + m, j + n]
Here k_h and k_w are the height and width of the kernel. The result is a feature map that highlights wherever the pattern encoded by the kernel appears in the image.
In PyTorch a 2D convolutional layer is provided by the class torch.nn.Conv2d. This module maintains a set of filters with shape (out_channels, in_channels, k_h, k_w) and applies them to a batched input tensor of shape (batch_size, in_channels, height, width). The convolution also uses a stride parameter to skip positions and a padding parameter to include a border of zeros around the input. The output height and width are determined by
H_out = floor((H_in + 2·padding − dilation·(k_h − 1) − 1) / stride + 1)
W_out = floor((W_in + 2·padding − dilation·(k_w − 1) − 1) / stride + 1)
Below is an example that constructs and applies a convolutional layer to a batch of RGB images that are 32×32 pixels each.
import torch
import torch.nn as nn
# Create a batch of eight RGB images of size 32×32
batch_size, in_channels, H, W = 8, 3, 32, 32
images = torch.randn(batch_size, in_channels, H, W)
# Define a convolutional layer with 16 output channels, a 3×3 kernel, stride one, and padding one
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
# Apply the convolution to the images
features = conv(images)
# features has shape (8, 16, 32, 32)
When the stride is one and padding equals (kernel_size−1)/2 for odd kernel sizes, the output feature map has the same spatial dimensions as the input. Preserving dimensions in this way is common in early stages of image-processing networks. By contrast, a stride greater than one or the use of pooling layers will downsample the spatial dimensions. A two-by-two max-pooling layer with stride two, for example, halves both the height and width of its input.
Pooling introduces a form of local translational invariance and reduces computational cost in deeper layers. A max-pooling operation over a p×p window replaces each window by its maximum value, whereas average pooling replaces the window by its mean value. In code one creates a pooling layer by instantiating torch.nn.MaxPool2d or torch.nn.AvgPool2d.
Convolutional layers can be stacked sequentially to build deep feature hierarchies. Early layers learn to detect simple patterns such as edges and textures, and later layers combine those patterns into more abstract representations such as shapes and objects. After several convolutional and pooling stages, the resulting feature maps are often flattened into a vector and passed through fully connected layers to perform classification or regression.
A minimal convolutional network in PyTorch might look like this. Every line of code captures one aspect of the architecture.
import torch
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super(SimpleCNN, self).__init__()
# First convolutional block: 3→16 channels, kernel 3×3, padding to preserve size
self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
# Normalize each feature map across the batch to zero mean, unit variance
self.bn1 = nn.BatchNorm2d(num_features=16)
# Introduce nonlinearity
self.relu = nn.ReLU()
# Second convolutional block: 16→32 channels
self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(num_features=32)
# Downsampling by a factor of two
self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Fully connected layer mapping flattened features to class scores
self.fc = nn.Linear(in_features=32 * 16 * 16, out_features=num_classes)
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.conv2(x)
x = self.bn2(x)
x = self.relu(x)
x = self.pool(x)
x = x.view(x.size(0), -1)
scores = self.fc(x)
return scores
model = SimpleCNN(num_classes=10)
input_tensor = torch.randn(8, 3, 32, 32)
output_scores = model(input_tensor)
# output_scores has shape (8, 10)
The first convolutional block transforms the three-channel input into sixteen feature maps of the same spatial size. Batch normalization then stabilizes the distribution of activations, which often accelerates training. A ReLU nonlinearity introduces the required nonlinearity. The second block repeats this pattern and, after max pooling, reduces each feature map’s height and width from 32 to 16. Finally, all feature maps are reshaped into a two-dimensional tensor of shape (batch_size, 32×16×16) and passed through a linear layer to produce one score per class.
Convolutional neural networks form the basis of state-of-the-art models in computer vision and beyond.
Chapter 7: Recurrent Neural Networks
Recurrent neural networks are designed to process sequential data by maintaining a hidden state that evolves over time. Unlike feed-forward networks, which assume each input is independent of all others, recurrent networks allow information to persist across timesteps. At each step t, a recurrent cell receives both the new input vector xₜ and the previous hidden state hₜ₋₁, computes a new hidden state hₜ according to a learned transformation, and (optionally) produces an output yₜ.
In its simplest form, a vanilla RNN cell computes a pre-activation vector zₜ as the sum of an input transformation and a hidden-state transformation plus a bias:
zₜ = Wₓ · xₜ + Wₕ · hₜ₋₁ + b
The new hidden state is then obtained by applying a nonlinear activation σ element-wise:
hₜ = σ(zₜ)
If an output yₜ is required at each timestep, one may add a read-out layer:
yₜ = V · hₜ + c
where V and c are an output weight matrix and bias vector respectively.
In PyTorch, the class torch.nn.RNN encapsulates this behavior and handles stacking multiple layers and batches seamlessly. The following example shows how to create a single-layer RNN cell, feed it a batch of sequences, and extract the final hidden state.
import torch
import torch.nn as nn
# Suppose we have sequences of length 100, each element is a 20-dimensional vector,
# and we process them in batches of size 16.
seq_len, batch_size, input_size = 100, 16, 20
hidden_size = 50
# Create a random batch of input sequences: shape (seq_len, batch_size, input_size)
inputs = torch.randn(seq_len, batch_size, input_size)
# Instantiate a one-layer RNN with tanh activation (the default)
rnn = nn.RNN(input_size=input_size,
hidden_size=hidden_size,
num_layers=1,
nonlinearity='tanh',
batch_first=False)
# Initialize the hidden state: shape (num_layers, batch_size, hidden_size)
h0 = torch.zeros(1, batch_size, hidden_size)
# Forward propagate through the RNN
outputs, hn = rnn(inputs, h0)
# `outputs` has shape (seq_len, batch_size, hidden_size)
# `hn` is the hidden state at the final timestep, shape (1, batch_size, hidden_size)
Every line of this snippet plays a clear role. Creating inputs simulates a batch of time-series data. The RNN module allocates two parameter matrices: one of shape (hidden_size, input_size) for Wₓ and one of shape (hidden_size, hidden_size) for Wₕ, plus a bias vector of length hidden_size. When we call the module on inputs and the initial state h0, it iterates over the 100 timesteps, computing the recurrence relation at each step. The output tensor collects all intermediate hidden states, while hn returns only the final one.
Although vanilla RNNs are conceptually simple, they struggle to learn long-range dependencies because gradients flowing backward through many timesteps tend to vanish or explode. To mitigate this, gated recurrent units such as LSTM and GRU introduce internal gates that control how much of the input and previous state should influence the new state.
The Long Short-Term Memory (LSTM) cell maintains both a hidden state hₜ and a cell state cₜ. It uses three gates—forget gate fₜ, input gate iₜ, and output gate oₜ—computed as sigmoid activations, and a candidate cell update ĉₜ computed with a tanh activation. Concretely:
fₜ = σ( W_f · xₜ + U_f · hₜ₋₁ + b_f )
iₜ = σ( W_i · xₜ + U_i · hₜ₋₁ + b_i )
oₜ = σ( W_o · xₜ + U_o · hₜ₋₁ + b_o )
ĉₜ = tanh( W_c · xₜ + U_c · hₜ₋₁ + b_c )
The cell state then updates by combining the previous cell state and the candidate, weighted by the forget and input gates:
cₜ = fₜ * cₜ₋₁ + iₜ * ĉₜ
Finally the new hidden state is produced by applying the output gate to the cell state’s nonlinearity:
hₜ = oₜ * tanh(cₜ)
PyTorch’s torch.nn.LSTM encapsulates all these computations under the hood. Here is an example of its usage for a batch of sequences:
import torch
import torch.nn as nn
# Sequence parameters as before
seq_len, batch_size, input_size = 100, 16, 20
hidden_size, num_layers = 50, 2
# Random input batch
inputs = torch.randn(seq_len, batch_size, input_size)
# Instantiate a two-layer LSTM
lstm = nn.LSTM(input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=False)
# Initialize hidden and cell states: each of shape (num_layers, batch_size, hidden_size)
h0 = torch.zeros(num_layers, batch_size, hidden_size)
c0 = torch.zeros(num_layers, batch_size, hidden_size)
# Forward pass through the LSTM
outputs, (hn, cn) = lstm(inputs, (h0, c0))
# `outputs` has shape (seq_len, batch_size, hidden_size)
# `hn` and `cn` each have shape (num_layers, batch_size, hidden_size)
The Gated Recurrent Unit (GRU) simplifies the LSTM by combining the forget and input gates into a single update gate zₜ, and by merging the cell and hidden states. Its equations are:
zₜ = σ( W_z · xₜ + U_z · hₜ₋₁ + b_z )
rₜ = σ( W_r · xₜ + U_r · hₜ₋₁ + b_r )
ħₜ = tanh( W · xₜ + U · ( rₜ * hₜ₋₁ ) + b )
hₜ = (1 − zₜ) * hₜ₋₁ + zₜ * ħₜ
In PyTorch, torch.nn.GRU provides this functionality with the same interface as nn.LSTM except that it returns only the hidden states.
When working with variable-length sequences, one often uses torch.nn.utils.rnn.pack_padded_sequence and pad_packed_sequence to efficiently batch and process sequences without wasting computation on padding tokens.
Recurrent networks excel at tasks such as language modeling, time-series forecasting, and sequence-to-sequence translation, but in many applications they have been surpassed by attention-based models. Before turning to those, one may combine convolutional and recurrent layers to process data with both spatial and temporal structure. For example, in video classification a CNN can extract frame-level features that are then fed into an LSTM to capture motion dynamics.
Chapter 8: Hybrid Architectures
Data in the real world often have multiple dimensions of structure. Video, for example, exhibits both spatial patterns within each frame and temporal patterns across frames. Hybrid architectures combine the strengths of convolutional, recurrent, and attention-based layers to model such complex data. In this chapter we illustrate two common patterns: convolution followed by recurrence, and convolution combined with self-attention.
A classic hybrid model for video classification first applies a stack of convolutional layers to each frame independently, producing a per-frame feature vector that captures spatial structure. These per-frame vectors are then fed into a recurrent network that models temporal dependencies. Concretely, suppose we have a batch of videos represented as a tensor of shape (batch_size, seq_len, channels, height, width). We can first reshape and process all frames through a 2D CNN, then restore the temporal axis and apply an LSTM:
import torch
import torch.nn as nn
class CNNLSTM(nn.Module):
def __init__(self, num_classes, cnn_out_dim=128, hidden_dim=64):
super(CNNLSTM, self).__init__()
# A simple CNN that maps (C,H,W) to cnn_out_dim
self.cnn = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2), # halves H,W
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(),
nn.AdaptiveAvgPool2d((1,1)) # global feature
)
# Linear layer to flatten CNN output
self.fc = nn.Linear(64, cnn_out_dim)
# LSTM to model temporal sequence of frame features
self.lstm = nn.LSTM(input_size=cnn_out_dim,
hidden_size=hidden_dim,
num_layers=1,
batch_first=True)
# Final classifier mapping LSTM output to class logits
self.classifier = nn.Linear(hidden_dim, num_classes)
def forward(self, videos):
# videos: (batch, seq_len, C, H, W)
b, t, c, h, w = videos.shape
# Merge batch and time for CNN processing
frames = videos.view(b * t, c, h, w)
features = self.cnn(frames) # (b*t, 64, 1, 1)
features = features.view(b * t, 64)
features = self.fc(features) # (b*t, cnn_out_dim)
# Restore batch and time dimensions
features = features.view(b, t, -1) # (b, t, cnn_out_dim)
# LSTM expects (batch, seq_len, feature)
lstm_out, _ = self.lstm(features) # (b, t, hidden_dim)
# Use the final LSTM hidden state for classification
final = lstm_out[:, -1, :] # (b, hidden_dim)
logits = self.classifier(final) # (b, num_classes)
return logits
# Example usage
model = CNNLSTM(num_classes=10)
dummy_videos = torch.randn(4, 16, 3, 64, 64) # batch of 4, 16 frames each
outputs = model(dummy_videos) # (4, 10)
In this code the adaptive average pooling reduces each feature map to a single number, turning the spatial map into a vector of length 64. The linear layer projects that vector into a 128-dimensional feature for each frame. By reshaping the tensor back to (batch, seq_len, feature_dim), we feed the frame features into an LSTM which captures how patterns evolve over time. Finally we take the LSTM’s last output and pass it through a linear classifier to produce class scores.
An alternative hybrid approach replaces the recurrent network with an attention mechanism. After extracting feature maps with a CNN, one can treat each spatial location (or each frame) as a “token” and apply self-attention to model long-range dependencies. The core of self-attention is the scaled dot-product formula, which for a set of query, key, and value vectors Q, K, and V computes:
Attention(Q, K, V) = softmax( (Q·Kᵀ) / √d_k ) · V
Here d_k is the dimension of the key vectors, and the softmax produces attention weights that sum to one for each query. In PyTorch one can implement a single multi-head attention layer as follows:
import torch.nn.functional as F
class CNNAttention(nn.Module):
def __init__(self, cnn_out_dim=128, num_heads=4):
super().__init__()
# CNN feature extractor (similar to before)
self.cnn = nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(64, cnn_out_dim, 3, padding=1),
nn.ReLU()
)
# Multi-head attention layer
self.attn = nn.MultiheadAttention(embed_dim=cnn_out_dim,
num_heads=num_heads,
batch_first=True)
# Classifier
self.classifier = nn.Linear(cnn_out_dim, 10)
def forward(self, images):
# images: (batch, C, H, W)
features = self.cnn(images) # (b, cnn_out_dim, H', W')
b, d, h, w = features.shape
# Flatten spatial dimensions into sequence
seq = features.view(b, d, h*w).permute(0, 2, 1)
# Self-attention over spatial tokens
attn_out, _ = self.attn(seq, seq, seq) # (b, h*w, d)
# Pool across tokens by averaging
pooled = attn_out.mean(dim=1) # (b, d)
logits = self.classifier(pooled) # (b, 10)
return logits
# Example usage
model = CNNAttention()
dummy_images = torch.randn(4, 3, 32, 32)
outputs = model(dummy_images) # (4,10)
This hybrid design uses the convolutional layers to produce a spatial grid of feature vectors. By flattening the grid into a sequence of length H′·W′, we treat each location as a token. The nn.MultiheadAttention module computes multiple sets of query, key, and value projections, then concatenates and projects them back to the original dimension, enabling the model to capture diverse relationships among spatial regions. Averaging across tokens yields a global representation that the classifier can use.
Hybrid architectures allow one to exploit locality, sequence, and global context in a single model. They power applications ranging from video understanding to image captioning and beyond.
Chapter 9: The Future of Neural Networks
Neural network research advances at a rapid pace, and several trends promise to shape the next generation of models. Attention-only architectures, pioneered by the Transformer, have already displaced recurrence in domains such as natural language processing and are now being applied to vision. The core innovation of the Transformer is to stack layers of multi-head self-attention and feed-forward networks, eliminating convolutions and recurrence entirely. In PyTorch one can instantiate a Transformer encoder layer with:
import torch.nn as nn
encoder_layer = nn.TransformerEncoderLayer(d_model=512,
nhead=8,
dim_feedforward=2048)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
# Input shape: (seq_len, batch, d_model)
src = torch.randn(100, 32, 512)
output = transformer_encoder(src) # (100, 32, 512)
Self-supervised learning is another major trend. By pretraining models on massive unlabeled datasets using tasks such as masked token prediction or contrastive learning, one can learn representations that transfer effectively to downstream tasks with limited labeled data. Examples include BERT in language and SimCLR in vision.
Graph neural networks generalize convolution to arbitrary graph structures by aggregating information from a node’s neighbors. Their layer update takes the form
hᵢ′ = σ( W·hᵢ + Σ_{j∈N(i)} U·hⱼ + b )
allowing applications in chemistry, social networks, and combinatorial optimization.
Automated neural architecture search uses reinforcement learning or evolutionary algorithms to discover optimal network topologies. Techniques such as NASNet and EfficientNet have yielded models that outperform human-designed architectures under given compute constraints.
Continual learning and meta-learning aim to equip networks with the ability to learn new tasks without forgetting previous ones, or to adapt quickly to novel tasks with few examples.
Finally, interpretability and reliability remain critical. Methods for explaining network decisions—such as saliency maps, SHAP values, and concept activations—help build trust in AI systems, especially in safety-critical domains.
As hardware continues to evolve, specialized accelerators for sparse computation, low-precision arithmetic, and neuromorphic designs will further expand the scope of neural networks. Quantum neural networks, although still in their infancy, suggest yet another frontier.
Throughout this journey, the core principles—defining neurons, stacking layers, choosing activations, measuring loss, and optimizing parameters—remain the foundation. The landscape of models built upon those principles only grows richer and more varied.
Chapter 10: Graph Neural Networks
Graph neural networks extend the idea of neural computation from regular grids to irregular graph structures, enabling deep learning on data whose relationships are best expressed as nodes connected by edges. A graph G consists of a set of nodes V and a set of edges E between those nodes. Each node i carries a feature vector xᵢ, and the pattern of edges encodes how information should flow between nodes.
At the heart of many graph neural networks lies a message-passing paradigm. During each layer of the network, every node gathers (“aggregates”) information from its neighbors, transforms that aggregated message, and then updates its own feature representation. By stacking multiple layers, nodes can incorporate information from progressively larger neighborhoods.
One of the simplest and most widely used forms of graph convolution is the Graph Convolutional Network (GCN). Suppose we have N nodes, each with a d-dimensional feature vector, collected in a matrix X ∈ ℝᴺˣᵈ. Let A ∈ ℝᴺˣᴺ be the adjacency matrix of the graph, where Aᵢⱼ = 1 if there is an edge from node i to node j and zero otherwise. To include each node’s own features, we add the identity matrix I to A, producing à = A + I. We then compute the degree matrix D̃ where D̃ᵢᵢ = Σⱼ Ãᵢⱼ. A single GCN layer transforms X into new features H ∈ ℝᴺˣᵈ′ by the rule:
H = σ( D̃⁻½ · Ã · D̃⁻½ · X · W )
Here W ∈ ℝᵈˣᵈ′ is a learnable weight matrix, and σ is an element-wise nonlinearity such as ReLU. The symmetric normalization D̃⁻½ Ã D̃⁻½ ensures that messages from high-degree nodes do not overwhelm those from low-degree nodes.
Below is a minimal PyTorch implementation of a single GCN layer. Every step is explained in detail.
import torch
import torch.nn as nn
class GCNLayer(nn.Module):
def __init__(self, in_features, out_features):
super(GCNLayer, self).__init__()
# Weight matrix W of shape (in_features, out_features)
self.weight = nn.Parameter(torch.randn(in_features, out_features))
def forward(self, X, adjacency):
# Add self-loops by adding the identity matrix to adjacency
A_tilde = adjacency + torch.eye(adjacency.size(0), device=adjacency.device)
# Compute the degree matrix of A_tilde
degrees = A_tilde.sum(dim=1)
# Compute D_tilde^(-1/2)
D_inv_sqrt = torch.diag(degrees.pow(-0.5))
# Symmetric normalization: D^(-1/2) * A_tilde * D^(-1/2)
A_normalized = D_inv_sqrt @ A_tilde @ D_inv_sqrt
# Linear transformation: X * W
support = X @ self.weight
# Propagate messages: A_normalized * support
out = A_normalized @ support
# Apply nonlinearity
return torch.relu(out)
In this code the adjacency matrix is a dense tensor of shape (N, N). We first add self-loops by summing with the identity. We then compute the degree of each node by summing the rows of Ã. Taking the inverse square root of these degrees and forming a diagonal matrix yields D̃⁻½. Multiplying D̃⁻½ on both sides of à yields the normalized adjacency. The node features X are multiplied by the weight matrix W to transform them into a new feature space, and finally the normalized adjacency matrix mixes these transformed features according to the graph structure. A ReLU activation injects nonlinearity.
By stacking multiple such layers, for example
class SimpleGCN(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(SimpleGCN, self).__init__()
self.gcn1 = GCNLayer(input_dim, hidden_dim)
self.gcn2 = GCNLayer(hidden_dim, output_dim)
def forward(self, X, adjacency):
h1 = self.gcn1(X, adjacency)
# h1 serves as input to the next layer
h2 = self.gcn2(h1, adjacency)
return h2
we enable each node to gather information from nodes that are up to two hops away. For a classification task where each node i has a label yᵢ in {1,…,C}, we can pair the final outputs H ∈ ℝᴺˣᶜ with a cross-entropy loss, just as in ordinary classification, and train by gradient descent.
Beyond GCNs, attention-based graph networks compute edge-specific weights that tell a node how much to attend to each neighbor. The Graph Attention Network (GAT) introduces learnable attention coefficients αᵢⱼ defined by:
eᵢⱼ = LeakyReLU( aᵀ · [ W·xᵢ ∥ W·xⱼ ] )
αᵢⱼ = softmax_j( eᵢⱼ )
where ∥ denotes concatenation, a ∈ ℝ²ᵈ′ is a learnable vector, and softmax_j normalizes over all neighbors of i. The node update becomes:
hᵢ′ = σ( Σⱼ αᵢⱼ · W·xⱼ ).
Implementing a GAT layer from scratch follows the same pattern of message passing but requires computing eᵢⱼ for every edge and then normalizing. For large graphs one uses sparse representations or libraries such as PyTorch Geometric to handle efficiency.
Graph neural networks open the door to applications in chemistry, social network analysis, recommendation systems, and combinatorial optimization. They provide a principled way to learn representations on structured data where each entity’s context is defined by its relationships.
Chapter 11: Implementing a Transformer from Scratch
The transformer is a neural architecture built entirely on attention mechanisms, dispensing with recurrence and convolution to process sequences in parallel. Its key innovation is the scaled dot-product attention sub-layer, which computes relationships among all positions in the input simultaneously. A transformer encoder stacks multiple layers of multi-head self-attention and position-wise feed-forward networks, each wrapped in residual connections and layer normalization. The decoder adds masked self-attention and encoder-decoder attention to enable autoregressive generation.
We begin by formalizing the scaled dot-product attention. Given query, key, and value matrices Q, K, and V with shapes
(batch_size, num_heads, seq_len, d_k),
we compute raw scores by taking the dot product of Q with the transpose of K. We then scale those scores by √d_k to prevent extreme values from causing vanishing gradients, apply softmax to obtain attention weights, and multiply by V to produce the attended output:
Attention(Q, K, V) = softmax( (Q · Kᵀ) / √d_k ) · V
In PyTorch this can be implemented as follows:
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Compute scaled dot-product attention.
Q, K, V have shape (batch_size, num_heads, seq_len, d_k).
Mask, if provided, is added to scores to prevent attention on certain positions.
"""
d_k = Q.size(-1)
# Compute raw attention scores.
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
# Apply mask (e.g., to prevent attending to future tokens in decoder).
if mask is not None:
scores = scores + mask
# Normalize to obtain attention weights.
attn_weights = F.softmax(scores, dim=-1)
# Compute the weighted sum of the values.
output = torch.matmul(attn_weights, V)
return output, attn_weights
In this function we extract the dimension d_k from Q, compute the dot products, scale them, and optionally add a mask before the softmax. The mask contains large negative values (−∞) at disallowed positions so that after softmax those positions receive zero weight.
Multi-head attention extends this idea by allowing the model to attend jointly to information from multiple representation subspaces. We first project the input tensor X of shape (batch_size, seq_len, d_model) into queries, keys, and values using learned linear layers. We then split each of these projections into num_heads separate heads along the feature dimension, apply the scaled dot-product attention in parallel on each head, concatenate the results, and project back to the original d_model:
import torch
import torch.nn as nn
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.num_heads = num_heads
self.d_k = d_model // num_heads
# Linear projections for queries, keys, values, and the final output.
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, X, mask=None):
batch_size, seq_len, _ = X.size()
# Project inputs to Q, K, V.
Q = self.W_q(X)
K = self.W_k(X)
V = self.W_v(X)
# Reshape and transpose to separate heads.
Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# Apply scaled dot-product attention.
attn_output, _ = scaled_dot_product_attention(Q, K, V, mask)
# Concatenate heads and project back to d_model.
concat = attn_output.transpose(1, 2).contiguous().view(batch_size, seq_len, -1)
output = self.W_o(concat)
return output
Because the transformer has no built-in notion of order, positional encodings are added to the token embeddings to provide the model with information about the position of each element in the sequence. The original transformer uses sinusoidal encodings defined by:
P[pos, 2i ] = sin( pos / (10000^(2i/d_model)) )
P[pos, 2i+1 ] = cos( pos / (10000^(2i/d_model)) )
for pos in [0, L−1] and i in [0, d_model/2−1]. We implement this as:
import math
def get_sinusoidal_positional_encoding(L, d_model):
# Create a tensor of shape (L, d_model).
P = torch.zeros(L, d_model)
position = torch.arange(0, L).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
# Apply sin to even indices.
P[:, 0::2] = torch.sin(position * div_term)
# Apply cos to odd indices.
P[:, 1::2] = torch.cos(position * div_term)
return P
Each encoder layer comprises a self-attention sub-layer followed by a position-wise feed-forward network. Both sub-layers are enclosed in residual connections and followed by layer normalization and dropout. The feed-forward network has the form:
FFN(x) = ReLU(x·W₁ + b₁)·W₂ + b₂
and is applied independently to each position. We assemble an encoder layer in PyTorch:
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super(TransformerEncoderLayer, self).__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model),
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention with residual connection and normalization.
attn_out = self.self_attn(x, mask)
x = x + self.dropout1(attn_out)
x = self.norm1(x)
# Feed-forward with residual connection and normalization.
ffn_out = self.ffn(x)
x = x + self.dropout2(ffn_out)
x = self.norm2(x)
return x
To build the full encoder, we stack N such layers and apply positional encodings at the input:
class TransformerEncoder(nn.Module):
def __init__(self, num_layers, d_model, num_heads, d_ff, dropout):
super(TransformerEncoder, self).__init__()
self.pos_encoder = get_sinusoidal_positional_encoding
self.layers = nn.ModuleList([
TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
self.norm = nn.LayerNorm(d_model)
def forward(self, src, src_mask=None):
# src: (batch_size, seq_len, d_model)
seq_len = src.size(1)
# Add positional encoding.
pos_enc = self.pos_encoder(seq_len, src.size(2)).to(src.device)
x = src + pos_enc.unsqueeze(0)
# Pass through each encoder layer.
for layer in self.layers:
x = layer(x, src_mask)
return self.norm(x)
Implementing a decoder follows the same pattern but includes a masked self-attention sub-layer to prevent attending to future positions and an encoder-decoder attention sub-layer that attends over the encoder’s output. A final linear and softmax layer maps the decoder output to probabilities over the target vocabulary.
By coding every component—from scaled dot-product attention through multi-head attention, positional encoding, feed-forward networks, and encoder layers—you gain insight into how information flows through the transformer. With this foundation, it becomes straightforward to adapt or extend the model for tasks such as machine translation, text summarization, or even image generation.
Chapter 12: Full Encoder–Decoder Transformer
To perform sequence‐to‐sequence tasks such as machine translation or summarization, we must extend the encoder into a paired decoder that generates one token at a time while attending to the encoder’s output. A full transformer thus comprises token embeddings, positional encodings, a stack of encoder layers, a stack of decoder layers, and a final linear projection into the target vocabulary.
Below is a step‐by‐step implementation in PyTorch, with every line explained.
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Compute scaled dot‐product attention.
Q, K, V are shape (batch_size, num_heads, seq_len, d_k).
Mask, if provided, contains -inf at disallowed positions.
"""
d_k = Q.size(-1)
# Compute raw attention scores by matrix‐multiplying queries with keys.
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# If a mask is supplied, add it (positions with -inf remain zero after softmax).
if mask is not None:
scores = scores + mask
# Normalize scores into probabilities.
attn_weights = F.softmax(scores, dim=-1)
# Multiply probabilities by values to get the attended outputs.
output = torch.matmul(attn_weights, V)
return output, attn_weights
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
# Ensure d_model divides evenly into the number of heads.
assert d_model % num_heads == 0
self.num_heads = num_heads
self.d_k = d_model // num_heads
# Linear projections for queries, keys, values.
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
# Final linear projection after concatenating all heads.
self.W_o = nn.Linear(d_model, d_model)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# Project input tensors into Q, K, V.
Q = self.W_q(query)
K = self.W_k(key)
V = self.W_v(value)
# Reshape into (batch, heads, seq_len, d_k) and transpose.
Q = Q.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
# Apply scaled dot‐product attention per head.
attn_output, _ = scaled_dot_product_attention(Q, K, V, mask)
# Concatenate heads: transpose back and merge head dimension.
concat = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
# Final linear projection.
output = self.W_o(concat)
return output
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEncoding, self).__init__()
# Create sinusoidal positional encodings once.
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# Add batch dimension and register as buffer so it moves with the model.
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
# x has shape (batch_size, seq_len, d_model).
# Add the positional encodings up to the input length.
x = x + self.pe[:, :x.size(1)]
return x
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super(TransformerEncoderLayer, self).__init__()
# Self-attention sub-layer.
self.self_attn = MultiHeadAttention(d_model, num_heads)
# Position-wise feed-forward network.
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model),
)
# Layer normalization modules.
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
# Dropout for regularization.
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, src_mask=None):
# Apply self-attention, then add & norm.
attn_out = self.self_attn(x, x, x, src_mask)
x = self.norm1(x + self.dropout1(attn_out))
# Apply feed-forward network, then add & norm.
ffn_out = self.ffn(x)
x = self.norm2(x + self.dropout2(ffn_out))
return x
class TransformerDecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super(TransformerDecoderLayer, self).__init__()
# Masked self-attention for target sequence.
self.self_attn = MultiHeadAttention(d_model, num_heads)
# Encoder-decoder attention to attend over source.
self.src_attn = MultiHeadAttention(d_model, num_heads)
# Feed-forward network.
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model),
)
# Layer norms and dropouts for each sub-layer.
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.dropout3 = nn.Dropout(dropout)
def forward(self, x, memory, src_mask=None, tgt_mask=None):
# Masked self-attention on the decoder input.
self_attn_out = self.self_attn(x, x, x, tgt_mask)
x = self.norm1(x + self.dropout1(self_attn_out))
# Encoder-decoder attention over encoder outputs.
src_attn_out = self.src_attn(x, memory, memory, src_mask)
x = self.norm2(x + self.dropout2(src_attn_out))
# Feed-forward and add & norm.
ffn_out = self.ffn(x)
x = self.norm3(x + self.dropout3(ffn_out))
return x
class Transformer(nn.Module):
def __init__(self,
src_vocab_size,
tgt_vocab_size,
d_model=512,
num_heads=8,
d_ff=2048,
num_encoder_layers=6,
num_decoder_layers=6,
dropout=0.1):
super(Transformer, self).__init__()
# Token embedding for source and target.
self.src_embed = nn.Sequential(
nn.Embedding(src_vocab_size, d_model),
PositionalEncoding(d_model, dropout=dropout)
)
self.tgt_embed = nn.Sequential(
nn.Embedding(tgt_vocab_size, d_model),
PositionalEncoding(d_model, dropout=dropout)
)
# Stacked encoder and decoder layers.
self.encoder_layers = nn.ModuleList([
TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_encoder_layers)
])
self.decoder_layers = nn.ModuleList([
TransformerDecoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_decoder_layers)
])
# Final linear projection to vocabulary size.
self.generator = nn.Linear(d_model, tgt_vocab_size)
self.d_model = d_model
def encode(self, src, src_mask=None):
# Embed and add positional encoding.
x = self.src_embed(src) * math.sqrt(self.d_model)
# Pass through each encoder layer.
for layer in self.encoder_layers:
x = layer(x, src_mask)
return x
def decode(self, tgt, memory, src_mask=None, tgt_mask=None):
# Embed target and add positional encoding.
x = self.tgt_embed(tgt) * math.sqrt(self.d_model)
# Pass through each decoder layer.
for layer in self.decoder_layers:
x = layer(x, memory, src_mask, tgt_mask)
return x
def forward(self, src, tgt, src_mask=None, tgt_mask=None):
# Compute encoder output.
memory = self.encode(src, src_mask)
# Compute decoder output attending to encoder memory.
output = self.decode(tgt, memory, src_mask, tgt_mask)
# Project to vocabulary logits.
return self.generator(output)
def generate_square_subsequent_mask(sz):
"""
Create a mask for causal attention so that position i can only
attend to positions ≤ i. Mask entries are 0 where allowed and
-inf where disallowed.
"""
mask = torch.triu(torch.full((sz, sz), float('-inf')), diagonal=1)
return mask
Usage example with dummy data:
# Define vocabulary sizes and sequence lengths.
src_vocab_size, tgt_vocab_size = 10000, 10000
batch_size, src_len, tgt_len = 2, 20, 22
# Instantiate the transformer.
model = Transformer(src_vocab_size, tgt_vocab_size)
# Example source and target token indices.
src = torch.randint(0, src_vocab_size, (batch_size, src_len))
tgt = torch.randint(0, tgt_vocab_size, (batch_size, tgt_len))
# No padding mask for this example.
src_mask = None
# Causal mask for the decoder.
tgt_mask = generate_square_subsequent_mask(tgt_len)
# Forward pass yields logits of shape (batch_size, tgt_len, tgt_vocab_size).
logits = model(src, tgt, src_mask, tgt_mask)
In this implementation every encoder layer applies multi‐head self‐attention and a position‐wise feed‐forward network, each with residual connections and layer normalization. Every decoder layer adds a masked self‐attention step to prevent peeking at future tokens and an additional encoder–decoder attention that allows the decoder to focus on relevant parts of the source sequence. Positional encodings built with sinusoids inject order information into the model, and the final linear layer projects decoder outputs to raw token scores.
With this foundation, you can train the model on paired text data by defining an appropriate loss (for example, cross‐entropy between the predicted logits and true token indices) and using any of the optimizers discussed earlier.
No comments:
Post a Comment