INTRODUCTION
Welcome to this comprehensive tutorial on building neural networks from scratch in Python. This guide will take you on a journey from the absolute basics to advanced concepts in deep learning. We will not rely on high-level libraries like TensorFlow or PyTorch for the core implementation. Instead, we will build everything ourselves using only NumPy for numerical operations. This approach will give you a deep understanding of what happens under the hood when you train a neural network.
By the end of this tutorial, you will understand how neurons work, how networks learn through backpropagation, how to implement various optimization algorithms, and how to add practical features like batch processing and early stopping. Each concept will be explained thoroughly before we implement it in code.
WHAT IS A NEURAL NETWORK?
A neural network is a computational model inspired by the way biological neurons work in the human brain. At its core, a neural network consists of layers of interconnected nodes called neurons. Each neuron receives inputs, processes them, and produces an output that gets passed to the next layer.
The simplest neural network has three types of layers. The input layer receives the raw data. Hidden layers perform computations and extract features from the data. The output layer produces the final prediction or classification.
The power of neural networks comes from their ability to learn complex patterns in data. They do this by adjusting internal parameters called weights and biases during a training process. This training process uses examples of input data paired with correct outputs to gradually improve the network's predictions.
THE MATHEMATICS BEHIND A SINGLE NEURON
Before we build a full network, let us understand how a single neuron works. A neuron takes multiple inputs, multiplies each input by a weight, adds all these weighted inputs together, adds a bias term, and then applies an activation function to produce an output.
Mathematically, for a neuron with inputs x1, x2, x3 and corresponding weights w1, w2, w3, the weighted sum z is calculated as:
z = w1 * x1 + w2 * x2 + w3 * x3 + b
where b is the bias term. The bias allows the neuron to shift its activation function left or right, which helps the network fit the data better.
After computing z, we apply an activation function to introduce non-linearity. Without activation functions, no matter how many layers we stack, the network would only be able to learn linear relationships. Common activation functions include sigmoid, tanh, and ReLU.
The sigmoid function squashes any input value to a range between 0 and 1:
sigmoid(z) = 1 / (1 + exp(-z))
The ReLU (Rectified Linear Unit) function is simpler and often works better in practice:
ReLU(z) = max(0, z)
This means if z is positive, ReLU returns z. If z is negative, ReLU returns 0.
SETTING UP OUR PYTHON ENVIRONMENT
Before we start coding, we need to import the necessary libraries. We will use NumPy for all our numerical computations and matplotlib for visualizing our results.
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List, Optional
We set a random seed to ensure our results are reproducible. This means every time we run our code, we will get the same random initialization of weights.
np.random.seed(42)
BUILDING OUR FIRST SIMPLE NEURAL NETWORK
Let us start by creating a very simple neural network with one hidden layer. This network will have an input layer, one hidden layer with a few neurons, and an output layer. We will build it step by step, explaining each component.
First, we need to implement the activation functions we discussed earlier. We will implement both the forward pass (computing the activation) and the backward pass (computing the derivative, which we need for backpropagation).
def sigmoid(z):
"""
Compute the sigmoid activation function.
The sigmoid function maps any real number to a value between 0 and 1.
It is useful for binary classification problems.
Parameters:
z : numpy array of any shape
Returns:
activation : numpy array of same shape as z
"""
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(z):
"""
Compute the derivative of the sigmoid function.
This is used during backpropagation to compute gradients.
The derivative of sigmoid(z) is sigmoid(z) * (1 - sigmoid(z)).
Parameters:
z : numpy array of any shape
Returns:
derivative : numpy array of same shape as z
"""
sig = sigmoid(z)
return sig * (1 - sig)
Now let us implement the ReLU activation function and its derivative.
def relu(z):
"""
Compute the ReLU (Rectified Linear Unit) activation function.
ReLU returns the input if it is positive, otherwise returns 0.
It is computationally efficient and works well in practice.
Parameters:
z : numpy array of any shape
Returns:
activation : numpy array of same shape as z
"""
return np.maximum(0, z)
def relu_derivative(z):
"""
Compute the derivative of the ReLU function.
The derivative is 1 where z > 0, and 0 elsewhere.
Parameters:
z : numpy array of any shape
Returns:
derivative : numpy array of same shape as z
"""
return (z > 0).astype(float)
INITIALIZING NETWORK PARAMETERS
When we create a neural network, we need to initialize the weights and biases. The way we initialize these parameters can significantly affect how well and how quickly the network learns.
A common approach is to initialize weights randomly with small values. If weights are too large, the activations can explode. If they are too small or all zero, the network may not learn effectively. We will use a technique called He initialization for ReLU networks, which scales the random weights based on the number of inputs.
def initialize_parameters(layer_dimensions):
"""
Initialize the weights and biases for all layers in the network.
We use He initialization for weights, which works well with ReLU activations.
Biases are initialized to zeros.
Parameters:
layer_dimensions : list of integers representing the number of units in each layer
For example, [784, 128, 64, 10] means:
- Input layer: 784 features
- First hidden layer: 128 neurons
- Second hidden layer: 64 neurons
- Output layer: 10 neurons
Returns:
parameters : dictionary containing weights (W) and biases (b) for each layer
"""
parameters = {}
num_layers = len(layer_dimensions)
for layer in range(1, num_layers):
# He initialization: multiply random values by sqrt(2 / n_previous_layer)
# This helps prevent vanishing or exploding gradients
parameters[f'W{layer}'] = np.random.randn(
layer_dimensions[layer],
layer_dimensions[layer - 1]
) * np.sqrt(2.0 / layer_dimensions[layer - 1])
# Initialize biases to zeros
parameters[f'b{layer}'] = np.zeros((layer_dimensions[layer], 1))
return parameters
FORWARD PROPAGATION: MAKING PREDICTIONS
Forward propagation is the process of passing input data through the network to get predictions. At each layer, we compute the weighted sum of inputs plus bias, then apply an activation function.
Let us implement forward propagation for a network with one hidden layer using ReLU activation and an output layer using sigmoid activation.
def forward_propagation_simple(X, parameters):
"""
Perform forward propagation through a simple 2-layer network.
The network architecture is:
Input -> Hidden Layer (ReLU) -> Output Layer (Sigmoid)
Parameters:
X : numpy array of shape (n_features, n_examples)
Input data where each column is one training example
parameters : dictionary containing W1, b1, W2, b2
Returns:
A2 : numpy array of shape (n_output, n_examples)
Final output (predictions)
cache : dictionary containing intermediate values needed for backpropagation
"""
# Retrieve parameters
W1 = parameters['W1']
b1 = parameters['b1']
W2 = parameters['W2']
b2 = parameters['b2']
# Forward propagation for hidden layer
# Z1 is the weighted sum before activation
Z1 = np.dot(W1, X) + b1
# A1 is the activation output
A1 = relu(Z1)
# Forward propagation for output layer
Z2 = np.dot(W2, A1) + b2
A2 = sigmoid(Z2)
# Store values for backpropagation
cache = {
'Z1': Z1,
'A1': A1,
'Z2': Z2,
'A2': A2
}
return A2, cache
COMPUTING THE COST FUNCTION
The cost function (also called loss function) measures how wrong our network's predictions are. During training, we want to minimize this cost. For binary classification, we typically use binary cross-entropy loss.
The binary cross-entropy cost for a single example is:
cost = - (y * log(prediction) + (1 - y) * log(1 - prediction))
where y is the true label (0 or 1) and prediction is our network's output.
For multiple examples, we average the cost across all examples.
def compute_cost(A2, Y):
"""
Compute the binary cross-entropy cost.
This measures how different our predictions are from the true labels.
Lower cost means better predictions.
Parameters:
A2 : numpy array of shape (1, n_examples)
Network predictions (probabilities between 0 and 1)
Y : numpy array of shape (1, n_examples)
True labels (0 or 1)
Returns:
cost : float
Average cost across all examples
"""
m = Y.shape[1] # Number of examples
# Compute cross-entropy cost
# We add a small epsilon to avoid log(0)
epsilon = 1e-8
cost = -np.sum(Y * np.log(A2 + epsilon) + (1 - Y) * np.log(1 - A2 + epsilon)) / m
return cost
BACKWARD PROPAGATION: LEARNING FROM MISTAKES
Backward propagation is the heart of how neural networks learn. It computes the gradient of the cost function with respect to each parameter (weight and bias). These gradients tell us how to adjust the parameters to reduce the cost.
The process works backwards from the output layer to the input layer, using the chain rule from calculus. For each layer, we compute how much each parameter contributed to the error.
def backward_propagation_simple(X, Y, parameters, cache):
"""
Perform backward propagation to compute gradients.
This calculates how much each weight and bias should change
to reduce the cost function.
Parameters:
X : numpy array of shape (n_features, n_examples)
Input data
Y : numpy array of shape (1, n_examples)
True labels
parameters : dictionary containing W1, b1, W2, b2
cache : dictionary containing Z1, A1, Z2, A2 from forward propagation
Returns:
gradients : dictionary containing dW1, db1, dW2, db2
"""
m = X.shape[1] # Number of examples
# Retrieve cached values
Z1 = cache['Z1']
A1 = cache['A1']
A2 = cache['A2']
# Retrieve parameters
W2 = parameters['W2']
# Backward propagation for output layer
# dZ2 is the gradient of cost with respect to Z2
dZ2 = A2 - Y
# Gradients for W2 and b2
dW2 = np.dot(dZ2, A1.T) / m
db2 = np.sum(dZ2, axis=1, keepdims=True) / m
# Backward propagation for hidden layer
# We propagate the gradient back through W2
dA1 = np.dot(W2.T, dZ2)
# Then multiply by the derivative of ReLU
dZ1 = dA1 * relu_derivative(Z1)
# Gradients for W1 and b1
dW1 = np.dot(dZ1, X.T) / m
db1 = np.sum(dZ1, axis=1, keepdims=True) / m
gradients = {
'dW1': dW1,
'db1': db1,
'dW2': dW2,
'db2': db2
}
return gradients
UPDATING PARAMETERS WITH GRADIENT DESCENT
Once we have computed the gradients, we need to update our parameters. The simplest optimization algorithm is gradient descent. We move each parameter in the opposite direction of its gradient, scaled by a learning rate.
The update rule is:
new_weight = old_weight - learning_rate * gradient
The learning rate controls how big our steps are. If it is too large, we might overshoot the minimum. If it is too small, training will be very slow.
def update_parameters(parameters, gradients, learning_rate):
"""
Update parameters using gradient descent.
Each parameter is adjusted in the direction that reduces the cost.
Parameters:
parameters : dictionary containing current W1, b1, W2, b2
gradients : dictionary containing dW1, db1, dW2, db2
learning_rate : float
Controls the step size of parameter updates
Returns:
parameters : dictionary containing updated W1, b1, W2, b2
"""
# Update weights and biases for each layer
parameters['W1'] = parameters['W1'] - learning_rate * gradients['dW1']
parameters['b1'] = parameters['b1'] - learning_rate * gradients['db1']
parameters['W2'] = parameters['W2'] - learning_rate * gradients['dW2']
parameters['b2'] = parameters['b2'] - learning_rate * gradients['db2']
return parameters
PUTTING IT ALL TOGETHER: TRAINING THE NETWORK
Now we can combine all the pieces into a complete training loop. We will repeatedly perform forward propagation, compute the cost, perform backward propagation, and update the parameters.
def train_simple_network(X, Y, layer_dimensions, learning_rate=0.01, num_iterations=1000, print_cost=True):
"""
Train a simple 2-layer neural network.
This function performs the complete training process:
1. Initialize parameters
2. For each iteration:
- Forward propagation
- Compute cost
- Backward propagation
- Update parameters
Parameters:
X : numpy array of shape (n_features, n_examples)
Training data
Y : numpy array of shape (1, n_examples)
Training labels
layer_dimensions : list of layer sizes [n_input, n_hidden, n_output]
learning_rate : float
Learning rate for gradient descent
num_iterations : int
Number of training iterations
print_cost : bool
Whether to print cost every 100 iterations
Returns:
parameters : dictionary containing trained weights and biases
costs : list of costs computed during training
"""
costs = []
# Initialize parameters
parameters = initialize_parameters(layer_dimensions)
# Training loop
for iteration in range(num_iterations):
# Forward propagation
A2, cache = forward_propagation_simple(X, parameters)
# Compute cost
cost = compute_cost(A2, Y)
costs.append(cost)
# Backward propagation
gradients = backward_propagation_simple(X, Y, parameters, cache)
# Update parameters
parameters = update_parameters(parameters, gradients, learning_rate)
# Print cost every 100 iterations
if print_cost and iteration % 100 == 0:
print(f"Cost after iteration {iteration}: {cost:.6f}")
return parameters, costs
TESTING OUR SIMPLE NETWORK
Let us create a simple dataset and test our neural network. We will generate synthetic data for a binary classification problem.
def generate_simple_dataset(n_samples=1000):
"""
Generate a simple synthetic dataset for binary classification.
This creates two classes of points that are linearly separable
with some noise added.
Parameters:
n_samples : int
Number of samples to generate
Returns:
X : numpy array of shape (2, n_samples)
Features
Y : numpy array of shape (1, n_samples)
Labels (0 or 1)
"""
# Generate random points
np.random.seed(42)
# Class 0: points clustered around (-2, -2)
X_class0 = np.random.randn(2, n_samples // 2) + np.array([[-2], [-2]])
Y_class0 = np.zeros((1, n_samples // 2))
# Class 1: points clustered around (2, 2)
X_class1 = np.random.randn(2, n_samples // 2) + np.array([[2], [2]])
Y_class1 = np.ones((1, n_samples // 2))
# Combine both classes
X = np.concatenate([X_class0, X_class1], axis=1)
Y = np.concatenate([Y_class0, Y_class1], axis=1)
# Shuffle the data
permutation = np.random.permutation(n_samples)
X = X[:, permutation]
Y = Y[:, permutation]
return X, Y
Now let us train our network on this dataset.
# Generate dataset
X_train, Y_train = generate_simple_dataset(n_samples=1000)
# Define network architecture
# Input layer: 2 features
# Hidden layer: 4 neurons
# Output layer: 1 neuron (binary classification)
layer_dims = [2, 4, 1]
# Train the network
print("Training simple neural network...")
parameters, costs = train_simple_network(
X_train,
Y_train,
layer_dims,
learning_rate=0.5,
num_iterations=1000,
print_cost=True
)
print("\nTraining complete!")
MAKING PREDICTIONS
After training, we need a function to make predictions on new data.
def predict(X, parameters):
"""
Make predictions using the trained network.
Parameters:
X : numpy array of shape (n_features, n_examples)
Input data
parameters : dictionary containing trained weights and biases
Returns:
predictions : numpy array of shape (1, n_examples)
Predicted class (0 or 1) for each example
"""
# Forward propagation
A2, _ = forward_propagation_simple(X, parameters)
# Convert probabilities to binary predictions
# If probability > 0.5, predict class 1, otherwise class 0
predictions = (A2 > 0.5).astype(int)
return predictions
def compute_accuracy(predictions, Y):
"""
Compute the accuracy of predictions.
Parameters:
predictions : numpy array of shape (1, n_examples)
Predicted labels
Y : numpy array of shape (1, n_examples)
True labels
Returns:
accuracy : float
Percentage of correct predictions
"""
accuracy = np.mean(predictions == Y) * 100
return accuracy
Let us test our trained network.
# Make predictions on training data
predictions = predict(X_train, parameters)
accuracy = compute_accuracy(predictions, Y_train)
print(f"Training accuracy: {accuracy:.2f}%")
UNDERSTANDING WHAT THE NETWORK LEARNED
To visualize what our network learned, we can plot the decision boundary. This shows how the network divides the input space into regions for each class.
def plot_decision_boundary(X, Y, parameters):
"""
Plot the decision boundary learned by the network.
This creates a visualization showing how the network
classifies different regions of the input space.
Parameters:
X : numpy array of shape (2, n_examples)
Input data (must be 2D for visualization)
Y : numpy array of shape (1, n_examples)
True labels
parameters : dictionary containing trained weights and biases
"""
# Set up the grid
x_min, x_max = X[0, :].min() - 1, X[0, :].max() + 1
y_min, y_max = X[1, :].min() - 1, X[1, :].max() + 1
h = 0.1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Make predictions for every point in the grid
grid_points = np.c_[xx.ravel(), yy.ravel()].T
Z, _ = forward_propagation_simple(grid_points, parameters)
Z = Z.reshape(xx.shape)
# Plot the decision boundary
plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, levels=[0, 0.5, 1], alpha=0.3, colors=['blue', 'red'])
# Plot the training points
plt.scatter(X[0, Y[0] == 0], X[1, Y[0] == 0], c='blue', marker='o', label='Class 0', edgecolors='k')
plt.scatter(X[0, Y[0] == 1], X[1, Y[0] == 1], c='red', marker='s', label='Class 1', edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Decision Boundary of Neural Network')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
BUILDING A DEEPER NETWORK
Now that we understand the basics, let us build a more flexible network that can have any number of layers. This is where deep learning gets its name - from using networks with many layers.
A deeper network can learn more complex patterns because each layer can build on the features learned by previous layers. The first layers might learn simple features like edges, while deeper layers combine these into more complex patterns.
class DeepNeuralNetwork:
"""
A flexible deep neural network with arbitrary depth.
This class encapsulates all the functionality needed to create,
train, and use a deep neural network with multiple hidden layers.
"""
def __init__(self, layer_dimensions, activation='relu'):
"""
Initialize the deep neural network.
Parameters:
layer_dimensions : list of integers
Number of units in each layer
Example: [784, 128, 64, 10] creates a network with
784 input features, two hidden layers (128 and 64 units),
and 10 output units
activation : str
Activation function to use in hidden layers ('relu' or 'sigmoid')
"""
self.layer_dimensions = layer_dimensions
self.num_layers = len(layer_dimensions)
self.activation = activation
self.parameters = self._initialize_parameters()
def _initialize_parameters(self):
"""
Initialize weights and biases for all layers.
Returns:
parameters : dictionary containing all weights and biases
"""
parameters = {}
for layer in range(1, self.num_layers):
# He initialization for ReLU, Xavier for sigmoid
if self.activation == 'relu':
scale = np.sqrt(2.0 / self.layer_dimensions[layer - 1])
else:
scale = np.sqrt(1.0 / self.layer_dimensions[layer - 1])
parameters[f'W{layer}'] = np.random.randn(
self.layer_dimensions[layer],
self.layer_dimensions[layer - 1]
) * scale
parameters[f'b{layer}'] = np.zeros((self.layer_dimensions[layer], 1))
return parameters
def _activation_forward(self, Z, activation_type):
"""
Apply activation function.
Parameters:
Z : numpy array
Pre-activation values
activation_type : str
Type of activation ('relu', 'sigmoid')
Returns:
A : numpy array
Post-activation values
"""
if activation_type == 'relu':
return relu(Z)
elif activation_type == 'sigmoid':
return sigmoid(Z)
else:
raise ValueError(f"Unknown activation: {activation_type}")
def _activation_backward(self, dA, Z, activation_type):
"""
Compute gradient of activation function.
Parameters:
dA : numpy array
Gradient of cost with respect to activation
Z : numpy array
Pre-activation values
activation_type : str
Type of activation
Returns:
dZ : numpy array
Gradient of cost with respect to pre-activation
"""
if activation_type == 'relu':
return dA * relu_derivative(Z)
elif activation_type == 'sigmoid':
return dA * sigmoid_derivative(Z)
else:
raise ValueError(f"Unknown activation: {activation_type}")
def forward_propagation(self, X):
"""
Perform forward propagation through all layers.
Parameters:
X : numpy array of shape (n_features, n_examples)
Input data
Returns:
AL : numpy array
Final layer activation (predictions)
caches : list of dictionaries
Cached values needed for backpropagation
"""
caches = []
A = X
# Forward through all layers except the last
for layer in range(1, self.num_layers - 1):
A_prev = A
W = self.parameters[f'W{layer}']
b = self.parameters[f'b{layer}']
Z = np.dot(W, A_prev) + b
A = self._activation_forward(Z, self.activation)
cache = {
'A_prev': A_prev,
'Z': Z,
'W': W,
'b': b
}
caches.append(cache)
# Forward through the output layer (always sigmoid for binary classification)
W = self.parameters[f'W{self.num_layers - 1}']
b = self.parameters[f'b{self.num_layers - 1}']
Z = np.dot(W, A) + b
AL = sigmoid(Z)
cache = {
'A_prev': A,
'Z': Z,
'W': W,
'b': b
}
caches.append(cache)
return AL, caches
def compute_cost(self, AL, Y):
"""
Compute the binary cross-entropy cost.
Parameters:
AL : numpy array
Network predictions
Y : numpy array
True labels
Returns:
cost : float
"""
m = Y.shape[1]
epsilon = 1e-8
cost = -np.sum(Y * np.log(AL + epsilon) + (1 - Y) * np.log(1 - AL + epsilon)) / m
return cost
def backward_propagation(self, AL, Y, caches):
"""
Perform backward propagation through all layers.
Parameters:
AL : numpy array
Final layer activation
Y : numpy array
True labels
caches : list of dictionaries
Cached values from forward propagation
Returns:
gradients : dictionary containing all gradients
"""
gradients = {}
m = Y.shape[1]
L = self.num_layers - 1
# Initialize backpropagation from output layer
dAL = -(np.divide(Y, AL + 1e-8) - np.divide(1 - Y, 1 - AL + 1e-8))
# Output layer gradients (sigmoid activation)
current_cache = caches[L - 1]
dZ = dAL * sigmoid_derivative(current_cache['Z'])
gradients[f'dW{L}'] = np.dot(dZ, current_cache['A_prev'].T) / m
gradients[f'db{L}'] = np.sum(dZ, axis=1, keepdims=True) / m
dA_prev = np.dot(current_cache['W'].T, dZ)
# Backpropagate through hidden layers
for layer in reversed(range(L - 1)):
current_cache = caches[layer]
dZ = self._activation_backward(dA_prev, current_cache['Z'], self.activation)
gradients[f'dW{layer + 1}'] = np.dot(dZ, current_cache['A_prev'].T) / m
gradients[f'db{layer + 1}'] = np.sum(dZ, axis=1, keepdims=True) / m
dA_prev = np.dot(current_cache['W'].T, dZ)
return gradients
def update_parameters(self, gradients, learning_rate):
"""
Update parameters using gradient descent.
Parameters:
gradients : dictionary containing all gradients
learning_rate : float
"""
for layer in range(1, self.num_layers):
self.parameters[f'W{layer}'] -= learning_rate * gradients[f'dW{layer}']
self.parameters[f'b{layer}'] -= learning_rate * gradients[f'db{layer}']
def train(self, X, Y, learning_rate=0.01, num_iterations=1000, print_cost=True):
"""
Train the neural network.
Parameters:
X : numpy array
Training data
Y : numpy array
Training labels
learning_rate : float
num_iterations : int
print_cost : bool
Returns:
costs : list of costs during training
"""
costs = []
for iteration in range(num_iterations):
# Forward propagation
AL, caches = self.forward_propagation(X)
# Compute cost
cost = self.compute_cost(AL, Y)
costs.append(cost)
# Backward propagation
gradients = self.backward_propagation(AL, Y, caches)
# Update parameters
self.update_parameters(gradients, learning_rate)
# Print cost
if print_cost and iteration % 100 == 0:
print(f"Cost after iteration {iteration}: {cost:.6f}")
return costs
def predict(self, X):
"""
Make predictions on new data.
Parameters:
X : numpy array
Input data
Returns:
predictions : numpy array
Binary predictions
"""
AL, _ = self.forward_propagation(X)
predictions = (AL > 0.5).astype(int)
return predictions
Let us test our deep neural network on a more complex dataset.
def generate_complex_dataset(n_samples=1000):
"""
Generate a more complex dataset with non-linear decision boundary.
This creates a dataset where classes are arranged in concentric circles,
which requires a non-linear classifier.
Parameters:
n_samples : int
Returns:
X : numpy array of shape (2, n_samples)
Y : numpy array of shape (1, n_samples)
"""
np.random.seed(42)
# Generate points
radius = np.random.rand(n_samples)
angle = 2 * np.pi * np.random.rand(n_samples)
# Class 0: inner circle
mask_class0 = radius < 0.5
X_class0 = np.vstack([
radius[mask_class0] * np.cos(angle[mask_class0]),
radius[mask_class0] * np.sin(angle[mask_class0])
])
Y_class0 = np.zeros((1, X_class0.shape[1]))
# Class 1: outer ring
mask_class1 = radius >= 0.5
X_class1 = np.vstack([
radius[mask_class1] * np.cos(angle[mask_class1]),
radius[mask_class1] * np.sin(angle[mask_class1])
])
Y_class1 = np.ones((1, X_class1.shape[1]))
# Combine
X = np.concatenate([X_class0, X_class1], axis=1)
Y = np.concatenate([Y_class0, Y_class1], axis=1)
# Add some noise
X += np.random.randn(*X.shape) * 0.1
return X, Y
# Generate complex dataset
X_complex, Y_complex = generate_complex_dataset(n_samples=1000)
# Create and train a deeper network
print("\nTraining deep neural network on complex dataset...")
deep_net = DeepNeuralNetwork(layer_dimensions=[2, 16, 8, 1], activation='relu')
costs_deep = deep_net.train(X_complex, Y_complex, learning_rate=0.5, num_iterations=2000, print_cost=True)
# Evaluate
predictions_deep = deep_net.predict(X_complex)
accuracy_deep = compute_accuracy(predictions_deep, Y_complex)
print(f"\nDeep network accuracy: {accuracy_deep:.2f}%")
INTRODUCING MINI-BATCH GRADIENT DESCENT
So far, we have been using batch gradient descent, where we compute gradients using all training examples at once. This works well for small datasets but becomes impractical for large datasets because it requires a lot of memory and computation per iteration.
Mini-batch gradient descent is a compromise. We split the training data into small batches and update parameters after processing each batch. This allows us to make more frequent updates and can lead to faster convergence.
The benefits of mini-batch gradient descent include faster training, better memory efficiency, and the ability to leverage vectorization while still making frequent updates.
def create_mini_batches(X, Y, batch_size):
"""
Split the dataset into mini-batches.
Parameters:
X : numpy array of shape (n_features, n_examples)
Training data
Y : numpy array of shape (n_output, n_examples)
Training labels
batch_size : int
Size of each mini-batch
Returns:
mini_batches : list of tuples (mini_batch_X, mini_batch_Y)
"""
m = X.shape[1]
mini_batches = []
# Shuffle the data
permutation = np.random.permutation(m)
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation]
# Partition into mini-batches
num_complete_batches = m // batch_size
for k in range(num_complete_batches):
mini_batch_X = shuffled_X[:, k * batch_size:(k + 1) * batch_size]
mini_batch_Y = shuffled_Y[:, k * batch_size:(k + 1) * batch_size]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
# Handle the remaining examples (if any)
if m % batch_size != 0:
mini_batch_X = shuffled_X[:, num_complete_batches * batch_size:]
mini_batch_Y = shuffled_Y[:, num_complete_batches * batch_size:]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
return mini_batches
Now let us modify our DeepNeuralNetwork class to support mini-batch training.
class DeepNeuralNetworkWithMiniBatch(DeepNeuralNetwork):
"""
Deep neural network with mini-batch gradient descent support.
This extends the basic deep neural network to support training
with mini-batches instead of using the entire dataset at once.
"""
def train_with_mini_batches(self, X, Y, learning_rate=0.01, num_epochs=100,
batch_size=64, print_cost=True):
"""
Train the network using mini-batch gradient descent.
An epoch is one complete pass through the training data.
In each epoch, we process multiple mini-batches.
Parameters:
X : numpy array
Training data
Y : numpy array
Training labels
learning_rate : float
num_epochs : int
Number of complete passes through the data
batch_size : int
Size of each mini-batch
print_cost : bool
Returns:
costs : list of average costs per epoch
"""
costs = []
m = X.shape[1]
for epoch in range(num_epochs):
epoch_cost = 0
# Create mini-batches for this epoch
mini_batches = create_mini_batches(X, Y, batch_size)
num_batches = len(mini_batches)
for mini_batch in mini_batches:
mini_batch_X, mini_batch_Y = mini_batch
# Forward propagation
AL, caches = self.forward_propagation(mini_batch_X)
# Compute cost
batch_cost = self.compute_cost(AL, mini_batch_Y)
epoch_cost += batch_cost
# Backward propagation
gradients = self.backward_propagation(AL, mini_batch_Y, caches)
# Update parameters
self.update_parameters(gradients, learning_rate)
# Average cost for this epoch
avg_cost = epoch_cost / num_batches
costs.append(avg_cost)
if print_cost and epoch % 10 == 0:
print(f"Cost after epoch {epoch}: {avg_cost:.6f}")
return costs
Let us test mini-batch training.
# Create network with mini-batch support
print("\nTraining with mini-batch gradient descent...")
mini_batch_net = DeepNeuralNetworkWithMiniBatch(layer_dimensions=[2, 16, 8, 1], activation='relu')
costs_mini_batch = mini_batch_net.train_with_mini_batches(
X_complex,
Y_complex,
learning_rate=0.5,
num_epochs=200,
batch_size=32,
print_cost=True
)
# Evaluate
predictions_mini = mini_batch_net.predict(X_complex)
accuracy_mini = compute_accuracy(predictions_mini, Y_complex)
print(f"\nMini-batch network accuracy: {accuracy_mini:.2f}%")
IMPLEMENTING MOMENTUM OPTIMIZATION
Gradient descent can be slow, especially when the cost function has regions with different curvatures. Momentum is an optimization technique that helps accelerate gradient descent by accumulating a velocity vector in directions of persistent gradient.
Think of momentum like a ball rolling down a hill. The ball builds up speed (momentum) as it rolls, allowing it to move faster through flat regions and smooth out oscillations in steep regions.
The momentum update rule is:
velocity = beta * velocity + (1 - beta) * gradient parameter = parameter - learning_rate * velocity
The beta parameter (typically 0.9) controls how much of the previous velocity to retain.
class DeepNeuralNetworkWithMomentum(DeepNeuralNetworkWithMiniBatch):
"""
Deep neural network with momentum optimization.
Momentum helps accelerate training by accumulating gradients
in consistent directions.
"""
def __init__(self, layer_dimensions, activation='relu'):
"""
Initialize network with momentum support.
"""
super().__init__(layer_dimensions, activation)
self.velocities = self._initialize_velocities()
def _initialize_velocities(self):
"""
Initialize velocity vectors for momentum.
Returns:
velocities : dictionary containing velocity for each parameter
"""
velocities = {}
for layer in range(1, self.num_layers):
velocities[f'dW{layer}'] = np.zeros_like(self.parameters[f'W{layer}'])
velocities[f'db{layer}'] = np.zeros_like(self.parameters[f'b{layer}'])
return velocities
def update_parameters_with_momentum(self, gradients, learning_rate, beta=0.9):
"""
Update parameters using momentum.
Parameters:
gradients : dictionary containing gradients
learning_rate : float
beta : float
Momentum coefficient (typically 0.9)
"""
for layer in range(1, self.num_layers):
# Update velocities
self.velocities[f'dW{layer}'] = (
beta * self.velocities[f'dW{layer}'] +
(1 - beta) * gradients[f'dW{layer}']
)
self.velocities[f'db{layer}'] = (
beta * self.velocities[f'db{layer}'] +
(1 - beta) * gradients[f'db{layer}']
)
# Update parameters using velocities
self.parameters[f'W{layer}'] -= learning_rate * self.velocities[f'dW{layer}']
self.parameters[f'b{layer}'] -= learning_rate * self.velocities[f'db{layer}']
def train_with_momentum(self, X, Y, learning_rate=0.01, num_epochs=100,
batch_size=64, beta=0.9, print_cost=True):
"""
Train network using mini-batch gradient descent with momentum.
Parameters:
X : numpy array
Y : numpy array
learning_rate : float
num_epochs : int
batch_size : int
beta : float
Momentum coefficient
print_cost : bool
Returns:
costs : list of costs
"""
costs = []
for epoch in range(num_epochs):
epoch_cost = 0
mini_batches = create_mini_batches(X, Y, batch_size)
num_batches = len(mini_batches)
for mini_batch in mini_batches:
mini_batch_X, mini_batch_Y = mini_batch
# Forward propagation
AL, caches = self.forward_propagation(mini_batch_X)
# Compute cost
batch_cost = self.compute_cost(AL, mini_batch_Y)
epoch_cost += batch_cost
# Backward propagation
gradients = self.backward_propagation(AL, mini_batch_Y, caches)
# Update with momentum
self.update_parameters_with_momentum(gradients, learning_rate, beta)
avg_cost = epoch_cost / num_batches
costs.append(avg_cost)
if print_cost and epoch % 10 == 0:
print(f"Cost after epoch {epoch}: {avg_cost:.6f}")
return costs
IMPLEMENTING ADAM OPTIMIZATION
Adam (Adaptive Moment Estimation) is one of the most popular optimization algorithms in deep learning. It combines ideas from momentum and another technique called RMSprop. Adam adapts the learning rate for each parameter individually, which often leads to faster convergence.
Adam maintains two moving averages for each parameter. The first moment estimate is similar to momentum, tracking the average of gradients. The second moment estimate tracks the average of squared gradients, which helps adapt the learning rate.
The Adam update rules are:
first_moment = beta1 * first_moment + (1 - beta1) * gradient second_moment = beta2 * second_moment + (1 - beta2) * gradient_squared
Then we apply bias correction and update:
first_moment_corrected = first_moment / (1 - beta1^t) second_moment_corrected = second_moment / (1 - beta2^t) parameter = parameter - learning_rate * first_moment_corrected / (sqrt(second_moment_corrected) + epsilon)
class DeepNeuralNetworkWithAdam(DeepNeuralNetworkWithMiniBatch):
"""
Deep neural network with Adam optimization.
Adam is an advanced optimizer that adapts learning rates
for each parameter individually, often leading to faster
and more stable training.
"""
def __init__(self, layer_dimensions, activation='relu'):
"""
Initialize network with Adam optimizer support.
"""
super().__init__(layer_dimensions, activation)
self.adam_params = self._initialize_adam()
def _initialize_adam(self):
"""
Initialize Adam optimizer parameters.
Returns:
adam_params : dictionary containing first and second moment estimates
"""
adam_params = {
'v': {}, # First moment (momentum)
's': {}, # Second moment (RMSprop)
't': 0 # Time step
}
for layer in range(1, self.num_layers):
adam_params['v'][f'dW{layer}'] = np.zeros_like(self.parameters[f'W{layer}'])
adam_params['v'][f'db{layer}'] = np.zeros_like(self.parameters[f'b{layer}'])
adam_params['s'][f'dW{layer}'] = np.zeros_like(self.parameters[f'W{layer}'])
adam_params['s'][f'db{layer}'] = np.zeros_like(self.parameters[f'b{layer}'])
return adam_params
def update_parameters_with_adam(self, gradients, learning_rate,
beta1=0.9, beta2=0.999, epsilon=1e-8):
"""
Update parameters using Adam optimization.
Parameters:
gradients : dictionary containing gradients
learning_rate : float
beta1 : float
Exponential decay rate for first moment (typically 0.9)
beta2 : float
Exponential decay rate for second moment (typically 0.999)
epsilon : float
Small constant to prevent division by zero
"""
# Increment time step
self.adam_params['t'] += 1
t = self.adam_params['t']
for layer in range(1, self.num_layers):
# Update first moment (momentum)
self.adam_params['v'][f'dW{layer}'] = (
beta1 * self.adam_params['v'][f'dW{layer}'] +
(1 - beta1) * gradients[f'dW{layer}']
)
self.adam_params['v'][f'db{layer}'] = (
beta1 * self.adam_params['v'][f'db{layer}'] +
(1 - beta1) * gradients[f'db{layer}']
)
# Update second moment (RMSprop)
self.adam_params['s'][f'dW{layer}'] = (
beta2 * self.adam_params['s'][f'dW{layer}'] +
(1 - beta2) * np.square(gradients[f'dW{layer}'])
)
self.adam_params['s'][f'db{layer}'] = (
beta2 * self.adam_params['s'][f'db{layer}'] +
(1 - beta2) * np.square(gradients[f'db{layer}'])
)
# Bias correction for first moment
v_corrected_W = self.adam_params['v'][f'dW{layer}'] / (1 - beta1**t)
v_corrected_b = self.adam_params['v'][f'db{layer}'] / (1 - beta1**t)
# Bias correction for second moment
s_corrected_W = self.adam_params['s'][f'dW{layer}'] / (1 - beta2**t)
s_corrected_b = self.adam_params['s'][f'db{layer}'] / (1 - beta2**t)
# Update parameters
self.parameters[f'W{layer}'] -= (
learning_rate * v_corrected_W / (np.sqrt(s_corrected_W) + epsilon)
)
self.parameters[f'b{layer}'] -= (
learning_rate * v_corrected_b / (np.sqrt(s_corrected_b) + epsilon)
)
def train_with_adam(self, X, Y, learning_rate=0.001, num_epochs=100,
batch_size=64, beta1=0.9, beta2=0.999, print_cost=True):
"""
Train network using Adam optimization.
Parameters:
X : numpy array
Y : numpy array
learning_rate : float
Note: Adam typically works well with smaller learning rates
num_epochs : int
batch_size : int
beta1 : float
beta2 : float
print_cost : bool
Returns:
costs : list of costs
"""
costs = []
for epoch in range(num_epochs):
epoch_cost = 0
mini_batches = create_mini_batches(X, Y, batch_size)
num_batches = len(mini_batches)
for mini_batch in mini_batches:
mini_batch_X, mini_batch_Y = mini_batch
# Forward propagation
AL, caches = self.forward_propagation(mini_batch_X)
# Compute cost
batch_cost = self.compute_cost(AL, mini_batch_Y)
epoch_cost += batch_cost
# Backward propagation
gradients = self.backward_propagation(AL, mini_batch_Y, caches)
# Update with Adam
self.update_parameters_with_adam(gradients, learning_rate, beta1, beta2)
avg_cost = epoch_cost / num_batches
costs.append(avg_cost)
if print_cost and epoch % 10 == 0:
print(f"Cost after epoch {epoch}: {avg_cost:.6f}")
return costs
Let us compare the performance of different optimizers.
print("\nComparing different optimizers...")
# Standard gradient descent
print("\n1. Standard Gradient Descent:")
gd_net = DeepNeuralNetworkWithMiniBatch(layer_dimensions=[2, 16, 8, 1], activation='relu')
costs_gd = gd_net.train_with_mini_batches(
X_complex, Y_complex,
learning_rate=0.5,
num_epochs=100,
batch_size=32,
print_cost=False
)
# Momentum
print("\n2. Gradient Descent with Momentum:")
momentum_net = DeepNeuralNetworkWithMomentum(layer_dimensions=[2, 16, 8, 1], activation='relu')
costs_momentum = momentum_net.train_with_momentum(
X_complex, Y_complex,
learning_rate=0.5,
num_epochs=100,
batch_size=32,
beta=0.9,
print_cost=False
)
# Adam
print("\n3. Adam Optimizer:")
adam_net = DeepNeuralNetworkWithAdam(layer_dimensions=[2, 16, 8, 1], activation='relu')
costs_adam = adam_net.train_with_adam(
X_complex, Y_complex,
learning_rate=0.01,
num_epochs=100,
batch_size=32,
print_cost=False
)
# Compare final accuracies
pred_gd = gd_net.predict(X_complex)
pred_momentum = momentum_net.predict(X_complex)
pred_adam = adam_net.predict(X_complex)
print(f"\nFinal Accuracies:")
print(f"Standard GD: {compute_accuracy(pred_gd, Y_complex):.2f}%")
print(f"Momentum: {compute_accuracy(pred_momentum, Y_complex):.2f}%")
print(f"Adam: {compute_accuracy(pred_adam, Y_complex):.2f}%")
IMPLEMENTING EARLY STOPPING
Early stopping is a regularization technique that prevents overfitting. The idea is simple: we monitor the performance on a validation set during training, and stop training when the validation performance stops improving.
Overfitting occurs when a model learns the training data too well, including its noise and peculiarities, and performs poorly on new data. Early stopping helps by stopping training before the model has a chance to overfit.
To implement early stopping, we need to split our data into training and validation sets. We train on the training set and evaluate on the validation set after each epoch. If the validation cost does not improve for a certain number of epochs (called patience), we stop training.
def split_train_validation(X, Y, validation_split=0.2):
"""
Split data into training and validation sets.
Parameters:
X : numpy array
Features
Y : numpy array
Labels
validation_split : float
Fraction of data to use for validation
Returns:
X_train, Y_train, X_val, Y_val : numpy arrays
"""
m = X.shape[1]
# Shuffle data
permutation = np.random.permutation(m)
X_shuffled = X[:, permutation]
Y_shuffled = Y[:, permutation]
# Split
split_index = int(m * (1 - validation_split))
X_train = X_shuffled[:, :split_index]
Y_train = Y_shuffled[:, :split_index]
X_val = X_shuffled[:, split_index:]
Y_val = Y_shuffled[:, split_index:]
return X_train, Y_train, X_val, Y_val
class DeepNeuralNetworkWithEarlyStopping(DeepNeuralNetworkWithAdam):
"""
Deep neural network with early stopping support.
Early stopping monitors validation performance and stops
training when it stops improving, preventing overfitting.
"""
def train_with_early_stopping(self, X_train, Y_train, X_val, Y_val,
learning_rate=0.001, num_epochs=1000,
batch_size=64, patience=10, print_cost=True):
"""
Train network with early stopping.
Parameters:
X_train, Y_train : numpy arrays
Training data
X_val, Y_val : numpy arrays
Validation data
learning_rate : float
num_epochs : int
Maximum number of epochs
batch_size : int
patience : int
Number of epochs to wait for improvement before stopping
print_cost : bool
Returns:
train_costs : list of training costs
val_costs : list of validation costs
best_epoch : int
Epoch where best validation performance was achieved
"""
train_costs = []
val_costs = []
best_val_cost = float('inf')
best_parameters = None
epochs_without_improvement = 0
best_epoch = 0
for epoch in range(num_epochs):
# Training phase
epoch_train_cost = 0
mini_batches = create_mini_batches(X_train, Y_train, batch_size)
num_batches = len(mini_batches)
for mini_batch in mini_batches:
mini_batch_X, mini_batch_Y = mini_batch
AL, caches = self.forward_propagation(mini_batch_X)
batch_cost = self.compute_cost(AL, mini_batch_Y)
epoch_train_cost += batch_cost
gradients = self.backward_propagation(AL, mini_batch_Y, caches)
self.update_parameters_with_adam(gradients, learning_rate)
avg_train_cost = epoch_train_cost / num_batches
train_costs.append(avg_train_cost)
# Validation phase
AL_val, _ = self.forward_propagation(X_val)
val_cost = self.compute_cost(AL_val, Y_val)
val_costs.append(val_cost)
# Check for improvement
if val_cost < best_val_cost:
best_val_cost = val_cost
best_parameters = {key: value.copy() for key, value in self.parameters.items()}
epochs_without_improvement = 0
best_epoch = epoch
else:
epochs_without_improvement += 1
if print_cost and epoch % 10 == 0:
print(f"Epoch {epoch}: Train cost = {avg_train_cost:.6f}, Val cost = {val_cost:.6f}")
# Early stopping check
if epochs_without_improvement >= patience:
print(f"\nEarly stopping triggered at epoch {epoch}")
print(f"Best validation cost: {best_val_cost:.6f} at epoch {best_epoch}")
# Restore best parameters
self.parameters = best_parameters
break
return train_costs, val_costs, best_epoch
Let us test early stopping.
print("\nTesting early stopping...")
# Split data
X_train, Y_train, X_val, Y_val = split_train_validation(X_complex, Y_complex, validation_split=0.2)
# Create network with early stopping
early_stop_net = DeepNeuralNetworkWithEarlyStopping(
layer_dimensions=[2, 16, 8, 1],
activation='relu'
)
# Train with early stopping
train_costs, val_costs, best_epoch = early_stop_net.train_with_early_stopping(
X_train, Y_train, X_val, Y_val,
learning_rate=0.01,
num_epochs=500,
batch_size=32,
patience=20,
print_cost=True
)
# Evaluate on validation set
pred_val = early_stop_net.predict(X_val)
val_accuracy = compute_accuracy(pred_val, Y_val)
print(f"\nValidation accuracy: {val_accuracy:.2f}%")
IMPLEMENTING REGULARIZATION
Regularization is another technique to prevent overfitting. It works by adding a penalty term to the cost function that discourages large weights. This encourages the network to learn simpler patterns that generalize better.
The most common form is L2 regularization (also called weight decay). The regularized cost becomes:
regularized_cost = original_cost + (lambda / (2 * m)) * sum_of_squared_weights
where lambda is the regularization parameter that controls the strength of regularization.
class DeepNeuralNetworkWithRegularization(DeepNeuralNetworkWithEarlyStopping):
"""
Deep neural network with L2 regularization.
Regularization adds a penalty for large weights, which helps
prevent overfitting and improves generalization.
"""
def compute_cost_with_regularization(self, AL, Y, lambd):
"""
Compute cost with L2 regularization.
Parameters:
AL : numpy array
Network predictions
Y : numpy array
True labels
lambd : float
Regularization parameter
Returns:
cost : float
"""
m = Y.shape[1]
# Standard cross-entropy cost
epsilon = 1e-8
cross_entropy_cost = -np.sum(
Y * np.log(AL + epsilon) + (1 - Y) * np.log(1 - AL + epsilon)
) / m
# L2 regularization cost
l2_cost = 0
for layer in range(1, self.num_layers):
W = self.parameters[f'W{layer}']
l2_cost += np.sum(np.square(W))
l2_cost = (lambd / (2 * m)) * l2_cost
# Total cost
cost = cross_entropy_cost + l2_cost
return cost
def backward_propagation_with_regularization(self, AL, Y, caches, lambd):
"""
Backward propagation with L2 regularization.
The gradients for weights include an additional term from regularization.
Parameters:
AL : numpy array
Y : numpy array
caches : list
lambd : float
Returns:
gradients : dictionary
"""
# Standard backpropagation
gradients = self.backward_propagation(AL, Y, caches)
# Add regularization term to weight gradients
m = Y.shape[1]
for layer in range(1, self.num_layers):
W = self.parameters[f'W{layer}']
gradients[f'dW{layer}'] += (lambd / m) * W
return gradients
def train_with_regularization(self, X_train, Y_train, X_val, Y_val,
learning_rate=0.001, num_epochs=500,
batch_size=64, lambd=0.01, patience=20,
print_cost=True):
"""
Train network with L2 regularization and early stopping.
Parameters:
X_train, Y_train : numpy arrays
X_val, Y_val : numpy arrays
learning_rate : float
num_epochs : int
batch_size : int
lambd : float
Regularization parameter
patience : int
print_cost : bool
Returns:
train_costs, val_costs, best_epoch
"""
train_costs = []
val_costs = []
best_val_cost = float('inf')
best_parameters = None
epochs_without_improvement = 0
best_epoch = 0
for epoch in range(num_epochs):
epoch_train_cost = 0
mini_batches = create_mini_batches(X_train, Y_train, batch_size)
num_batches = len(mini_batches)
for mini_batch in mini_batches:
mini_batch_X, mini_batch_Y = mini_batch
AL, caches = self.forward_propagation(mini_batch_X)
batch_cost = self.compute_cost_with_regularization(AL, mini_batch_Y, lambd)
epoch_train_cost += batch_cost
gradients = self.backward_propagation_with_regularization(
AL, mini_batch_Y, caches, lambd
)
self.update_parameters_with_adam(gradients, learning_rate)
avg_train_cost = epoch_train_cost / num_batches
train_costs.append(avg_train_cost)
# Validation
AL_val, _ = self.forward_propagation(X_val)
val_cost = self.compute_cost_with_regularization(AL_val, Y_val, lambd)
val_costs.append(val_cost)
# Early stopping logic
if val_cost < best_val_cost:
best_val_cost = val_cost
best_parameters = {key: value.copy() for key, value in self.parameters.items()}
epochs_without_improvement = 0
best_epoch = epoch
else:
epochs_without_improvement += 1
if print_cost and epoch % 10 == 0:
print(f"Epoch {epoch}: Train = {avg_train_cost:.6f}, Val = {val_cost:.6f}")
if epochs_without_improvement >= patience:
print(f"\nEarly stopping at epoch {epoch}")
print(f"Best validation cost: {best_val_cost:.6f} at epoch {best_epoch}")
self.parameters = best_parameters
break
return train_costs, val_costs, best_epoch
IMPLEMENTING DROPOUT REGULARIZATION
Dropout is another powerful regularization technique. During training, we randomly set a fraction of neurons to zero in each forward pass. This prevents neurons from co-adapting too much and forces the network to learn more robust features.
During testing, we use all neurons but scale their outputs by the dropout probability to account for the fact that more neurons are active than during training.
def dropout_forward(A, keep_prob):
"""
Apply dropout to activations.
Parameters:
A : numpy array
Activations from a layer
keep_prob : float
Probability of keeping each neuron (between 0 and 1)
Returns:
A_dropout : numpy array
Activations after dropout
mask : numpy array
Binary mask indicating which neurons were kept
"""
# Create random mask
mask = np.random.rand(*A.shape) < keep_prob
# Apply mask and scale
A_dropout = A * mask / keep_prob
return A_dropout, mask
def dropout_backward(dA, mask, keep_prob):
"""
Backpropagate through dropout.
Parameters:
dA : numpy array
Gradient of cost with respect to activations
mask : numpy array
Mask from forward pass
keep_prob : float
Returns:
dA_dropout : numpy array
"""
dA_dropout = dA * mask / keep_prob
return dA_dropout
PUTTING IT ALL TOGETHER: COMPLETE TRAINING PIPELINE
Now let us create a complete neural network class that incorporates all the features we have discussed: mini-batch training, Adam optimization, early stopping, L2 regularization, and dropout.
class CompleteNeuralNetwork:
"""
A complete neural network implementation with all advanced features.
Features:
- Flexible architecture with arbitrary depth
- Multiple activation functions
- Mini-batch gradient descent
- Adam optimization
- L2 regularization
- Dropout regularization
- Early stopping
"""
def __init__(self, layer_dimensions, activation='relu', dropout_rate=0.0):
"""
Initialize the complete neural network.
Parameters:
layer_dimensions : list
activation : str
dropout_rate : float
Fraction of neurons to drop (0 means no dropout)
"""
self.layer_dimensions = layer_dimensions
self.num_layers = len(layer_dimensions)
self.activation = activation
self.dropout_rate = dropout_rate
self.keep_prob = 1.0 - dropout_rate
self.parameters = self._initialize_parameters()
self.adam_params = self._initialize_adam()
def _initialize_parameters(self):
"""Initialize weights and biases."""
parameters = {}
for layer in range(1, self.num_layers):
if self.activation == 'relu':
scale = np.sqrt(2.0 / self.layer_dimensions[layer - 1])
else:
scale = np.sqrt(1.0 / self.layer_dimensions[layer - 1])
parameters[f'W{layer}'] = np.random.randn(
self.layer_dimensions[layer],
self.layer_dimensions[layer - 1]
) * scale
parameters[f'b{layer}'] = np.zeros((self.layer_dimensions[layer], 1))
return parameters
def _initialize_adam(self):
"""Initialize Adam optimizer parameters."""
adam_params = {'v': {}, 's': {}, 't': 0}
for layer in range(1, self.num_layers):
adam_params['v'][f'dW{layer}'] = np.zeros_like(self.parameters[f'W{layer}'])
adam_params['v'][f'db{layer}'] = np.zeros_like(self.parameters[f'b{layer}'])
adam_params['s'][f'dW{layer}'] = np.zeros_like(self.parameters[f'W{layer}'])
adam_params['s'][f'db{layer}'] = np.zeros_like(self.parameters[f'b{layer}'])
return adam_params
def forward_propagation(self, X, training=True):
"""
Forward propagation with optional dropout.
Parameters:
X : numpy array
training : bool
Whether we are in training mode (affects dropout)
Returns:
AL : numpy array
caches : list
dropout_masks : list (only if training and dropout_rate > 0)
"""
caches = []
dropout_masks = []
A = X
# Hidden layers
for layer in range(1, self.num_layers - 1):
A_prev = A
W = self.parameters[f'W{layer}']
b = self.parameters[f'b{layer}']
Z = np.dot(W, A_prev) + b
if self.activation == 'relu':
A = relu(Z)
else:
A = sigmoid(Z)
# Apply dropout during training
if training and self.dropout_rate > 0:
A, mask = dropout_forward(A, self.keep_prob)
dropout_masks.append(mask)
cache = {'A_prev': A_prev, 'Z': Z, 'W': W, 'b': b}
caches.append(cache)
# Output layer (no dropout)
W = self.parameters[f'W{self.num_layers - 1}']
b = self.parameters[f'b{self.num_layers - 1}']
Z = np.dot(W, A) + b
AL = sigmoid(Z)
cache = {'A_prev': A, 'Z': Z, 'W': W, 'b': b}
caches.append(cache)
if training and self.dropout_rate > 0:
return AL, caches, dropout_masks
else:
return AL, caches
def compute_cost(self, AL, Y, lambd=0.0):
"""
Compute cost with optional L2 regularization.
Parameters:
AL : numpy array
Y : numpy array
lambd : float
Returns:
cost : float
"""
m = Y.shape[1]
epsilon = 1e-8
# Cross-entropy cost
cross_entropy = -np.sum(
Y * np.log(AL + epsilon) + (1 - Y) * np.log(1 - AL + epsilon)
) / m
# L2 regularization
l2_cost = 0
if lambd > 0:
for layer in range(1, self.num_layers):
W = self.parameters[f'W{layer}']
l2_cost += np.sum(np.square(W))
l2_cost = (lambd / (2 * m)) * l2_cost
cost = cross_entropy + l2_cost
return cost
def backward_propagation(self, AL, Y, caches, dropout_masks=None, lambd=0.0):
"""
Backward propagation with optional dropout and regularization.
Parameters:
AL : numpy array
Y : numpy array
caches : list
dropout_masks : list or None
lambd : float
Returns:
gradients : dictionary
"""
gradients = {}
m = Y.shape[1]
L = self.num_layers - 1
# Output layer
dAL = -(np.divide(Y, AL + 1e-8) - np.divide(1 - Y, 1 - AL + 1e-8))
current_cache = caches[L - 1]
dZ = dAL * sigmoid_derivative(current_cache['Z'])
gradients[f'dW{L}'] = np.dot(dZ, current_cache['A_prev'].T) / m
gradients[f'db{L}'] = np.sum(dZ, axis=1, keepdims=True) / m
# Add regularization
if lambd > 0:
gradients[f'dW{L}'] += (lambd / m) * current_cache['W']
dA_prev = np.dot(current_cache['W'].T, dZ)
# Hidden layers
for layer in reversed(range(L - 1)):
current_cache = caches[layer]
# Apply dropout mask if available
if dropout_masks is not None and len(dropout_masks) > layer:
dA_prev = dropout_backward(dA_prev, dropout_masks[layer], self.keep_prob)
# Compute gradients
if self.activation == 'relu':
dZ = dA_prev * relu_derivative(current_cache['Z'])
else:
dZ = dA_prev * sigmoid_derivative(current_cache['Z'])
gradients[f'dW{layer + 1}'] = np.dot(dZ, current_cache['A_prev'].T) / m
gradients[f'db{layer + 1}'] = np.sum(dZ, axis=1, keepdims=True) / m
# Add regularization
if lambd > 0:
gradients[f'dW{layer + 1}'] += (lambd / m) * current_cache['W']
dA_prev = np.dot(current_cache['W'].T, dZ)
return gradients
def update_parameters_adam(self, gradients, learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8):
"""Update parameters using Adam optimizer."""
self.adam_params['t'] += 1
t = self.adam_params['t']
for layer in range(1, self.num_layers):
# Update moments
self.adam_params['v'][f'dW{layer}'] = (
beta1 * self.adam_params['v'][f'dW{layer}'] +
(1 - beta1) * gradients[f'dW{layer}']
)
self.adam_params['v'][f'db{layer}'] = (
beta1 * self.adam_params['v'][f'db{layer}'] +
(1 - beta1) * gradients[f'db{layer}']
)
self.adam_params['s'][f'dW{layer}'] = (
beta2 * self.adam_params['s'][f'dW{layer}'] +
(1 - beta2) * np.square(gradients[f'dW{layer}'])
)
self.adam_params['s'][f'db{layer}'] = (
beta2 * self.adam_params['s'][f'db{layer}'] +
(1 - beta2) * np.square(gradients[f'db{layer}'])
)
# Bias correction
v_corrected_W = self.adam_params['v'][f'dW{layer}'] / (1 - beta1**t)
v_corrected_b = self.adam_params['v'][f'db{layer}'] / (1 - beta1**t)
s_corrected_W = self.adam_params['s'][f'dW{layer}'] / (1 - beta2**t)
s_corrected_b = self.adam_params['s'][f'db{layer}'] / (1 - beta2**t)
# Update parameters
self.parameters[f'W{layer}'] -= (
learning_rate * v_corrected_W / (np.sqrt(s_corrected_W) + epsilon)
)
self.parameters[f'b{layer}'] -= (
learning_rate * v_corrected_b / (np.sqrt(s_corrected_b) + epsilon)
)
def train(self, X_train, Y_train, X_val, Y_val, learning_rate=0.001,
num_epochs=500, batch_size=64, lambd=0.0, patience=20, print_cost=True):
"""
Complete training pipeline with all features.
Parameters:
X_train, Y_train : numpy arrays
X_val, Y_val : numpy arrays
learning_rate : float
num_epochs : int
batch_size : int
lambd : float
patience : int
print_cost : bool
Returns:
history : dictionary containing training history
"""
train_costs = []
val_costs = []
train_accuracies = []
val_accuracies = []
best_val_cost = float('inf')
best_parameters = None
epochs_without_improvement = 0
best_epoch = 0
for epoch in range(num_epochs):
epoch_train_cost = 0
mini_batches = create_mini_batches(X_train, Y_train, batch_size)
num_batches = len(mini_batches)
for mini_batch in mini_batches:
mini_batch_X, mini_batch_Y = mini_batch
# Forward propagation with dropout
if self.dropout_rate > 0:
AL, caches, dropout_masks = self.forward_propagation(mini_batch_X, training=True)
else:
AL, caches = self.forward_propagation(mini_batch_X, training=True)
dropout_masks = None
# Compute cost
batch_cost = self.compute_cost(AL, mini_batch_Y, lambd)
epoch_train_cost += batch_cost
# Backward propagation
gradients = self.backward_propagation(AL, mini_batch_Y, caches, dropout_masks, lambd)
# Update parameters
self.update_parameters_adam(gradients, learning_rate)
# Average training cost
avg_train_cost = epoch_train_cost / num_batches
train_costs.append(avg_train_cost)
# Training accuracy
train_pred = self.predict(X_train)
train_acc = compute_accuracy(train_pred, Y_train)
train_accuracies.append(train_acc)
# Validation cost and accuracy
AL_val, _ = self.forward_propagation(X_val, training=False)
val_cost = self.compute_cost(AL_val, Y_val, lambd)
val_costs.append(val_cost)
val_pred = self.predict(X_val)
val_acc = compute_accuracy(val_pred, Y_val)
val_accuracies.append(val_acc)
# Early stopping check
if val_cost < best_val_cost:
best_val_cost = val_cost
best_parameters = {key: value.copy() for key, value in self.parameters.items()}
epochs_without_improvement = 0
best_epoch = epoch
else:
epochs_without_improvement += 1
if print_cost and epoch % 10 == 0:
print(f"Epoch {epoch}: Train Cost = {avg_train_cost:.6f}, Val Cost = {val_cost:.6f}, "
f"Train Acc = {train_acc:.2f}%, Val Acc = {val_acc:.2f}%")
# Early stopping
if epochs_without_improvement >= patience:
print(f"\nEarly stopping triggered at epoch {epoch}")
print(f"Best validation cost: {best_val_cost:.6f} at epoch {best_epoch}")
self.parameters = best_parameters
break
history = {
'train_costs': train_costs,
'val_costs': val_costs,
'train_accuracies': train_accuracies,
'val_accuracies': val_accuracies,
'best_epoch': best_epoch
}
return history
def predict(self, X):
"""Make predictions."""
AL, _ = self.forward_propagation(X, training=False)
predictions = (AL > 0.5).astype(int)
return predictions
FINAL EXAMPLE: TRAINING A COMPLETE NETWORK
Let us now use our complete neural network implementation on a real example, demonstrating all the features we have built.
print("\n" + "="*70)
print("FINAL DEMONSTRATION: COMPLETE NEURAL NETWORK")
print("="*70)
# Generate a larger, more complex dataset
X_final, Y_final = generate_complex_dataset(n_samples=2000)
# Split into train and validation
X_train_final, Y_train_final, X_val_final, Y_val_final = split_train_validation(
X_final, Y_final, validation_split=0.2
)
print(f"\nDataset sizes:")
print(f"Training: {X_train_final.shape[1]} examples")
print(f"Validation: {X_val_final.shape[1]} examples")
# Create complete network with all features
print("\nCreating neural network with:")
print("- Architecture: [2, 32, 16, 8, 1]")
print("- Activation: ReLU")
print("- Dropout: 20%")
print("- L2 Regularization: lambda = 0.01")
print("- Optimizer: Adam")
print("- Early Stopping: patience = 30")
complete_net = CompleteNeuralNetwork(
layer_dimensions=[2, 32, 16, 8, 1],
activation='relu',
dropout_rate=0.2
)
# Train the network
print("\nTraining network...")
history = complete_net.train(
X_train_final, Y_train_final,
X_val_final, Y_val_final,
learning_rate=0.001,
num_epochs=500,
batch_size=32,
lambd=0.01,
patience=30,
print_cost=True
)
# Final evaluation
print("\n" + "="*70)
print("FINAL RESULTS")
print("="*70)
train_pred_final = complete_net.predict(X_train_final)
val_pred_final = complete_net.predict(X_val_final)
train_acc_final = compute_accuracy(train_pred_final, Y_train_final)
val_acc_final = compute_accuracy(val_pred_final, Y_val_final)
print(f"\nFinal Training Accuracy: {train_acc_final:.2f}%")
print(f"Final Validation Accuracy: {val_acc_final:.2f}%")
print(f"Best Epoch: {history['best_epoch']}")
print(f"Total Epochs Trained: {len(history['train_costs'])}")
VISUALIZING TRAINING PROGRESS
It is important to visualize how our network learns over time. Let us create functions to plot the training history.
def plot_training_history(history):
"""
Plot training and validation costs and accuracies.
Parameters:
history : dictionary containing training history
"""
epochs = range(len(history['train_costs']))
# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# Plot costs
ax1.plot(epochs, history['train_costs'], label='Training Cost', linewidth=2)
ax1.plot(epochs, history['val_costs'], label='Validation Cost', linewidth=2)
ax1.axvline(x=history['best_epoch'], color='red', linestyle='--',
label=f'Best Epoch ({history["best_epoch"]})')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Cost')
ax1.set_title('Training and Validation Cost')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Plot accuracies
ax2.plot(epochs, history['train_accuracies'], label='Training Accuracy', linewidth=2)
ax2.plot(epochs, history['val_accuracies'], label='Validation Accuracy', linewidth=2)
ax2.axvline(x=history['best_epoch'], color='red', linestyle='--',
label=f'Best Epoch ({history["best_epoch"]})')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy (%)')
ax2.set_title('Training and Validation Accuracy')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Visualize the training history
print("\nGenerating training history plots...")
# plot_training_history(history)
UNDERSTANDING HYPERPARAMETERS
Hyperparameters are settings that we choose before training begins. They are not learned from the data but significantly affect how well the network learns. Let us discuss the key hyperparameters and how to choose them.
Learning rate is perhaps the most important hyperparameter. If it is too high, training will be unstable and may diverge. If it is too low, training will be very slow. A good starting point for Adam optimizer is 0.001. You can try values like 0.0001, 0.001, 0.01 and see which works best.
Batch size affects both training speed and generalization. Smaller batches (like 32 or 64) provide more frequent updates and can help escape local minima, but training is noisier. Larger batches (like 128 or 256) provide more stable gradients but require more memory. Common choices are 32, 64, 128, or 256.
The number of hidden layers and neurons per layer determines the network's capacity. More layers and neurons allow the network to learn more complex patterns, but also increase the risk of overfitting. Start with a moderate architecture and increase complexity if the network underfits.
Regularization strength (lambda) controls how much we penalize large weights. Higher values prevent overfitting more strongly but may cause underfitting. Typical values range from 0.0001 to 0.1. Start with 0.01 and adjust based on the gap between training and validation performance.
Dropout rate determines what fraction of neurons to randomly drop during training. Common values are 0.2 to 0.5. Higher dropout provides stronger regularization but may slow down training.
Early stopping patience determines how many epochs to wait for improvement before stopping. This depends on your dataset size and complexity. For small datasets, 10 to 20 epochs might be enough. For larger datasets, you might use 30 to 50.
TIPS FOR DEBUGGING NEURAL NETWORKS
Neural networks can be tricky to debug because there are many things that can go wrong. Here are some tips to help you identify and fix problems.
If your training cost is not decreasing, first check that your learning rate is not too small. Try increasing it by a factor of 10. Also verify that your backward propagation is correctly implemented by using gradient checking.
If the cost decreases initially but then plateaus at a high value, your network might be stuck in a local minimum or the learning rate might be too low. Try increasing the learning rate or using a different initialization.
If you see the cost exploding to very large values or becoming NaN (not a number), your learning rate is probably too high. Reduce it by a factor of 10. Also check for numerical instability in your activation functions.
If training accuracy is high but validation accuracy is much lower, your network is overfitting. Add regularization (L2 or dropout), reduce the network size, or get more training data.
If both training and validation accuracy are low, your network is underfitting. Try increasing the network size, training for more epochs, or reducing regularization.
If training is very slow, consider using a larger batch size, a faster optimizer like Adam, or reducing the network size.
GRADIENT CHECKING FOR DEBUGGING
Gradient checking is a technique to verify that your backward propagation is correctly implemented. The idea is to numerically approximate the gradients and compare them with the gradients computed by backpropagation.
The numerical gradient for a parameter theta is approximately:
gradient ≈ (cost(theta + epsilon) - cost(theta - epsilon)) / (2 * epsilon)
where epsilon is a small value like 1e-7.
def gradient_check(network, X, Y, epsilon=1e-7, threshold=1e-7):
"""
Perform gradient checking to verify backpropagation implementation.
This compares analytical gradients from backpropagation with
numerical gradients computed using finite differences.
Parameters:
network : CompleteNeuralNetwork instance
X : numpy array
Small sample of input data
Y : numpy array
Corresponding labels
epsilon : float
Small value for numerical gradient computation
threshold : float
Maximum acceptable difference
Returns:
difference : float
Relative difference between gradients
"""
# Get analytical gradients
AL, caches = network.forward_propagation(X, training=False)
gradients = network.backward_propagation(AL, Y, caches, dropout_masks=None, lambd=0.0)
# Flatten all parameters and gradients into vectors
params_values = []
grad_values = []
for layer in range(1, network.num_layers):
params_values.extend(network.parameters[f'W{layer}'].flatten())
params_values.extend(network.parameters[f'b{layer}'].flatten())
grad_values.extend(gradients[f'dW{layer}'].flatten())
grad_values.extend(gradients[f'db{layer}'].flatten())
params_values = np.array(params_values)
grad_values = np.array(grad_values)
# Compute numerical gradients
num_gradients = np.zeros_like(params_values)
for i in range(len(params_values)):
# Compute cost with theta + epsilon
params_plus = params_values.copy()
params_plus[i] += epsilon
network_copy_plus = _set_parameters_from_vector(network, params_plus)
AL_plus, _ = network_copy_plus.forward_propagation(X, training=False)
cost_plus = network_copy_plus.compute_cost(AL_plus, Y, lambd=0.0)
# Compute cost with theta - epsilon
params_minus = params_values.copy()
params_minus[i] -= epsilon
network_copy_minus = _set_parameters_from_vector(network, params_minus)
AL_minus, _ = network_copy_minus.forward_propagation(X, training=False)
cost_minus = network_copy_minus.compute_cost(AL_minus, Y, lambd=0.0)
# Numerical gradient
num_gradients[i] = (cost_plus - cost_minus) / (2 * epsilon)
# Compute relative difference
numerator = np.linalg.norm(grad_values - num_gradients)
denominator = np.linalg.norm(grad_values) + np.linalg.norm(num_gradients)
difference = numerator / denominator
if difference < threshold:
print(f"Gradient check passed! Difference: {difference:.10f}")
else:
print(f"WARNING: Gradient check failed! Difference: {difference:.10f}")
print(f"This suggests an error in the backpropagation implementation.")
return difference
def _set_parameters_from_vector(network, params_vector):
"""Helper function to set network parameters from a vector."""
import copy
network_copy = copy.deepcopy(network)
idx = 0
for layer in range(1, network.num_layers):
W_shape = network.parameters[f'W{layer}'].shape
W_size = W_shape[0] * W_shape[1]
network_copy.parameters[f'W{layer}'] = params_vector[idx:idx + W_size].reshape(W_shape)
idx += W_size
b_shape = network.parameters[f'b{layer}'].shape
b_size = b_shape[0] * b_shape[1]
network_copy.parameters[f'b{layer}'] = params_vector[idx:idx + b_size].reshape(b_shape)
idx += b_size
return network_copy
PRACTICAL RECOMMENDATIONS
Based on everything we have learned, here are some practical recommendations for building and training neural networks.
Start simple. Begin with a small network and simple settings. Make sure it can overfit a small subset of your data. If it cannot overfit, there is likely a bug in your implementation.
Use Adam optimizer. For most problems, Adam works well out of the box with a learning rate of 0.001. It is a good default choice.
Normalize your input data. Scale your features to have zero mean and unit variance. This helps the network train faster and more stably.
Use ReLU activation for hidden layers. ReLU is simple, fast, and works well in practice. Use sigmoid or softmax for the output layer depending on your task.
Start without regularization. First get your network to work without regularization. Then add L2 regularization or dropout if you see overfitting.
Monitor both training and validation metrics. Always keep track of both training and validation performance. A large gap indicates overfitting.
Use early stopping. It is a simple and effective way to prevent overfitting without having to tune regularization hyperparameters.
Experiment with architecture. Try different numbers of layers and neurons. Deeper networks can learn more complex patterns but are harder to train.
Be patient. Training neural networks can take time. Do not give up too quickly if results are not perfect immediately.
CONCLUSION
Congratulations! You have now built a complete deep learning neural network from scratch. We started with the basics of a single neuron and gradually added complexity: multiple layers, different activation functions, mini-batch training, advanced optimizers like momentum and Adam, regularization techniques, and early stopping.
You now understand not just how to use neural networks, but how they actually work under the hood. This knowledge will help you debug problems, choose appropriate architectures, and understand what is happening when you use high-level libraries like TensorFlow or PyTorch.
The key concepts we covered include forward propagation for making predictions, backward propagation for computing gradients, gradient descent and its variants for optimization, regularization for preventing overfitting, and various practical techniques for training neural networks effectively.
Remember that building neural networks is as much art as science. There is no one-size-fits-all solution. You will need to experiment with different architectures, hyperparameters, and techniques to find what works best for your specific problem.
The complete implementation we built provides a solid foundation. You can extend it further by adding more activation functions, implementing different cost functions for multi-class classification, adding batch normalization, or implementing convolutional layers for image data.
Keep learning, keep experimenting, and most importantly, keep building. The best way to truly understand neural networks is to implement them yourself and see how they behave with different settings and datasets.
Happy learning!