Friday, May 15, 2026

NEURAL NETWORKS FROM SCRATCH: A COMPLETE GUIDE FOR BEGINNERS

 


INTRODUCTION

Welcome to this comprehensive tutorial on building neural networks from scratch in Python. This guide will take you on a journey from the absolute basics to advanced concepts in deep learning. We will not rely on high-level libraries like TensorFlow or PyTorch for the core implementation. Instead, we will build everything ourselves using only NumPy for numerical operations. This approach will give you a deep understanding of what happens under the hood when you train a neural network.

By the end of this tutorial, you will understand how neurons work, how networks learn through backpropagation, how to implement various optimization algorithms, and how to add practical features like batch processing and early stopping. Each concept will be explained thoroughly before we implement it in code.

WHAT IS A NEURAL NETWORK?

A neural network is a computational model inspired by the way biological neurons work in the human brain. At its core, a neural network consists of layers of interconnected nodes called neurons. Each neuron receives inputs, processes them, and produces an output that gets passed to the next layer.

The simplest neural network has three types of layers. The input layer receives the raw data. Hidden layers perform computations and extract features from the data. The output layer produces the final prediction or classification.

The power of neural networks comes from their ability to learn complex patterns in data. They do this by adjusting internal parameters called weights and biases during a training process. This training process uses examples of input data paired with correct outputs to gradually improve the network's predictions.

THE MATHEMATICS BEHIND A SINGLE NEURON

Before we build a full network, let us understand how a single neuron works. A neuron takes multiple inputs, multiplies each input by a weight, adds all these weighted inputs together, adds a bias term, and then applies an activation function to produce an output.

Mathematically, for a neuron with inputs x1, x2, x3 and corresponding weights w1, w2, w3, the weighted sum z is calculated as:

z = w1 * x1 + w2 * x2 + w3 * x3 + b

where b is the bias term. The bias allows the neuron to shift its activation function left or right, which helps the network fit the data better.

After computing z, we apply an activation function to introduce non-linearity. Without activation functions, no matter how many layers we stack, the network would only be able to learn linear relationships. Common activation functions include sigmoid, tanh, and ReLU.

The sigmoid function squashes any input value to a range between 0 and 1:

sigmoid(z) = 1 / (1 + exp(-z))

The ReLU (Rectified Linear Unit) function is simpler and often works better in practice:

ReLU(z) = max(0, z)

This means if z is positive, ReLU returns z. If z is negative, ReLU returns 0.

SETTING UP OUR PYTHON ENVIRONMENT

Before we start coding, we need to import the necessary libraries. We will use NumPy for all our numerical computations and matplotlib for visualizing our results.

import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List, Optional

We set a random seed to ensure our results are reproducible. This means every time we run our code, we will get the same random initialization of weights.

np.random.seed(42)

BUILDING OUR FIRST SIMPLE NEURAL NETWORK

Let us start by creating a very simple neural network with one hidden layer. This network will have an input layer, one hidden layer with a few neurons, and an output layer. We will build it step by step, explaining each component.

First, we need to implement the activation functions we discussed earlier. We will implement both the forward pass (computing the activation) and the backward pass (computing the derivative, which we need for backpropagation).

def sigmoid(z):
    """
    Compute the sigmoid activation function.
    
    The sigmoid function maps any real number to a value between 0 and 1.
    It is useful for binary classification problems.
    
    Parameters:
    z : numpy array of any shape
    
    Returns:
    activation : numpy array of same shape as z
    """
    return 1 / (1 + np.exp(-z))


def sigmoid_derivative(z):
    """
    Compute the derivative of the sigmoid function.
    
    This is used during backpropagation to compute gradients.
    The derivative of sigmoid(z) is sigmoid(z) * (1 - sigmoid(z)).
    
    Parameters:
    z : numpy array of any shape
    
    Returns:
    derivative : numpy array of same shape as z
    """
    sig = sigmoid(z)
    return sig * (1 - sig)

Now let us implement the ReLU activation function and its derivative.

def relu(z):
    """
    Compute the ReLU (Rectified Linear Unit) activation function.
    
    ReLU returns the input if it is positive, otherwise returns 0.
    It is computationally efficient and works well in practice.
    
    Parameters:
    z : numpy array of any shape
    
    Returns:
    activation : numpy array of same shape as z
    """
    return np.maximum(0, z)


def relu_derivative(z):
    """
    Compute the derivative of the ReLU function.
    
    The derivative is 1 where z > 0, and 0 elsewhere.
    
    Parameters:
    z : numpy array of any shape
    
    Returns:
    derivative : numpy array of same shape as z
    """
    return (z > 0).astype(float)

INITIALIZING NETWORK PARAMETERS

When we create a neural network, we need to initialize the weights and biases. The way we initialize these parameters can significantly affect how well and how quickly the network learns.

A common approach is to initialize weights randomly with small values. If weights are too large, the activations can explode. If they are too small or all zero, the network may not learn effectively. We will use a technique called He initialization for ReLU networks, which scales the random weights based on the number of inputs.


def initialize_parameters(layer_dimensions):
    """
    Initialize the weights and biases for all layers in the network.
    
    We use He initialization for weights, which works well with ReLU activations.
    Biases are initialized to zeros.
    
    Parameters:
    layer_dimensions : list of integers representing the number of units in each layer
                      For example, [784, 128, 64, 10] means:
                      - Input layer: 784 features
                      - First hidden layer: 128 neurons
                      - Second hidden layer: 64 neurons
                      - Output layer: 10 neurons
    
    Returns:
    parameters : dictionary containing weights (W) and biases (b) for each layer
    """
    parameters = {}
    num_layers = len(layer_dimensions)
    
    for layer in range(1, num_layers):
        # He initialization: multiply random values by sqrt(2 / n_previous_layer)
        # This helps prevent vanishing or exploding gradients
        parameters[f'W{layer}'] = np.random.randn(
            layer_dimensions[layer], 
            layer_dimensions[layer - 1]
        ) * np.sqrt(2.0 / layer_dimensions[layer - 1])
        
        # Initialize biases to zeros
        parameters[f'b{layer}'] = np.zeros((layer_dimensions[layer], 1))
    
    return parameters

FORWARD PROPAGATION: MAKING PREDICTIONS

Forward propagation is the process of passing input data through the network to get predictions. At each layer, we compute the weighted sum of inputs plus bias, then apply an activation function.

Let us implement forward propagation for a network with one hidden layer using ReLU activation and an output layer using sigmoid activation.

def forward_propagation_simple(X, parameters):
    """
    Perform forward propagation through a simple 2-layer network.
    
    The network architecture is:
    Input -> Hidden Layer (ReLU) -> Output Layer (Sigmoid)
    
    Parameters:
    X : numpy array of shape (n_features, n_examples)
        Input data where each column is one training example
    parameters : dictionary containing W1, b1, W2, b2
    
    Returns:
    A2 : numpy array of shape (n_output, n_examples)
         Final output (predictions)
    cache : dictionary containing intermediate values needed for backpropagation
    """
    # Retrieve parameters
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    
    # Forward propagation for hidden layer
    # Z1 is the weighted sum before activation
    Z1 = np.dot(W1, X) + b1
    # A1 is the activation output
    A1 = relu(Z1)
    
    # Forward propagation for output layer
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)
    
    # Store values for backpropagation
    cache = {
        'Z1': Z1,
        'A1': A1,
        'Z2': Z2,
        'A2': A2
    }
    
    return A2, cache

COMPUTING THE COST FUNCTION

The cost function (also called loss function) measures how wrong our network's predictions are. During training, we want to minimize this cost. For binary classification, we typically use binary cross-entropy loss.

The binary cross-entropy cost for a single example is:

cost = - (y * log(prediction) + (1 - y) * log(1 -  prediction))

where y is the true label (0 or 1) and prediction is our network's output.

For multiple examples, we average the cost across all examples.

def compute_cost(A2, Y):
    """
    Compute the binary cross-entropy cost.
    
    This measures how different our predictions are from the true labels.
    Lower cost means better predictions.
    
    Parameters:
    A2 : numpy array of shape (1, n_examples)
         Network predictions (probabilities between 0 and 1)
    Y : numpy array of shape (1, n_examples)
        True labels (0 or 1)
    
    Returns:
    cost : float
           Average cost across all examples
    """
    m = Y.shape[1]  # Number of examples
    
    # Compute cross-entropy cost
    # We add a small epsilon to avoid log(0)
    epsilon = 1e-8
    cost = -np.sum(Y * np.log(A2 + epsilon) + (1 - Y) * np.log(1 - A2 + epsilon)) / m
    
    return cost

BACKWARD PROPAGATION: LEARNING FROM MISTAKES

Backward propagation is the heart of how neural networks learn. It computes the gradient of the cost function with respect to each parameter (weight and bias). These gradients tell us how to adjust the parameters to reduce the cost.

The process works backwards from the output layer to the input layer, using the chain rule from calculus. For each layer, we compute how much each parameter contributed to the error.

def backward_propagation_simple(X, Y, parameters, cache):
    """
    Perform backward propagation to compute gradients.
    
    This calculates how much each weight and bias should change
    to reduce the cost function.
    
    Parameters:
    X : numpy array of shape (n_features, n_examples)
        Input data
    Y : numpy array of shape (1, n_examples)
        True labels
    parameters : dictionary containing W1, b1, W2, b2
    cache : dictionary containing Z1, A1, Z2, A2 from forward propagation
    
    Returns:
    gradients : dictionary containing dW1, db1, dW2, db2
    """
    m = X.shape[1]  # Number of examples
    
    # Retrieve cached values
    Z1 = cache['Z1']
    A1 = cache['A1']
    A2 = cache['A2']
    
    # Retrieve parameters
    W2 = parameters['W2']
    
    # Backward propagation for output layer
    # dZ2 is the gradient of cost with respect to Z2
    dZ2 = A2 - Y
    
    # Gradients for W2 and b2
    dW2 = np.dot(dZ2, A1.T) / m
    db2 = np.sum(dZ2, axis=1, keepdims=True) / m
    
    # Backward propagation for hidden layer
    # We propagate the gradient back through W2
    dA1 = np.dot(W2.T, dZ2)
    # Then multiply by the derivative of ReLU
    dZ1 = dA1 * relu_derivative(Z1)
    
    # Gradients for W1 and b1
    dW1 = np.dot(dZ1, X.T) / m
    db1 = np.sum(dZ1, axis=1, keepdims=True) / m
    
    gradients = {
        'dW1': dW1,
        'db1': db1,
        'dW2': dW2,
        'db2': db2
    }
    
    return gradients

UPDATING PARAMETERS WITH GRADIENT DESCENT

Once we have computed the gradients, we need to update our parameters. The simplest optimization algorithm is gradient descent. We move each parameter in the opposite direction of its gradient, scaled by a learning rate.

The update rule is:

new_weight = old_weight - learning_rate * gradient

The learning rate controls how big our steps are. If it is too large, we might overshoot the minimum. If it is too small, training will be very slow.

def update_parameters(parameters, gradients, learning_rate):
    """
    Update parameters using gradient descent.
    
    Each parameter is adjusted in the direction that reduces the cost.
    
    Parameters:
    parameters : dictionary containing current W1, b1, W2, b2
    gradients : dictionary containing dW1, db1, dW2, db2
    learning_rate : float
                    Controls the step size of parameter updates
    
    Returns:
    parameters : dictionary containing updated W1, b1, W2, b2
    """
    # Update weights and biases for each layer
    parameters['W1'] = parameters['W1'] - learning_rate * gradients['dW1']
    parameters['b1'] = parameters['b1'] - learning_rate * gradients['db1']
    parameters['W2'] = parameters['W2'] - learning_rate * gradients['dW2']
    parameters['b2'] = parameters['b2'] - learning_rate * gradients['db2']
    
    return parameters

PUTTING IT ALL TOGETHER: TRAINING THE NETWORK

Now we can combine all the pieces into a complete training loop. We will repeatedly perform forward propagation, compute the cost, perform backward propagation, and update the parameters.

def train_simple_network(X, Y, layer_dimensions, learning_rate=0.01, num_iterations=1000, print_cost=True):
    """
    Train a simple 2-layer neural network.
    
    This function performs the complete training process:
    1. Initialize parameters
    2. For each iteration:
       - Forward propagation
       - Compute cost
       - Backward propagation
       - Update parameters
    
    Parameters:
    X : numpy array of shape (n_features, n_examples)
        Training data
    Y : numpy array of shape (1, n_examples)
        Training labels
    layer_dimensions : list of layer sizes [n_input, n_hidden, n_output]
    learning_rate : float
                    Learning rate for gradient descent
    num_iterations : int
                     Number of training iterations
    print_cost : bool
                 Whether to print cost every 100 iterations
    
    Returns:
    parameters : dictionary containing trained weights and biases
    costs : list of costs computed during training
    """
    costs = []
    
    # Initialize parameters
    parameters = initialize_parameters(layer_dimensions)
    
    # Training loop
    for iteration in range(num_iterations):
        # Forward propagation
        A2, cache = forward_propagation_simple(X, parameters)
        
        # Compute cost
        cost = compute_cost(A2, Y)
        costs.append(cost)
        
        # Backward propagation
        gradients = backward_propagation_simple(X, Y, parameters, cache)
        
        # Update parameters
        parameters = update_parameters(parameters, gradients, learning_rate)
        
        # Print cost every 100 iterations
        if print_cost and iteration % 100 == 0:
            print(f"Cost after iteration {iteration}: {cost:.6f}")
    
    return parameters, costs

TESTING OUR SIMPLE NETWORK

Let us create a simple dataset and test our neural network. We will generate synthetic data for a binary classification problem.

def generate_simple_dataset(n_samples=1000):
    """
    Generate a simple synthetic dataset for binary classification.
    
    This creates two classes of points that are linearly separable
    with some noise added.
    
    Parameters:
    n_samples : int
                Number of samples to generate
    
    Returns:
    X : numpy array of shape (2, n_samples)
        Features
    Y : numpy array of shape (1, n_samples)
        Labels (0 or 1)
    """
    # Generate random points
    np.random.seed(42)
    
    # Class 0: points clustered around (-2, -2)
    X_class0 = np.random.randn(2, n_samples // 2) + np.array([[-2], [-2]])
    Y_class0 = np.zeros((1, n_samples // 2))
    
    # Class 1: points clustered around (2, 2)
    X_class1 = np.random.randn(2, n_samples // 2) + np.array([[2], [2]])
    Y_class1 = np.ones((1, n_samples // 2))
    
    # Combine both classes
    X = np.concatenate([X_class0, X_class1], axis=1)
    Y = np.concatenate([Y_class0, Y_class1], axis=1)
    
    # Shuffle the data
    permutation = np.random.permutation(n_samples)
    X = X[:, permutation]
    Y = Y[:, permutation]
    
    return X, Y

Now let us train our network on this dataset.

# Generate dataset
X_train, Y_train = generate_simple_dataset(n_samples=1000)

# Define network architecture
# Input layer: 2 features
# Hidden layer: 4 neurons
# Output layer: 1 neuron (binary classification)
layer_dims = [2, 4, 1]

# Train the network
print("Training simple neural network...")
parameters, costs = train_simple_network(
    X_train, 
    Y_train, 
    layer_dims, 
    learning_rate=0.5, 
    num_iterations=1000, 
    print_cost=True
)

print("\nTraining complete!")

MAKING PREDICTIONS

After training, we need a function to make predictions on new data.

def predict(X, parameters):
    """
    Make predictions using the trained network.
    
    Parameters:
    X : numpy array of shape (n_features, n_examples)
        Input data
    parameters : dictionary containing trained weights and biases
    
    Returns:
    predictions : numpy array of shape (1, n_examples)
                  Predicted class (0 or 1) for each example
    """
    # Forward propagation
    A2, _ = forward_propagation_simple(X, parameters)
    
    # Convert probabilities to binary predictions
    # If probability > 0.5, predict class 1, otherwise class 0
    predictions = (A2 > 0.5).astype(int)
    
    return predictions


def compute_accuracy(predictions, Y):
    """
    Compute the accuracy of predictions.
    
    Parameters:
    predictions : numpy array of shape (1, n_examples)
                  Predicted labels
    Y : numpy array of shape (1, n_examples)
        True labels
    
    Returns:
    accuracy : float
               Percentage of correct predictions
    """
    accuracy = np.mean(predictions == Y) * 100
    return accuracy

Let us test our trained network.

# Make predictions on training data
predictions = predict(X_train, parameters)
accuracy = compute_accuracy(predictions, Y_train)
print(f"Training accuracy: {accuracy:.2f}%")

UNDERSTANDING WHAT THE NETWORK LEARNED

To visualize what our network learned, we can plot the decision boundary. This shows how the network divides the input space into regions for each class.

def plot_decision_boundary(X, Y, parameters):
    """
    Plot the decision boundary learned by the network.
    
    This creates a visualization showing how the network
    classifies different regions of the input space.
    
    Parameters:
    X : numpy array of shape (2, n_examples)
        Input data (must be 2D for visualization)
    Y : numpy array of shape (1, n_examples)
        True labels
    parameters : dictionary containing trained weights and biases
    """
    # Set up the grid
    x_min, x_max = X[0, :].min() - 1, X[0, :].max() + 1
    y_min, y_max = X[1, :].min() - 1, X[1, :].max() + 1
    h = 0.1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Make predictions for every point in the grid
    grid_points = np.c_[xx.ravel(), yy.ravel()].T
    Z, _ = forward_propagation_simple(grid_points, parameters)
    Z = Z.reshape(xx.shape)
    
    # Plot the decision boundary
    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, Z, levels=[0, 0.5, 1], alpha=0.3, colors=['blue', 'red'])
    
    # Plot the training points
    plt.scatter(X[0, Y[0] == 0], X[1, Y[0] == 0], c='blue', marker='o', label='Class 0', edgecolors='k')
    plt.scatter(X[0, Y[0] == 1], X[1, Y[0] == 1], c='red', marker='s', label='Class 1', edgecolors='k')
    
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('Decision Boundary of Neural Network')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

BUILDING A DEEPER NETWORK

Now that we understand the basics, let us build a more flexible network that can have any number of layers. This is where deep learning gets its name - from using networks with many layers.

A deeper network can learn more complex patterns because each layer can build on the features learned by previous layers. The first layers might learn simple features like edges, while deeper layers combine these into more complex patterns.

class DeepNeuralNetwork:
    """
    A flexible deep neural network with arbitrary depth.
    
    This class encapsulates all the functionality needed to create,
    train, and use a deep neural network with multiple hidden layers.
    """
    
    def __init__(self, layer_dimensions, activation='relu'):
        """
        Initialize the deep neural network.
        
        Parameters:
        layer_dimensions : list of integers
                          Number of units in each layer
                          Example: [784, 128, 64, 10] creates a network with
                          784 input features, two hidden layers (128 and 64 units),
                          and 10 output units
        activation : str
                    Activation function to use in hidden layers ('relu' or 'sigmoid')
        """
        self.layer_dimensions = layer_dimensions
        self.num_layers = len(layer_dimensions)
        self.activation = activation
        self.parameters = self._initialize_parameters()
        
    def _initialize_parameters(self):
        """
        Initialize weights and biases for all layers.
        
        Returns:
        parameters : dictionary containing all weights and biases
        """
        parameters = {}
        
        for layer in range(1, self.num_layers):
            # He initialization for ReLU, Xavier for sigmoid
            if self.activation == 'relu':
                scale = np.sqrt(2.0 / self.layer_dimensions[layer - 1])
            else:
                scale = np.sqrt(1.0 / self.layer_dimensions[layer - 1])
            
            parameters[f'W{layer}'] = np.random.randn(
                self.layer_dimensions[layer],
                self.layer_dimensions[layer - 1]
            ) * scale
            
            parameters[f'b{layer}'] = np.zeros((self.layer_dimensions[layer], 1))
        
        return parameters
    
    def _activation_forward(self, Z, activation_type):
        """
        Apply activation function.
        
        Parameters:
        Z : numpy array
            Pre-activation values
        activation_type : str
                         Type of activation ('relu', 'sigmoid')
        
        Returns:
        A : numpy array
            Post-activation values
        """
        if activation_type == 'relu':
            return relu(Z)
        elif activation_type == 'sigmoid':
            return sigmoid(Z)
        else:
            raise ValueError(f"Unknown activation: {activation_type}")
    
    def _activation_backward(self, dA, Z, activation_type):
        """
        Compute gradient of activation function.
        
        Parameters:
        dA : numpy array
             Gradient of cost with respect to activation
        Z : numpy array
            Pre-activation values
        activation_type : str
                         Type of activation
        
        Returns:
        dZ : numpy array
             Gradient of cost with respect to pre-activation
        """
        if activation_type == 'relu':
            return dA * relu_derivative(Z)
        elif activation_type == 'sigmoid':
            return dA * sigmoid_derivative(Z)
        else:
            raise ValueError(f"Unknown activation: {activation_type}")
    
    def forward_propagation(self, X):
        """
        Perform forward propagation through all layers.
        
        Parameters:
        X : numpy array of shape (n_features, n_examples)
            Input data
        
        Returns:
        AL : numpy array
             Final layer activation (predictions)
        caches : list of dictionaries
                Cached values needed for backpropagation
        """
        caches = []
        A = X
        
        # Forward through all layers except the last
        for layer in range(1, self.num_layers - 1):
            A_prev = A
            W = self.parameters[f'W{layer}']
            b = self.parameters[f'b{layer}']
            
            Z = np.dot(W, A_prev) + b
            A = self._activation_forward(Z, self.activation)
            
            cache = {
                'A_prev': A_prev,
                'Z': Z,
                'W': W,
                'b': b
            }
            caches.append(cache)
        
        # Forward through the output layer (always sigmoid for binary classification)
        W = self.parameters[f'W{self.num_layers - 1}']
        b = self.parameters[f'b{self.num_layers - 1}']
        Z = np.dot(W, A) + b
        AL = sigmoid(Z)
        
        cache = {
            'A_prev': A,
            'Z': Z,
            'W': W,
            'b': b
        }
        caches.append(cache)
        
        return AL, caches
    
    def compute_cost(self, AL, Y):
        """
        Compute the binary cross-entropy cost.
        
        Parameters:
        AL : numpy array
             Network predictions
        Y : numpy array
            True labels
        
        Returns:
        cost : float
        """
        m = Y.shape[1]
        epsilon = 1e-8
        cost = -np.sum(Y * np.log(AL + epsilon) + (1 - Y) * np.log(1 - AL + epsilon)) / m
        return cost
    
    def backward_propagation(self, AL, Y, caches):
        """
        Perform backward propagation through all layers.
        
        Parameters:
        AL : numpy array
             Final layer activation
        Y : numpy array
            True labels
        caches : list of dictionaries
                Cached values from forward propagation
        
        Returns:
        gradients : dictionary containing all gradients
        """
        gradients = {}
        m = Y.shape[1]
        L = self.num_layers - 1
        
        # Initialize backpropagation from output layer
        dAL = -(np.divide(Y, AL + 1e-8) - np.divide(1 - Y, 1 - AL + 1e-8))
        
        # Output layer gradients (sigmoid activation)
        current_cache = caches[L - 1]
        dZ = dAL * sigmoid_derivative(current_cache['Z'])
        gradients[f'dW{L}'] = np.dot(dZ, current_cache['A_prev'].T) / m
        gradients[f'db{L}'] = np.sum(dZ, axis=1, keepdims=True) / m
        dA_prev = np.dot(current_cache['W'].T, dZ)
        
        # Backpropagate through hidden layers
        for layer in reversed(range(L - 1)):
            current_cache = caches[layer]
            dZ = self._activation_backward(dA_prev, current_cache['Z'], self.activation)
            gradients[f'dW{layer + 1}'] = np.dot(dZ, current_cache['A_prev'].T) / m
            gradients[f'db{layer + 1}'] = np.sum(dZ, axis=1, keepdims=True) / m
            dA_prev = np.dot(current_cache['W'].T, dZ)
        
        return gradients
    
    def update_parameters(self, gradients, learning_rate):
        """
        Update parameters using gradient descent.
        
        Parameters:
        gradients : dictionary containing all gradients
        learning_rate : float
        """
        for layer in range(1, self.num_layers):
            self.parameters[f'W{layer}'] -= learning_rate * gradients[f'dW{layer}']
            self.parameters[f'b{layer}'] -= learning_rate * gradients[f'db{layer}']
    
    def train(self, X, Y, learning_rate=0.01, num_iterations=1000, print_cost=True):
        """
        Train the neural network.
        
        Parameters:
        X : numpy array
            Training data
        Y : numpy array
            Training labels
        learning_rate : float
        num_iterations : int
        print_cost : bool
        
        Returns:
        costs : list of costs during training
        """
        costs = []
        
        for iteration in range(num_iterations):
            # Forward propagation
            AL, caches = self.forward_propagation(X)
            
            # Compute cost
            cost = self.compute_cost(AL, Y)
            costs.append(cost)
            
            # Backward propagation
            gradients = self.backward_propagation(AL, Y, caches)
            
            # Update parameters
            self.update_parameters(gradients, learning_rate)
            
            # Print cost
            if print_cost and iteration % 100 == 0:
                print(f"Cost after iteration {iteration}: {cost:.6f}")
        
        return costs
    
    def predict(self, X):
        """
        Make predictions on new data.
        
        Parameters:
        X : numpy array
            Input data
        
        Returns:
        predictions : numpy array
                     Binary predictions
        """
        AL, _ = self.forward_propagation(X)
        predictions = (AL > 0.5).astype(int)
        return predictions

Let us test our deep neural network on a more complex dataset.

def generate_complex_dataset(n_samples=1000):
    """
    Generate a more complex dataset with non-linear decision boundary.
    
    This creates a dataset where classes are arranged in concentric circles,
    which requires a non-linear classifier.
    
    Parameters:
    n_samples : int
    
    Returns:
    X : numpy array of shape (2, n_samples)
    Y : numpy array of shape (1, n_samples)
    """
    np.random.seed(42)
    
    # Generate points
    radius = np.random.rand(n_samples)
    angle = 2 * np.pi * np.random.rand(n_samples)
    
    # Class 0: inner circle
    mask_class0 = radius < 0.5
    X_class0 = np.vstack([
        radius[mask_class0] * np.cos(angle[mask_class0]),
        radius[mask_class0] * np.sin(angle[mask_class0])
    ])
    Y_class0 = np.zeros((1, X_class0.shape[1]))
    
    # Class 1: outer ring
    mask_class1 = radius >= 0.5
    X_class1 = np.vstack([
        radius[mask_class1] * np.cos(angle[mask_class1]),
        radius[mask_class1] * np.sin(angle[mask_class1])
    ])
    Y_class1 = np.ones((1, X_class1.shape[1]))
    
    # Combine
    X = np.concatenate([X_class0, X_class1], axis=1)
    Y = np.concatenate([Y_class0, Y_class1], axis=1)
    
    # Add some noise
    X += np.random.randn(*X.shape) * 0.1
    
    return X, Y


# Generate complex dataset
X_complex, Y_complex = generate_complex_dataset(n_samples=1000)

# Create and train a deeper network
print("\nTraining deep neural network on complex dataset...")
deep_net = DeepNeuralNetwork(layer_dimensions=[2, 16, 8, 1], activation='relu')
costs_deep = deep_net.train(X_complex, Y_complex, learning_rate=0.5, num_iterations=2000, print_cost=True)

# Evaluate
predictions_deep = deep_net.predict(X_complex)
accuracy_deep = compute_accuracy(predictions_deep, Y_complex)
print(f"\nDeep network accuracy: {accuracy_deep:.2f}%")

INTRODUCING MINI-BATCH GRADIENT DESCENT

So far, we have been using batch gradient descent, where we compute gradients using all training examples at once. This works well for small datasets but becomes impractical for large datasets because it requires a lot of memory and computation per iteration.

Mini-batch gradient descent is a compromise. We split the training data into small batches and update parameters after processing each batch. This allows us to make more frequent updates and can lead to faster convergence.

The benefits of mini-batch gradient descent include faster training, better memory efficiency, and the ability to leverage vectorization while still making frequent updates.

def create_mini_batches(X, Y, batch_size):
    """
    Split the dataset into mini-batches.
    
    Parameters:
    X : numpy array of shape (n_features, n_examples)
        Training data
    Y : numpy array of shape (n_output, n_examples)
        Training labels
    batch_size : int
                Size of each mini-batch
    
    Returns:
    mini_batches : list of tuples (mini_batch_X, mini_batch_Y)
    """
    m = X.shape[1]
    mini_batches = []
    
    # Shuffle the data
    permutation = np.random.permutation(m)
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation]
    
    # Partition into mini-batches
    num_complete_batches = m // batch_size
    
    for k in range(num_complete_batches):
        mini_batch_X = shuffled_X[:, k * batch_size:(k + 1) * batch_size]
        mini_batch_Y = shuffled_Y[:, k * batch_size:(k + 1) * batch_size]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # Handle the remaining examples (if any)
    if m % batch_size != 0:
        mini_batch_X = shuffled_X[:, num_complete_batches * batch_size:]
        mini_batch_Y = shuffled_Y[:, num_complete_batches * batch_size:]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

Now let us modify our DeepNeuralNetwork class to support mini-batch training.

class DeepNeuralNetworkWithMiniBatch(DeepNeuralNetwork):
    """
    Deep neural network with mini-batch gradient descent support.
    
    This extends the basic deep neural network to support training
    with mini-batches instead of using the entire dataset at once.
    """
    
    def train_with_mini_batches(self, X, Y, learning_rate=0.01, num_epochs=100, 
                                batch_size=64, print_cost=True):
        """
        Train the network using mini-batch gradient descent.
        
        An epoch is one complete pass through the training data.
        In each epoch, we process multiple mini-batches.
        
        Parameters:
        X : numpy array
            Training data
        Y : numpy array
            Training labels
        learning_rate : float
        num_epochs : int
                    Number of complete passes through the data
        batch_size : int
                    Size of each mini-batch
        print_cost : bool
        
        Returns:
        costs : list of average costs per epoch
        """
        costs = []
        m = X.shape[1]
        
        for epoch in range(num_epochs):
            epoch_cost = 0
            
            # Create mini-batches for this epoch
            mini_batches = create_mini_batches(X, Y, batch_size)
            num_batches = len(mini_batches)
            
            for mini_batch in mini_batches:
                mini_batch_X, mini_batch_Y = mini_batch
                
                # Forward propagation
                AL, caches = self.forward_propagation(mini_batch_X)
                
                # Compute cost
                batch_cost = self.compute_cost(AL, mini_batch_Y)
                epoch_cost += batch_cost
                
                # Backward propagation
                gradients = self.backward_propagation(AL, mini_batch_Y, caches)
                
                # Update parameters
                self.update_parameters(gradients, learning_rate)
            
            # Average cost for this epoch
            avg_cost = epoch_cost / num_batches
            costs.append(avg_cost)
            
            if print_cost and epoch % 10 == 0:
                print(f"Cost after epoch {epoch}: {avg_cost:.6f}")
        
        return costs

Let us test mini-batch training.

# Create network with mini-batch support
print("\nTraining with mini-batch gradient descent...")
mini_batch_net = DeepNeuralNetworkWithMiniBatch(layer_dimensions=[2, 16, 8, 1], activation='relu')
costs_mini_batch = mini_batch_net.train_with_mini_batches(
    X_complex, 
    Y_complex, 
    learning_rate=0.5, 
    num_epochs=200,
    batch_size=32,
    print_cost=True
)

# Evaluate
predictions_mini = mini_batch_net.predict(X_complex)
accuracy_mini = compute_accuracy(predictions_mini, Y_complex)
print(f"\nMini-batch network accuracy: {accuracy_mini:.2f}%")

IMPLEMENTING MOMENTUM OPTIMIZATION

Gradient descent can be slow, especially when the cost function has regions with different curvatures. Momentum is an optimization technique that helps accelerate gradient descent by accumulating a velocity vector in directions of persistent gradient.

Think of momentum like a ball rolling down a hill. The ball builds up speed (momentum) as it rolls, allowing it to move faster through flat regions and smooth out oscillations in steep regions.

The momentum update rule is:

velocity = beta * velocity + (1 - beta) * gradient parameter = parameter - learning_rate * velocity

The beta parameter (typically 0.9) controls how much of the previous velocity to retain.

class DeepNeuralNetworkWithMomentum(DeepNeuralNetworkWithMiniBatch):
    """
    Deep neural network with momentum optimization.
    
    Momentum helps accelerate training by accumulating gradients
    in consistent directions.
    """
    
    def __init__(self, layer_dimensions, activation='relu'):
        """
        Initialize network with momentum support.
        """
        super().__init__(layer_dimensions, activation)
        self.velocities = self._initialize_velocities()
    
    def _initialize_velocities(self):
        """
        Initialize velocity vectors for momentum.
        
        Returns:
        velocities : dictionary containing velocity for each parameter
        """
        velocities = {}
        
        for layer in range(1, self.num_layers):
            velocities[f'dW{layer}'] = np.zeros_like(self.parameters[f'W{layer}'])
            velocities[f'db{layer}'] = np.zeros_like(self.parameters[f'b{layer}'])
        
        return velocities
    
    def update_parameters_with_momentum(self, gradients, learning_rate, beta=0.9):
        """
        Update parameters using momentum.
        
        Parameters:
        gradients : dictionary containing gradients
        learning_rate : float
        beta : float
              Momentum coefficient (typically 0.9)
        """
        for layer in range(1, self.num_layers):
            # Update velocities
            self.velocities[f'dW{layer}'] = (
                beta * self.velocities[f'dW{layer}'] + 
                (1 - beta) * gradients[f'dW{layer}']
            )
            self.velocities[f'db{layer}'] = (
                beta * self.velocities[f'db{layer}'] + 
                (1 - beta) * gradients[f'db{layer}']
            )
            
            # Update parameters using velocities
            self.parameters[f'W{layer}'] -= learning_rate * self.velocities[f'dW{layer}']
            self.parameters[f'b{layer}'] -= learning_rate * self.velocities[f'db{layer}']
    
    def train_with_momentum(self, X, Y, learning_rate=0.01, num_epochs=100,
                           batch_size=64, beta=0.9, print_cost=True):
        """
        Train network using mini-batch gradient descent with momentum.
        
        Parameters:
        X : numpy array
        Y : numpy array
        learning_rate : float
        num_epochs : int
        batch_size : int
        beta : float
              Momentum coefficient
        print_cost : bool
        
        Returns:
        costs : list of costs
        """
        costs = []
        
        for epoch in range(num_epochs):
            epoch_cost = 0
            mini_batches = create_mini_batches(X, Y, batch_size)
            num_batches = len(mini_batches)
            
            for mini_batch in mini_batches:
                mini_batch_X, mini_batch_Y = mini_batch
                
                # Forward propagation
                AL, caches = self.forward_propagation(mini_batch_X)
                
                # Compute cost
                batch_cost = self.compute_cost(AL, mini_batch_Y)
                epoch_cost += batch_cost
                
                # Backward propagation
                gradients = self.backward_propagation(AL, mini_batch_Y, caches)
                
                # Update with momentum
                self.update_parameters_with_momentum(gradients, learning_rate, beta)
            
            avg_cost = epoch_cost / num_batches
            costs.append(avg_cost)
            
            if print_cost and epoch % 10 == 0:
                print(f"Cost after epoch {epoch}: {avg_cost:.6f}")
        
        return costs

IMPLEMENTING ADAM OPTIMIZATION

Adam (Adaptive Moment Estimation) is one of the most popular optimization algorithms in deep learning. It combines ideas from momentum and another technique called RMSprop. Adam adapts the learning rate for each parameter individually, which often leads to faster convergence.

Adam maintains two moving averages for each parameter. The first moment estimate is similar to momentum, tracking the average of gradients. The second moment estimate tracks the average of squared gradients, which helps adapt the learning rate.

The Adam update rules are:

first_moment = beta1 * first_moment + (1 - beta1) * gradient second_moment = beta2 * second_moment + (1 - beta2) * gradient_squared

Then we apply bias correction and update:

first_moment_corrected = first_moment / (1 - beta1^t) second_moment_corrected = second_moment / (1 - beta2^t) parameter = parameter - learning_rate * first_moment_corrected / (sqrt(second_moment_corrected) + epsilon)

class DeepNeuralNetworkWithAdam(DeepNeuralNetworkWithMiniBatch):
    """
    Deep neural network with Adam optimization.
    
    Adam is an advanced optimizer that adapts learning rates
    for each parameter individually, often leading to faster
    and more stable training.
    """
    
    def __init__(self, layer_dimensions, activation='relu'):
        """
        Initialize network with Adam optimizer support.
        """
        super().__init__(layer_dimensions, activation)
        self.adam_params = self._initialize_adam()
    
    def _initialize_adam(self):
        """
        Initialize Adam optimizer parameters.
        
        Returns:
        adam_params : dictionary containing first and second moment estimates
        """
        adam_params = {
            'v': {},  # First moment (momentum)
            's': {},  # Second moment (RMSprop)
            't': 0    # Time step
        }
        
        for layer in range(1, self.num_layers):
            adam_params['v'][f'dW{layer}'] = np.zeros_like(self.parameters[f'W{layer}'])
            adam_params['v'][f'db{layer}'] = np.zeros_like(self.parameters[f'b{layer}'])
            adam_params['s'][f'dW{layer}'] = np.zeros_like(self.parameters[f'W{layer}'])
            adam_params['s'][f'db{layer}'] = np.zeros_like(self.parameters[f'b{layer}'])
        
        return adam_params
    
    def update_parameters_with_adam(self, gradients, learning_rate, 
                                   beta1=0.9, beta2=0.999, epsilon=1e-8):
        """
        Update parameters using Adam optimization.
        
        Parameters:
        gradients : dictionary containing gradients
        learning_rate : float
        beta1 : float
               Exponential decay rate for first moment (typically 0.9)
        beta2 : float
               Exponential decay rate for second moment (typically 0.999)
        epsilon : float
                 Small constant to prevent division by zero
        """
        # Increment time step
        self.adam_params['t'] += 1
        t = self.adam_params['t']
        
        for layer in range(1, self.num_layers):
            # Update first moment (momentum)
            self.adam_params['v'][f'dW{layer}'] = (
                beta1 * self.adam_params['v'][f'dW{layer}'] +
                (1 - beta1) * gradients[f'dW{layer}']
            )
            self.adam_params['v'][f'db{layer}'] = (
                beta1 * self.adam_params['v'][f'db{layer}'] +
                (1 - beta1) * gradients[f'db{layer}']
            )
            
            # Update second moment (RMSprop)
            self.adam_params['s'][f'dW{layer}'] = (
                beta2 * self.adam_params['s'][f'dW{layer}'] +
                (1 - beta2) * np.square(gradients[f'dW{layer}'])
            )
            self.adam_params['s'][f'db{layer}'] = (
                beta2 * self.adam_params['s'][f'db{layer}'] +
                (1 - beta2) * np.square(gradients[f'db{layer}'])
            )
            
            # Bias correction for first moment
            v_corrected_W = self.adam_params['v'][f'dW{layer}'] / (1 - beta1**t)
            v_corrected_b = self.adam_params['v'][f'db{layer}'] / (1 - beta1**t)
            
            # Bias correction for second moment
            s_corrected_W = self.adam_params['s'][f'dW{layer}'] / (1 - beta2**t)
            s_corrected_b = self.adam_params['s'][f'db{layer}'] / (1 - beta2**t)
            
            # Update parameters
            self.parameters[f'W{layer}'] -= (
                learning_rate * v_corrected_W / (np.sqrt(s_corrected_W) + epsilon)
            )
            self.parameters[f'b{layer}'] -= (
                learning_rate * v_corrected_b / (np.sqrt(s_corrected_b) + epsilon)
            )
    
    def train_with_adam(self, X, Y, learning_rate=0.001, num_epochs=100,
                       batch_size=64, beta1=0.9, beta2=0.999, print_cost=True):
        """
        Train network using Adam optimization.
        
        Parameters:
        X : numpy array
        Y : numpy array
        learning_rate : float
                       Note: Adam typically works well with smaller learning rates
        num_epochs : int
        batch_size : int
        beta1 : float
        beta2 : float
        print_cost : bool
        
        Returns:
        costs : list of costs
        """
        costs = []
        
        for epoch in range(num_epochs):
            epoch_cost = 0
            mini_batches = create_mini_batches(X, Y, batch_size)
            num_batches = len(mini_batches)
            
            for mini_batch in mini_batches:
                mini_batch_X, mini_batch_Y = mini_batch
                
                # Forward propagation
                AL, caches = self.forward_propagation(mini_batch_X)
                
                # Compute cost
                batch_cost = self.compute_cost(AL, mini_batch_Y)
                epoch_cost += batch_cost
                
                # Backward propagation
                gradients = self.backward_propagation(AL, mini_batch_Y, caches)
                
                # Update with Adam
                self.update_parameters_with_adam(gradients, learning_rate, beta1, beta2)
            
            avg_cost = epoch_cost / num_batches
            costs.append(avg_cost)
            
            if print_cost and epoch % 10 == 0:
                print(f"Cost after epoch {epoch}: {avg_cost:.6f}")
        
        return costs

Let us compare the performance of different optimizers.

print("\nComparing different optimizers...")

# Standard gradient descent
print("\n1. Standard Gradient Descent:")
gd_net = DeepNeuralNetworkWithMiniBatch(layer_dimensions=[2, 16, 8, 1], activation='relu')
costs_gd = gd_net.train_with_mini_batches(
    X_complex, Y_complex, 
    learning_rate=0.5, 
    num_epochs=100, 
    batch_size=32,
    print_cost=False
)

# Momentum
print("\n2. Gradient Descent with Momentum:")
momentum_net = DeepNeuralNetworkWithMomentum(layer_dimensions=[2, 16, 8, 1], activation='relu')
costs_momentum = momentum_net.train_with_momentum(
    X_complex, Y_complex,
    learning_rate=0.5,
    num_epochs=100,
    batch_size=32,
    beta=0.9,
    print_cost=False
)

# Adam
print("\n3. Adam Optimizer:")
adam_net = DeepNeuralNetworkWithAdam(layer_dimensions=[2, 16, 8, 1], activation='relu')
costs_adam = adam_net.train_with_adam(
    X_complex, Y_complex,
    learning_rate=0.01,
    num_epochs=100,
    batch_size=32,
    print_cost=False
)

# Compare final accuracies
pred_gd = gd_net.predict(X_complex)
pred_momentum = momentum_net.predict(X_complex)
pred_adam = adam_net.predict(X_complex)

print(f"\nFinal Accuracies:")
print(f"Standard GD: {compute_accuracy(pred_gd, Y_complex):.2f}%")
print(f"Momentum: {compute_accuracy(pred_momentum, Y_complex):.2f}%")
print(f"Adam: {compute_accuracy(pred_adam, Y_complex):.2f}%")

IMPLEMENTING EARLY STOPPING

Early stopping is a regularization technique that prevents overfitting. The idea is simple: we monitor the performance on a validation set during training, and stop training when the validation performance stops improving.

Overfitting occurs when a model learns the training data too well, including its noise and peculiarities, and performs poorly on new data. Early stopping helps by stopping training before the model has a chance to overfit.

To implement early stopping, we need to split our data into training and validation sets. We train on the training set and evaluate on the validation set after each epoch. If the validation cost does not improve for a certain number of epochs (called patience), we stop training.

def split_train_validation(X, Y, validation_split=0.2):
    """
    Split data into training and validation sets.
    
    Parameters:
    X : numpy array
        Features
    Y : numpy array
        Labels
    validation_split : float
                      Fraction of data to use for validation
    
    Returns:
    X_train, Y_train, X_val, Y_val : numpy arrays
    """
    m = X.shape[1]
    
    # Shuffle data
    permutation = np.random.permutation(m)
    X_shuffled = X[:, permutation]
    Y_shuffled = Y[:, permutation]
    
    # Split
    split_index = int(m * (1 - validation_split))
    X_train = X_shuffled[:, :split_index]
    Y_train = Y_shuffled[:, :split_index]
    X_val = X_shuffled[:, split_index:]
    Y_val = Y_shuffled[:, split_index:]
    
    return X_train, Y_train, X_val, Y_val


class DeepNeuralNetworkWithEarlyStopping(DeepNeuralNetworkWithAdam):
    """
    Deep neural network with early stopping support.
    
    Early stopping monitors validation performance and stops
    training when it stops improving, preventing overfitting.
    """
    
    def train_with_early_stopping(self, X_train, Y_train, X_val, Y_val,
                                  learning_rate=0.001, num_epochs=1000,
                                  batch_size=64, patience=10, print_cost=True):
        """
        Train network with early stopping.
        
        Parameters:
        X_train, Y_train : numpy arrays
                          Training data
        X_val, Y_val : numpy arrays
                      Validation data
        learning_rate : float
        num_epochs : int
                    Maximum number of epochs
        batch_size : int
        patience : int
                  Number of epochs to wait for improvement before stopping
        print_cost : bool
        
        Returns:
        train_costs : list of training costs
        val_costs : list of validation costs
        best_epoch : int
                    Epoch where best validation performance was achieved
        """
        train_costs = []
        val_costs = []
        best_val_cost = float('inf')
        best_parameters = None
        epochs_without_improvement = 0
        best_epoch = 0
        
        for epoch in range(num_epochs):
            # Training phase
            epoch_train_cost = 0
            mini_batches = create_mini_batches(X_train, Y_train, batch_size)
            num_batches = len(mini_batches)
            
            for mini_batch in mini_batches:
                mini_batch_X, mini_batch_Y = mini_batch
                
                AL, caches = self.forward_propagation(mini_batch_X)
                batch_cost = self.compute_cost(AL, mini_batch_Y)
                epoch_train_cost += batch_cost
                
                gradients = self.backward_propagation(AL, mini_batch_Y, caches)
                self.update_parameters_with_adam(gradients, learning_rate)
            
            avg_train_cost = epoch_train_cost / num_batches
            train_costs.append(avg_train_cost)
            
            # Validation phase
            AL_val, _ = self.forward_propagation(X_val)
            val_cost = self.compute_cost(AL_val, Y_val)
            val_costs.append(val_cost)
            
            # Check for improvement
            if val_cost < best_val_cost:
                best_val_cost = val_cost
                best_parameters = {key: value.copy() for key, value in self.parameters.items()}
                epochs_without_improvement = 0
                best_epoch = epoch
            else:
                epochs_without_improvement += 1
            
            if print_cost and epoch % 10 == 0:
                print(f"Epoch {epoch}: Train cost = {avg_train_cost:.6f}, Val cost = {val_cost:.6f}")
            
            # Early stopping check
            if epochs_without_improvement >= patience:
                print(f"\nEarly stopping triggered at epoch {epoch}")
                print(f"Best validation cost: {best_val_cost:.6f} at epoch {best_epoch}")
                # Restore best parameters
                self.parameters = best_parameters
                break
        
        return train_costs, val_costs, best_epoch

Let us test early stopping.

print("\nTesting early stopping...")

# Split data
X_train, Y_train, X_val, Y_val = split_train_validation(X_complex, Y_complex, validation_split=0.2)

# Create network with early stopping
early_stop_net = DeepNeuralNetworkWithEarlyStopping(
    layer_dimensions=[2, 16, 8, 1], 
    activation='relu'
)

# Train with early stopping
train_costs, val_costs, best_epoch = early_stop_net.train_with_early_stopping(
    X_train, Y_train, X_val, Y_val,
    learning_rate=0.01,
    num_epochs=500,
    batch_size=32,
    patience=20,
    print_cost=True
)

# Evaluate on validation set
pred_val = early_stop_net.predict(X_val)
val_accuracy = compute_accuracy(pred_val, Y_val)
print(f"\nValidation accuracy: {val_accuracy:.2f}%")

IMPLEMENTING REGULARIZATION

Regularization is another technique to prevent overfitting. It works by adding a penalty term to the cost function that discourages large weights. This encourages the network to learn simpler patterns that generalize better.

The most common form is L2 regularization (also called weight decay). The regularized cost becomes:

regularized_cost = original_cost + (lambda / (2 * m)) * sum_of_squared_weights

where lambda is the regularization parameter that controls the strength of regularization.

class DeepNeuralNetworkWithRegularization(DeepNeuralNetworkWithEarlyStopping):
    """
    Deep neural network with L2 regularization.
    
    Regularization adds a penalty for large weights, which helps
    prevent overfitting and improves generalization.
    """
    
    def compute_cost_with_regularization(self, AL, Y, lambd):
        """
        Compute cost with L2 regularization.
        
        Parameters:
        AL : numpy array
             Network predictions
        Y : numpy array
            True labels
        lambd : float
               Regularization parameter
        
        Returns:
        cost : float
        """
        m = Y.shape[1]
        
        # Standard cross-entropy cost
        epsilon = 1e-8
        cross_entropy_cost = -np.sum(
            Y * np.log(AL + epsilon) + (1 - Y) * np.log(1 - AL + epsilon)
        ) / m
        
        # L2 regularization cost
        l2_cost = 0
        for layer in range(1, self.num_layers):
            W = self.parameters[f'W{layer}']
            l2_cost += np.sum(np.square(W))
        
        l2_cost = (lambd / (2 * m)) * l2_cost
        
        # Total cost
        cost = cross_entropy_cost + l2_cost
        
        return cost
    
    def backward_propagation_with_regularization(self, AL, Y, caches, lambd):
        """
        Backward propagation with L2 regularization.
        
        The gradients for weights include an additional term from regularization.
        
        Parameters:
        AL : numpy array
        Y : numpy array
        caches : list
        lambd : float
        
        Returns:
        gradients : dictionary
        """
        # Standard backpropagation
        gradients = self.backward_propagation(AL, Y, caches)
        
        # Add regularization term to weight gradients
        m = Y.shape[1]
        for layer in range(1, self.num_layers):
            W = self.parameters[f'W{layer}']
            gradients[f'dW{layer}'] += (lambd / m) * W
        
        return gradients
    
    def train_with_regularization(self, X_train, Y_train, X_val, Y_val,
                                 learning_rate=0.001, num_epochs=500,
                                 batch_size=64, lambd=0.01, patience=20,
                                 print_cost=True):
        """
        Train network with L2 regularization and early stopping.
        
        Parameters:
        X_train, Y_train : numpy arrays
        X_val, Y_val : numpy arrays
        learning_rate : float
        num_epochs : int
        batch_size : int
        lambd : float
               Regularization parameter
        patience : int
        print_cost : bool
        
        Returns:
        train_costs, val_costs, best_epoch
        """
        train_costs = []
        val_costs = []
        best_val_cost = float('inf')
        best_parameters = None
        epochs_without_improvement = 0
        best_epoch = 0
        
        for epoch in range(num_epochs):
            epoch_train_cost = 0
            mini_batches = create_mini_batches(X_train, Y_train, batch_size)
            num_batches = len(mini_batches)
            
            for mini_batch in mini_batches:
                mini_batch_X, mini_batch_Y = mini_batch
                
                AL, caches = self.forward_propagation(mini_batch_X)
                batch_cost = self.compute_cost_with_regularization(AL, mini_batch_Y, lambd)
                epoch_train_cost += batch_cost
                
                gradients = self.backward_propagation_with_regularization(
                    AL, mini_batch_Y, caches, lambd
                )
                self.update_parameters_with_adam(gradients, learning_rate)
            
            avg_train_cost = epoch_train_cost / num_batches
            train_costs.append(avg_train_cost)
            
            # Validation
            AL_val, _ = self.forward_propagation(X_val)
            val_cost = self.compute_cost_with_regularization(AL_val, Y_val, lambd)
            val_costs.append(val_cost)
            
            # Early stopping logic
            if val_cost < best_val_cost:
                best_val_cost = val_cost
                best_parameters = {key: value.copy() for key, value in self.parameters.items()}
                epochs_without_improvement = 0
                best_epoch = epoch
            else:
                epochs_without_improvement += 1
            
            if print_cost and epoch % 10 == 0:
                print(f"Epoch {epoch}: Train = {avg_train_cost:.6f}, Val = {val_cost:.6f}")
            
            if epochs_without_improvement >= patience:
                print(f"\nEarly stopping at epoch {epoch}")
                print(f"Best validation cost: {best_val_cost:.6f} at epoch {best_epoch}")
                self.parameters = best_parameters
                break
        
        return train_costs, val_costs, best_epoch

IMPLEMENTING DROPOUT REGULARIZATION

Dropout is another powerful regularization technique. During training, we randomly set a fraction of neurons to zero in each forward pass. This prevents neurons from co-adapting too much and forces the network to learn more robust features.

During testing, we use all neurons but scale their outputs by the dropout probability to account for the fact that more neurons are active than during training.

def dropout_forward(A, keep_prob):
    """
    Apply dropout to activations.
    
    Parameters:
    A : numpy array
        Activations from a layer
    keep_prob : float
               Probability of keeping each neuron (between 0 and 1)
    
    Returns:
    A_dropout : numpy array
               Activations after dropout
    mask : numpy array
          Binary mask indicating which neurons were kept
    """
    # Create random mask
    mask = np.random.rand(*A.shape) < keep_prob
    
    # Apply mask and scale
    A_dropout = A * mask / keep_prob
    
    return A_dropout, mask


def dropout_backward(dA, mask, keep_prob):
    """
    Backpropagate through dropout.
    
    Parameters:
    dA : numpy array
         Gradient of cost with respect to activations
    mask : numpy array
          Mask from forward pass
    keep_prob : float
    
    Returns:
    dA_dropout : numpy array
    """
    dA_dropout = dA * mask / keep_prob
    return dA_dropout

PUTTING IT ALL TOGETHER: COMPLETE TRAINING PIPELINE

Now let us create a complete neural network class that incorporates all the features we have discussed: mini-batch training, Adam optimization, early stopping, L2 regularization, and dropout.

class CompleteNeuralNetwork:
    """
    A complete neural network implementation with all advanced features.
    
    Features:
    - Flexible architecture with arbitrary depth
    - Multiple activation functions
    - Mini-batch gradient descent
    - Adam optimization
    - L2 regularization
    - Dropout regularization
    - Early stopping
    """
    
    def __init__(self, layer_dimensions, activation='relu', dropout_rate=0.0):
        """
        Initialize the complete neural network.
        
        Parameters:
        layer_dimensions : list
        activation : str
        dropout_rate : float
                      Fraction of neurons to drop (0 means no dropout)
        """
        self.layer_dimensions = layer_dimensions
        self.num_layers = len(layer_dimensions)
        self.activation = activation
        self.dropout_rate = dropout_rate
        self.keep_prob = 1.0 - dropout_rate
        
        self.parameters = self._initialize_parameters()
        self.adam_params = self._initialize_adam()
    
    def _initialize_parameters(self):
        """Initialize weights and biases."""
        parameters = {}
        
        for layer in range(1, self.num_layers):
            if self.activation == 'relu':
                scale = np.sqrt(2.0 / self.layer_dimensions[layer - 1])
            else:
                scale = np.sqrt(1.0 / self.layer_dimensions[layer - 1])
            
            parameters[f'W{layer}'] = np.random.randn(
                self.layer_dimensions[layer],
                self.layer_dimensions[layer - 1]
            ) * scale
            
            parameters[f'b{layer}'] = np.zeros((self.layer_dimensions[layer], 1))
        
        return parameters
    
    def _initialize_adam(self):
        """Initialize Adam optimizer parameters."""
        adam_params = {'v': {}, 's': {}, 't': 0}
        
        for layer in range(1, self.num_layers):
            adam_params['v'][f'dW{layer}'] = np.zeros_like(self.parameters[f'W{layer}'])
            adam_params['v'][f'db{layer}'] = np.zeros_like(self.parameters[f'b{layer}'])
            adam_params['s'][f'dW{layer}'] = np.zeros_like(self.parameters[f'W{layer}'])
            adam_params['s'][f'db{layer}'] = np.zeros_like(self.parameters[f'b{layer}'])
        
        return adam_params
    
    def forward_propagation(self, X, training=True):
        """
        Forward propagation with optional dropout.
        
        Parameters:
        X : numpy array
        training : bool
                  Whether we are in training mode (affects dropout)
        
        Returns:
        AL : numpy array
        caches : list
        dropout_masks : list (only if training and dropout_rate > 0)
        """
        caches = []
        dropout_masks = []
        A = X
        
        # Hidden layers
        for layer in range(1, self.num_layers - 1):
            A_prev = A
            W = self.parameters[f'W{layer}']
            b = self.parameters[f'b{layer}']
            
            Z = np.dot(W, A_prev) + b
            
            if self.activation == 'relu':
                A = relu(Z)
            else:
                A = sigmoid(Z)
            
            # Apply dropout during training
            if training and self.dropout_rate > 0:
                A, mask = dropout_forward(A, self.keep_prob)
                dropout_masks.append(mask)
            
            cache = {'A_prev': A_prev, 'Z': Z, 'W': W, 'b': b}
            caches.append(cache)
        
        # Output layer (no dropout)
        W = self.parameters[f'W{self.num_layers - 1}']
        b = self.parameters[f'b{self.num_layers - 1}']
        Z = np.dot(W, A) + b
        AL = sigmoid(Z)
        
        cache = {'A_prev': A, 'Z': Z, 'W': W, 'b': b}
        caches.append(cache)
        
        if training and self.dropout_rate > 0:
            return AL, caches, dropout_masks
        else:
            return AL, caches
    
    def compute_cost(self, AL, Y, lambd=0.0):
        """
        Compute cost with optional L2 regularization.
        
        Parameters:
        AL : numpy array
        Y : numpy array
        lambd : float
        
        Returns:
        cost : float
        """
        m = Y.shape[1]
        epsilon = 1e-8
        
        # Cross-entropy cost
        cross_entropy = -np.sum(
            Y * np.log(AL + epsilon) + (1 - Y) * np.log(1 - AL + epsilon)
        ) / m
        
        # L2 regularization
        l2_cost = 0
        if lambd > 0:
            for layer in range(1, self.num_layers):
                W = self.parameters[f'W{layer}']
                l2_cost += np.sum(np.square(W))
            l2_cost = (lambd / (2 * m)) * l2_cost
        
        cost = cross_entropy + l2_cost
        return cost
    
    def backward_propagation(self, AL, Y, caches, dropout_masks=None, lambd=0.0):
        """
        Backward propagation with optional dropout and regularization.
        
        Parameters:
        AL : numpy array
        Y : numpy array
        caches : list
        dropout_masks : list or None
        lambd : float
        
        Returns:
        gradients : dictionary
        """
        gradients = {}
        m = Y.shape[1]
        L = self.num_layers - 1
        
        # Output layer
        dAL = -(np.divide(Y, AL + 1e-8) - np.divide(1 - Y, 1 - AL + 1e-8))
        current_cache = caches[L - 1]
        dZ = dAL * sigmoid_derivative(current_cache['Z'])
        
        gradients[f'dW{L}'] = np.dot(dZ, current_cache['A_prev'].T) / m
        gradients[f'db{L}'] = np.sum(dZ, axis=1, keepdims=True) / m
        
        # Add regularization
        if lambd > 0:
            gradients[f'dW{L}'] += (lambd / m) * current_cache['W']
        
        dA_prev = np.dot(current_cache['W'].T, dZ)
        
        # Hidden layers
        for layer in reversed(range(L - 1)):
            current_cache = caches[layer]
            
            # Apply dropout mask if available
            if dropout_masks is not None and len(dropout_masks) > layer:
                dA_prev = dropout_backward(dA_prev, dropout_masks[layer], self.keep_prob)
            
            # Compute gradients
            if self.activation == 'relu':
                dZ = dA_prev * relu_derivative(current_cache['Z'])
            else:
                dZ = dA_prev * sigmoid_derivative(current_cache['Z'])
            
            gradients[f'dW{layer + 1}'] = np.dot(dZ, current_cache['A_prev'].T) / m
            gradients[f'db{layer + 1}'] = np.sum(dZ, axis=1, keepdims=True) / m
            
            # Add regularization
            if lambd > 0:
                gradients[f'dW{layer + 1}'] += (lambd / m) * current_cache['W']
            
            dA_prev = np.dot(current_cache['W'].T, dZ)
        
        return gradients
    
    def update_parameters_adam(self, gradients, learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8):
        """Update parameters using Adam optimizer."""
        self.adam_params['t'] += 1
        t = self.adam_params['t']
        
        for layer in range(1, self.num_layers):
            # Update moments
            self.adam_params['v'][f'dW{layer}'] = (
                beta1 * self.adam_params['v'][f'dW{layer}'] +
                (1 - beta1) * gradients[f'dW{layer}']
            )
            self.adam_params['v'][f'db{layer}'] = (
                beta1 * self.adam_params['v'][f'db{layer}'] +
                (1 - beta1) * gradients[f'db{layer}']
            )
            
            self.adam_params['s'][f'dW{layer}'] = (
                beta2 * self.adam_params['s'][f'dW{layer}'] +
                (1 - beta2) * np.square(gradients[f'dW{layer}'])
            )
            self.adam_params['s'][f'db{layer}'] = (
                beta2 * self.adam_params['s'][f'db{layer}'] +
                (1 - beta2) * np.square(gradients[f'db{layer}'])
            )
            
            # Bias correction
            v_corrected_W = self.adam_params['v'][f'dW{layer}'] / (1 - beta1**t)
            v_corrected_b = self.adam_params['v'][f'db{layer}'] / (1 - beta1**t)
            s_corrected_W = self.adam_params['s'][f'dW{layer}'] / (1 - beta2**t)
            s_corrected_b = self.adam_params['s'][f'db{layer}'] / (1 - beta2**t)
            
            # Update parameters
            self.parameters[f'W{layer}'] -= (
                learning_rate * v_corrected_W / (np.sqrt(s_corrected_W) + epsilon)
            )
            self.parameters[f'b{layer}'] -= (
                learning_rate * v_corrected_b / (np.sqrt(s_corrected_b) + epsilon)
            )
    
    def train(self, X_train, Y_train, X_val, Y_val, learning_rate=0.001,
             num_epochs=500, batch_size=64, lambd=0.0, patience=20, print_cost=True):
        """
        Complete training pipeline with all features.
        
        Parameters:
        X_train, Y_train : numpy arrays
        X_val, Y_val : numpy arrays
        learning_rate : float
        num_epochs : int
        batch_size : int
        lambd : float
        patience : int
        print_cost : bool
        
        Returns:
        history : dictionary containing training history
        """
        train_costs = []
        val_costs = []
        train_accuracies = []
        val_accuracies = []
        best_val_cost = float('inf')
        best_parameters = None
        epochs_without_improvement = 0
        best_epoch = 0
        
        for epoch in range(num_epochs):
            epoch_train_cost = 0
            mini_batches = create_mini_batches(X_train, Y_train, batch_size)
            num_batches = len(mini_batches)
            
            for mini_batch in mini_batches:
                mini_batch_X, mini_batch_Y = mini_batch
                
                # Forward propagation with dropout
                if self.dropout_rate > 0:
                    AL, caches, dropout_masks = self.forward_propagation(mini_batch_X, training=True)
                else:
                    AL, caches = self.forward_propagation(mini_batch_X, training=True)
                    dropout_masks = None
                
                # Compute cost
                batch_cost = self.compute_cost(AL, mini_batch_Y, lambd)
                epoch_train_cost += batch_cost
                
                # Backward propagation
                gradients = self.backward_propagation(AL, mini_batch_Y, caches, dropout_masks, lambd)
                
                # Update parameters
                self.update_parameters_adam(gradients, learning_rate)
            
            # Average training cost
            avg_train_cost = epoch_train_cost / num_batches
            train_costs.append(avg_train_cost)
            
            # Training accuracy
            train_pred = self.predict(X_train)
            train_acc = compute_accuracy(train_pred, Y_train)
            train_accuracies.append(train_acc)
            
            # Validation cost and accuracy
            AL_val, _ = self.forward_propagation(X_val, training=False)
            val_cost = self.compute_cost(AL_val, Y_val, lambd)
            val_costs.append(val_cost)
            
            val_pred = self.predict(X_val)
            val_acc = compute_accuracy(val_pred, Y_val)
            val_accuracies.append(val_acc)
            
            # Early stopping check
            if val_cost < best_val_cost:
                best_val_cost = val_cost
                best_parameters = {key: value.copy() for key, value in self.parameters.items()}
                epochs_without_improvement = 0
                best_epoch = epoch
            else:
                epochs_without_improvement += 1
            
            if print_cost and epoch % 10 == 0:
                print(f"Epoch {epoch}: Train Cost = {avg_train_cost:.6f}, Val Cost = {val_cost:.6f}, "
                      f"Train Acc = {train_acc:.2f}%, Val Acc = {val_acc:.2f}%")
            
            # Early stopping
            if epochs_without_improvement >= patience:
                print(f"\nEarly stopping triggered at epoch {epoch}")
                print(f"Best validation cost: {best_val_cost:.6f} at epoch {best_epoch}")
                self.parameters = best_parameters
                break
        
        history = {
            'train_costs': train_costs,
            'val_costs': val_costs,
            'train_accuracies': train_accuracies,
            'val_accuracies': val_accuracies,
            'best_epoch': best_epoch
        }
        
        return history
    
    def predict(self, X):
        """Make predictions."""
        AL, _ = self.forward_propagation(X, training=False)
        predictions = (AL > 0.5).astype(int)
        return predictions

FINAL EXAMPLE: TRAINING A COMPLETE NETWORK

Let us now use our complete neural network implementation on a real example, demonstrating all the features we have built.

print("\n" + "="*70)
print("FINAL DEMONSTRATION: COMPLETE NEURAL NETWORK")
print("="*70)

# Generate a larger, more complex dataset
X_final, Y_final = generate_complex_dataset(n_samples=2000)

# Split into train and validation
X_train_final, Y_train_final, X_val_final, Y_val_final = split_train_validation(
    X_final, Y_final, validation_split=0.2
)

print(f"\nDataset sizes:")
print(f"Training: {X_train_final.shape[1]} examples")
print(f"Validation: {X_val_final.shape[1]} examples")

# Create complete network with all features
print("\nCreating neural network with:")
print("- Architecture: [2, 32, 16, 8, 1]")
print("- Activation: ReLU")
print("- Dropout: 20%")
print("- L2 Regularization: lambda = 0.01")
print("- Optimizer: Adam")
print("- Early Stopping: patience = 30")

complete_net = CompleteNeuralNetwork(
    layer_dimensions=[2, 32, 16, 8, 1],
    activation='relu',
    dropout_rate=0.2
)

# Train the network
print("\nTraining network...")
history = complete_net.train(
    X_train_final, Y_train_final,
    X_val_final, Y_val_final,
    learning_rate=0.001,
    num_epochs=500,
    batch_size=32,
    lambd=0.01,
    patience=30,
    print_cost=True
)

# Final evaluation
print("\n" + "="*70)
print("FINAL RESULTS")
print("="*70)

train_pred_final = complete_net.predict(X_train_final)
val_pred_final = complete_net.predict(X_val_final)

train_acc_final = compute_accuracy(train_pred_final, Y_train_final)
val_acc_final = compute_accuracy(val_pred_final, Y_val_final)

print(f"\nFinal Training Accuracy: {train_acc_final:.2f}%")
print(f"Final Validation Accuracy: {val_acc_final:.2f}%")
print(f"Best Epoch: {history['best_epoch']}")
print(f"Total Epochs Trained: {len(history['train_costs'])}")

VISUALIZING TRAINING PROGRESS

It is important to visualize how our network learns over time. Let us create functions to plot the training history.

def plot_training_history(history):
    """
    Plot training and validation costs and accuracies.
    
    Parameters:
    history : dictionary containing training history
    """
    epochs = range(len(history['train_costs']))
    
    # Create figure with two subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Plot costs
    ax1.plot(epochs, history['train_costs'], label='Training Cost', linewidth=2)
    ax1.plot(epochs, history['val_costs'], label='Validation Cost', linewidth=2)
    ax1.axvline(x=history['best_epoch'], color='red', linestyle='--', 
                label=f'Best Epoch ({history["best_epoch"]})')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Cost')
    ax1.set_title('Training and Validation Cost')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot accuracies
    ax2.plot(epochs, history['train_accuracies'], label='Training Accuracy', linewidth=2)
    ax2.plot(epochs, history['val_accuracies'], label='Validation Accuracy', linewidth=2)
    ax2.axvline(x=history['best_epoch'], color='red', linestyle='--',
                label=f'Best Epoch ({history["best_epoch"]})')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Accuracy (%)')
    ax2.set_title('Training and Validation Accuracy')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()


# Visualize the training history
print("\nGenerating training history plots...")
# plot_training_history(history)

UNDERSTANDING HYPERPARAMETERS

Hyperparameters are settings that we choose before training begins. They are not learned from the data but significantly affect how well the network learns. Let us discuss the key hyperparameters and how to choose them.

Learning rate is perhaps the most important hyperparameter. If it is too high, training will be unstable and may diverge. If it is too low, training will be very slow. A good starting point for Adam optimizer is 0.001. You can try values like 0.0001, 0.001, 0.01 and see which works best.

Batch size affects both training speed and generalization. Smaller batches (like 32 or 64) provide more frequent updates and can help escape local minima, but training is noisier. Larger batches (like 128 or 256) provide more stable gradients but require more memory. Common choices are 32, 64, 128, or 256.

The number of hidden layers and neurons per layer determines the network's capacity. More layers and neurons allow the network to learn more complex patterns, but also increase the risk of overfitting. Start with a moderate architecture and increase complexity if the network underfits.

Regularization strength (lambda) controls how much we penalize large weights. Higher values prevent overfitting more strongly but may cause underfitting. Typical values range from 0.0001 to 0.1. Start with 0.01 and adjust based on the gap between training and validation performance.

Dropout rate determines what fraction of neurons to randomly drop during training. Common values are 0.2 to 0.5. Higher dropout provides stronger regularization but may slow down training.

Early stopping patience determines how many epochs to wait for improvement before stopping. This depends on your dataset size and complexity. For small datasets, 10 to 20 epochs might be enough. For larger datasets, you might use 30 to 50.

TIPS FOR DEBUGGING NEURAL NETWORKS

Neural networks can be tricky to debug because there are many things that can go wrong. Here are some tips to help you identify and fix problems.

If your training cost is not decreasing, first check that your learning rate is not too small. Try increasing it by a factor of 10. Also verify that your backward propagation is correctly implemented by using gradient checking.

If the cost decreases initially but then plateaus at a high value, your network might be stuck in a local minimum or the learning rate might be too low. Try increasing the learning rate or using a different initialization.

If you see the cost exploding to very large values or becoming NaN (not a number), your learning rate is probably too high. Reduce it by a factor of 10. Also check for numerical instability in your activation functions.

If training accuracy is high but validation accuracy is much lower, your network is overfitting. Add regularization (L2 or dropout), reduce the network size, or get more training data.

If both training and validation accuracy are low, your network is underfitting. Try increasing the network size, training for more epochs, or reducing regularization.

If training is very slow, consider using a larger batch size, a faster optimizer like Adam, or reducing the network size.

GRADIENT CHECKING FOR DEBUGGING

Gradient checking is a technique to verify that your backward propagation is correctly implemented. The idea is to numerically approximate the gradients and compare them with the gradients computed by backpropagation.

The numerical gradient for a parameter theta is approximately:

gradient ≈ (cost(theta + epsilon) - cost(theta - epsilon)) / (2 * epsilon)

where epsilon is a small value like 1e-7.

def gradient_check(network, X, Y, epsilon=1e-7, threshold=1e-7):
    """
    Perform gradient checking to verify backpropagation implementation.
    
    This compares analytical gradients from backpropagation with
    numerical gradients computed using finite differences.
    
    Parameters:
    network : CompleteNeuralNetwork instance
    X : numpy array
        Small sample of input data
    Y : numpy array
        Corresponding labels
    epsilon : float
             Small value for numerical gradient computation
    threshold : float
               Maximum acceptable difference
    
    Returns:
    difference : float
                Relative difference between gradients
    """
    # Get analytical gradients
    AL, caches = network.forward_propagation(X, training=False)
    gradients = network.backward_propagation(AL, Y, caches, dropout_masks=None, lambd=0.0)
    
    # Flatten all parameters and gradients into vectors
    params_values = []
    grad_values = []
    
    for layer in range(1, network.num_layers):
        params_values.extend(network.parameters[f'W{layer}'].flatten())
        params_values.extend(network.parameters[f'b{layer}'].flatten())
        grad_values.extend(gradients[f'dW{layer}'].flatten())
        grad_values.extend(gradients[f'db{layer}'].flatten())
    
    params_values = np.array(params_values)
    grad_values = np.array(grad_values)
    
    # Compute numerical gradients
    num_gradients = np.zeros_like(params_values)
    
    for i in range(len(params_values)):
        # Compute cost with theta + epsilon
        params_plus = params_values.copy()
        params_plus[i] += epsilon
        network_copy_plus = _set_parameters_from_vector(network, params_plus)
        AL_plus, _ = network_copy_plus.forward_propagation(X, training=False)
        cost_plus = network_copy_plus.compute_cost(AL_plus, Y, lambd=0.0)
        
        # Compute cost with theta - epsilon
        params_minus = params_values.copy()
        params_minus[i] -= epsilon
        network_copy_minus = _set_parameters_from_vector(network, params_minus)
        AL_minus, _ = network_copy_minus.forward_propagation(X, training=False)
        cost_minus = network_copy_minus.compute_cost(AL_minus, Y, lambd=0.0)
        
        # Numerical gradient
        num_gradients[i] = (cost_plus - cost_minus) / (2 * epsilon)
    
    # Compute relative difference
    numerator = np.linalg.norm(grad_values - num_gradients)
    denominator = np.linalg.norm(grad_values) + np.linalg.norm(num_gradients)
    difference = numerator / denominator
    
    if difference < threshold:
        print(f"Gradient check passed! Difference: {difference:.10f}")
    else:
        print(f"WARNING: Gradient check failed! Difference: {difference:.10f}")
        print(f"This suggests an error in the backpropagation implementation.")
    
    return difference


def _set_parameters_from_vector(network, params_vector):
    """Helper function to set network parameters from a vector."""
    import copy
    network_copy = copy.deepcopy(network)
    
    idx = 0
    for layer in range(1, network.num_layers):
        W_shape = network.parameters[f'W{layer}'].shape
        W_size = W_shape[0] * W_shape[1]
        network_copy.parameters[f'W{layer}'] = params_vector[idx:idx + W_size].reshape(W_shape)
        idx += W_size
        
        b_shape = network.parameters[f'b{layer}'].shape
        b_size = b_shape[0] * b_shape[1]
        network_copy.parameters[f'b{layer}'] = params_vector[idx:idx + b_size].reshape(b_shape)
        idx += b_size
    
    return network_copy

PRACTICAL RECOMMENDATIONS

Based on everything we have learned, here are some practical recommendations for building and training neural networks.

Start simple. Begin with a small network and simple settings. Make sure it can overfit a small subset of your data. If it cannot overfit, there is likely a bug in your implementation.

Use Adam optimizer. For most problems, Adam works well out of the box with a learning rate of 0.001. It is a good default choice.

Normalize your input data. Scale your features to have zero mean and unit variance. This helps the network train faster and more stably.

Use ReLU activation for hidden layers. ReLU is simple, fast, and works well in practice. Use sigmoid or softmax for the output layer depending on your task.

Start without regularization. First get your network to work without regularization. Then add L2 regularization or dropout if you see overfitting.

Monitor both training and validation metrics. Always keep track of both training and validation performance. A large gap indicates overfitting.

Use early stopping. It is a simple and effective way to prevent overfitting without having to tune regularization hyperparameters.

Experiment with architecture. Try different numbers of layers and neurons. Deeper networks can learn more complex patterns but are harder to train.

Be patient. Training neural networks can take time. Do not give up too quickly if results are not perfect immediately.

CONCLUSION

Congratulations! You have now built a complete deep learning neural network from scratch. We started with the basics of a single neuron and gradually added complexity: multiple layers, different activation functions, mini-batch training, advanced optimizers like momentum and Adam, regularization techniques, and early stopping.

You now understand not just how to use neural networks, but how they actually work under the hood. This knowledge will help you debug problems, choose appropriate architectures, and understand what is happening when you use high-level libraries like TensorFlow or PyTorch.

The key concepts we covered include forward propagation for making predictions, backward propagation for computing gradients, gradient descent and its variants for optimization, regularization for preventing overfitting, and various practical techniques for training neural networks effectively.

Remember that building neural networks is as much art as science. There is no one-size-fits-all solution. You will need to experiment with different architectures, hyperparameters, and techniques to find what works best for your specific problem.

The complete implementation we built provides a solid foundation. You can extend it further by adding more activation functions, implementing different cost functions for multi-class classification, adding batch normalization, or implementing convolutional layers for image data.

Keep learning, keep experimenting, and most importantly, keep building. The best way to truly understand neural networks is to implement them yourself and see how they behave with different settings and datasets.

Happy learning!