Hitchhiker's Guide to AI, Software Architecture, and Everything Else: HARNESSING APPLE SILICON: A GUIDE TO GPU AND NPU PROGRAMMING FOR ARTIFICIAL INTELLIGENCE APPLICATIONS

Introduction: Understanding the Apple Silicon Revolution

Apple Silicon represents a fundamental shift in computing architecture. Unlike traditional computers that separate the CPU, GPU, and memory into distinct components connected by buses, Apple Silicon uses a unified memory architecture. This means that the CPU, GPU, and Neural Processing Unit all share the same pool of high-bandwidth memory. This architectural decision eliminates the need to copy data between different memory spaces, which traditionally has been one of the biggest bottlenecks in AI and machine learning workloads.

When you run AI applications on Apple Silicon, you have three main computational resources at your disposal. The CPU handles general-purpose computing tasks and orchestrates the overall workflow. The GPU excels at parallel operations involving large matrices and tensors, which are the fundamental data structures in neural networks. The Neural Engine, which is Apple's term for their NPU, is specifically designed for neural network operations and can perform trillions of operations per second while consuming minimal power.

Understanding how to leverage these resources effectively requires knowledge of several frameworks and APIs that Apple provides. The most important ones are Metal, which gives you low-level access to the GPU; Core ML, which provides high-level neural network inference capabilities; and Metal Performance Shaders, which offers optimized implementations of common computational kernels.

Part One: The Foundation - Metal and the GPU

Metal is Apple's graphics and compute API. While it was originally designed for graphics rendering, it has evolved into a powerful general-purpose GPU computing framework. To understand Metal, you need to grasp a few fundamental concepts.

First, everything in Metal revolves around the concept of a device. The device represents your GPU hardware. You obtain a reference to the default device, which on Apple Silicon is the integrated GPU that shares memory with the CPU. Second, you work with command queues and command buffers. A command queue is a serial stream of command buffers, and each command buffer contains encoded commands that the GPU will execute. Third, you write compute kernels in Metal Shading Language, which is based on C++ but includes special features for GPU programming.

Let's start with a simple example that demonstrates the basic structure of a Metal compute program. We'll create a program that adds two arrays of numbers together, which is a fundamental operation in neural networks.

import Metal

import Foundation

// Step 1: Get access to the GPU device

// This returns the default Metal device, which on Apple Silicon

// is the integrated GPU that shares unified memory

guard let device = MTLCreateSystemDefaultDevice() else {

fatalError("Metal is not supported on this device")

}

// Step 2: Create a command queue

// The command queue manages the execution of command buffers

// Think of it as a pipeline where you submit work to the GPU

guard let commandQueue = device.makeCommandQueue() else {

fatalError("Failed to create command queue")

}

// Step 3: Define the Metal shader code

// This is the code that will actually run on the GPU

// Each thread processes one element of the arrays

let shaderCode = """

#include <metal_stdlib>

using namespace metal;

// The kernel function that runs on the GPU

// Each thread executes this function for one element

kernel void add_arrays(device const float* inA [[buffer(0)]],

device const float* inB [[buffer(1)]],

device float* result [[buffer(2)]],

uint index [[thread_position_in_grid]])

{

// Simply add the corresponding elements

result[index] = inA[index] + inB[index];

}

"""

This code establishes the foundation for GPU computing. The device object is your gateway to the GPU. The command queue is how you submit work. The shader code defines what computation will happen on the GPU. Notice that the kernel function uses special Metal syntax. The "device" keyword indicates that these pointers reference memory in the device address space. The double square brackets contain attributes that tell Metal how to bind data to the function parameters.

Now let's continue with the code that compiles this shader and prepares it for execution:

// Step 4: Compile the shader code into a library

// This is similar to compiling C code into a binary

let library: MTLLibrary

do {

library = try device.makeLibrary(source: shaderCode, options: nil)

} catch {

fatalError("Failed to compile shader: \(error)")

}

// Step 5: Get the specific function we want to execute

// A library can contain multiple kernel functions

guard let addFunction = library.makeFunction(name: "add_arrays") else {

fatalError("Failed to find add_arrays function")

}

// Step 6: Create a compute pipeline state

// This is an optimized, ready-to-execute version of our kernel

let pipelineState: MTLComputePipelineState

do {

pipelineState = try device.makeComputePipelineState(function: addFunction)

} catch {

fatalError("Failed to create pipeline state: \(error)")

}

The compilation process transforms your Metal Shading Language code into GPU machine code. The pipeline state object is highly optimized for your specific GPU and represents a ready-to-execute kernel. Creating pipeline states can be expensive, so you typically do this once during initialization and reuse the pipeline state many times.

Now we need to create the actual data and allocate GPU buffers to hold it:

// Step 7: Prepare the input data

// We'll create two arrays of 1 million floats

let arrayLength = 1_000_000

var inputA = [Float](repeating: 0, count: arrayLength)

var inputB = [Float](repeating: 0, count: arrayLength)

// Fill with some test data

for i in 0..<arrayLength {

inputA[i] = Float(i)

inputB[i] = Float(i * 2)

}

// Step 8: Create Metal buffers

// These buffers live in the unified memory and can be accessed

// by both the CPU and GPU without copying

let bufferSize = arrayLength * MemoryLayout<Float>.stride

guard let bufferA = device.makeBuffer(bytes: inputA,

length: bufferSize,

options: .storageModeShared) else {

fatalError("Failed to create buffer A")

}

guard let bufferB = device.makeBuffer(bytes: inputB,

length: bufferSize,

options: .storageModeShared) else {

fatalError("Failed to create buffer B")

}

guard let bufferResult = device.makeBuffer(length: bufferSize,

options: .storageModeShared) else {

fatalError("Failed to create result buffer")

}

The storageModeShared option is crucial for Apple Silicon. It tells Metal to allocate the buffer in unified memory that both the CPU and GPU can access directly. On discrete GPUs, you would use different storage modes and would need to explicitly copy data between CPU and GPU memory. But on Apple Silicon, the unified memory architecture means you can avoid these expensive copy operations entirely.

Now comes the exciting part where we actually execute our kernel on the GPU:

// Step 9: Create a command buffer

// This represents a batch of work to submit to the GPU

guard let commandBuffer = commandQueue.makeCommandBuffer() else {

fatalError("Failed to create command buffer")

}

// Step 10: Create a compute command encoder

// This is what we use to encode GPU commands into the buffer

guard let computeEncoder = commandBuffer.makeComputeCommandEncoder() else {

fatalError("Failed to create compute encoder")

}

// Step 11: Set up the compute command

// Tell the encoder which pipeline to use

computeEncoder.setComputePipelineState(pipelineState)

// Bind our buffers to the kernel parameters

// The indices (0, 1, 2) correspond to the buffer indices in the shader

computeEncoder.setBuffer(bufferA, offset: 0, index: 0)

computeEncoder.setBuffer(bufferB, offset: 0, index: 1)

computeEncoder.setBuffer(bufferResult, offset: 0, index: 2)

// Step 12: Determine the thread execution configuration

// We need to tell Metal how many threads to launch

let threadsPerGrid = MTLSize(width: arrayLength, height: 1, depth: 1)

// The threadgroup size determines how threads are organized

// Metal has a maximum threadgroup size, so we query it

let maxThreadsPerThreadgroup = pipelineState.maxTotalThreadsPerThreadgroup

let threadsPerThreadgroup = MTLSize(width: maxThreadsPerThreadgroup,

height: 1,

depth: 1)

// Dispatch the compute kernel

computeEncoder.dispatchThreads(threadsPerGrid,

threadsPerThreadgroup: threadsPerThreadgroup)

// Step 13: Finish encoding and commit the work

computeEncoder.endEncoding()

// Commit the command buffer to the queue for execution

commandBuffer.commit()

// Wait for the GPU to finish

commandBuffer.waitUntilCompleted()

// Step 14: Read back the results

// Because we used shared storage mode, we can directly access the buffer

let resultPointer = bufferResult.contents().bindMemory(to: Float.self,

capacity: arrayLength)

let resultArray = Array(UnsafeBufferPointer(start: resultPointer,

// Verify a few results

print("First result: \(resultArray[0]) (expected 0)")

print("Second result: \(resultArray[1]) (expected 3)")

print("Third result: \(resultArray[2]) (expected 6)")

This example demonstrates the complete workflow for GPU computing on Apple Silicon. You create a device and command queue, compile your shader code, create buffers in unified memory, encode commands that bind your data and dispatch threads, and then commit the work to the GPU. The unified memory architecture means that once the GPU finishes, you can immediately access the results without any explicit copy operations.

The thread execution model is important to understand. When you dispatch threads, you specify how many threads to launch in total and how they should be organized into threadgroups. A threadgroup is a collection of threads that execute together and can share fast threadgroup memory. The GPU scheduler assigns threadgroups to execution units on the GPU. Each thread gets a unique index that it can use to determine which piece of data to process.

Part Two: Matrix Operations for Neural Networks

Neural networks are built on matrix operations. The most fundamental operation is matrix multiplication, which is used in every layer of a neural network. Let's build a matrix multiplication kernel that demonstrates more advanced Metal programming techniques.

Matrix multiplication takes two input matrices and produces an output matrix. If matrix A has dimensions M by K and matrix B has dimensions K by N, the result matrix C has dimensions M by N. Each element of C is computed by taking the dot product of a row from A with a column from B.

Here's a Metal kernel that performs matrix multiplication:

let matrixMultiplyShader = """

#include <metal_stdlib>

using namespace metal;

// Kernel for matrix multiplication: C = A * B

// A is M x K, B is K x N, C is M x N

kernel void matrix_multiply(device const float* A [[buffer(0)]],

device const float* B [[buffer(1)]],

device float* C [[buffer(2)]],

constant uint& M [[buffer(3)]],

constant uint& N [[buffer(4)]],

constant uint& K [[buffer(5)]],

uint2 gid [[thread_position_in_grid]])

{

// Each thread computes one element of the output matrix

uint row = gid.y; // Row index in output matrix

uint col = gid.x; // Column index in output matrix

// Bounds check to handle cases where grid size doesn't match exactly

if (row >= M || col >= N) {

return;

}

// Compute the dot product of row from A with column from B

float sum = 0.0;

for (uint k = 0; k < K; k++) {

// A is stored in row-major order: A[row][k] = A[row * K + k]

// B is stored in row-major order: B[k][col] = B[k * N + col]

sum += A[row * K + k] * B[k * N + col];

}

// Store the result in C

// C is stored in row-major order: C[row][col] = C[row * N + col]

C[row * N + col] = sum;

}

"""

This kernel is more sophisticated than our simple addition example. It uses a two-dimensional thread grid where each thread computes one element of the output matrix. The thread position is given by a uint2 value that contains both the x and y coordinates. The kernel also takes the matrix dimensions as parameters, which makes it flexible for different sized matrices.

Now let's write the Swift code to use this kernel:

// Create a helper class to manage matrix multiplication

class MetalMatrixMultiplier {

let device: MTLDevice

let commandQueue: MTLCommandQueue

let pipelineState: MTLComputePipelineState

init() {

// Initialize Metal device and command queue

guard let device = MTLCreateSystemDefaultDevice() else {

fatalError("Metal is not supported")

}

self.device = device

guard let commandQueue = device.makeCommandQueue() else {

fatalError("Failed to create command queue")

}

self.commandQueue = commandQueue

// Compile the shader and create pipeline state

let library: MTLLibrary

do {

library = try device.makeLibrary(source: matrixMultiplyShader,

options: nil)

} catch {

fatalError("Failed to compile shader: \(error)")

}

guard let function = library.makeFunction(name: "matrix_multiply") else {

fatalError("Failed to find matrix_multiply function")

}

do {

self.pipelineState = try device.makeComputePipelineState(

function: function)

} catch {

fatalError("Failed to create pipeline state: \(error)")

}

func multiply(matrixA: [Float], rowsA: Int, colsA: Int,

matrixB: [Float], rowsB: Int, colsB: Int) -> [Float]? {

// Verify dimensions are compatible

guard colsA == rowsB else {

print("Incompatible matrix dimensions")

return nil

}

let M = rowsA

let K = colsA

let N = colsB

// Create buffers for the matrices

let sizeA = M * K * MemoryLayout<Float>.stride

let sizeB = K * N * MemoryLayout<Float>.stride

let sizeC = M * N * MemoryLayout<Float>.stride

guard let bufferA = device.makeBuffer(bytes: matrixA,

length: sizeA,

options: .storageModeShared),

let bufferB = device.makeBuffer(bytes: matrixB,

length: sizeB,

options: .storageModeShared),

let bufferC = device.makeBuffer(length: sizeC,

options: .storageModeShared) else {

print("Failed to create buffers")

return nil

}

// Create buffers for the dimension parameters

var mValue = UInt32(M)

var nValue = UInt32(N)

var kValue = UInt32(K)

guard let bufferM = device.makeBuffer(bytes: &mValue,

length: MemoryLayout<UInt32>.stride,

options: .storageModeShared),

let bufferN = device.makeBuffer(bytes: &nValue,

length: MemoryLayout<UInt32>.stride,

options: .storageModeShared),

let bufferK = device.makeBuffer(bytes: &kValue,

length: MemoryLayout<UInt32>.stride,

options: .storageModeShared) else {

print("Failed to create dimension buffers")

return nil

}

// Create command buffer and encoder

guard let commandBuffer = commandQueue.makeCommandBuffer(),

let encoder = commandBuffer.makeComputeCommandEncoder() else {

print("Failed to create command buffer or encoder")

return nil

}

// Set up the compute command

encoder.setComputePipelineState(pipelineState)

encoder.setBuffer(bufferA, offset: 0, index: 0)

encoder.setBuffer(bufferB, offset: 0, index: 1)

encoder.setBuffer(bufferC, offset: 0, index: 2)

encoder.setBuffer(bufferM, offset: 0, index: 3)

encoder.setBuffer(bufferN, offset: 0, index: 4)

encoder.setBuffer(bufferK, offset: 0, index: 5)

// Configure thread execution

// We need M x N threads total, organized in a 2D grid

let threadsPerGrid = MTLSize(width: N, height: M, depth: 1)

// Choose a reasonable threadgroup size

// 16x16 is a common choice for 2D operations

let threadgroupWidth = min(16, N)

let threadgroupHeight = min(16, M)

let threadsPerThreadgroup = MTLSize(width: threadgroupWidth,

height: threadgroupHeight,

depth: 1)

encoder.dispatchThreads(threadsPerGrid,

threadsPerThreadgroup: threadsPerThreadgroup)

encoder.endEncoding()

// Execute and wait

commandBuffer.commit()

commandBuffer.waitUntilCompleted()

// Read results

let resultPointer = bufferC.contents().bindMemory(to: Float.self,

capacity: M * N)

return Array(UnsafeBufferPointer(start: resultPointer, count: M * N))

}

This class encapsulates the matrix multiplication functionality in a reusable way. The initialization sets up the Metal device and compiles the shader once. The multiply method can then be called multiple times with different matrices without recompiling the shader.

Let's test this with a simple example:

// Create the multiplier

let multiplier = MetalMatrixMultiplier()

// Create two small test matrices

// Matrix A: 3x2

let matrixA: [Float] = [

1, 2,

3, 4,

5, 6

]

// Matrix B: 2x4

let matrixB: [Float] = [

1, 2, 3, 4,

5, 6, 7, 8

]

// Multiply them

if let result = multiplier.multiply(matrixA: matrixA, rowsA: 3, colsA: 2,

matrixB: matrixB, rowsB: 2, colsB: 4) {

print("Result matrix (3x4):")

for row in 0..<3 {

for col in 0..<4 {

let value = result[row * 4 + col]

print(String(format: "%.1f ", value), terminator: "")

}

print()

}

This will compute the matrix product and print the result. The expected output is a 3 by 4 matrix where each element is the dot product of the corresponding row from A and column from B.

Part Three: Optimizing Matrix Multiplication with Threadgroup Memory

The matrix multiplication kernel we just wrote is correct but not optimally efficient. Each thread reads from global memory many times, and multiple threads read the same values. We can optimize this by using threadgroup memory, which is a fast shared memory that threads within a threadgroup can use to communicate.

The optimization strategy is called tiling. Instead of having each thread independently compute one output element, we have each threadgroup cooperatively compute a tile of the output matrix. The threads in the threadgroup work together to load tiles of the input matrices into threadgroup memory, then each thread computes its output element using the cached data.

Here's an optimized kernel using threadgroup memory:

let optimizedMatrixMultiplyShader = """

#include <metal_stdlib>

using namespace metal;

// Optimized matrix multiplication using threadgroup memory

// This version uses tiling to reduce global memory accesses

kernel void matrix_multiply_tiled(

device const float* A [[buffer(0)]],

device const float* B [[buffer(1)]],

device float* C [[buffer(2)]],

constant uint& M [[buffer(3)]],

constant uint& N [[buffer(4)]],

constant uint& K [[buffer(5)]],

uint2 gid [[thread_position_in_grid]],

uint2 tid [[thread_position_in_threadgroup]],

uint2 tgSize [[threads_per_threadgroup]])

{

// Define the tile size

// This should match the threadgroup size

const uint TILE_SIZE = 16;

// Allocate threadgroup memory for tiles

// These are shared among all threads in the threadgroup

threadgroup float tileA[TILE_SIZE][TILE_SIZE];

threadgroup float tileB[TILE_SIZE][TILE_SIZE];

uint row = gid.y;

uint col = gid.x;

// Accumulator for the dot product

float sum = 0.0;

// Number of tiles we need to process

uint numTiles = (K + TILE_SIZE - 1) / TILE_SIZE;

// Loop over tiles

for (uint t = 0; t < numTiles; t++) {

// Each thread loads one element into tileA

uint aRow = row;

uint aCol = t * TILE_SIZE + tid.x;

if (aRow < M && aCol < K) {

tileA[tid.y][tid.x] = A[aRow * K + aCol];

} else {

tileA[tid.y][tid.x] = 0.0;

}

// Each thread loads one element into tileB

uint bRow = t * TILE_SIZE + tid.y;

uint bCol = col;

if (bRow < K && bCol < N) {

tileB[tid.y][tid.x] = B[bRow * N + bCol];

} else {

tileB[tid.y][tid.x] = 0.0;

}

// Wait for all threads in the threadgroup to finish loading

threadgroup_barrier(mem_flags::mem_threadgroup);

// Compute partial dot product using the cached tiles

for (uint k = 0; k < TILE_SIZE; k++) {

sum += tileA[tid.y][k] * tileB[k][tid.x];

}

// Wait for all threads to finish computing before loading next tile

threadgroup_barrier(mem_flags::mem_threadgroup);

}

// Write the final result

if (row < M && col < N) {

C[row * N + col] = sum;

}

"""

This optimized kernel is more complex but significantly faster for large matrices. The key insight is that threads within a threadgroup cooperate to load tiles of data into fast threadgroup memory. Each thread loads one element of each tile, then all threads synchronize using a threadgroup barrier. After the barrier, all threads can read from the cached tiles without accessing slow global memory.

The threadgroup barrier is crucial. It ensures that all threads have finished loading data before any thread starts reading from the threadgroup memory. Without this synchronization, some threads might try to read data that hasn't been loaded yet, leading to incorrect results.

Let's understand the memory access pattern more deeply. In the naive kernel, if we have a 1000 by 1000 matrix multiplication, each thread reads 1000 elements from matrix A and 1000 elements from matrix B. That's 2000 global memory reads per thread, and with 1 million threads total, we have 2 billion global memory accesses. Many of these accesses read the same data because multiple threads need the same elements.

In the tiled kernel, threads cooperate to load data into threadgroup memory. If we use 16 by 16 tiles, each tile of 256 elements is loaded once into threadgroup memory by the 256 threads in the threadgroup. Then all 256 threads can read from this fast cache. This dramatically reduces the number of global memory accesses and improves performance.

Part Four: Metal Performance Shaders for Common Operations

While writing custom Metal kernels gives you maximum control and performance, Apple provides Metal Performance Shaders, which are highly optimized implementations of common operations. For production AI applications, you should use MPS whenever possible because Apple's engineers have spent enormous effort optimizing these kernels for Apple Silicon.

Metal Performance Shaders includes operations for neural networks, image processing, linear algebra, and more. Let's look at how to use MPS for matrix multiplication, which is much simpler than writing our own kernel:

import MetalPerformanceShaders

class MPSMatrixMultiplier {

let device: MTLDevice

let commandQueue: MTLCommandQueue

init() {

guard let device = MTLCreateSystemDefaultDevice() else {

fatalError("Metal is not supported")

}

self.device = device

guard let commandQueue = device.makeCommandQueue() else {

fatalError("Failed to create command queue")

}

self.commandQueue = commandQueue

}

func multiply(matrixA: [Float], rowsA: Int, colsA: Int,

matrixB: [Float], rowsB: Int, colsB: Int) -> [Float]? {

guard colsA == rowsB else {

print("Incompatible dimensions")

return nil

}

let M = rowsA

let K = colsA

let N = colsB

// Create Metal buffers

let sizeA = M * K * MemoryLayout<Float>.stride

let sizeB = K * N * MemoryLayout<Float>.stride

let sizeC = M * N * MemoryLayout<Float>.stride

guard let bufferA = device.makeBuffer(bytes: matrixA,

length: sizeA,

options: .storageModeShared),

let bufferB = device.makeBuffer(bytes: matrixB,

length: sizeB,

options: .storageModeShared),

let bufferC = device.makeBuffer(length: sizeC,

options: .storageModeShared) else {

return nil

}

// Create MPS matrix descriptors

// These describe the layout and dimensions of the matrices

let descA = MPSMatrixDescriptor(rows: M, columns: K,

rowBytes: K * MemoryLayout<Float>.stride,

dataType: .float32)

let descB = MPSMatrixDescriptor(rows: K, columns: N,

rowBytes: N * MemoryLayout<Float>.stride,

dataType: .float32)

let descC = MPSMatrixDescriptor(rows: M, columns: N,

rowBytes: N * MemoryLayout<Float>.stride,

dataType: .float32)

// Create MPS matrices wrapping our buffers

let mpsA = MPSMatrix(buffer: bufferA, descriptor: descA)

let mpsB = MPSMatrix(buffer: bufferB, descriptor: descB)

let mpsC = MPSMatrix(buffer: bufferC, descriptor: descC)

// Create the matrix multiplication kernel

// alpha and beta are scaling factors: C = alpha * A * B + beta * C

// We use alpha = 1 and beta = 0 for simple multiplication

let matmul = MPSMatrixMultiplication(device: device,

transposeLeft: false,

transposeRight: false,

resultRows: M,

resultColumns: N,

interiorColumns: K,

alpha: 1.0,

beta: 0.0)

// Encode and execute

guard let commandBuffer = commandQueue.makeCommandBuffer() else {

return nil

}

matmul.encode(commandBuffer: commandBuffer,

leftMatrix: mpsA,

rightMatrix: mpsB,

resultMatrix: mpsC)

commandBuffer.commit()

commandBuffer.waitUntilCompleted()

// Read results

let resultPointer = bufferC.contents().bindMemory(to: Float.self,

capacity: M * N)

return Array(UnsafeBufferPointer(start: resultPointer, count: M * N))

}

This code is much simpler than our custom kernel, and the MPS implementation is highly optimized. It uses advanced techniques like tiling, vectorization, and careful memory access patterns that are tuned specifically for Apple Silicon. For matrix multiplication and other common operations, you should always prefer MPS over custom kernels unless you have very specific requirements.

Metal Performance Shaders also includes a complete neural network graph API called MPS Graph. This allows you to build complex neural network architectures using high-level operations. Here's an example of creating a simple neural network layer:

import MetalPerformanceShaders

import MetalPerformanceShadersGraph

class SimpleNeuralLayer {

let device: MTLDevice

let graph: MPSGraph

let weightsTensor: MPSGraphTensor

let biasTensor: MPSGraphTensor

init(inputSize: Int, outputSize: Int) {

guard let device = MTLCreateSystemDefaultDevice() else {

fatalError("Metal not supported")

}

self.device = device

// Create an MPS Graph

self.graph = MPSGraph()

// Create placeholder tensors for weights and bias

// These will be filled with actual data at execution time

let weightsShape = [outputSize as NSNumber, inputSize as NSNumber]

self.weightsTensor = graph.placeholder(shape: weightsShape,

dataType: .float32,

let biasShape = [outputSize as NSNumber]

self.biasTensor = graph.placeholder(shape: biasShape,

dataType: .float32,

}

func forward(input: MPSGraphTensor) -> MPSGraphTensor {

// Perform matrix multiplication: output = input * weights^T

let matmul = graph.matrixMultiplication(primary: input,

secondary: weightsTensor,

// Add bias

let withBias = graph.addition(matmul, biasTensor, name: "add_bias")

// Apply ReLU activation

let output = graph.reLU(with: withBias, name: "relu")

return output

}

MPS Graph provides a high-level way to build neural networks. You define the computational graph symbolically using placeholder tensors and operations. The graph is then compiled and optimized for execution on the GPU. This approach is similar to frameworks like TensorFlow or PyTorch, but it's specifically designed for Apple Silicon and can leverage both the GPU and Neural Engine.

Part Five: Leveraging the Neural Engine with Core ML

The Neural Engine is Apple's dedicated neural network accelerator. It's designed specifically for neural network inference and can perform operations much more efficiently than the GPU for certain workloads. The Neural Engine is particularly good at convolutional operations, which are the foundation of computer vision models.

You access the Neural Engine through Core ML, Apple's machine learning framework. Core ML provides a high-level API for running trained neural networks. You typically train your model using a framework like PyTorch or TensorFlow, then convert it to Core ML format, and finally deploy it on Apple devices where it can run on the Neural Engine.

Let's walk through the process of using Core ML. First, you need a trained model. For this example, let's assume you have a simple image classification model. Here's how you would use it:

import CoreML

import Vision

import CoreImage

class ImageClassifier {

let model: VNCoreMLModel

init(modelURL: URL) throws {

// Load the Core ML model

// The model file has a .mlmodel or .mlmodelc extension

let mlModel = try MLModel(contentsOf: modelURL)

// Wrap it in a Vision model for easy image processing

self.model = try VNCoreMLModel(for: mlModel)

}

func classify(image: CIImage, completion: @escaping ([String: Float]) -> Void) {

// Create a Vision request

let request = VNCoreMLRequest(model: model) { request, error in

guard let results = request.results as? [VNClassificationObservation] else {

completion([:])

return

}

// Convert results to a dictionary

var classifications: [String: Float] = [:]

for result in results {

classifications[result.identifier] = result.confidence

}

completion(classifications)

}

// Configure the request to use the Neural Engine

// Core ML will automatically choose the best compute unit

// but you can provide hints

request.usesCPUOnly = false

// Create a request handler and perform the request

let handler = VNImageRequestHandler(ciImage: image, options: [:])

do {

try handler.perform([request])

} catch {

print("Failed to perform classification: \(error)")

completion([:])

}

Core ML automatically decides whether to run your model on the CPU, GPU, or Neural Engine based on the model architecture and the available hardware. The Neural Engine is particularly efficient for models that use convolutions, which includes most computer vision models. For models that are primarily matrix multiplications, like transformers, the GPU might be faster.

You can also have more control over where your model runs by specifying compute units:

class FlexibleImageClassifier {

let cpuModel: MLModel

let gpuModel: MLModel

let neuralEngineModel: MLModel

init(modelURL: URL) throws {

// Create configurations for different compute units

let cpuConfig = MLModelConfiguration()

cpuConfig.computeUnits = .cpuOnly

let gpuConfig = MLModelConfiguration()

gpuConfig.computeUnits = .cpuAndGPU

let neConfig = MLModelConfiguration()

neConfig.computeUnits = .all // Allows Neural Engine

// Load the model with different configurations

self.cpuModel = try MLModel(contentsOf: modelURL,

configuration: cpuConfig)

self.gpuModel = try MLModel(contentsOf: modelURL,

configuration: gpuConfig)

self.neuralEngineModel = try MLModel(contentsOf: modelURL,

configuration: neConfig)

}

func classifyOnNeuralEngine(image: CVPixelBuffer) throws -> MLFeatureProvider {

// Prepare input

let input = try MLDictionaryFeatureProvider(dictionary: ["image": MLFeatureValue(pixelBuffer: image)])

// Run inference

let output = try neuralEngineModel.prediction(from: input)

return output

}

The Neural Engine has some constraints. It works best with models that use 16-bit floating point or 8-bit integer quantization. It has limitations on the maximum size of certain operations. Core ML will automatically fall back to the GPU or CPU for operations that the Neural Engine cannot handle, so your model might run on a combination of compute units.

Part Six: Building a Complete AI Pipeline

Now let's put everything together and build a complete AI pipeline that leverages all the compute resources on Apple Silicon. We'll create a system that processes images through multiple stages, using the appropriate compute unit for each stage.

Our pipeline will perform the following steps. First, we'll preprocess the image on the GPU using Metal compute shaders. Second, we'll run a neural network for object detection on the Neural Engine using Core ML. Third, we'll perform post-processing on the GPU to draw bounding boxes. This demonstrates how to orchestrate different compute units for a complex workflow.

Here's the implementation:

import Metal

import MetalKit

import CoreML

import Vision

import CoreImage

class AIProcessingPipeline {

let device: MTLDevice

let commandQueue: MTLCommandQueue

let preprocessPipeline: MTLComputePipelineState

let postprocessPipeline: MTLComputePipelineState

let detectionModel: VNCoreMLModel

init(modelURL: URL) throws {

// Initialize Metal

guard let device = MTLCreateSystemDefaultDevice() else {

throw NSError(domain: "Metal not supported", code: -1)

}

self.device = device

guard let commandQueue = device.makeCommandQueue() else {

throw NSError(domain: "Failed to create command queue", code: -1)

}

self.commandQueue = commandQueue

// Load and compile preprocessing shader

let preprocessShader = """

#include <metal_stdlib>

using namespace metal;

// Normalize image pixels from [0, 255] to [0, 1]

// and convert from RGBA to RGB

kernel void preprocess_image(

texture2d<float, access::read> inputTexture [[texture(0)]],

texture2d<float, access::write> outputTexture [[texture(1)]],

uint2 gid [[thread_position_in_grid]])

{

// Check bounds

if (gid.x >= outputTexture.get_width() ||

gid.y >= outputTexture.get_height()) {

return;

}

// Read pixel

float4 pixel = inputTexture.read(gid);

// Normalize to [0, 1] range

pixel = pixel / 255.0;

// Write normalized pixel

outputTexture.write(pixel, gid);

}

"""

let library = try device.makeLibrary(source: preprocessShader,

options: nil)

let preprocessFunc = library.makeFunction(name: "preprocess_image")!

self.preprocessPipeline = try device.makeComputePipelineState(

function: preprocessFunc)

// Load and compile postprocessing shader

let postprocessShader = """

#include <metal_stdlib>

using namespace metal;

// Draw a bounding box on the image

kernel void draw_bounding_box(

texture2d<float, access::read_write> texture [[texture(0)]],

constant float4& box [[buffer(0)]], // x, y, width, height

constant float4& color [[buffer(1)]], // r, g, b, a

uint2 gid [[thread_position_in_grid]])

{

if (gid.x >= texture.get_width() ||

gid.y >= texture.get_height()) {

return;

}

float x = float(gid.x);

float y = float(gid.y);

// Check if pixel is on the box border

bool onBorder = false;

float thickness = 2.0;

if (x >= box.x && x <= box.x + box.z &&

y >= box.y && y <= box.y + box.w) {

if (x <= box.x + thickness ||

x >= box.x + box.z - thickness ||

y <= box.y + thickness ||

y >= box.y + box.w - thickness) {

onBorder = true;

}

if (onBorder) {

texture.write(color, gid);

}

"""

let postLibrary = try device.makeLibrary(source: postprocessShader,

options: nil)

let postprocessFunc = postLibrary.makeFunction(name: "draw_bounding_box")!

self.postprocessPipeline = try device.makeComputePipelineState(

function: postprocessFunc)

// Load Core ML model

let mlModel = try MLModel(contentsOf: modelURL)

self.detectionModel = try VNCoreMLModel(for: mlModel)

}

func process(inputTexture: MTLTexture,

completion: @escaping (MTLTexture?) -> Void) {

// Step 1: Preprocess on GPU

guard let preprocessedTexture = createTexture(

width: inputTexture.width,

height: inputTexture.height) else {

completion(nil)

return

}

guard let commandBuffer = commandQueue.makeCommandBuffer(),

let encoder = commandBuffer.makeComputeCommandEncoder() else {

completion(nil)

return

}

encoder.setComputePipelineState(preprocessPipeline)

encoder.setTexture(inputTexture, index: 0)

encoder.setTexture(preprocessedTexture, index: 1)

let threadgroupSize = MTLSize(width: 16, height: 16, depth: 1)

let threadgroups = MTLSize(

width: (inputTexture.width + 15) / 16,

height: (inputTexture.height + 15) / 16,

depth: 1)

encoder.dispatchThreadgroups(threadgroups,

threadsPerThreadgroup: threadgroupSize)

encoder.endEncoding()

commandBuffer.commit()

commandBuffer.waitUntilCompleted()

// Step 2: Run detection on Neural Engine

let ciImage = CIImage(mtlTexture: preprocessedTexture,

options: nil)!

let request = VNCoreMLRequest(model: detectionModel) { request, error in

guard let results = request.results as? [VNRecognizedObjectObservation] else {

completion(nil)

return

}

// Step 3: Postprocess on GPU

self.drawBoundingBoxes(on: preprocessedTexture,

detections: results,

completion: completion)

}

let handler = VNImageRequestHandler(ciImage: ciImage,

options: [:])

do {

try handler.perform([request])

} catch {

print("Detection failed: \(error)")

completion(nil)

}

private func drawBoundingBoxes(on texture: MTLTexture,

detections: [VNRecognizedObjectObservation],

completion: @escaping (MTLTexture?) -> Void) {

guard let commandBuffer = commandQueue.makeCommandBuffer() else {

completion(nil)

return

}

// Draw each detection

for detection in detections {

guard let encoder = commandBuffer.makeComputeCommandEncoder() else {

continue

}

encoder.setComputePipelineState(postprocessPipeline)

encoder.setTexture(texture, index: 0)

// Convert normalized coordinates to pixel coordinates

let box = detection.boundingBox

var boxData = SIMD4<Float>(

Float(box.origin.x) * Float(texture.width),

Float(box.origin.y) * Float(texture.height),

Float(box.width) * Float(texture.width),

Float(box.height) * Float(texture.height)

)

var color = SIMD4<Float>(1.0, 0.0, 0.0, 1.0) // Red

encoder.setBytes(&boxData,

length: MemoryLayout<SIMD4<Float>>.stride,

index: 0)

encoder.setBytes(&color,

length: MemoryLayout<SIMD4<Float>>.stride,

index: 1)

let threadgroupSize = MTLSize(width: 16, height: 16, depth: 1)

let threadgroups = MTLSize(

width: (texture.width + 15) / 16,

height: (texture.height + 15) / 16,

depth: 1)

encoder.dispatchThreadgroups(threadgroups,

threadsPerThreadgroup: threadgroupSize)

encoder.endEncoding()

}

commandBuffer.commit()

commandBuffer.waitUntilCompleted()

completion(texture)

}

private func createTexture(width: Int, height: Int) -> MTLTexture? {

let descriptor = MTLTextureDescriptor.texture2DDescriptor(

pixelFormat: .rgba32Float,

width: width,

height: height,

mipmapped: false)

descriptor.usage = [.shaderRead, .shaderWrite]

descriptor.storageMode = .shared

return device.makeTexture(descriptor: descriptor)

}

This pipeline demonstrates the power of Apple Silicon's unified architecture. The preprocessing happens on the GPU, the neural network runs on the Neural Engine, and the postprocessing returns to the GPU. Because everything shares unified memory, there are no expensive copy operations between these stages. The data stays in place and different compute units access it as needed.

Part Seven: Performance Optimization Strategies

To get the best performance from Apple Silicon, you need to understand several optimization strategies. These techniques apply whether you're using Metal directly, Metal Performance Shaders, or Core ML.

First, minimize data transfers. Even though Apple Silicon has unified memory, accessing memory still has a cost. Structure your algorithms to maximize data reuse. Keep data in GPU buffers or textures as long as possible rather than reading it back to CPU memory and then sending it back to the GPU.

Second, use appropriate data types. The Neural Engine works best with 16-bit floats or 8-bit integers. The GPU can handle 32-bit floats efficiently but 16-bit floats use less memory bandwidth. For many AI applications, 16-bit precision is sufficient and can nearly double your performance.

Here's an example of using 16-bit floats in Metal:

let halfPrecisionShader = """

#include <metal_stdlib>

using namespace metal;

// Matrix multiplication using half precision

// half is a 16-bit floating point type

kernel void matmul_half(device const half* A [[buffer(0)]],

device const half* B [[buffer(1)]],

device half* C [[buffer(2)]],

constant uint& M [[buffer(3)]],

constant uint& N [[buffer(4)]],

constant uint& K [[buffer(5)]],

uint2 gid [[thread_position_in_grid]])

{

uint row = gid.y;

uint col = gid.x;

if (row >= M || col >= N) {

return;

}

// Use float for accumulation to maintain precision

float sum = 0.0;

for (uint k = 0; k < K; k++) {

// Implicit conversion from half to float

sum += float(A[row * K + k]) * float(B[k * N + col]);

}

// Convert back to half for storage

C[row * N + col] = half(sum);

}

"""

Third, batch your operations. Instead of processing one image at a time, process multiple images together. This increases parallelism and improves GPU utilization. Most neural networks can process batches of inputs with minimal overhead.

Fourth, use asynchronous execution. Don't wait for GPU operations to complete if you don't need the results immediately. You can encode multiple command buffers and submit them all at once, allowing the GPU to work on them in parallel with CPU operations.

Here's an example of asynchronous execution:

class AsyncProcessor {

let device: MTLDevice

let commandQueue: MTLCommandQueue

let pipelineState: MTLComputePipelineState

// Initialization code omitted for brevity

func processAsync(data: [Float], completion: @escaping ([Float]) -> Void) {

// Create buffer

let buffer = device.makeBuffer(bytes: data,

length: data.count * MemoryLayout<Float>.stride,

options: .storageModeShared)!

// Create command buffer

let commandBuffer = commandQueue.makeCommandBuffer()!

// Encode work

let encoder = commandBuffer.makeComputeCommandEncoder()!

encoder.setComputePipelineState(pipelineState)

encoder.setBuffer(buffer, offset: 0, index: 0)

// ... encode more commands ...

encoder.endEncoding()

// Add completion handler

commandBuffer.addCompletedHandler { _ in

let pointer = buffer.contents().bindMemory(to: Float.self,

capacity: data.count)

let result = Array(UnsafeBufferPointer(start: pointer,

completion(result)

}

// Commit without waiting

commandBuffer.commit()

// CPU can continue doing other work here

// The completion handler will be called when GPU finishes

}

Fifth, profile your code. Xcode includes excellent profiling tools for Metal and Core ML. The Metal System Trace instrument shows you exactly how your GPU is being utilized, where bottlenecks are, and how much time is spent in each kernel. The Core ML instrument shows you which compute unit is being used for each layer of your neural network.

To profile your Metal code, you would run your app in Xcode, open Instruments, and select the Metal System Trace template. This will show you a timeline of all GPU activity, including when command buffers are submitted, when they execute, and how long each kernel takes.

Part Eight: Advanced Topics and Best Practices

As you develop more sophisticated AI applications on Apple Silicon, you'll encounter advanced scenarios that require deeper understanding.

One important topic is memory management. While unified memory simplifies many things, you still need to be careful about memory usage. Large neural networks can consume gigabytes of memory. Use memory-mapped files for large datasets that don't fit in RAM. Core ML supports memory-mapped model weights, which means the model data stays on disk and is paged in as needed.

Another advanced topic is custom Core ML layers. If your neural network includes operations that Core ML doesn't support natively, you can implement custom layers using Metal. This allows you to extend Core ML while still benefiting from its optimization and Neural Engine support for the standard layers.

Here's a skeleton for a custom Core ML layer:

import CoreML

// Custom layer that implements a specialized operation

@available(macOS 10.15, *)

class CustomMetalLayer: MLCustomLayer {

let device: MTLDevice

let pipelineState: MTLComputePipelineState

required init(parameters: [String: Any]) throws {

// Initialize Metal resources

guard let device = MTLCreateSystemDefaultDevice() else {

throw NSError(domain: "Metal not supported", code: -1)

}

self.device = device

// Compile custom kernel

let shaderCode = """

#include <metal_stdlib>

using namespace metal;

kernel void custom_operation(device const float* input [[buffer(0)]],

device float* output [[buffer(1)]],

uint index [[thread_position_in_grid]])

{

// Implement your custom operation here

output[index] = input[index] * 2.0; // Example: simple scaling

}

"""

let library = try device.makeLibrary(source: shaderCode, options: nil)

let function = library.makeFunction(name: "custom_operation")!

self.pipelineState = try device.makeComputePipelineState(function: function)

}

func setWeightData(_ weights: [Data]) throws {

// Load any weights needed by this layer

}

func outputShapes(forInputShapes inputShapes: [[NSNumber]]) throws -> [[NSNumber]] {

// Return the output shape given input shapes

// For this example, output shape matches input shape

return inputShapes

}

func evaluate(inputs: [MLMultiArray], outputs: [MLMultiArray]) throws {

// This is where you execute your Metal kernel

guard let input = inputs.first,

let output = outputs.first else {

throw NSError(domain: "Invalid inputs/outputs", code: -1)

}

// Get pointers to the data

let inputPointer = input.dataPointer.assumingMemoryBound(to: Float.self)

let outputPointer = output.dataPointer.assumingMemoryBound(to: Float.self)

// Create Metal buffers

let count = input.count

let size = count * MemoryLayout<Float>.stride

let inputBuffer = device.makeBuffer(bytes: inputPointer,

length: size,

options: .storageModeShared)!

let outputBuffer = device.makeBuffer(bytes: outputPointer,

length: size,

options: .storageModeShared)!

// Execute kernel

let commandQueue = device.makeCommandQueue()!

let commandBuffer = commandQueue.makeCommandBuffer()!

let encoder = commandBuffer.makeComputeCommandEncoder()!

encoder.setComputePipelineState(pipelineState)

encoder.setBuffer(inputBuffer, offset: 0, index: 0)

encoder.setBuffer(outputBuffer, offset: 0, index: 1)

let threadsPerGrid = MTLSize(width: count, height: 1, depth: 1)

let maxThreads = pipelineState.maxTotalThreadsPerThreadgroup

let threadsPerThreadgroup = MTLSize(width: maxThreads, height: 1, depth: 1)

encoder.dispatchThreads(threadsPerGrid,

threadsPerThreadgroup: threadsPerThreadgroup)

encoder.endEncoding()

commandBuffer.commit()

commandBuffer.waitUntilCompleted()

// Copy results back

memcpy(outputPointer,

outputBuffer.contents(),

size)

}

When building production AI applications, consider these best practices. First, warm up your models. The first inference is often slower because of compilation and initialization overhead. Run a dummy inference during app startup to warm up the model. Second, reuse resources. Don't create new Metal devices, command queues, or pipeline states for every operation. Create them once and reuse them. Third, handle errors gracefully. GPU operations can fail for various reasons, so always check return values and handle errors appropriately.

Fourth, test on real devices. The simulator doesn't have a real GPU or Neural Engine, so performance characteristics are completely different. Always test your AI code on actual Apple Silicon hardware. Fifth, consider power consumption. The Neural Engine is much more power-efficient than the GPU for neural network operations. For battery-powered devices, using the Neural Engine can significantly extend battery life.

Conclusion: The Future of AI on Apple Silicon

Apple Silicon represents a new paradigm for AI computing. The unified memory architecture, the powerful GPU, and the specialized Neural Engine work together to provide exceptional performance for AI workloads. By understanding how to leverage these resources through Metal, Metal Performance Shaders, and Core ML, you can build AI applications that are fast, efficient, and power-conscious.

The key takeaways are these. Use Metal for custom compute operations that need maximum performance and control. Use Metal Performance Shaders for common operations like matrix multiplication and convolutions, as they are highly optimized. Use Core ML for neural network inference, as it automatically leverages the Neural Engine and provides the best performance for most models. Structure your applications to minimize data movement and maximize parallelism. Profile your code to identify bottlenecks and optimize the critical paths.

As Apple continues to evolve its silicon and software frameworks, the capabilities will only grow. Future versions of Apple Silicon will have more GPU cores, more powerful Neural Engines, and new features that enable even more sophisticated AI applications. By mastering the fundamentals covered in this tutorial, you'll be well-positioned to take advantage of these advances and build the next generation of AI applications on Apple platforms.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Tuesday, June 09, 2026

HARNESSING APPLE SILICON: A GUIDE TO GPU AND NPU PROGRAMMING FOR ARTIFICIAL INTELLIGENCE APPLICATIONS

Introduction: Understanding the Apple Silicon Revolution

Part One: The Foundation - Metal and the GPU

Part Two: Matrix Operations for Neural Networks

Part Three: Optimizing Matrix Multiplication with Threadgroup Memory

Part Four: Metal Performance Shaders for Common Operations

Part Five: Leveraging the Neural Engine with Core ML

Part Six: Building a Complete AI Pipeline

Part Seven: Performance Optimization Strategies

Part Eight: Advanced Topics and Best Practices

Conclusion: The Future of AI on Apple Silicon

No comments:

About Me