Tuesday, July 29, 2025

Running AI/Generative AI in the Browser with WebAssembly - Part 1


INTRODUCTION – WHY AI IN THE BROWSER?


The fusion of artificial intelligence and the browser is no longer a distant dream or an academic toy. Thanks to the increasing computational power of client devices and the evolution of technologies like WebAssembly (abbreviated WASM), it is now feasible to run entire neural networks or even lightweight Generative AI models directly in the browser. This shift has profound implications: no backend latency, zero data exfiltration risks, complete privacy, and astonishingly fast inference for tasks like image generation, text prediction, and interactive UX enhancements.


But why would anyone want to do this? The motivation stems from a combination of privacy needs, deployment simplicity, and interactivity. If the entire model runs locally in the user’s browser, then there is no need to maintain costly inference servers. Additionally, end users benefit from responsiveness, offline support, and data that never leaves their device.



WEBASSEMBLY – THE MISSING LINK BETWEEN NATIVE AI CODE AND THE BROWSER


WebAssembly is a low-level, binary instruction format designed as a safe and portable compilation target for high-performance applications on the web. In plain terms, it allows code written in languages like C, C++, or Rust to run safely and efficiently inside the browser, alongside traditional JavaScript.


Unlike JavaScript, which is dynamically typed and interpreted or JIT-compiled at runtime, WebAssembly is statically typed, compiled ahead-of-time, and optimized for deterministic performance. This makes it ideally suited for AI workloads, especially matrix operations, convolutions, and vectorized computations, which are central to neural network inference.



CONSTRAINTS OF JAVASCRIPT FOR AI AND THE RISE OF WEBASSEMBLY


While JavaScript has seen the rise of libraries like TensorFlow.js, it suffers from performance bottlenecks due to its interpreted nature, lack of strong typing, and limited threading capabilities. Moreover, integrating native libraries for high-performance numerical computations, such as BLAS, LAPACK, or even SIMD-accelerated code, is not possible using JavaScript alone.


WebAssembly changes this landscape entirely. By compiling C++-based neural network runtimes or custom AI engines to WASM, one can achieve near-native performance in the browser. It also opens the door to using sophisticated memory management, fine-tuned compute kernels, and reusable cross-platform AI components directly in web apps.



A TECHNICAL OVERVIEW OF WEBASSEMBLY MODULES


A WebAssembly module is a binary file (.wasm) that can be loaded and instantiated by JavaScript running in a browser. It exposes a set of functions and memory areas that can be called and accessed using JavaScript or another WASM host.


Here is a minimal example of compiling and using a C function from WebAssembly:


First, we define a simple function in C:


#include <stdint.h>


int32_t add(int32_t a, int32_t b) {

    return a + b;

}



This C code can be compiled to WebAssembly using Emscripten:


emcc add.c -Os -s WASM=1 -s EXPORTED_FUNCTIONS="['_add']" -o add.js


Then, in JavaScript (in the browser), you can load and use it like this:


const wasmModule = await WebAssembly.instantiateStreaming(fetch("add.wasm"));


const result = wasmModule.instance.exports.add(10, 20);

console.log("The result is:", result);


This kind of interaction can be scaled to entire AI inference pipelines.




HOW TO COMPILE AI/ML MODELS TO WEBASSEMBLY


To integrate AI models into a browser using WebAssembly, one typically needs to compile an inference engine or runtime into WASM. Several well-established toolchains are available:

1. TensorFlow Lite for WebAssembly – TensorFlow Lite supports compiling its inference engine into WASM using Emscripten. The model is first converted to the .tflite format and then loaded via a WASM runtime.

2. ONNX Runtime Web – ONNX models can be executed using ONNX Runtime compiled to WebAssembly. This allows running PyTorch or TensorFlow exported models directly in the browser with WASM acceleration.

3. Custom Rust or C++ Inference Code – For custom or minimal models (e.g. GPT-2 Tiny or RNNs), one can hand-code the inference logic and compile it to WASM for maximum control and minimal overhead.


Let’s walk through one concrete example.


USING TENSORFLOW.JS AND ONNX RUNTIME WEB FOR NEURAL NETWORKS


TensorFlow.js provides both a JavaScript and WebAssembly backend. To enable the WASM backend in TensorFlow.js, the developer simply imports the relevant WASM backend package and sets it explicitly:


import * as tf from '@tensorflow/tfjs';

import '@tensorflow/tfjs-backend-wasm';


async function runModel() {

    await tf.setBackend('wasm');

    await tf.ready();


    const model = await tf.loadLayersModel('model.json');

    const input = tf.tensor([[1, 2, 3, 4]]);

    const prediction = model.predict(input);


    prediction.print();

}


In this case, model.json refers to a saved model format, usually converted from a trained Keras or TensorFlow model. The performance of WASM in this context is surprisingly good and makes it viable to deploy small to medium-sized models without using WebGL or WebGPU.


BRINGING GENERATIVE AI TO THE BROWSER – STRATEGIES AND CHALLENGES


Integrating full-blown Generative AI models like GPT-2, Stable Diffusion, or LLaMA into the browser is not trivial. These models are often hundreds of megabytes in size and rely heavily on vectorized linear algebra. However, strategies to make this feasible include:

  • Quantizing the models to 8-bit or even 4-bit integer formats.
  • Using distillation or pruning to reduce model size.
  • Running inference in a streaming fashion with progressive generation.
  • Employing WASM SIMD instructions for optimized matrix multiplications.


Projects such as llama.cpp have made enormous progress in making smaller language models feasible for WebAssembly compilation. A subset of these models, such as gpt2-small, can be pruned and quantized to run on modern browsers, given sufficient memory.


A DETAILED WALKTHROUGH – LOADING AND RUNNING A TINY GENERATIVE MODEL


Let us take a distilled and quantized GPT-2 variant, export it to ONNX format, and run it in the browser using ONNX Runtime Web.


First, the model is exported and quantized on the server:


python export_gpt2_to_onnx.py --quantize --output=gpt2_quantized.onnx


In the browser, the loading code looks as follows:


import * as ort from 'onnxruntime-web';


async function runGeneration(promptTokens) {

    const session = await ort.InferenceSession.create('gpt2_quantized.onnx');

    const input = new ort.Tensor('int64', new BigInt64Array(promptTokens), [1, promptTokens.length]);

    const feeds = { input_ids: input };


    const results = await session.run(feeds);

    const outputTokens = results['output'].data;


    const decodedText = decodeTokens(outputTokens);

    console.log("Generated Text:", decodedText);

}



PERFORMANCE CONSIDERATIONS AND WASM + SIMD + THREADS


WebAssembly now supports SIMD (Single Instruction, Multiple Data) extensions, which significantly improve performance for operations on vectors and matrices. Modern JavaScript engines, including V8 and SpiderMonkey, already support WASM SIMD, so AI code compiled with SIMD support runs faster.


To enable SIMD when compiling with Emscripten:


emcc model.c -msimd128 -s "EXPORTED_FUNCTIONS=['_infer']" -o model.wasm



Threading is also now possible using SharedArrayBuffer and WASM threads, though cross-browser support may require setting special HTTP headers due to security policies. With multi-threaded WASM, parallel inference or data pre-processing becomes possible, further boosting interactivity.



FUTURE DIRECTIONS – ON-DEVICE LLMS AND BEYOND


We are only scratching the surface of what’s possible. The next leap includes:

  • Running 7B parameter LLMs using browser-offloaded GPU computation via WebGPU.
  • Adapting model weights dynamically via retrieval-augmented generation, with embeddings stored in IndexedDB.
  • Hosting personal assistants that never leave your browser, fine-tuned on your local data.


Projects like WebLLM from MLC and HuggingFace’s Transformers.js are pointing the way forward.



FINAL THOUGHTS


Running AI and Generative AI in the browser is no longer science fiction. With WebAssembly, SIMD, ONNX, and careful engineering, developers can build responsive, secure, and private AI experiences that run entirely client-side. While challenges around model size, memory use, and compute power remain, the trend is unmistakable: AI is moving to the edge, and the browser is now a serious AI runtime.

No comments: