Sunday, March 08, 2026

TWO HEADS ARE BETTER THAN ONE: A COMPLETE GUIDE TO SETTING UP, CONNECTING, AND RUNNING LARGE LANGUAGE MODELS ON TWO NVIDIA DGX SPARK WORKSTATIONS


     (c) Nvidia


Chapter 0 - BEFORE WE BEGIN: WHAT IS THIS ALL ABOUT?


Imagine having a personal AI supercomputer sitting on your desk. Not a cloud

instance you rent by the hour, not a shared cluster you have to queue for, but

a machine that belongs to you, runs entirely offline if you want, and can

execute large language models that would make most laptops weep. Now imagine

having two of them, connected at high speed, working in concert. That is

exactly what this tutorial is about.


The NVIDIA DGX Spark is not a gaming PC with a fancy GPU strapped to it. It

is a purpose-built AI workstation that packs a genuinely remarkable amount of

compute into a compact desktop chassis. When you connect two of them with

NVIDIA's ConnectX-7 networking adapter, you create a small but serious

two-node AI cluster capable of running models that exceed what either machine

could handle alone.


This tutorial assumes you are reasonably comfortable with Linux command-line

basics - you know what a terminal is, you can type commands, and you are not

afraid of a configuration file. Beyond that, no deep expertise is required.

Every concept will be explained from the ground up, including the "why" behind

each decision, not just the "how." By the time you finish reading, you will

understand what you are doing and why it works, not just which commands to

type.


We will cover five different inference engines - Ollama, LM Studio, vLLM,

SGLang, and TensorRT-LLM - because different tools excel at different tasks,

and a well-equipped AI practitioner knows when to reach for which tool. We

will also write actual Python code that communicates with these engines, both

locally and across the network between your two machines.


Let us begin.


Chapter 1 - MEET THE MACHINE: THE NVIDIA DGX SPARK DEEP DIVE



1.1  The Big Picture


The DGX Spark is built around a single chip that represents one of the most

significant architectural leaps in AI hardware in recent years: the NVIDIA GB10

Grace Blackwell Superchip. To understand why this chip matters, we need to

briefly discuss what has historically been the biggest bottleneck in running

large language models on a single machine.


Traditionally, a computer has a CPU (the general-purpose processor that runs

your operating system and applications) and a GPU (the massively parallel

processor that handles AI computations). These two components sit on separate

chips, connected by a PCIe bus. PCIe is fast by everyday standards, but it is

glacially slow compared to the internal buses within each chip. When a large

language model runs, data must constantly shuttle back and forth between CPU

memory (RAM) and GPU memory (VRAM). This shuttle service is expensive in both

time and energy.


The GB10 solves this problem by eliminating the separation entirely. The Grace

CPU and the Blackwell GPU are connected via NVLink-C2C (Chip-to-Chip), a

proprietary interconnect that delivers 900 gigabytes per second of bidirectional

bandwidth. For context, a typical PCIe 5.0 x16 connection delivers roughly

64 GB/s. The NVLink-C2C connection is therefore about 14 times faster. This

is not a minor improvement; it is a qualitative change in what becomes possible.


1.2  The Memory Architecture: Why 128GB Unified Memory Is a Game Changer


Because the CPU and GPU are so tightly coupled, NVIDIA designed the DGX Spark

with a single unified memory pool of 128 gigabytes of LPDDR5X memory. This

memory is simultaneously accessible by both the CPU cores and the Blackwell GPU

without any copying. When you load a 70-billion-parameter language model, it

lives in this shared pool and the GPU can access every byte of it at full speed.


To appreciate why this matters, consider a conventional workstation with a

high-end GPU that has 24GB of VRAM. If you want to run a model that requires

48GB, you simply cannot do it on that GPU alone - the model does not fit. You

would need to either quantize the model aggressively (reducing its quality) or

use multiple GPUs. On the DGX Spark, a 48GB model fits comfortably in the

128GB unified pool, and the GPU can access it as if it were native VRAM.


The practical consequence is that the DGX Spark can run models in the 70B

parameter range in 16-bit floating point precision, or models approaching

200B parameters in 4-bit quantized form. This is extraordinary for a single

desktop machine.


1.3  The Compute Specifications


Let us look at the full hardware specification of the DGX Spark:


  GPU:           NVIDIA Blackwell GPU (part of GB10 Superchip)

  AI Performance: Up to 1 PFLOPS (petaFLOP per second) at FP4 precision

  CPU:           20-core NVIDIA Grace CPU (ARM Neoverse N2 architecture)

  Memory:        128 GB LPDDR5X unified memory (shared CPU + GPU)

  Storage:       4 TB NVMe SSD

  Networking:    NVIDIA ConnectX-7 (100 Gigabit Ethernet)

  USB:           Multiple USB 3.2 and USB-C ports

  Display:       DisplayPort output

  Power:         170W TDP (Thermal Design Power)

  OS:            Ubuntu 24.04 LTS (pre-installed)


One teraFLOP is one trillion floating-point operations per second. One

petaFLOP is one thousand teraFLOPs, or one quadrillion operations per second.

At FP4 precision (4-bit floating point, used for inference), the DGX Spark

delivers this performance in a machine that consumes only 170 watts - less

than many gaming GPUs alone.


The 20-core Grace CPU is based on ARM's Neoverse N2 architecture, which is

the same family of cores used in cloud data centers by companies like Amazon

(Graviton) and Ampere. It is a server-grade CPU, not a consumer chip, and it

is optimized for the kind of memory-intensive workloads that AI inference

demands.


1.4  The Software Foundation


The DGX Spark ships with Ubuntu 24.04 LTS pre-installed. This is important

because Ubuntu 24.04 is a Long-Term Support release, meaning it will receive

security updates and support until 2029. NVIDIA has configured the system with

all necessary GPU drivers, CUDA libraries, and NVIDIA Container Toolkit

pre-installed. You do not need to hunt for drivers or fight with kernel

modules. The machine is ready to run AI workloads out of the box.


The CUDA version on the DGX Spark is 12.x, which is required by all modern

inference frameworks. The system also includes cuDNN (CUDA Deep Neural Network

library), NCCL (NVIDIA Collective Communications Library, essential for

multi-node communication), and the NVIDIA Container Runtime, which allows

Docker containers to access the GPU directly.


Chapter 2 - THE NERVOUS SYSTEM: CONNECTX-7 AND HIGH-SPEED NETWORKING



2.1  What Is ConnectX-7?


The NVIDIA ConnectX-7 is a network adapter, but calling it "just a network

adapter" is like calling a Formula 1 car "just a car." ConnectX-7 is a

smart network interface card (SmartNIC) that supports both InfiniBand and

Ethernet protocols at speeds up to 400 Gb/s in its highest configurations.

In the DGX Spark, it operates at 100 Gigabit Ethernet (100GbE).


What makes ConnectX-7 special for AI workloads is its support for RDMA -

Remote Direct Memory Access. RDMA allows one machine to read from or write to

the memory of another machine directly, without involving the CPU of the

remote machine. In traditional networking, when machine A sends data to

machine B, machine B's CPU must be interrupted, the data must be copied from

the network buffer into application memory, and then the application can use

it. With RDMA, the data goes directly from machine A's memory to machine B's

memory, bypassing both CPUs entirely.


For distributed AI inference, this is enormously valuable. When two DGX Spark

units are running a model together and need to exchange intermediate results

(called activations) between layers, RDMA allows this exchange to happen at

near-memory speeds rather than at network speeds. The latency drops from

microseconds to nanoseconds, and the CPU is free to do other work.


2.2  RoCE: RDMA Over Converged Ethernet


InfiniBand is the traditional protocol for RDMA in high-performance computing,

but it requires specialized InfiniBand switches and cables. The DGX Spark uses

a technology called RoCE (RDMA over Converged Ethernet, pronounced "rocky"),

which brings RDMA capabilities to standard Ethernet infrastructure. This means

you can connect two DGX Spark units with a standard 100GbE cable and still

get RDMA performance.


RoCE version 2 (RoCEv2) is the relevant standard here. It encapsulates RDMA

packets inside standard UDP/IP packets, which means they can be routed across

standard Ethernet networks. For a direct connection between two machines, this

is straightforward to configure.


2.3  The Cable You Need


To connect two DGX Spark units directly, you need one of the following:


A DAC (Direct Attach Copper) cable is the simplest option for short distances

up to about 5 meters. It is a passive cable with QSFP28 connectors on each

end that plugs directly into the ConnectX-7 port. DAC cables are inexpensive

and reliable for desk-to-desk connections.


An active optical cable (AOC) or a combination of SFP28 transceivers and

fiber optic cable is appropriate for longer distances, up to hundreds of

meters. This is more expensive but necessary if your two machines are in

different rooms or on different floors.


For most users setting up two DGX Spark units in the same office or lab, a

100GbE DAC cable of 1-3 meters is the right choice. Make sure it is rated for

QSFP28 (100G), not the older QSFP+ (40G) standard.


Chapter 3 - PHYSICAL SETUP: CABLES, POWER, AND FIRST BOOT


3.1  Unboxing and Placement


When your DGX Spark units arrive, give them time to reach room temperature

before powering them on, especially if they were shipped in cold weather.

Condensation inside electronics is not your friend. An hour at room temperature

is sufficient.


Place the units on a stable, flat surface with adequate airflow. The DGX Spark

has intake vents on the sides and exhaust at the rear. Leave at least 10 cm

(4 inches) of clearance on all sides. Do not stack them directly on top of

each other without a spacer, as the bottom unit's exhaust will feed hot air

into the top unit's intake. Side-by-side placement is ideal.


We will call the two machines "Node A" and "Node B" throughout this tutorial.

You can label them with a piece of tape if that helps you keep track.


3.2  Power Connections


Each DGX Spark uses a standard IEC C13 power connector (the same type used by

most desktop computers and monitors). Connect each unit to a power outlet or,

preferably, to a UPS (Uninterruptible Power Supply). A UPS protects against

sudden power loss, which can corrupt filesystems and interrupt long-running

AI jobs. For two machines drawing up to 170W each, a 1000VA UPS is more than

sufficient.


3.3  The ConnectX-7 Network Connection


Locate the ConnectX-7 port on the rear of each DGX Spark. It is a QSFP28

port, which looks like a slightly larger version of a standard SFP+ port.

Connect one end of your 100GbE DAC cable to Node A and the other end to Node B.

The cable is keyed and will only insert in the correct orientation. You should

feel a positive click when it is fully seated.


In addition to the direct ConnectX-7 connection between the two nodes, you

will also want to connect each machine to your regular office or home network

via the standard 1GbE or 10GbE Ethernet port. This management network is used

for internet access, software updates, and SSH access from your laptop. The

ConnectX-7 link is dedicated to high-speed AI traffic between the two nodes.


3.4  Display, Keyboard, and Mouse for Initial Setup


For the very first boot, you need a monitor, keyboard, and mouse connected to

at least one of the machines (Node A is a good choice to start with). The DGX

Spark has a DisplayPort output, so you need either a DisplayPort monitor or a

DisplayPort-to-HDMI adapter. Connect a USB keyboard and mouse to the USB ports.


After the initial setup is complete, you can switch to headless operation and

disconnect the peripherals. We will cover both modes in detail in Chapters 4 and 5.


3.5  First Power-On


Press the power button on Node A. The system will run through a POST (Power-On

Self-Test) and then boot into Ubuntu 24.04. The first boot may take slightly

longer than subsequent boots as the system initializes hardware and expands

the filesystem to fill the 4TB NVMe SSD.


You will be greeted by the Ubuntu initial setup wizard, which walks you through

language selection, keyboard layout, timezone, and user account creation. Create

a user account with a strong password. For the username, something simple and

memorable works well - we will use "aiuser" in this tutorial, but you can

choose anything you like.


After completing the setup wizard, you will land on the Ubuntu desktop. Take a

moment to appreciate what you are looking at: a full desktop Linux environment

running on hardware that can execute a trillion AI operations per second.


Chapter 4 - NON-HEADLESS SETUP: WORKING WITH A MONITOR AND KEYBOARD


4.1  Why You Might Want a Non-Headless Setup


A non-headless setup means you are working directly at the machine with a

monitor, keyboard, and mouse attached. This is the most intuitive way to get

started, especially if you are new to Linux or to AI workstations. It gives

you a full graphical desktop environment where you can open a terminal, a web

browser, and graphical applications like LM Studio all in the same workspace.


The trade-off is that you need to be physically present at the machine to use

it. For many research and development workflows, this is perfectly acceptable.

You sit at your desk, you work on your DGX Spark, and you go home when you

are done. Simple and effective.


4.2  Updating the System


The very first thing you should do after the initial setup wizard completes is

update all installed software. NVIDIA ships the DGX Spark with a known-good

software configuration, but security patches and bug fixes accumulate quickly.

Open a terminal (press Ctrl+Alt+T or find the Terminal application in the

application menu) and run the following commands.


The first command refreshes the list of available packages from all configured

software repositories, so Ubuntu knows what updates are available:


    sudo apt update


The second command downloads and installs all available updates. The -y flag

answers "yes" automatically to any confirmation prompts:


    sudo apt upgrade -y


This process may take several minutes depending on how many updates are

available. After it completes, reboot the system to ensure all updates,

especially kernel updates, take effect:


    sudo reboot


4.3  Verifying the GPU Is Recognized


After rebooting, open a terminal and run NVIDIA's System Management Interface

tool to confirm the GPU is properly recognized and the drivers are working:


    nvidia-smi


You should see output similar to this (the exact numbers will reflect the

GB10 Blackwell GPU):


    +-----------------------------------------------------------------------------+

    | NVIDIA-SMI 570.xx.xx    Driver Version: 570.xx.xx    CUDA Version: 12.x    |

    |-------------------------------+----------------------+----------------------+

    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

    |===============================+======================+======================|

    |   0  NVIDIA GB10 ...     On   | 00000000:01:00.0 Off |                  N/A |

    | N/A   45C    P0    25W / 170W |   2048MiB / 131072MiB|      0%      Default |

    +-----------------------------------------------------------------------------+


The key things to verify are that the GPU name is shown (GB10 or similar),

that memory shows approximately 131072 MiB (128 GB), and that the driver

version and CUDA version are displayed correctly. If you see "No devices were

found" or similar errors, something is wrong with the driver installation,

which is unusual on a DGX Spark but can happen after a kernel update.


4.4  Installing Essential Tools


Before diving into inference engines, install a set of tools that will make

your life easier throughout this tutorial. The following command installs

several utilities in one go:


    sudo apt install -y \

        git \

        curl \

        wget \

        htop \

        nvtop \

        net-tools \

        iperf3 \

        python3-pip \

        python3-venv \

        build-essential \

        openssh-server


Let us understand what each of these tools does and why we are installing it.


The git tool is the industry-standard version control system. You will use it

to clone repositories for inference frameworks and to manage your own code.


The curl and wget tools are command-line utilities for downloading files from

the internet. Many installation scripts use curl, and wget is useful for

downloading large files like model weights.


The htop tool is an interactive process viewer that shows CPU usage, memory

usage, and running processes in a colorful, easy-to-read format. It is far

more useful than the basic top command.


The nvtop tool is the GPU equivalent of htop. It shows real-time GPU

utilization, memory usage, and temperature. You will use this constantly to

monitor your inference workloads.


The net-tools package provides classic networking commands like ifconfig and

netstat, which are useful for diagnosing network issues.


The iperf3 tool is a network performance testing utility. You will use it to

verify that the 100GbE connection between your two nodes is working at full

speed.


The python3-pip and python3-venv tools are the Python package manager and

virtual environment manager, respectively. All five inference frameworks we

will install are Python-based.


The build-essential package installs the GCC compiler, make, and other tools

needed to compile software from source code. Some inference frameworks require

compilation steps.


The openssh-server package installs the SSH server daemon, which allows you

to connect to this machine remotely from another computer. Even in a

non-headless setup, having SSH available is valuable for scripting and remote

management.


4.5  Configuring SSH for Remote Access


Even if you are using a non-headless setup with a monitor attached, enabling

SSH is a good practice. It allows you to control the machine from your laptop,

copy files to and from it, and run commands without having to physically sit

at the machine.


After installing openssh-server, start the SSH service and configure it to

start automatically on boot:


    sudo systemctl enable ssh

    sudo systemctl start ssh


Now find the IP address of the machine on your regular network (not the

ConnectX-7 link, which we will configure later):


    ip addr show


Look for an entry that shows your regular Ethernet interface (typically named

something like eth0, eno1, or enp3s0) with an IP address in your local network

range (typically 192.168.x.x or 10.x.x.x). Note this IP address - you will

use it to SSH into the machine from your laptop.


From your laptop, you can now connect with:


    ssh aiuser@192.168.1.100


Replace 192.168.1.100 with the actual IP address of your Node A. You will be

prompted for the password you set during initial setup.


Chapter 5 - HEADLESS SETUP: SSH, REMOTE ACCESS, AND AUTOMATION


5.1  What Does "Headless" Mean and Why Would You Want It?


A headless setup means the machine runs without a monitor, keyboard, or mouse

attached. You interact with it entirely over the network via SSH. This is the

standard way to operate servers and AI workstations in professional

environments for several good reasons.


First, it saves money. Monitors, keyboards, and mice cost money, and if you

have two DGX Spark units, you do not need two sets of peripherals. One laptop

can manage both machines over SSH.


Second, it is more efficient. Once you are comfortable with the command line,

SSH is faster than working at a physical terminal. You can have multiple SSH

sessions open simultaneously, copy and paste between them, and script complex

operations.


Third, it enables automation. When your machines are managed entirely over the

network, you can write scripts that configure them, start inference servers,

monitor their health, and restart services automatically. This is essential

for production AI deployments.


5.2  Completing the Initial Setup Without a Monitor (Node B)


For Node B, you have two options for the initial setup. The first option is to

temporarily connect a monitor and keyboard, complete the Ubuntu setup wizard,

enable SSH, and then disconnect the peripherals. This is the simplest approach.


The second option is to use a technique called "blind configuration." If you

know the machine's IP address (which you can find from your router's DHCP

client list), you can SSH into it immediately after first boot, because Ubuntu

24.04 enables SSH by default in some configurations. However, this is not

guaranteed, so the first option is more reliable.


We will assume you have completed the initial setup wizard on both machines

with a monitor attached, enabled SSH on both, and noted their IP addresses.

From this point forward, all configuration will be done over SSH.


5.3  Setting Up SSH Key Authentication


Typing a password every time you SSH into a machine becomes tedious quickly.

SSH key authentication is more secure and more convenient. It works by

generating a pair of cryptographic keys: a private key that stays on your

laptop and a public key that you copy to the remote machine. When you connect,

the machines perform a cryptographic handshake that proves your identity

without requiring a password.


On your laptop (not on the DGX Spark), generate an SSH key pair if you do not

already have one:


    ssh-keygen -t ed25519 -C "dgx-spark-access"


The -t ed25519 flag specifies the Ed25519 algorithm, which is modern, fast,

and secure. The -C flag adds a comment to help you identify the key later.

When prompted for a file location, press Enter to accept the default

(~/.ssh/id_ed25519). When prompted for a passphrase, you can either set one

(more secure) or press Enter for no passphrase (more convenient).


Now copy the public key to both DGX Spark nodes. The ssh-copy-id command

handles this automatically:


    ssh-copy-id aiuser@192.168.1.100   # Node A

    ssh-copy-id aiuser@192.168.1.101   # Node B


After this, you can SSH into either machine without a password:


    ssh aiuser@192.168.1.100


5.4  Setting Up SSH Config for Convenience


Instead of typing IP addresses every time, create an SSH config file on your

laptop that gives friendly names to your machines. Open or create the file

~/.ssh/config on your laptop and add the following:


    Host node-a

        HostName 192.168.1.100

        User aiuser

        IdentityFile ~/.ssh/id_ed25519

        ServerAliveInterval 60

        ServerAliveCountMax 3


    Host node-b

        HostName 192.168.1.101

        User aiuser

        IdentityFile ~/.ssh/id_ed25519

        ServerAliveInterval 60

        ServerAliveCountMax 3


The ServerAliveInterval and ServerAliveCountMax settings tell your SSH client

to send keepalive packets every 60 seconds and to give up after 3 missed

responses. This prevents SSH sessions from dropping when you are running long

jobs and not typing anything.


Now you can connect with simply:


    ssh node-a

    ssh node-b


5.5  Configuring Passwordless SSH Between the Two Nodes


For distributed inference frameworks like vLLM and TensorRT-LLM, the two

nodes need to be able to SSH into each other without passwords. This is

because the head node (Node A) will launch processes on the worker node

(Node B) automatically.


On Node A, generate an SSH key pair:


    ssh-keygen -t ed25519 -C "node-a-to-node-b" -f ~/.ssh/id_ed25519_cluster


Then copy Node A's public key to Node B. First, display the public key:


    cat ~/.ssh/id_ed25519_cluster.pub


Copy the output, then SSH into Node B and add it to the authorized_keys file:


    # On Node B:

    echo "PASTE_THE_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys

    chmod 600 ~/.ssh/authorized_keys


Now do the reverse: on Node B, generate a key and copy it to Node A's

authorized_keys. After this, both nodes can SSH into each other without

passwords, which is required for MPI-based distributed inference.


5.6  Disabling the Graphical Desktop on Headless Nodes


Running a full graphical desktop environment on a headless machine wastes

memory and CPU cycles. Ubuntu 24.04 uses the GNOME desktop by default, which

can consume 1-2 GB of RAM even when idle. For a headless AI workstation, you

want all available resources dedicated to inference.


To switch to a text-only boot target (which still allows you to start a

graphical session manually if needed), run:


    sudo systemctl set-default multi-user.target


This tells systemd (the Linux init system) to boot into a multi-user text

mode by default instead of the graphical desktop. The change takes effect on

the next reboot. To revert to graphical mode if needed:


    sudo systemctl set-default graphical.target


After setting multi-user mode, reboot:


    sudo reboot


When the machine comes back up, it will present a text login prompt instead

of a graphical desktop. SSH into it from your laptop as usual - the SSH

server starts in both modes.


Chapter 6 - NETWORKING THE TWO NODES: IP ADDRESSES, ROCE, AND JUMBO FRAMES


6.1  Understanding the Two Network Interfaces


Each DGX Spark has at least two network interfaces that we care about:


The management interface is the standard Ethernet port (1GbE or 10GbE) that

connects to your regular office or home network. This is used for internet

access, SSH from your laptop, and downloading models. We will call the IP

addresses on this interface the "management IPs" - for example, 192.168.1.100

for Node A and 192.168.1.101 for Node B.


The high-speed interface is the ConnectX-7 port that connects the two DGX

Spark units directly to each other via the DAC cable. This is used exclusively

for high-speed AI traffic between the nodes. We will assign IP addresses in a

separate subnet to this interface - for example, 10.0.0.1 for Node A and

10.0.0.2 for Node B.


Keeping these two networks separate is important. It ensures that AI traffic

does not compete with management traffic, and it makes routing simpler because

each network has a clear purpose.


6.2  Identifying the ConnectX-7 Interface Name


Linux assigns names to network interfaces automatically. The ConnectX-7

interface will have a name like enp1s0f0np0 or similar, depending on which

PCIe slot it occupies. To find the correct interface name, run:


    ip link show


You will see a list of all network interfaces. The ConnectX-7 interface will

typically show a link speed of 100000 Mb/s when the DAC cable is connected.

You can also use:


    ethtool <interface_name> | grep Speed


to check the speed of a specific interface. Alternatively, the mlxlink tool

(part of the Mellanox/NVIDIA networking tools) provides detailed ConnectX-7

status:


    sudo mlxlink -d /dev/mst/mt4129_pciconf0 --show_module


The interface name for the ConnectX-7 on Node A might be enp1s0f0np0. We

will use this name in examples below, but substitute the actual name you find

on your system.


6.3  Configuring Static IP Addresses on the ConnectX-7 Interface


Ubuntu 24.04 uses Netplan for network configuration. Netplan is a declarative

network configuration system that reads YAML files and generates configuration

for the underlying network daemon (NetworkManager or systemd-networkd).


On Node A, create a new Netplan configuration file for the ConnectX-7

interface. The file must be in /etc/netplan/ and have a .yaml extension. We

will call it 10-connectx7.yaml (the number prefix determines the order in

which files are processed):


    sudo nano /etc/netplan/10-connectx7.yaml


Enter the following configuration. Be very careful with indentation - YAML

is whitespace-sensitive, and incorrect indentation will cause errors:


    network:

      version: 2

      ethernets:

        enp1s0f0np0:

          dhcp4: false

          addresses:

            - 10.0.0.1/24

          mtu: 9000


The dhcp4: false line tells Netplan not to request an IP address from a DHCP

server on this interface - we are assigning a static address manually. The

addresses section assigns the IP address 10.0.0.1 with a /24 subnet mask

(which means addresses 10.0.0.1 through 10.0.0.254 are on the same network).

The mtu: 9000 line sets the Maximum Transmission Unit to 9000 bytes, which

are called "jumbo frames."


On Node B, create the same file with the IP address changed to 10.0.0.2:


    network:

      version: 2

      ethernets:

        enp1s0f0np0:

          dhcp4: false

          addresses:

            - 10.0.0.2/24

          mtu: 9000


Apply the configuration on both nodes:


    sudo netplan apply


6.4  Why Jumbo Frames Matter


The standard Ethernet MTU (Maximum Transmission Unit) is 1500 bytes. This

means each network packet can carry at most 1500 bytes of data. When

transferring large amounts of data (like the activation tensors exchanged

between nodes during distributed inference), using 1500-byte packets means

a lot of overhead: each packet has headers, checksums, and other metadata

that do not carry useful data. With 1500-byte packets, this overhead is

relatively large compared to the payload.


Jumbo frames increase the MTU to 9000 bytes, which means each packet carries

six times more data for the same amount of overhead. For high-bandwidth,

low-latency applications like distributed AI inference, this can improve

throughput by 10-20% and reduce CPU overhead significantly.


The key requirement for jumbo frames is that both endpoints must be configured

with the same MTU. Since we are connecting the two DGX Spark units directly

(without a switch in between), we only need to configure the MTU on the two

machines themselves. If you were using a switch, the switch ports would also

need to be configured for jumbo frames.


6.5  Verifying the Connection


After applying the Netplan configuration, verify that the two nodes can

communicate over the ConnectX-7 link. From Node A, ping Node B:


    ping -c 4 10.0.0.2


You should see responses with very low latency, typically under 0.1 milliseconds

for a direct connection. If the ping fails, check that the cable is properly

seated, that both nodes have the correct IP addresses, and that the interface

is up (ip link show should show "UP" for the ConnectX-7 interface).


Now test the actual bandwidth using iperf3. On Node B, start the iperf3 server:


    iperf3 -s


On Node A, run the iperf3 client pointing at Node B's ConnectX-7 IP:


    iperf3 -c 10.0.0.2 -t 30 -P 4


The -t 30 flag runs the test for 30 seconds, and -P 4 uses 4 parallel streams.

You should see throughput close to 100 Gbits/sec. If you see significantly

less (say, under 80 Gbits/sec), check that jumbo frames are configured

correctly on both ends and that the cable is rated for 100G.


6.6  Configuring RoCE for RDMA


To enable RDMA over the ConnectX-7 interface, we need to configure RoCEv2.

First, install the RDMA user-space libraries:


    sudo apt install -y rdma-core ibverbs-utils


Verify that the RDMA device is recognized:


    ibv_devices


You should see the ConnectX-7 listed as an RDMA device. Now verify the device

attributes:


    ibv_devinfo


This shows the RDMA capabilities of the device, including supported transport

types and maximum message sizes.


To configure RoCEv2 (as opposed to RoCEv1), we need to set the GID (Global

Identifier) index. RoCEv2 uses UDP/IP encapsulation, which is routable and

works with standard Ethernet infrastructure. The configuration is done through

the sysfs filesystem:


    # Check available GID entries for the interface

    # (substitute your actual interface and port number)

    cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/1


For NCCL (which vLLM and other frameworks use for multi-node communication),

set the following environment variables to tell NCCL to use RoCEv2:


    export NCCL_IB_GID_INDEX=3

    export NCCL_IB_DISABLE=0

    export NCCL_NET_GDR_LEVEL=5

    export NCCL_SOCKET_IFNAME=enp1s0f0np0


Add these to your ~/.bashrc file on both nodes so they are set automatically

in every new shell session:


    echo 'export NCCL_IB_GID_INDEX=3' >> ~/.bashrc

    echo 'export NCCL_IB_DISABLE=0' >> ~/.bashrc

    echo 'export NCCL_NET_GDR_LEVEL=5' >> ~/.bashrc

    echo 'export NCCL_SOCKET_IFNAME=enp1s0f0np0' >> ~/.bashrc

    source ~/.bashrc


The NCCL_NET_GDR_LEVEL=5 setting enables GPU Direct RDMA, which allows data

to be transferred directly between the GPU memory on one node and the GPU

memory on another node, bypassing the CPU and system memory entirely. This

is the highest level of RDMA optimization and provides the best performance

for distributed inference.


Chapter 7 - INFERENCE ENGINE 1: OLLAMA - THE FRIENDLY GIANT


7.1  What Is Ollama and Why Should You Care?


Ollama is an open-source tool that makes running large language models locally

as simple as running a Docker container. It handles model downloading, format

conversion, quantization, and serving through a clean REST API - all with a

single command. If you have ever wanted to run a model like Llama 3, Mistral,

or Qwen on your own hardware without wrestling with Python dependencies and

model format conversions, Ollama is the answer.


Ollama works by wrapping llama.cpp, a highly optimized C++ inference library,

with a user-friendly interface. It maintains a library of pre-quantized models

that you can download with a single command, and it automatically detects and

uses your GPU for acceleration.


For the DGX Spark, Ollama is an excellent starting point. It requires minimal

configuration, works out of the box with the NVIDIA GPU, and provides a REST

API that is easy to call from Python, JavaScript, or any other language. The

trade-off is that Ollama does not support multi-node inference natively - each

DGX Spark runs its own independent Ollama instance. However, you can use both

instances together in clever ways, which we will explore.


7.2  Installing Ollama on Both Nodes


The installation is refreshingly simple. On each node (Node A and Node B),

run the official installation script:


    curl -fsSL https://ollama.com/install.sh | sh


This script detects your operating system and architecture (ARM64 in the case

of the DGX Spark's Grace CPU), downloads the appropriate binary, installs it

to /usr/local/bin/ollama, and creates a systemd service that starts Ollama

automatically on boot.


After installation, verify that Ollama is running:


    systemctl status ollama


You should see "active (running)" in the output. If it is not running, start

it manually:


    sudo systemctl start ollama


7.3  Configuring Ollama to Listen on the Network


By default, Ollama only listens on localhost (127.0.0.1), which means it can

only be accessed from the same machine. To allow Node B to send requests to

Node A's Ollama instance (and vice versa), we need to configure Ollama to

listen on all network interfaces.


Edit the Ollama systemd service file to add the necessary environment variable:


    sudo systemctl edit ollama


This opens a text editor with an override file. Add the following content:


    [Service]

    Environment="OLLAMA_HOST=0.0.0.0"


Save and close the file. Then reload the systemd configuration and restart

Ollama:


    sudo systemctl daemon-reload

    sudo systemctl restart ollama


Now Ollama listens on all interfaces, including the ConnectX-7 interface at

10.0.0.1 (on Node A) and 10.0.0.2 (on Node B). This means you can send

inference requests from Node A to Node B's Ollama instance and vice versa.


7.4  Pulling and Running Your First Model


Let us download and run a model. We will start with Llama 3.2 3B, a capable

but compact model that downloads quickly and runs fast:


    ollama pull llama3.2:3b


Ollama downloads the model in GGUF format (a quantized format optimized for

llama.cpp) and stores it in ~/.ollama/models. The download size is a few

gigabytes. Once downloaded, run the model interactively:


    ollama run llama3.2:3b


You will see a prompt where you can type messages and receive responses. This

is the simplest possible way to interact with an LLM on your DGX Spark. Type

/bye to exit the interactive session.


For a more serious model that takes advantage of the DGX Spark's 128GB memory,

try Llama 3.1 70B:


    ollama pull llama3.1:70b


This model is approximately 40GB in its quantized form and can run on a single

DGX Spark with memory to spare. The inference speed will be impressive - the

Blackwell GPU's 1 PFLOP of AI performance translates to fast token generation

even for this large model.


7.5  Using the Ollama REST API


The real power of Ollama comes from its REST API, which allows you to integrate

LLM inference into your own applications. The API is available at

http://localhost:11434 by default.


The following Python script demonstrates how to send a request to the Ollama

API and stream the response. Streaming means you receive tokens as they are

generated rather than waiting for the entire response to complete, which makes

the interaction feel much more responsive:


    import requests

    import json



    def query_ollama(

        prompt: str,

        model: str = "llama3.2:3b",

        host: str = "localhost",

        port: int = 11434,

        stream: bool = True

    ) -> str:

        """

        Send a prompt to an Ollama instance and return the generated text.


        Args:

            prompt:  The text prompt to send to the model.

            model:   The Ollama model name to use for inference.

            host:    The hostname or IP address of the Ollama server.

            port:    The port number on which Ollama is listening.

            stream:  If True, stream the response token by token.


        Returns:

            The complete generated text as a string.

        """

        url = f"http://{host}:{port}/api/generate"


        # Build the request payload according to the Ollama API specification.

        # The 'stream' field controls whether the server sends back partial

        # responses as they are generated or waits for the full completion.

        payload = {

            "model": model,

            "prompt": prompt,

            "stream": stream

        }


        full_response = ""


        # Use a streaming HTTP request so we can process each chunk as it arrives.

        # This is important for user-facing applications where responsiveness matters.

        with requests.post(url, json=payload, stream=stream) as response:

            response.raise_for_status()


            for line in response.iter_lines():

                if line:

                    # Each line from the Ollama streaming API is a JSON object

                    # containing a 'response' field with the next token(s).

                    chunk = json.loads(line)

                    token = chunk.get("response", "")

                    full_response += token


                    # Print each token immediately so the user sees output

                    # appearing in real time, just like ChatGPT's interface.

                    print(token, end="", flush=True)


                    # The 'done' field signals that generation is complete.

                    if chunk.get("done", False):

                        print()  # Add a newline after the response is complete.

                        break


        return full_response



    if __name__ == "__main__":

        # Query the local Ollama instance on Node A.

        print("=== Querying local Ollama (Node A) ===")

        response_a = query_ollama(

            prompt="Explain quantum entanglement in simple terms.",

            model="llama3.2:3b",

            host="localhost"

        )


        # Query the remote Ollama instance on Node B via the ConnectX-7 link.

        # Notice that we use the high-speed 10.0.0.2 address, not the

        # management network address. This routes the traffic over the

        # 100GbE direct connection for minimum latency.

        print("\n=== Querying remote Ollama (Node B) ===")

        response_b = query_ollama(

            prompt="What are the applications of quantum computing?",

            model="llama3.2:3b",

            host="10.0.0.2"

        )


This script is straightforward but illustrates a powerful concept: with two

DGX Spark units running Ollama, you can distribute inference requests across

both machines. Node A handles some requests while Node B handles others,

effectively doubling your throughput for workloads that involve many concurrent

users or many independent queries.


7.6  A Load Balancer for Two Ollama Instances


To automatically distribute requests between the two Ollama instances, you

can write a simple round-robin load balancer. This is useful when you want to

serve many users and want to spread the load evenly:


    import requests

    import json

    import itertools

    from typing import Iterator



    class OllamaLoadBalancer:

        """

        A simple round-robin load balancer for multiple Ollama instances.


        This class cycles through a list of Ollama server addresses and

        sends each request to the next server in the rotation. This ensures

        that no single server is overwhelmed while others sit idle.

        """


        def __init__(self, servers: list[dict]) -> None:

            """

            Initialize the load balancer with a list of server configurations.


            Args:

                servers: A list of dicts, each with 'host' and 'port' keys.

                         Example: [{"host": "localhost", "port": 11434},

                                   {"host": "10.0.0.2",  "port": 11434}]

            """

            self.servers = servers

            # itertools.cycle creates an infinite iterator that cycles through

            # the list: server0, server1, server0, server1, ...

            self._server_cycle: Iterator[dict] = itertools.cycle(servers)


        def _get_next_server(self) -> dict:

            """Return the next server in the rotation."""

            return next(self._server_cycle)


        def generate(

            self,

            prompt: str,

            model: str = "llama3.2:3b"

        ) -> str:

            """

            Send a generation request to the next available server.


            Args:

                prompt: The text prompt for the model.

                model:  The model name to use.


            Returns:

                The generated text response.

            """

            server = self._get_next_server()

            url = f"http://{server['host']}:{server['port']}/api/generate"


            print(f"Routing request to {server['host']}:{server['port']}")


            payload = {

                "model": model,

                "prompt": prompt,

                "stream": False  # Non-streaming for simplicity in this example.

            }


            response = requests.post(url, json=payload)

            response.raise_for_status()


            return response.json().get("response", "")



    if __name__ == "__main__":

        # Configure the load balancer with both DGX Spark nodes.

        # Node A is accessed via localhost (we are running this script on Node A).

        # Node B is accessed via the high-speed ConnectX-7 interface.

        balancer = OllamaLoadBalancer(

            servers=[

                {"host": "localhost", "port": 11434},

                {"host": "10.0.0.2",  "port": 11434}

            ]

        )


        # Simulate five incoming requests. They will alternate between

        # Node A and Node B automatically.

        prompts = [

            "What is machine learning?",

            "Explain neural networks.",

            "What is backpropagation?",

            "Describe transformer architecture.",

            "What is attention mechanism?"

        ]


        for i, prompt in enumerate(prompts):

            print(f"\n--- Request {i + 1} ---")

            print(f"Prompt: {prompt}")

            response = balancer.generate(prompt=prompt, model="llama3.2:3b")

            print(f"Response: {response[:200]}...")  # Print first 200 characters.


Chapter 8 - INFERENCE ENGINE 2: LM STUDIO - THE GUI POWERHOUSE


8.1  What Is LM Studio?


LM Studio is a desktop application that provides a polished graphical user

interface for downloading, managing, and running large language models locally.

If you have ever wished for a ChatGPT-like interface that runs entirely on your

own hardware, LM Studio is exactly that. It includes a model browser that lets

you search and download models from Hugging Face, a chat interface for

interactive conversations, and a local server that exposes an OpenAI-compatible

API.


LM Studio is particularly valuable for non-headless setups where you have a

monitor connected. It is the most beginner-friendly of the five inference

engines we cover, requiring no command-line interaction for basic use. However,

it also provides enough advanced features to satisfy experienced practitioners.


8.2  Installing LM Studio on the DGX Spark


LM Studio supports Linux ARM64, which is the architecture of the DGX Spark's

Grace CPU. Download the ARM64 AppImage from the LM Studio website:


    wget https://releases.lmstudio.ai/linux/arm64/latest/LM_Studio-latest-arm64.AppImage


Make the downloaded file executable:


    chmod +x LM_Studio-latest-arm64.AppImage


For a non-headless setup, simply double-click the AppImage in the file manager,

or run it from the terminal:


    ./LM_Studio-latest-arm64.AppImage


LM Studio will launch with a graphical interface. On first launch, it may ask

you to accept a license agreement and choose a directory for storing models.

The default location (~/.cache/lm-studio/models) is fine, but given the DGX

Spark's 4TB NVMe SSD, you have plenty of space to store many large models.


For a headless setup, LM Studio can be run in server mode without a graphical

interface. This is done using the lms command-line tool that LM Studio installs:


    # First, install the lms CLI tool (run this after launching LM Studio at

    # least once to complete the installation):

    ~/.lmstudio/bin/lms install


    # Start the LM Studio server in headless mode:

    ~/.lmstudio/bin/lms server start --port 1234


8.3  Using LM Studio's GUI


In the graphical interface, the left sidebar has several icons. The first icon

(a magnifying glass) opens the model search interface, where you can browse

and download models from Hugging Face. Search for "llama" or "mistral" to find

popular models. LM Studio shows the model size and quantization level, helping

you choose a model that fits in your 128GB memory.


The second icon (a chat bubble) opens the chat interface. After loading a model

(click the model name in the top bar to load it), you can type messages and

receive responses in a familiar chat format. This is excellent for interactive

exploration and testing.


The third icon (a server icon) opens the local server settings. Enable the

server and it will listen on port 1234 by default, exposing an OpenAI-compatible

API. This is the same API format used by OpenAI's ChatGPT, which means any

code written for OpenAI's API works with LM Studio with minimal changes.


8.4  Using LM Studio's OpenAI-Compatible API


Once the LM Studio server is running (either in GUI mode or headless mode),

you can interact with it using the OpenAI Python library. This is one of the

most important aspects of LM Studio: because it speaks the OpenAI API protocol,

you can swap between LM Studio and actual OpenAI models with a single line

change in your code.


Install the OpenAI Python library if you have not already:


    pip install openai


The following script demonstrates how to use LM Studio's API, including how

to switch seamlessly between local and remote models:


    from openai import OpenAI

    from typing import Optional



    def create_lmstudio_client(

        host: str = "localhost",

        port: int = 1234

    ) -> OpenAI:

        """

        Create an OpenAI client configured to talk to an LM Studio server.


        LM Studio exposes an OpenAI-compatible API, so we use the standard

        OpenAI Python library but point it at our local LM Studio instance.

        The api_key parameter is required by the library but is not actually

        validated by LM Studio - any non-empty string works.


        Args:

            host: The hostname or IP address of the LM Studio server.

            port: The port number on which LM Studio is listening.


        Returns:

            A configured OpenAI client instance.

        """

        return OpenAI(

            base_url=f"http://{host}:{port}/v1",

            api_key="lm-studio"  # LM Studio ignores this, but it must be set.

        )



    def chat_with_model(

        client: OpenAI,

        system_prompt: str,

        user_message: str,

        model: str = "local-model",

        temperature: float = 0.7,

        max_tokens: int = 1024

    ) -> str:

        """

        Send a chat message to an LM Studio model and return the response.


        This function uses the chat completions API, which is the standard

        way to interact with instruction-tuned models. It supports a system

        prompt (which sets the model's persona and behavior) and a user

        message (the actual question or instruction).


        Args:

            client:        The OpenAI client configured for LM Studio.

            system_prompt: Instructions that define the model's behavior.

            user_message:  The user's question or instruction.

            model:         The model identifier (LM Studio uses the loaded

                           model regardless of this value).

            temperature:   Controls randomness. 0.0 is deterministic,

                           1.0 is highly random. 0.7 is a good default.

            max_tokens:    Maximum number of tokens to generate.


        Returns:

            The model's response as a string.

        """

        completion = client.chat.completions.create(

            model=model,

            messages=[

                {

                    "role": "system",

                    "content": system_prompt

                },

                {

                    "role": "user",

                    "content": user_message

                }

            ],

            temperature=temperature,

            max_tokens=max_tokens

        )


        # The response is nested inside the completion object.

        # We extract just the text content of the first (and usually only)

        # choice that the model generated.

        return completion.choices[0].message.content



    if __name__ == "__main__":

        # Connect to LM Studio running on Node A (local).

        local_client = create_lmstudio_client(host="localhost", port=1234)


        # Connect to LM Studio running on Node B (remote, via ConnectX-7).

        remote_client = create_lmstudio_client(host="10.0.0.2", port=1234)


        system_prompt = (

            "You are a helpful AI assistant specializing in explaining "

            "complex technical concepts in simple, accessible language."

        )


        question = "How does a transformer neural network process text?"


        print("=== Response from Node A (local LM Studio) ===")

        response_local = chat_with_model(

            client=local_client,

            system_prompt=system_prompt,

            user_message=question

        )

        print(response_local)


        print("\n=== Response from Node B (remote LM Studio) ===")

        response_remote = chat_with_model(

            client=remote_client,

            system_prompt=system_prompt,

            user_message=question

        )

        print(response_remote)


Chapter 9 - INFERENCE ENGINE 3: VLLM - THE THROUGHPUT CHAMPION


9.1  What Is vLLM and Why Is It Different?


vLLM (Virtual Large Language Model) is an open-source inference engine

developed by researchers at UC Berkeley. It was created to solve a specific

problem: how do you serve large language models to many users simultaneously

with high throughput and low latency?


The key innovation in vLLM is a technique called PagedAttention. To understand

why PagedAttention matters, we need to briefly understand how LLM inference

works. When a model generates text, it maintains a "key-value cache" (KV cache)

for each token it has processed. This cache stores intermediate computations

that allow the model to attend to previous tokens efficiently. The KV cache

grows with each generated token and can consume enormous amounts of GPU memory.


The problem with naive KV cache management is memory fragmentation. If you are

serving 10 users simultaneously, each with a different conversation length,

the KV caches for those conversations are different sizes. Allocating fixed

blocks of memory for each conversation wastes space when conversations are

short and fails when conversations grow longer than expected.


PagedAttention borrows an idea from operating system virtual memory management:

it divides the KV cache into fixed-size "pages" and allocates them dynamically

as needed, similar to how an OS manages physical memory pages for multiple

processes. This eliminates fragmentation and allows vLLM to serve 2-4x more

concurrent users than naive implementations with the same hardware.


9.2  Installing vLLM


Create a Python virtual environment for vLLM to keep its dependencies isolated

from other tools:


    python3 -m venv ~/vllm-env

    source ~/vllm-env/bin/activate


Install vLLM. For the DGX Spark's ARM-based Grace CPU with CUDA 12.x, the

standard pip installation should work:


    pip install vllm


If the pip installation fails (which can happen on ARM architectures where

pre-built wheels are not available), you may need to build from source:


    git clone https://github.com/vllm-project/vllm.git

    cd vllm

    pip install -e .


The build from source takes 15-30 minutes as it compiles CUDA kernels. This

is normal and expected. The resulting installation is fully optimized for your

specific GPU architecture.


9.3  Running vLLM as a Single-Node Server


The simplest way to use vLLM is as a single-node OpenAI-compatible API server.

Start the server with a model from Hugging Face (vLLM downloads models

automatically from the Hugging Face Hub):


    python -m vllm.entrypoints.openai.api_server \

        --model meta-llama/Llama-3.1-70B-Instruct \

        --host 0.0.0.0 \

        --port 8000 \

        --dtype bfloat16 \

        --max-model-len 8192


Let us understand each argument. The --model flag specifies the Hugging Face

model identifier. The --host 0.0.0.0 flag makes the server listen on all

network interfaces, not just localhost. The --port 8000 flag sets the port

number. The --dtype bfloat16 flag tells vLLM to use 16-bit brain floating

point precision, which is the best precision for the Blackwell GPU. The

--max-model-len 8192 flag limits the maximum sequence length (input + output

tokens combined) to 8192 tokens, which controls memory usage.


For a model that requires authentication (like Llama 3.1), you need a Hugging

Face account and access token. Set the token as an environment variable:


    export HF_TOKEN="your_huggingface_token_here"


9.4  Setting Up Two-Node Distributed Inference with vLLM


This is where things get genuinely exciting. With two DGX Spark units, you

can run a single model that is distributed across both machines. This allows

you to run models that are too large for a single 128GB memory pool, or to

run models faster by parallelizing the computation.


vLLM supports two forms of parallelism for multi-node inference. Tensor

parallelism splits individual matrix operations across multiple GPUs, with

each GPU computing a portion of each operation simultaneously. Pipeline

parallelism splits the model's layers across GPUs, with each GPU processing

a different set of layers in sequence. For two nodes, tensor parallelism

typically gives better performance.


vLLM uses Ray for multi-node coordination. Ray is a distributed computing

framework that handles process management, communication, and fault tolerance.

Install Ray on both nodes:


    pip install ray


On Node A (the head node), start the Ray cluster:


    ray start --head \

        --node-ip-address=10.0.0.1 \

        --port=6379 \

        --dashboard-host=0.0.0.0


The --node-ip-address flag tells Ray to use the ConnectX-7 interface IP for

cluster communication. This routes all Ray traffic over the high-speed direct

connection rather than the management network.


On Node B (the worker node), join the Ray cluster:


    ray start \

        --address=10.0.0.1:6379 \

        --node-ip-address=10.0.0.2


Now, on Node A, start vLLM with tensor parallelism across both nodes:


    python -m vllm.entrypoints.openai.api_server \

        --model meta-llama/Llama-3.1-70B-Instruct \

        --host 0.0.0.0 \

        --port 8000 \

        --tensor-parallel-size 2 \

        --dtype bfloat16 \

        --max-model-len 8192


The --tensor-parallel-size 2 flag tells vLLM to split the model across 2 GPUs

(one on each node). vLLM uses the Ray cluster to coordinate with Node B

automatically. The model weights are split such that each node holds half the

model, and during inference, both nodes compute their portion simultaneously

and exchange results via the ConnectX-7 link.


9.5  Querying the vLLM Server


Once the vLLM server is running, you can query it using the OpenAI Python

library (since vLLM exposes an OpenAI-compatible API) or with direct HTTP

requests. The following script demonstrates both approaches and shows how to

handle the two-node setup:


    import asyncio

    import aiohttp

    import json

    from openai import AsyncOpenAI

    from typing import AsyncIterator



    class VLLMClient:

        """

        A client for interacting with a vLLM OpenAI-compatible API server.


        This client supports both synchronous and asynchronous operation,

        and demonstrates how to use streaming responses for real-time output.

        The vLLM server is accessed via the Node A management IP since it

        acts as the head node and exposes the unified API endpoint.

        """


        def __init__(

            self,

            host: str = "localhost",

            port: int = 8000,

            model: str = "meta-llama/Llama-3.1-70B-Instruct"

        ) -> None:

            """

            Initialize the vLLM client.


            Args:

                host:  The hostname or IP of the vLLM server (Node A).

                port:  The port number of the vLLM API server.

                model: The model name as registered in the vLLM server.

            """

            self.model = model

            self.client = AsyncOpenAI(

                base_url=f"http://{host}:{port}/v1",

                api_key="not-needed"  # vLLM does not require authentication

                                      # by default, but the field must be set.

            )


        async def stream_completion(

            self,

            prompt: str,

            max_tokens: int = 512,

            temperature: float = 0.7

        ) -> AsyncIterator[str]:

            """

            Stream a completion from the vLLM server token by token.


            This is an async generator that yields each token as it is

            generated. Using async streaming allows your application to

            remain responsive while waiting for the model to generate text,

            which is especially important for long responses.


            Args:

                prompt:     The text prompt to complete.

                max_tokens: Maximum number of tokens to generate.

                temperature: Sampling temperature (0.0 = deterministic).


            Yields:

                Individual tokens as strings.

            """

            stream = await self.client.completions.create(

                model=self.model,

                prompt=prompt,

                max_tokens=max_tokens,

                temperature=temperature,

                stream=True

            )


            async for chunk in stream:

                # Each chunk contains a list of choices. We take the first

                # choice and extract the text delta (the new tokens).

                if chunk.choices and chunk.choices[0].text:

                    yield chunk.choices[0].text


        async def chat_completion(

            self,

            messages: list[dict],

            max_tokens: int = 512,

            temperature: float = 0.7

        ) -> str:

            """

            Send a chat completion request and return the full response.


            Args:

                messages:   A list of message dicts with 'role' and 'content'.

                max_tokens: Maximum tokens to generate.

                temperature: Sampling temperature.


            Returns:

                The model's response as a string.

            """

            response = await self.client.chat.completions.create(

                model=self.model,

                messages=messages,

                max_tokens=max_tokens,

                temperature=temperature,

                stream=False

            )

            return response.choices[0].message.content



    async def main() -> None:

        """Demonstrate vLLM client usage with both streaming and non-streaming."""


        # Connect to the vLLM server running on Node A.

        # Even though the model is distributed across both nodes,

        # all requests go to Node A's API endpoint. vLLM handles

        # the distribution internally via the Ray cluster.

        client = VLLMClient(

            host="localhost",  # or use the management IP: "192.168.1.100"

            port=8000,

            model="meta-llama/Llama-3.1-70B-Instruct"

        )


        # Demonstrate streaming completion.

        print("=== Streaming Completion ===")

        prompt = "The key advantages of distributed AI inference are:"

        print(f"Prompt: {prompt}")

        print("Response: ", end="")


        async for token in client.stream_completion(

            prompt=prompt,

            max_tokens=256,

            temperature=0.3

        ):

            print(token, end="", flush=True)

        print()


        # Demonstrate chat completion with a system prompt.

        print("\n=== Chat Completion ===")

        messages = [

            {

                "role": "system",

                "content": "You are an expert in distributed computing and AI systems."

            },

            {

                "role": "user",

                "content": "What is tensor parallelism and how does it work?"

            }

        ]


        response = await client.chat_completion(

            messages=messages,

            max_tokens=512,

            temperature=0.5

        )

        print(f"Response: {response}")



    if __name__ == "__main__":

        asyncio.run(main())


Chapter 10 - INFERENCE ENGINE 4: SGLANG - THE STRUCTURED GENERATION WIZARD


10.1  What Is SGLang?


SGLang (Structured Generation Language) is an inference framework developed

at UC Berkeley that takes a different approach to LLM serving. While vLLM

focuses on maximizing throughput through efficient memory management, SGLang

focuses on making it easy and efficient to build complex LLM programs that

involve structured outputs, multi-step reasoning, and sophisticated prompting

patterns.


The key innovation in SGLang is RadixAttention, which is an extension of the

KV cache concept. In standard inference, each request has its own KV cache

that is discarded when the request completes. RadixAttention organizes KV

caches in a radix tree (a prefix tree) structure, allowing caches to be shared

between requests that share common prefixes. This is enormously valuable for

applications where many requests share a common system prompt or context, such

as a customer service bot where every conversation starts with the same

instructions.


For example, if you have a 2000-token system prompt that is the same for every

request, standard inference must recompute the KV cache for those 2000 tokens

for every single request. With RadixAttention, the KV cache for those 2000

tokens is computed once and reused for all subsequent requests that share that

prefix. This can reduce latency by 50-80% for such workloads.


10.2  Installing SGLang


Create a virtual environment for SGLang:


    python3 -m venv ~/sglang-env

    source ~/sglang-env/bin/activate


Install SGLang with all optional dependencies:


    pip install "sglang[all]"


If you encounter issues with the ARM64 architecture, install from source:


    git clone https://github.com/sgl-project/sglang.git

    cd sglang

    pip install -e ".[all]"


10.3  Starting the SGLang Server


SGLang provides a launch_server script that starts an OpenAI-compatible API

server. On Node A, start the server:


    python -m sglang.launch_server \

        --model-path meta-llama/Llama-3.1-8B-Instruct \

        --host 0.0.0.0 \

        --port 30000 \

        --dtype bfloat16 \

        --mem-fraction-static 0.85


The --mem-fraction-static 0.85 flag tells SGLang to use 85% of the available

GPU memory for the static KV cache pool. The remaining 15% is reserved for

dynamic allocations during inference. Adjusting this value lets you balance

between maximum batch size and stability.


For two-node distributed inference with SGLang, the setup uses torch.distributed

with NCCL as the communication backend. On Node A:


    python -m sglang.launch_server \

        --model-path meta-llama/Llama-3.1-70B-Instruct \

        --host 0.0.0.0 \

        --port 30000 \

        --tp-size 2 \

        --nnodes 2 \

        --node-rank 0 \

        --dist-init-addr 10.0.0.1:29500 \

        --dtype bfloat16


On Node B (run this command simultaneously with the Node A command):


    python -m sglang.launch_server \

        --model-path meta-llama/Llama-3.1-70B-Instruct \

        --host 0.0.0.0 \

        --port 30000 \

        --tp-size 2 \

        --nnodes 2 \

        --node-rank 1 \

        --dist-init-addr 10.0.0.1:29500 \

        --dtype bfloat16


The --tp-size 2 flag sets tensor parallelism to 2 (one GPU per node). The

--nnodes 2 flag specifies the total number of nodes. The --node-rank flag

identifies each node (0 for head, 1 for worker). The --dist-init-addr flag

specifies the address of the head node for distributed initialization, using

the ConnectX-7 IP for low-latency communication.


10.4  Using SGLang's Structured Generation Features


SGLang's most powerful feature is its ability to generate structured outputs

reliably. This is useful when you need the model to produce JSON, follow a

specific format, or make a series of decisions in a structured way. The

following example demonstrates how to use SGLang's Python API to generate

structured JSON output:


    import sglang as sgl

    from sglang import function, system, user, assistant, gen

    import json



    # SGLang uses a decorator-based programming model where you define

    # generation programs as Python functions decorated with @sgl.function.

    # This allows SGLang to optimize the execution of complex multi-step

    # generation tasks.

    @sgl.function

    def analyze_text(s, text: str) -> None:

        """

        Analyze a piece of text and extract structured information.


        This SGLang program instructs the model to analyze input text and

        produce a structured JSON response containing sentiment, key topics,

        and a summary. The use of SGLang's constrained generation ensures

        the output is valid JSON.


        Args:

            s:    The SGLang state object (injected automatically).

            text: The text to analyze.

        """

        # Set the system prompt that defines the model's behavior.

        s += system(

            "You are a text analysis assistant. Always respond with valid JSON."

        )


        # Provide the user's request with the text to analyze.

        s += user(

            f"Analyze the following text and provide a JSON response with "

            f"fields: 'sentiment' (positive/negative/neutral), "

            f"'key_topics' (list of strings), and 'summary' (one sentence).\n\n"

            f"Text: {text}"

        )


        # The gen() call tells SGLang to generate text here.

        # The max_tokens parameter limits the response length.

        # SGLang can also enforce JSON schema constraints if configured.

        s += assistant(gen("analysis", max_tokens=256))



    @sgl.function

    def multi_step_reasoning(s, question: str) -> None:

        """

        Perform multi-step chain-of-thought reasoning.


        This program demonstrates SGLang's ability to structure complex

        reasoning tasks. It first generates a step-by-step reasoning chain,

        then uses that reasoning to produce a final answer. This two-step

        approach often produces more accurate results than asking for the

        answer directly.


        Args:

            s:        The SGLang state object.

            question: The question to reason about.

        """

        s += system(

            "You are a careful reasoner who thinks step by step before answering."

        )


        s += user(f"Question: {question}\n\nFirst, think through this step by step:")


        # Generate the reasoning chain and store it in the 'reasoning' variable.

        s += assistant(gen("reasoning", max_tokens=512))


        # Now ask for the final answer, which can reference the reasoning above.

        s += user("Based on your reasoning above, what is your final answer?")


        # Generate the final answer and store it in the 'answer' variable.

        s += assistant(gen("answer", max_tokens=128))



    def run_sglang_examples() -> None:

        """

        Run the SGLang example programs against the local server.


        This function initializes the SGLang runtime to connect to the

        server we started earlier, then runs both example programs.

        """

        # Initialize the SGLang runtime to connect to the local server.

        # If you want to use the distributed two-node setup, point this

        # at Node A's management IP or localhost if running on Node A.

        sgl.set_default_backend(

            sgl.RuntimeEndpoint("http://localhost:30000")

        )


        # Run the text analysis program.

        print("=== Structured Text Analysis ===")

        sample_text = (

            "The new NVIDIA DGX Spark is a remarkable piece of engineering. "

            "It delivers petaFLOP-scale AI performance in a desktop form factor, "

            "making enterprise-grade AI accessible to individual researchers."

        )


        result = analyze_text(text=sample_text)


        # Access the generated content by variable name.

        raw_analysis = result["analysis"]

        print(f"Raw output: {raw_analysis}")


        # Attempt to parse the JSON output.

        try:

            parsed = json.loads(raw_analysis)

            print(f"Sentiment:   {parsed.get('sentiment', 'N/A')}")

            print(f"Key topics:  {parsed.get('key_topics', [])}")

            print(f"Summary:     {parsed.get('summary', 'N/A')}")

        except json.JSONDecodeError:

            print("Note: Output was not valid JSON. Adjust the prompt for stricter formatting.")


        # Run the multi-step reasoning program.

        print("\n=== Multi-Step Reasoning ===")

        question = (

            "If two DGX Spark units each have 128GB of unified memory and are "

            "connected via a 100GbE link, what is the theoretical maximum model "

            "size they could run together, and what are the practical limitations?"

        )


        result = multi_step_reasoning(question=question)

        print(f"Reasoning:\n{result['reasoning']}")

        print(f"\nFinal Answer:\n{result['answer']}")



    if __name__ == "__main__":

        run_sglang_examples()


Chapter 11 - INFERENCE ENGINE 5: TENSORRT-LLM - MAXIMUM PERFORMANCE MODE


11.1  What Is TensorRT-LLM?


TensorRT-LLM is NVIDIA's own high-performance inference library, and it

represents the pinnacle of optimization for NVIDIA hardware. While Ollama,

LM Studio, vLLM, and SGLang are general-purpose frameworks that work across

different hardware, TensorRT-LLM is specifically engineered to extract every

last drop of performance from NVIDIA GPUs.


The way TensorRT-LLM achieves this is through model compilation. Instead of

running a model in its original format (PyTorch weights), TensorRT-LLM compiles

the model into a TensorRT engine - a highly optimized binary that is tailored

to the specific GPU architecture it will run on. This compilation process

applies a battery of optimizations: kernel fusion (combining multiple operations

into a single GPU kernel to reduce memory bandwidth), precision reduction

(converting weights to FP8 or INT4 format), layer optimization (replacing

generic PyTorch operations with hand-written CUDA kernels), and graph

optimization (reordering and eliminating redundant operations).


The result is typically 2-5x faster inference compared to unoptimized PyTorch,

with the exact speedup depending on the model architecture and the specific

GPU. For the DGX Spark's Blackwell GPU, which has dedicated hardware for FP4

and FP8 operations, the speedup can be even more dramatic.


The trade-off is complexity. TensorRT-LLM requires a compilation step that

can take 30 minutes to several hours for large models, and the compiled engine

is specific to the GPU architecture it was compiled for. You cannot compile

an engine on a Blackwell GPU and run it on an Ampere GPU.


11.2  Installing TensorRT-LLM


TensorRT-LLM is best installed inside a Docker container, as it has complex

dependencies that are pre-configured in NVIDIA's official container images.

The DGX Spark has Docker and the NVIDIA Container Runtime pre-installed.


Pull the official TensorRT-LLM container:


    docker pull nvcr.io/nvidia/tensorrt-llm:latest


Alternatively, install via pip in a virtual environment:


    python3 -m venv ~/trtllm-env

    source ~/trtllm-env/bin/activate

    pip install tensorrt-llm


If the pip installation fails on ARM64, use the Docker approach, which is

more reliable:


    docker run --gpus all \

        --rm \

        -it \

        -v /home/aiuser/models:/models \

        nvcr.io/nvidia/tensorrt-llm:latest \

        bash


The -v flag mounts your local models directory inside the container, so models

you download are accessible both inside and outside Docker.


11.3  Building a TensorRT-LLM Engine


Building a TensorRT-LLM engine is a two-step process. First, you convert the

model weights from Hugging Face format to TensorRT-LLM's internal format.

Second, you compile the converted weights into an optimized TensorRT engine.


Step 1: Convert the model weights. This example uses Llama 3.1 8B:


    # Inside the TensorRT-LLM Docker container or virtual environment:


    # Download the model from Hugging Face first.

    # You need the huggingface_hub library for this.

    pip install huggingface_hub


    python3 -c "

    from huggingface_hub import snapshot_download

    snapshot_download(

        repo_id='meta-llama/Llama-3.1-8B-Instruct',

        local_dir='/models/llama-3.1-8b-hf',

        token='your_hf_token_here'

    )

    "


    # Convert the Hugging Face weights to TensorRT-LLM checkpoint format.

    # This script is included in the TensorRT-LLM repository.

    python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py \

        --model-dir /models/llama-3.1-8b-hf \

        --output-dir /models/llama-3.1-8b-trtllm-ckpt \

        --dtype bfloat16 \

        --tp-size 1


Step 2: Compile the TensorRT engine:


    trtllm-build \

        --checkpoint-dir /models/llama-3.1-8b-trtllm-ckpt \

        --output-dir /models/llama-3.1-8b-trtllm-engine \

        --gemm-plugin bfloat16 \

        --max-batch-size 8 \

        --max-input-len 2048 \

        --max-seq-len 4096


The --gemm-plugin flag enables TensorRT's optimized GEMM (General Matrix

Multiply) kernels, which are the core operation in transformer inference.

The --max-batch-size flag sets the maximum number of requests that can be

processed simultaneously. The --max-input-len and --max-seq-len flags control

the maximum context length.


11.4  Two-Node TensorRT-LLM with MPI


For two-node distributed inference with TensorRT-LLM, we use MPI (Message

Passing Interface), the standard parallel computing communication library.

Install MPI on both nodes:


    sudo apt install -y openmpi-bin openmpi-common libopenmpi-dev


For a two-node setup, rebuild the TensorRT engine with tensor parallelism:


    # On Node A: Convert with tp-size 2

    python3 /path/to/convert.py \

        --model-dir /models/llama-3.1-70b-hf \

        --output-dir /models/llama-3.1-70b-trtllm-ckpt \

        --dtype bfloat16 \

        --tp-size 2


    # Build the engine for 2-GPU tensor parallelism

    trtllm-build \

        --checkpoint-dir /models/llama-3.1-70b-trtllm-ckpt \

        --output-dir /models/llama-3.1-70b-trtllm-engine \

        --gemm-plugin bfloat16 \

        --max-batch-size 4 \

        --max-input-len 2048 \

        --max-seq-len 4096


Copy the engine to Node B (it must be identical on both nodes):


    rsync -avz \

        /models/llama-3.1-70b-trtllm-engine/ \

        aiuser@10.0.0.2:/models/llama-3.1-70b-trtllm-engine/


Now launch the TensorRT-LLM server across both nodes using mpirun:


    mpirun \

        -n 2 \

        --host 10.0.0.1,10.0.0.2 \

        --mca btl_tcp_if_include enp1s0f0np0 \

        python3 -m tensorrt_llm.serve \

            --engine-dir /models/llama-3.1-70b-trtllm-engine \

            --host 0.0.0.0 \

            --port 8080


The -n 2 flag launches 2 MPI processes (one per node). The --host flag

specifies the two nodes using their ConnectX-7 IP addresses. The

--mca btl_tcp_if_include flag tells MPI to use the ConnectX-7 interface for

communication, routing all inter-node traffic over the high-speed direct link.


11.5  Querying the TensorRT-LLM Server


The TensorRT-LLM server exposes an OpenAI-compatible API, so the same client

code works as for vLLM and SGLang. The following example adds performance

measurement to help you appreciate the speed difference:


    import time

    import requests

    import json

    from dataclasses import dataclass



    @dataclass

    class InferenceResult:

        """

        Container for inference results including performance metrics.


        This dataclass bundles the generated text with timing information,

        making it easy to compare performance across different inference

        engines and configurations.

        """

        response_text: str

        prompt_tokens: int

        completion_tokens: int

        total_time_seconds: float

        tokens_per_second: float



    def query_trtllm_server(

        prompt: str,

        host: str = "localhost",

        port: int = 8080,

        max_tokens: int = 256,

        temperature: float = 0.7

    ) -> InferenceResult:

        """

        Query the TensorRT-LLM server and measure performance.


        This function sends a completion request to the TensorRT-LLM server

        and measures the time taken to generate the response. The tokens

        per second metric is the key performance indicator for LLM inference:

        higher is better, and TensorRT-LLM typically achieves the highest

        values of any inference framework on NVIDIA hardware.


        Args:

            prompt:      The text prompt to complete.

            host:        The TensorRT-LLM server hostname or IP.

            port:        The server port number.

            max_tokens:  Maximum tokens to generate.

            temperature: Sampling temperature.


        Returns:

            An InferenceResult with the response and performance metrics.

        """

        url = f"http://{host}:{port}/v1/completions"


        payload = {

            "model": "tensorrt-llm",  # TensorRT-LLM uses this as a placeholder.

            "prompt": prompt,

            "max_tokens": max_tokens,

            "temperature": temperature,

            "stream": False

        }


        # Record the start time before sending the request.

        start_time = time.perf_counter()


        response = requests.post(url, json=payload)

        response.raise_for_status()


        # Record the end time after receiving the complete response.

        end_time = time.perf_counter()


        data = response.json()

        total_time = end_time - start_time


        # Extract usage statistics from the response.

        usage = data.get("usage", {})

        prompt_tokens = usage.get("prompt_tokens", 0)

        completion_tokens = usage.get("completion_tokens", 0)


        # Calculate tokens per second. This is the primary performance metric.

        # Divide by total time to get the overall throughput including

        # network overhead and server processing time.

        tokens_per_second = completion_tokens / total_time if total_time > 0 else 0


        response_text = data["choices"][0]["text"] if data.get("choices") else ""


        return InferenceResult(

            response_text=response_text,

            prompt_tokens=prompt_tokens,

            completion_tokens=completion_tokens,

            total_time_seconds=total_time,

            tokens_per_second=tokens_per_second

        )



    if __name__ == "__main__":

        test_prompt = (

            "Explain the difference between tensor parallelism and "

            "pipeline parallelism in distributed deep learning:"

        )


        print("Querying TensorRT-LLM server (two-node distributed)...")

        result = query_trtllm_server(

            prompt=test_prompt,

            host="localhost",

            port=8080,

            max_tokens=256

        )


        print(f"\nResponse:\n{result.response_text}")

        print(f"\nPerformance Metrics:")

        print(f"  Prompt tokens:     {result.prompt_tokens}")

        print(f"  Completion tokens: {result.completion_tokens}")

        print(f"  Total time:        {result.total_time_seconds:.2f} seconds")

        print(f"  Throughput:        {result.tokens_per_second:.1f} tokens/second")


Chapter 12 - WRITING CODE THAT TALKS TO LOCAL AND REMOTE LLMS


12.1  The Unified Client: One Interface, Five Engines


One of the most powerful patterns when working with multiple inference engines

is to write a unified client that abstracts away the differences between them.

All five engines we have covered (Ollama, LM Studio, vLLM, SGLang, and

TensorRT-LLM) expose either the Ollama API or the OpenAI API. This means we

can write a single client class that works with all of them by simply changing

the endpoint URL.


The following code implements a comprehensive unified LLM client that supports

both local models (via Ollama) and remote models (via any OpenAI-compatible

endpoint). It includes retry logic, error handling, and performance monitoring:


    import time

    import json

    import logging

    import requests

    from enum import Enum

    from dataclasses import dataclass, field

    from typing import Optional, Iterator

    from openai import OpenAI



    # Configure logging so we can see what the client is doing.

    # In production, you would configure this to write to a file.

    logging.basicConfig(

        level=logging.INFO,

        format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"

    )

    logger = logging.getLogger("llm_client")



    class BackendType(Enum):

        """

        Enumeration of supported LLM backend types.


        Each backend type corresponds to a different inference engine.

        The client uses this to determine which API protocol to use

        and how to format requests and parse responses.

        """

        OLLAMA = "ollama"

        OPENAI_COMPATIBLE = "openai_compatible"  # vLLM, LM Studio, SGLang, TensorRT-LLM



    @dataclass

    class LLMConfig:

        """

        Configuration for connecting to an LLM backend.


        This dataclass holds all the information needed to connect to

        and communicate with an LLM inference server. Separating

        configuration from logic makes it easy to switch between

        different servers and models.

        """

        host: str

        port: int

        model: str

        backend_type: BackendType

        api_key: str = "not-required"

        timeout_seconds: int = 120

        max_retries: int = 3

        retry_delay_seconds: float = 1.0


        @property

        def base_url(self) -> str:

            """Construct the base URL for the API endpoint."""

            if self.backend_type == BackendType.OLLAMA:

                return f"http://{self.host}:{self.port}"

            else:

                return f"http://{self.host}:{self.port}/v1"



    @dataclass

    class ChatMessage:

        """A single message in a conversation."""

        role: str    # "system", "user", or "assistant"

        content: str


        def to_dict(self) -> dict:

            """Convert to the dict format expected by API endpoints."""

            return {"role": self.role, "content": self.content}



    @dataclass

    class GenerationResult:

        """

        The result of a text generation request.


        This dataclass captures both the generated content and metadata

        about the generation, including timing information that helps

        you understand and optimize your inference pipeline.

        """

        content: str

        model: str

        backend: str

        prompt_tokens: int = 0

        completion_tokens: int = 0

        generation_time_seconds: float = 0.0


        @property

        def tokens_per_second(self) -> float:

            """Calculate the generation throughput in tokens per second."""

            if self.generation_time_seconds > 0 and self.completion_tokens > 0:

                return self.completion_tokens / self.generation_time_seconds

            return 0.0


        def __str__(self) -> str:

            return (

                f"GenerationResult(\n"

                f"  backend={self.backend},\n"

                f"  model={self.model},\n"

                f"  tokens={self.completion_tokens},\n"

                f"  speed={self.tokens_per_second:.1f} tok/s\n"

                f")"

            )



    class UnifiedLLMClient:

        """

        A unified client for interacting with multiple LLM inference backends.


        This client provides a consistent interface for sending requests to

        any of the five inference engines covered in this tutorial. It handles

        the differences in API protocols, request formats, and response

        structures transparently.


        The client supports both the Ollama native API and the OpenAI-compatible

        API, automatically selecting the correct protocol based on the backend

        type specified in the configuration.


        Usage example:

            # Configure for local Ollama

            config = LLMConfig(

                host="localhost",

                port=11434,

                model="llama3.2:3b",

                backend_type=BackendType.OLLAMA

            )

            client = UnifiedLLMClient(config)

            result = client.chat([ChatMessage("user", "Hello!")])

        """


        def __init__(self, config: LLMConfig) -> None:

            """

            Initialize the client with the given configuration.


            Args:

                config: The LLMConfig specifying which server to connect to.

            """

            self.config = config

            self._openai_client: Optional[OpenAI] = None


            # Only create the OpenAI client for OpenAI-compatible backends.

            if config.backend_type == BackendType.OPENAI_COMPATIBLE:

                self._openai_client = OpenAI(

                    base_url=config.base_url,

                    api_key=config.api_key,

                    timeout=config.timeout_seconds

                )


            logger.info(

                f"Initialized LLM client: {config.backend_type.value} "

                f"at {config.host}:{config.port} "

                f"using model '{config.model}'"

            )


        def _chat_via_ollama(

            self,

            messages: list[ChatMessage],

            temperature: float,

            max_tokens: int

        ) -> GenerationResult:

            """

            Send a chat request using the Ollama native API.


            The Ollama API uses a different request format than the OpenAI API.

            It accepts messages in a similar format but uses different field

            names and response structure.


            Args:

                messages:    The conversation history.

                temperature: Sampling temperature.

                max_tokens:  Maximum tokens to generate.


            Returns:

                A GenerationResult with the response and metadata.

            """

            url = f"{self.config.base_url}/api/chat"


            payload = {

                "model": self.config.model,

                "messages": [msg.to_dict() for msg in messages],

                "stream": False,

                "options": {

                    "temperature": temperature,

                    "num_predict": max_tokens

                }

            }


            start_time = time.perf_counter()


            response = requests.post(

                url,

                json=payload,

                timeout=self.config.timeout_seconds

            )

            response.raise_for_status()


            elapsed = time.perf_counter() - start_time

            data = response.json()


            # Extract the response content from the Ollama API response format.

            content = data.get("message", {}).get("content", "")


            # Ollama provides token counts in the response metadata.

            prompt_tokens = data.get("prompt_eval_count", 0)

            completion_tokens = data.get("eval_count", 0)


            return GenerationResult(

                content=content,

                model=self.config.model,

                backend="ollama",

                prompt_tokens=prompt_tokens,

                completion_tokens=completion_tokens,

                generation_time_seconds=elapsed

            )


        def _chat_via_openai(

            self,

            messages: list[ChatMessage],

            temperature: float,

            max_tokens: int

        ) -> GenerationResult:

            """

            Send a chat request using the OpenAI-compatible API.


            This method works with vLLM, LM Studio, SGLang, and TensorRT-LLM,

            all of which implement the OpenAI chat completions API.


            Args:

                messages:    The conversation history.

                temperature: Sampling temperature.

                max_tokens:  Maximum tokens to generate.


            Returns:

                A GenerationResult with the response and metadata.

            """

            assert self._openai_client is not None, (

                "OpenAI client not initialized. "

                "Check that backend_type is OPENAI_COMPATIBLE."

            )


            start_time = time.perf_counter()


            completion = self._openai_client.chat.completions.create(

                model=self.config.model,

                messages=[msg.to_dict() for msg in messages],

                temperature=temperature,

                max_tokens=max_tokens

            )


            elapsed = time.perf_counter() - start_time


            content = completion.choices[0].message.content or ""

            usage = completion.usage


            return GenerationResult(

                content=content,

                model=self.config.model,

                backend=self.config.backend_type.value,

                prompt_tokens=usage.prompt_tokens if usage else 0,

                completion_tokens=usage.completion_tokens if usage else 0,

                generation_time_seconds=elapsed

            )


        def chat(

            self,

            messages: list[ChatMessage],

            temperature: float = 0.7,

            max_tokens: int = 512

        ) -> GenerationResult:

            """

            Send a chat request with automatic retry on failure.


            This is the primary public method for sending requests. It

            automatically selects the correct API protocol based on the

            backend type and retries failed requests up to max_retries times.


            Args:

                messages:    The conversation history as a list of ChatMessages.

                temperature: Sampling temperature (0.0 = deterministic).

                max_tokens:  Maximum number of tokens to generate.


            Returns:

                A GenerationResult with the response and performance metrics.


            Raises:

                RuntimeError: If all retry attempts fail.

            """

            last_error: Optional[Exception] = None


            for attempt in range(self.config.max_retries):

                try:

                    if self.config.backend_type == BackendType.OLLAMA:

                        result = self._chat_via_ollama(

                            messages, temperature, max_tokens

                        )

                    else:

                        result = self._chat_via_openai(

                            messages, temperature, max_tokens

                        )


                    logger.info(

                        f"Request completed: {result.completion_tokens} tokens "

                        f"at {result.tokens_per_second:.1f} tok/s"

                    )

                    return result


                except Exception as error:

                    last_error = error

                    logger.warning(

                        f"Request attempt {attempt + 1} failed: {error}. "

                        f"Retrying in {self.config.retry_delay_seconds}s..."

                    )

                    time.sleep(self.config.retry_delay_seconds)


            raise RuntimeError(

                f"All {self.config.max_retries} attempts failed. "

                f"Last error: {last_error}"

            )


        def stream_chat(

            self,

            messages: list[ChatMessage],

            temperature: float = 0.7,

            max_tokens: int = 512

        ) -> Iterator[str]:

            """

            Stream a chat response token by token.


            This method yields tokens as they are generated, which is

            useful for building responsive user interfaces. Note that

            streaming is only supported for OpenAI-compatible backends

            in this implementation.


            Args:

                messages:    The conversation history.

                temperature: Sampling temperature.

                max_tokens:  Maximum tokens to generate.


            Yields:

                Individual tokens as strings.

            """

            if self.config.backend_type == BackendType.OLLAMA:

                # Use the Ollama streaming API.

                url = f"{self.config.base_url}/api/chat"

                payload = {

                    "model": self.config.model,

                    "messages": [msg.to_dict() for msg in messages],

                    "stream": True,

                    "options": {"temperature": temperature, "num_predict": max_tokens}

                }

                with requests.post(url, json=payload, stream=True) as response:

                    response.raise_for_status()

                    for line in response.iter_lines():

                        if line:

                            chunk = json.loads(line)

                            token = chunk.get("message", {}).get("content", "")

                            if token:

                                yield token

                            if chunk.get("done", False):

                                break

            else:

                # Use the OpenAI streaming API.

                assert self._openai_client is not None

                stream = self._openai_client.chat.completions.create(

                    model=self.config.model,

                    messages=[msg.to_dict() for msg in messages],

                    temperature=temperature,

                    max_tokens=max_tokens,

                    stream=True

                )

                for chunk in stream:

                    if chunk.choices and chunk.choices[0].delta.content:

                        yield chunk.choices[0].delta.content



    def demonstrate_all_backends() -> None:

        """

        Demonstrate the unified client with all five inference backends.


        This function creates a client for each backend and sends the same

        question to all of them, then compares the responses and performance.

        It assumes all servers are running on the appropriate ports as

        configured throughout this tutorial.

        """


        # Define configurations for all five backends.

        # Adjust host addresses and ports to match your actual setup.

        configs = [

            LLMConfig(

                host="localhost",

                port=11434,

                model="llama3.2:3b",

                backend_type=BackendType.OLLAMA

            ),

            LLMConfig(

                host="localhost",

                port=1234,

                model="local-model",

                backend_type=BackendType.OPENAI_COMPATIBLE

            ),

            LLMConfig(

                host="localhost",

                port=8000,

                model="meta-llama/Llama-3.1-70B-Instruct",

                backend_type=BackendType.OPENAI_COMPATIBLE

            ),

            LLMConfig(

                host="localhost",

                port=30000,

                model="meta-llama/Llama-3.1-8B-Instruct",

                backend_type=BackendType.OPENAI_COMPATIBLE

            ),

            LLMConfig(

                host="localhost",

                port=8080,

                model="tensorrt-llm",

                backend_type=BackendType.OPENAI_COMPATIBLE

            )

        ]


        backend_names = ["Ollama", "LM Studio", "vLLM", "SGLang", "TensorRT-LLM"]


        # The same question is sent to all backends for a fair comparison.

        question = (

            "In one paragraph, explain why unified memory architecture is "

            "important for running large language models."

        )


        messages = [

            ChatMessage(

                role="system",

                content="You are a concise technical expert. Answer in one paragraph."

            ),

            ChatMessage(role="user", content=question)

        ]


        print("=" * 70)

        print("COMPARING ALL FIVE INFERENCE BACKENDS")

        print("=" * 70)

        print(f"Question: {question}\n")


        for config, name in zip(configs, backend_names):

            print(f"\n--- {name} ---")

            try:

                client = UnifiedLLMClient(config)

                result = client.chat(messages=messages, temperature=0.3, max_tokens=256)

                print(f"Response: {result.content}")

                print(f"Speed: {result.tokens_per_second:.1f} tokens/second")

            except Exception as error:

                print(f"Error connecting to {name}: {error}")

                print("(Make sure the server is running on the expected port)")



    if __name__ == "__main__":

        demonstrate_all_backends()


Chapter 13 - MONITORING, TROUBLESHOOTING, AND KEEPING THINGS RUNNING


13.1  Real-Time GPU Monitoring


Understanding what your GPUs are doing is essential for diagnosing performance

issues and ensuring your inference workloads are running efficiently. The

primary tool for this is nvidia-smi, which you can run in watch mode to get

a continuously updating display:


    watch -n 1 nvidia-smi


This refreshes the output every second. You will see GPU utilization

(ideally close to 100% during inference), memory usage (which grows as models

are loaded), temperature (should stay below 85°C for sustained workloads),

and power consumption.


The nvtop tool provides a more visual, htop-like interface:


    nvtop


For monitoring both nodes simultaneously from a single terminal, you can use

SSH to run nvidia-smi on Node B and display the output locally:


    # In one terminal pane: monitor Node A

    watch -n 1 nvidia-smi


    # In another terminal pane: monitor Node B via SSH

    ssh aiuser@10.0.0.2 "watch -n 1 nvidia-smi"


13.2  Monitoring Network Performance


During distributed inference, the ConnectX-7 link is the critical path. If

the network is not performing well, your distributed inference will be slow

regardless of how fast the GPUs are. Monitor network throughput with:


    # Watch network interface statistics in real time.

    # Replace enp1s0f0np0 with your actual ConnectX-7 interface name.

    watch -n 1 "cat /proc/net/dev | grep enp1s0f0np0"


For a more detailed view, use the sar tool (part of the sysstat package):


    sudo apt install -y sysstat

    sar -n DEV 1 100


This shows network statistics for all interfaces, updated every second, for

100 iterations. Look for the enp1s0f0np0 interface and check that the

rxkB/s and txkB/s values are consistent with your expected workload.


13.3  Setting Up Systemd Services for Inference Engines


For production use, you want your inference servers to start automatically

when the machine boots and to restart automatically if they crash. Systemd

services handle this perfectly. Here is an example service file for vLLM:


    sudo nano /etc/systemd/system/vllm-server.service


Enter the following content, adjusting paths and parameters for your setup:


    [Unit]

    Description=vLLM OpenAI-Compatible Inference Server

    After=network.target

    Wants=network.target


    [Service]

    Type=simple

    User=aiuser

    WorkingDirectory=/home/aiuser

    Environment="PATH=/home/aiuser/vllm-env/bin:/usr/local/bin:/usr/bin:/bin"

    Environment="HF_TOKEN=your_huggingface_token_here"

    ExecStart=/home/aiuser/vllm-env/bin/python -m vllm.entrypoints.openai.api_server \

        --model meta-llama/Llama-3.1-8B-Instruct \

        --host 0.0.0.0 \

        --port 8000 \

        --dtype bfloat16 \

        --max-model-len 8192

    Restart=always

    RestartSec=10

    StandardOutput=journal

    StandardError=journal


    [Install]

    WantedBy=multi-user.target


Enable and start the service:


          sudo systemctl daemon-reload

    sudo systemctl enable vllm-server

    sudo systemctl start vllm-server


View the service logs:


    journalctl -u vllm-server -f


The -f flag follows the log in real time, similar to tail -f. This is

invaluable for debugging startup issues.


13.4  A Health Check Script


The following script checks the health of all inference engines and reports

which ones are running and responding correctly. Run this after a reboot or

when you suspect something is not working:


    import requests

    import subprocess

    from dataclasses import dataclass



    @dataclass

    class ServiceStatus:

        """Status information for a single inference service."""

        name: str

        host: str

        port: int

        is_running: bool

        response_time_ms: float

        error_message: str = ""



    def check_service_health(

        name: str,

        host: str,

        port: int,

        health_endpoint: str = "/health",

        timeout: float = 5.0

    ) -> ServiceStatus:

        """

        Check whether an inference service is running and responding.


        This function attempts to connect to the service's health endpoint

        and measures the response time. A successful response (HTTP 200)

        indicates the service is healthy. Any error indicates a problem

        that needs investigation.


        Args:

            name:            Human-readable name of the service.

            host:            The hostname or IP of the service.

            port:            The port number.

            health_endpoint: The URL path for the health check endpoint.

            timeout:         Maximum time to wait for a response in seconds.


        Returns:

            A ServiceStatus object with the health check results.

        """

        import time


        url = f"http://{host}:{port}{health_endpoint}"

        start = time.perf_counter()


        try:

            response = requests.get(url, timeout=timeout)

            elapsed_ms = (time.perf_counter() - start) * 1000


            return ServiceStatus(

                name=name,

                host=host,

                port=port,

                is_running=response.status_code == 200,

                response_time_ms=elapsed_ms

            )


        except requests.exceptions.ConnectionError:

            return ServiceStatus(

                name=name,

                host=host,

                port=port,

                is_running=False,

                response_time_ms=0.0,

                error_message="Connection refused - service may not be running"

            )

        except requests.exceptions.Timeout:

            return ServiceStatus(

                name=name,

                host=host,

                port=port,

                is_running=False,

                response_time_ms=timeout * 1000,

                error_message="Timeout - service is running but not responding"

            )



    def check_gpu_health() -> dict:

        """

        Check GPU health using nvidia-smi.


        Returns a dict with GPU temperature, utilization, and memory usage.

        This helps identify whether the GPU is overheating or running out

        of memory, which are common causes of inference failures.

        """

        try:

            result = subprocess.run(

                [

                    "nvidia-smi",

                    "--query-gpu=temperature.gpu,utilization.gpu,memory.used,memory.total",

                    "--format=csv,noheader,nounits"

                ],

                capture_output=True,

                text=True,

                timeout=10

            )


            if result.returncode == 0:

                values = result.stdout.strip().split(", ")

                return {

                    "temperature_c": int(values[0]),

                    "utilization_pct": int(values[1]),

                    "memory_used_mb": int(values[2]),

                    "memory_total_mb": int(values[3])

                }

        except Exception as error:

            return {"error": str(error)}


        return {}



    if __name__ == "__main__":

        print("=" * 60)

        print("DGX SPARK INFERENCE ENGINE HEALTH CHECK")

        print("=" * 60)


        # Define all services to check.

        services_to_check = [

            ("Ollama (Node A)",      "localhost", 11434, "/api/tags"),

            ("Ollama (Node B)",      "10.0.0.2",  11434, "/api/tags"),

            ("LM Studio (Node A)",   "localhost", 1234,  "/v1/models"),

            ("vLLM (Node A)",        "localhost", 8000,  "/health"),

            ("SGLang (Node A)",      "localhost", 30000, "/health"),

            ("TensorRT-LLM (Node A)","localhost", 8080,  "/health"),

        ]


        print("\nService Status:")

        print("-" * 60)


        for name, host, port, endpoint in services_to_check:

            status = check_service_health(name, host, port, endpoint)

            status_str = "RUNNING" if status.is_running else "DOWN"

            if status.is_running:

                print(

                    f"  [{status_str:7}] {name:<30} "

                    f"{status.response_time_ms:.1f}ms"

                )

            else:

                print(

                    f"  [{status_str:7}] {name:<30} "

                    f"{status.error_message}"

                )


        print("\nGPU Health (Node A):")

        print("-" * 60)

        gpu_info = check_gpu_health()

        if "error" not in gpu_info:

            print(f"  Temperature:  {gpu_info['temperature_c']}°C")

            print(f"  Utilization:  {gpu_info['utilization_pct']}%")

            memory_used_gb = gpu_info['memory_used_mb'] / 1024

            memory_total_gb = gpu_info['memory_total_mb'] / 1024

            print(f"  Memory:       {memory_used_gb:.1f}GB / {memory_total_gb:.1f}GB")

        else:

            print(f"  Error: {gpu_info['error']}")


13.5  Common Problems and Solutions


The following describes the most common issues you will encounter and how to

resolve them.


If nvidia-smi shows "No devices were found" after a kernel update, the GPU

driver module may not have been recompiled for the new kernel. Run

"sudo apt install --reinstall nvidia-driver-570" (or whatever the current

driver version is) to reinstall the driver, which triggers recompilation.


If an inference server fails to start with "CUDA out of memory," another

process is using GPU memory. Run "nvidia-smi" to identify the process, then

kill it with "sudo kill -9 <PID>". Also check that you are not trying to load

a model that exceeds the available memory.


If the ConnectX-7 link shows as "down" in "ip link show," check that the DAC

cable is fully seated in both ports. Try unplugging and replugging the cable.

If the problem persists, verify the cable is rated for 100G (QSFP28) and not

40G (QSFP+).


If NCCL reports "Connection refused" during multi-node startup, the firewall

may be blocking the NCCL communication ports. Disable the firewall temporarily

for testing: "sudo ufw disable". If this fixes the problem, add rules to allow

traffic on the NCCL ports (typically 29500 and above) between the two nodes.


If vLLM's Ray cluster fails to form, ensure that the Ray head node is fully

started before running "ray start" on the worker node. The head node logs

should show "Ray runtime started" before you proceed with the worker.


Chapter 14 - CLOSING THOUGHTS AND NEXT STEPS


14.1  What You Have Accomplished


If you have followed this tutorial to this point, you have done something

genuinely impressive. You have set up two of the most powerful personal AI

workstations available, connected them with a high-speed 100GbE direct link,

configured RDMA for low-latency GPU-to-GPU communication, and installed five

different inference engines that cover the full spectrum from beginner-friendly

to maximum-performance. You have also written Python code that can talk to all

of these engines, both locally and across the network.


This is not a trivial achievement. Many organizations spend months and

significant engineering resources to build AI inference infrastructure at this

level. You now have it running on two machines on your desk.


14.2  Choosing the Right Tool for the Job


Now that you have all five engines available, here is a practical guide to

choosing the right one for different situations.


Ollama is the right choice when you want to quickly experiment with a new

model, when you need the simplest possible setup, or when you are building a

prototype that you want to get running in minutes rather than hours. Its model

library is excellent, and the REST API is simple and well-documented.


LM Studio is the right choice when you want a graphical interface for

interactive model exploration, when you are demonstrating AI capabilities to

non-technical stakeholders, or when you want to quickly compare different

models side by side without writing code.


vLLM is the right choice when you need to serve many concurrent users with

high throughput, when you are building a production API service, or when you

need the flexibility of multi-node tensor parallelism for very large models.

Its PagedAttention makes it the most memory-efficient option for high-concurrency

workloads.


SGLang is the right choice when your application involves structured output

generation (JSON, XML, specific formats), complex multi-step reasoning chains,

or workloads where many requests share a common prefix (like a shared system

prompt). Its RadixAttention makes it uniquely efficient for these patterns.


TensorRT-LLM is the right choice when raw inference speed is the paramount

concern and you are willing to invest time in the compilation process. If you

are running the same model continuously in production and need the absolute

maximum tokens per second, TensorRT-LLM will outperform all other options on

NVIDIA hardware.


14.3  Next Steps and Further Exploration


The setup described in this tutorial is a solid foundation, but there is much

more to explore. Fine-tuning models on your own data using frameworks like

Hugging Face PEFT and LoRA is a natural next step that allows you to customize

models for your specific domain. The DGX Spark's unified memory architecture

makes fine-tuning of 7B-13B parameter models feasible on a single node.


Exploring quantization techniques - specifically GPTQ, AWQ, and GGUF - will

help you fit larger models into the available memory and run them faster.

Quantization reduces the precision of model weights (from 16-bit to 8-bit or

4-bit), trading a small amount of quality for significant reductions in memory

usage and inference time.


Building a proper model serving pipeline with load balancing, request queuing,

and monitoring using tools like Prometheus and Grafana will prepare your setup

for production use. The health check script in Chapter 13 is a starting point,

but a full observability stack gives you much deeper insight into system

behavior.


Experimenting with multimodal models - models that can process both text and

images - is another exciting direction. The DGX Spark's memory capacity makes

it well-suited for models like LLaVA, Qwen-VL, and similar vision-language

models.


Finally, connecting your two-node DGX Spark cluster to a larger network of

machines, or integrating it with cloud resources for burst capacity, opens up

possibilities for truly large-scale AI workloads. The skills and concepts you

have learned in this tutorial - RDMA networking, distributed inference,

multi-node coordination - are directly applicable to clusters of any size.


The machines are ready. The software is installed. The network is configured.

What you build next is entirely up to you.


APPENDIX: QUICK REFERENCE CARD


NETWORK ADDRESSES

  Node A management:   192.168.1.100

  Node B management:   192.168.1.101

  Node A ConnectX-7:   10.0.0.1

  Node B ConnectX-7:   10.0.0.2


INFERENCE ENGINE PORTS

  Ollama:              11434  (both nodes)

  LM Studio:           1234   (both nodes)

  vLLM:                8000   (Node A, head node)

  SGLang:              30000  (Node A, head node)

  TensorRT-LLM:        8080   (Node A, head node)


ESSENTIAL COMMANDS

  Check GPU status:    nvidia-smi

  Monitor GPU live:    watch -n 1 nvidia-smi

  Monitor GPU visual:  nvtop

  Check network:       ip link show

  Test bandwidth:      iperf3 -c 10.0.0.2 -t 30 -P 4

  Pull Ollama model:   ollama pull llama3.2:3b

  Run Ollama model:    ollama run llama3.2:3b

  Check service:       systemctl status <service-name>

  View service logs:   journalctl -u <service-name> -f


ENVIRONMENT VARIABLES FOR NCCL

  NCCL_IB_GID_INDEX=3

  NCCL_IB_DISABLE=0

  NCCL_NET_GDR_LEVEL=5

  NCCL_SOCKET_IFNAME=enp1s0f0np0




No comments: