Hitchhiker's Guide to AI, Software Architecture, and Everything Else: TWO HEADS ARE BETTER THAN ONE: A COMPLETE GUIDE TO SETTING UP, CONNECTING, AND RUNNING LARGE LANGUAGE MODELS ON TWO NVIDIA DGX SPARK WORKSTATIONS

(c) Nvidia

Chapter 0 - BEFORE WE BEGIN: WHAT IS THIS ALL ABOUT?

Imagine having a personal AI supercomputer sitting on your desk. Not a cloud

instance you rent by the hour, not a shared cluster you have to queue for, but

a machine that belongs to you, runs entirely offline if you want, and can

execute large language models that would make most laptops weep. Now imagine

having two of them, connected at high speed, working in concert. That is

exactly what this tutorial is about.

The NVIDIA DGX Spark is not a gaming PC with a fancy GPU strapped to it. It

is a purpose-built AI workstation that packs a genuinely remarkable amount of

compute into a compact desktop chassis. When you connect two of them with

NVIDIA's ConnectX-7 networking adapter, you create a small but serious

two-node AI cluster capable of running models that exceed what either machine

could handle alone.

This tutorial assumes you are reasonably comfortable with Linux command-line

basics - you know what a terminal is, you can type commands, and you are not

afraid of a configuration file. Beyond that, no deep expertise is required.

Every concept will be explained from the ground up, including the "why" behind

each decision, not just the "how." By the time you finish reading, you will

understand what you are doing and why it works, not just which commands to

type.

We will cover five different inference engines - Ollama, LM Studio, vLLM,

SGLang, and TensorRT-LLM - because different tools excel at different tasks,

and a well-equipped AI practitioner knows when to reach for which tool. We

will also write actual Python code that communicates with these engines, both

locally and across the network between your two machines.

Let us begin.

Chapter 1 - MEET THE MACHINE: THE NVIDIA DGX SPARK DEEP DIVE

1.1 The Big Picture

The DGX Spark is built around a single chip that represents one of the most

significant architectural leaps in AI hardware in recent years: the NVIDIA GB10

Grace Blackwell Superchip. To understand why this chip matters, we need to

briefly discuss what has historically been the biggest bottleneck in running

large language models on a single machine.

Traditionally, a computer has a CPU (the general-purpose processor that runs

your operating system and applications) and a GPU (the massively parallel

processor that handles AI computations). These two components sit on separate

chips, connected by a PCIe bus. PCIe is fast by everyday standards, but it is

glacially slow compared to the internal buses within each chip. When a large

language model runs, data must constantly shuttle back and forth between CPU

memory (RAM) and GPU memory (VRAM). This shuttle service is expensive in both

time and energy.

The GB10 solves this problem by eliminating the separation entirely. The Grace

CPU and the Blackwell GPU are connected via NVLink-C2C (Chip-to-Chip), a

proprietary interconnect that delivers 900 gigabytes per second of bidirectional

bandwidth. For context, a typical PCIe 5.0 x16 connection delivers roughly

64 GB/s. The NVLink-C2C connection is therefore about 14 times faster. This

is not a minor improvement; it is a qualitative change in what becomes possible.

1.2 The Memory Architecture: Why 128GB Unified Memory Is a Game Changer

Because the CPU and GPU are so tightly coupled, NVIDIA designed the DGX Spark

with a single unified memory pool of 128 gigabytes of LPDDR5X memory. This

memory is simultaneously accessible by both the CPU cores and the Blackwell GPU

without any copying. When you load a 70-billion-parameter language model, it

lives in this shared pool and the GPU can access every byte of it at full speed.

To appreciate why this matters, consider a conventional workstation with a

high-end GPU that has 24GB of VRAM. If you want to run a model that requires

48GB, you simply cannot do it on that GPU alone - the model does not fit. You

would need to either quantize the model aggressively (reducing its quality) or

use multiple GPUs. On the DGX Spark, a 48GB model fits comfortably in the

128GB unified pool, and the GPU can access it as if it were native VRAM.

The practical consequence is that the DGX Spark can run models in the 70B

parameter range in 16-bit floating point precision, or models approaching

200B parameters in 4-bit quantized form. This is extraordinary for a single

desktop machine.

1.3 The Compute Specifications

Let us look at the full hardware specification of the DGX Spark:

GPU: NVIDIA Blackwell GPU (part of GB10 Superchip)

AI Performance: Up to 1 PFLOPS (petaFLOP per second) at FP4 precision

CPU: 20-core NVIDIA Grace CPU (ARM Neoverse N2 architecture)

Memory: 128 GB LPDDR5X unified memory (shared CPU + GPU)

Storage: 4 TB NVMe SSD

Networking: NVIDIA ConnectX-7 (100 Gigabit Ethernet)

USB: Multiple USB 3.2 and USB-C ports

Display: DisplayPort output

Power: 170W TDP (Thermal Design Power)

OS: Ubuntu 24.04 LTS (pre-installed)

One teraFLOP is one trillion floating-point operations per second. One

petaFLOP is one thousand teraFLOPs, or one quadrillion operations per second.

At FP4 precision (4-bit floating point, used for inference), the DGX Spark

delivers this performance in a machine that consumes only 170 watts - less

than many gaming GPUs alone.

The 20-core Grace CPU is based on ARM's Neoverse N2 architecture, which is

the same family of cores used in cloud data centers by companies like Amazon

(Graviton) and Ampere. It is a server-grade CPU, not a consumer chip, and it

is optimized for the kind of memory-intensive workloads that AI inference

demands.

1.4 The Software Foundation

The DGX Spark ships with Ubuntu 24.04 LTS pre-installed. This is important

because Ubuntu 24.04 is a Long-Term Support release, meaning it will receive

security updates and support until 2029. NVIDIA has configured the system with

all necessary GPU drivers, CUDA libraries, and NVIDIA Container Toolkit

pre-installed. You do not need to hunt for drivers or fight with kernel

modules. The machine is ready to run AI workloads out of the box.

The CUDA version on the DGX Spark is 12.x, which is required by all modern

inference frameworks. The system also includes cuDNN (CUDA Deep Neural Network

library), NCCL (NVIDIA Collective Communications Library, essential for

multi-node communication), and the NVIDIA Container Runtime, which allows

Docker containers to access the GPU directly.

Chapter 2 - THE NERVOUS SYSTEM: CONNECTX-7 AND HIGH-SPEED NETWORKING

2.1 What Is ConnectX-7?

The NVIDIA ConnectX-7 is a network adapter, but calling it "just a network

adapter" is like calling a Formula 1 car "just a car." ConnectX-7 is a

smart network interface card (SmartNIC) that supports both InfiniBand and

Ethernet protocols at speeds up to 400 Gb/s in its highest configurations.

In the DGX Spark, it operates at 100 Gigabit Ethernet (100GbE).

What makes ConnectX-7 special for AI workloads is its support for RDMA -

Remote Direct Memory Access. RDMA allows one machine to read from or write to

the memory of another machine directly, without involving the CPU of the

remote machine. In traditional networking, when machine A sends data to

machine B, machine B's CPU must be interrupted, the data must be copied from

the network buffer into application memory, and then the application can use

it. With RDMA, the data goes directly from machine A's memory to machine B's

memory, bypassing both CPUs entirely.

For distributed AI inference, this is enormously valuable. When two DGX Spark

units are running a model together and need to exchange intermediate results

(called activations) between layers, RDMA allows this exchange to happen at

near-memory speeds rather than at network speeds. The latency drops from

microseconds to nanoseconds, and the CPU is free to do other work.

2.2 RoCE: RDMA Over Converged Ethernet

InfiniBand is the traditional protocol for RDMA in high-performance computing,

but it requires specialized InfiniBand switches and cables. The DGX Spark uses

a technology called RoCE (RDMA over Converged Ethernet, pronounced "rocky"),

which brings RDMA capabilities to standard Ethernet infrastructure. This means

you can connect two DGX Spark units with a standard 100GbE cable and still

get RDMA performance.

RoCE version 2 (RoCEv2) is the relevant standard here. It encapsulates RDMA

packets inside standard UDP/IP packets, which means they can be routed across

standard Ethernet networks. For a direct connection between two machines, this

is straightforward to configure.

2.3 The Cable You Need

To connect two DGX Spark units directly, you need one of the following:

A DAC (Direct Attach Copper) cable is the simplest option for short distances

up to about 5 meters. It is a passive cable with QSFP28 connectors on each

end that plugs directly into the ConnectX-7 port. DAC cables are inexpensive

and reliable for desk-to-desk connections.

An active optical cable (AOC) or a combination of SFP28 transceivers and

fiber optic cable is appropriate for longer distances, up to hundreds of

meters. This is more expensive but necessary if your two machines are in

different rooms or on different floors.

For most users setting up two DGX Spark units in the same office or lab, a

100GbE DAC cable of 1-3 meters is the right choice. Make sure it is rated for

QSFP28 (100G), not the older QSFP+ (40G) standard.

Chapter 3 - PHYSICAL SETUP: CABLES, POWER, AND FIRST BOOT

3.1 Unboxing and Placement

When your DGX Spark units arrive, give them time to reach room temperature

before powering them on, especially if they were shipped in cold weather.

Condensation inside electronics is not your friend. An hour at room temperature

is sufficient.

Place the units on a stable, flat surface with adequate airflow. The DGX Spark

has intake vents on the sides and exhaust at the rear. Leave at least 10 cm

(4 inches) of clearance on all sides. Do not stack them directly on top of

each other without a spacer, as the bottom unit's exhaust will feed hot air

into the top unit's intake. Side-by-side placement is ideal.

We will call the two machines "Node A" and "Node B" throughout this tutorial.

You can label them with a piece of tape if that helps you keep track.

3.2 Power Connections

Each DGX Spark uses a standard IEC C13 power connector (the same type used by

most desktop computers and monitors). Connect each unit to a power outlet or,

preferably, to a UPS (Uninterruptible Power Supply). A UPS protects against

sudden power loss, which can corrupt filesystems and interrupt long-running

AI jobs. For two machines drawing up to 170W each, a 1000VA UPS is more than

sufficient.

3.3 The ConnectX-7 Network Connection

Locate the ConnectX-7 port on the rear of each DGX Spark. It is a QSFP28

port, which looks like a slightly larger version of a standard SFP+ port.

Connect one end of your 100GbE DAC cable to Node A and the other end to Node B.

The cable is keyed and will only insert in the correct orientation. You should

feel a positive click when it is fully seated.

In addition to the direct ConnectX-7 connection between the two nodes, you

will also want to connect each machine to your regular office or home network

via the standard 1GbE or 10GbE Ethernet port. This management network is used

for internet access, software updates, and SSH access from your laptop. The

ConnectX-7 link is dedicated to high-speed AI traffic between the two nodes.

3.4 Display, Keyboard, and Mouse for Initial Setup

For the very first boot, you need a monitor, keyboard, and mouse connected to

at least one of the machines (Node A is a good choice to start with). The DGX

Spark has a DisplayPort output, so you need either a DisplayPort monitor or a

DisplayPort-to-HDMI adapter. Connect a USB keyboard and mouse to the USB ports.

After the initial setup is complete, you can switch to headless operation and

disconnect the peripherals. We will cover both modes in detail in Chapters 4 and 5.

3.5 First Power-On

Press the power button on Node A. The system will run through a POST (Power-On

Self-Test) and then boot into Ubuntu 24.04. The first boot may take slightly

longer than subsequent boots as the system initializes hardware and expands

the filesystem to fill the 4TB NVMe SSD.

You will be greeted by the Ubuntu initial setup wizard, which walks you through

language selection, keyboard layout, timezone, and user account creation. Create

a user account with a strong password. For the username, something simple and

memorable works well - we will use "aiuser" in this tutorial, but you can

choose anything you like.

After completing the setup wizard, you will land on the Ubuntu desktop. Take a

moment to appreciate what you are looking at: a full desktop Linux environment

running on hardware that can execute a trillion AI operations per second.

Chapter 4 - NON-HEADLESS SETUP: WORKING WITH A MONITOR AND KEYBOARD

4.1 Why You Might Want a Non-Headless Setup

A non-headless setup means you are working directly at the machine with a

monitor, keyboard, and mouse attached. This is the most intuitive way to get

started, especially if you are new to Linux or to AI workstations. It gives

you a full graphical desktop environment where you can open a terminal, a web

browser, and graphical applications like LM Studio all in the same workspace.

The trade-off is that you need to be physically present at the machine to use

it. For many research and development workflows, this is perfectly acceptable.

You sit at your desk, you work on your DGX Spark, and you go home when you

are done. Simple and effective.

4.2 Updating the System

The very first thing you should do after the initial setup wizard completes is

update all installed software. NVIDIA ships the DGX Spark with a known-good

software configuration, but security patches and bug fixes accumulate quickly.

Open a terminal (press Ctrl+Alt+T or find the Terminal application in the

application menu) and run the following commands.

The first command refreshes the list of available packages from all configured

software repositories, so Ubuntu knows what updates are available:

sudo apt update

The second command downloads and installs all available updates. The -y flag

answers "yes" automatically to any confirmation prompts:

sudo apt upgrade -y

This process may take several minutes depending on how many updates are

available. After it completes, reboot the system to ensure all updates,

especially kernel updates, take effect:

sudo reboot

4.3 Verifying the GPU Is Recognized

After rebooting, open a terminal and run NVIDIA's System Management Interface

tool to confirm the GPU is properly recognized and the drivers are working:

nvidia-smi

You should see output similar to this (the exact numbers will reflect the

GB10 Blackwell GPU):

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 570.xx.xx Driver Version: 570.xx.xx CUDA Version: 12.x |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 NVIDIA GB10 ... On | 00000000:01:00.0 Off | N/A |

| N/A 45C P0 25W / 170W | 2048MiB / 131072MiB| 0% Default |

+-----------------------------------------------------------------------------+

The key things to verify are that the GPU name is shown (GB10 or similar),

that memory shows approximately 131072 MiB (128 GB), and that the driver

version and CUDA version are displayed correctly. If you see "No devices were

found" or similar errors, something is wrong with the driver installation,

which is unusual on a DGX Spark but can happen after a kernel update.

4.4 Installing Essential Tools

Before diving into inference engines, install a set of tools that will make

your life easier throughout this tutorial. The following command installs

several utilities in one go:

sudo apt install -y \

git \

curl \

wget \

htop \

nvtop \

net-tools \

iperf3 \

python3-pip \

python3-venv \

build-essential \

openssh-server

Let us understand what each of these tools does and why we are installing it.

The git tool is the industry-standard version control system. You will use it

to clone repositories for inference frameworks and to manage your own code.

The curl and wget tools are command-line utilities for downloading files from

the internet. Many installation scripts use curl, and wget is useful for

downloading large files like model weights.

The htop tool is an interactive process viewer that shows CPU usage, memory

usage, and running processes in a colorful, easy-to-read format. It is far

more useful than the basic top command.

The nvtop tool is the GPU equivalent of htop. It shows real-time GPU

utilization, memory usage, and temperature. You will use this constantly to

monitor your inference workloads.

The net-tools package provides classic networking commands like ifconfig and

netstat, which are useful for diagnosing network issues.

The iperf3 tool is a network performance testing utility. You will use it to

verify that the 100GbE connection between your two nodes is working at full

speed.

The python3-pip and python3-venv tools are the Python package manager and

virtual environment manager, respectively. All five inference frameworks we

will install are Python-based.

The build-essential package installs the GCC compiler, make, and other tools

needed to compile software from source code. Some inference frameworks require

compilation steps.

The openssh-server package installs the SSH server daemon, which allows you

to connect to this machine remotely from another computer. Even in a

non-headless setup, having SSH available is valuable for scripting and remote

management.

4.5 Configuring SSH for Remote Access

Even if you are using a non-headless setup with a monitor attached, enabling

SSH is a good practice. It allows you to control the machine from your laptop,

copy files to and from it, and run commands without having to physically sit

at the machine.

After installing openssh-server, start the SSH service and configure it to

start automatically on boot:

sudo systemctl enable ssh

sudo systemctl start ssh

Now find the IP address of the machine on your regular network (not the

ConnectX-7 link, which we will configure later):

ip addr show

Look for an entry that shows your regular Ethernet interface (typically named

something like eth0, eno1, or enp3s0) with an IP address in your local network

range (typically 192.168.x.x or 10.x.x.x). Note this IP address - you will

use it to SSH into the machine from your laptop.

From your laptop, you can now connect with:

ssh aiuser@192.168.1.100

Replace 192.168.1.100 with the actual IP address of your Node A. You will be

prompted for the password you set during initial setup.

Chapter 5 - HEADLESS SETUP: SSH, REMOTE ACCESS, AND AUTOMATION

5.1 What Does "Headless" Mean and Why Would You Want It?

A headless setup means the machine runs without a monitor, keyboard, or mouse

attached. You interact with it entirely over the network via SSH. This is the

standard way to operate servers and AI workstations in professional

environments for several good reasons.

First, it saves money. Monitors, keyboards, and mice cost money, and if you

have two DGX Spark units, you do not need two sets of peripherals. One laptop

can manage both machines over SSH.

Second, it is more efficient. Once you are comfortable with the command line,

SSH is faster than working at a physical terminal. You can have multiple SSH

sessions open simultaneously, copy and paste between them, and script complex

operations.

Third, it enables automation. When your machines are managed entirely over the

network, you can write scripts that configure them, start inference servers,

monitor their health, and restart services automatically. This is essential

for production AI deployments.

5.2 Completing the Initial Setup Without a Monitor (Node B)

For Node B, you have two options for the initial setup. The first option is to

temporarily connect a monitor and keyboard, complete the Ubuntu setup wizard,

enable SSH, and then disconnect the peripherals. This is the simplest approach.

The second option is to use a technique called "blind configuration." If you

know the machine's IP address (which you can find from your router's DHCP

client list), you can SSH into it immediately after first boot, because Ubuntu

24.04 enables SSH by default in some configurations. However, this is not

guaranteed, so the first option is more reliable.

We will assume you have completed the initial setup wizard on both machines

with a monitor attached, enabled SSH on both, and noted their IP addresses.

From this point forward, all configuration will be done over SSH.

5.3 Setting Up SSH Key Authentication

Typing a password every time you SSH into a machine becomes tedious quickly.

SSH key authentication is more secure and more convenient. It works by

generating a pair of cryptographic keys: a private key that stays on your

laptop and a public key that you copy to the remote machine. When you connect,

the machines perform a cryptographic handshake that proves your identity

without requiring a password.

On your laptop (not on the DGX Spark), generate an SSH key pair if you do not

already have one:

ssh-keygen -t ed25519 -C "dgx-spark-access"

The -t ed25519 flag specifies the Ed25519 algorithm, which is modern, fast,

and secure. The -C flag adds a comment to help you identify the key later.

When prompted for a file location, press Enter to accept the default

(~/.ssh/id_ed25519). When prompted for a passphrase, you can either set one

(more secure) or press Enter for no passphrase (more convenient).

Now copy the public key to both DGX Spark nodes. The ssh-copy-id command

handles this automatically:

ssh-copy-id aiuser@192.168.1.100 # Node A

ssh-copy-id aiuser@192.168.1.101 # Node B

After this, you can SSH into either machine without a password:

ssh aiuser@192.168.1.100

5.4 Setting Up SSH Config for Convenience

Instead of typing IP addresses every time, create an SSH config file on your

laptop that gives friendly names to your machines. Open or create the file

~/.ssh/config on your laptop and add the following:

Host node-a

HostName 192.168.1.100

User aiuser

IdentityFile ~/.ssh/id_ed25519

ServerAliveInterval 60

ServerAliveCountMax 3

Host node-b

HostName 192.168.1.101

User aiuser

IdentityFile ~/.ssh/id_ed25519

ServerAliveInterval 60

ServerAliveCountMax 3

The ServerAliveInterval and ServerAliveCountMax settings tell your SSH client

to send keepalive packets every 60 seconds and to give up after 3 missed

responses. This prevents SSH sessions from dropping when you are running long

jobs and not typing anything.

Now you can connect with simply:

ssh node-a

ssh node-b

5.5 Configuring Passwordless SSH Between the Two Nodes

For distributed inference frameworks like vLLM and TensorRT-LLM, the two

nodes need to be able to SSH into each other without passwords. This is

because the head node (Node A) will launch processes on the worker node

(Node B) automatically.

On Node A, generate an SSH key pair:

ssh-keygen -t ed25519 -C "node-a-to-node-b" -f ~/.ssh/id_ed25519_cluster

Then copy Node A's public key to Node B. First, display the public key:

cat ~/.ssh/id_ed25519_cluster.pub

Copy the output, then SSH into Node B and add it to the authorized_keys file:

# On Node B:

echo "PASTE_THE_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys

chmod 600 ~/.ssh/authorized_keys

Now do the reverse: on Node B, generate a key and copy it to Node A's

authorized_keys. After this, both nodes can SSH into each other without

passwords, which is required for MPI-based distributed inference.

5.6 Disabling the Graphical Desktop on Headless Nodes

Running a full graphical desktop environment on a headless machine wastes

memory and CPU cycles. Ubuntu 24.04 uses the GNOME desktop by default, which

can consume 1-2 GB of RAM even when idle. For a headless AI workstation, you

want all available resources dedicated to inference.

To switch to a text-only boot target (which still allows you to start a

graphical session manually if needed), run:

sudo systemctl set-default multi-user.target

This tells systemd (the Linux init system) to boot into a multi-user text

mode by default instead of the graphical desktop. The change takes effect on

the next reboot. To revert to graphical mode if needed:

sudo systemctl set-default graphical.target

After setting multi-user mode, reboot:

sudo reboot

When the machine comes back up, it will present a text login prompt instead

of a graphical desktop. SSH into it from your laptop as usual - the SSH

server starts in both modes.

Chapter 6 - NETWORKING THE TWO NODES: IP ADDRESSES, ROCE, AND JUMBO FRAMES

6.1 Understanding the Two Network Interfaces

Each DGX Spark has at least two network interfaces that we care about:

The management interface is the standard Ethernet port (1GbE or 10GbE) that

connects to your regular office or home network. This is used for internet

access, SSH from your laptop, and downloading models. We will call the IP

addresses on this interface the "management IPs" - for example, 192.168.1.100

for Node A and 192.168.1.101 for Node B.

The high-speed interface is the ConnectX-7 port that connects the two DGX

Spark units directly to each other via the DAC cable. This is used exclusively

for high-speed AI traffic between the nodes. We will assign IP addresses in a

separate subnet to this interface - for example, 10.0.0.1 for Node A and

10.0.0.2 for Node B.

Keeping these two networks separate is important. It ensures that AI traffic

does not compete with management traffic, and it makes routing simpler because

each network has a clear purpose.

6.2 Identifying the ConnectX-7 Interface Name

Linux assigns names to network interfaces automatically. The ConnectX-7

interface will have a name like enp1s0f0np0 or similar, depending on which

PCIe slot it occupies. To find the correct interface name, run:

ip link show

You will see a list of all network interfaces. The ConnectX-7 interface will

typically show a link speed of 100000 Mb/s when the DAC cable is connected.

You can also use:

ethtool <interface_name> | grep Speed

to check the speed of a specific interface. Alternatively, the mlxlink tool

(part of the Mellanox/NVIDIA networking tools) provides detailed ConnectX-7

status:

sudo mlxlink -d /dev/mst/mt4129_pciconf0 --show_module

The interface name for the ConnectX-7 on Node A might be enp1s0f0np0. We

will use this name in examples below, but substitute the actual name you find

on your system.

6.3 Configuring Static IP Addresses on the ConnectX-7 Interface

Ubuntu 24.04 uses Netplan for network configuration. Netplan is a declarative

network configuration system that reads YAML files and generates configuration

for the underlying network daemon (NetworkManager or systemd-networkd).

On Node A, create a new Netplan configuration file for the ConnectX-7

interface. The file must be in /etc/netplan/ and have a .yaml extension. We

will call it 10-connectx7.yaml (the number prefix determines the order in

which files are processed):

sudo nano /etc/netplan/10-connectx7.yaml

Enter the following configuration. Be very careful with indentation - YAML

is whitespace-sensitive, and incorrect indentation will cause errors:

network:

version: 2

ethernets:

enp1s0f0np0:

dhcp4: false

addresses:

- 10.0.0.1/24

mtu: 9000

The dhcp4: false line tells Netplan not to request an IP address from a DHCP

server on this interface - we are assigning a static address manually. The

addresses section assigns the IP address 10.0.0.1 with a /24 subnet mask

(which means addresses 10.0.0.1 through 10.0.0.254 are on the same network).

The mtu: 9000 line sets the Maximum Transmission Unit to 9000 bytes, which

are called "jumbo frames."

On Node B, create the same file with the IP address changed to 10.0.0.2:

network:

version: 2

ethernets:

enp1s0f0np0:

dhcp4: false

addresses:

- 10.0.0.2/24

mtu: 9000

Apply the configuration on both nodes:

sudo netplan apply

6.4 Why Jumbo Frames Matter

The standard Ethernet MTU (Maximum Transmission Unit) is 1500 bytes. This

means each network packet can carry at most 1500 bytes of data. When

transferring large amounts of data (like the activation tensors exchanged

between nodes during distributed inference), using 1500-byte packets means

a lot of overhead: each packet has headers, checksums, and other metadata

that do not carry useful data. With 1500-byte packets, this overhead is

relatively large compared to the payload.

Jumbo frames increase the MTU to 9000 bytes, which means each packet carries

six times more data for the same amount of overhead. For high-bandwidth,

low-latency applications like distributed AI inference, this can improve

throughput by 10-20% and reduce CPU overhead significantly.

The key requirement for jumbo frames is that both endpoints must be configured

with the same MTU. Since we are connecting the two DGX Spark units directly

(without a switch in between), we only need to configure the MTU on the two

machines themselves. If you were using a switch, the switch ports would also

need to be configured for jumbo frames.

6.5 Verifying the Connection

After applying the Netplan configuration, verify that the two nodes can

communicate over the ConnectX-7 link. From Node A, ping Node B:

ping -c 4 10.0.0.2

You should see responses with very low latency, typically under 0.1 milliseconds

for a direct connection. If the ping fails, check that the cable is properly

seated, that both nodes have the correct IP addresses, and that the interface

is up (ip link show should show "UP" for the ConnectX-7 interface).

Now test the actual bandwidth using iperf3. On Node B, start the iperf3 server:

iperf3 -s

On Node A, run the iperf3 client pointing at Node B's ConnectX-7 IP:

iperf3 -c 10.0.0.2 -t 30 -P 4

The -t 30 flag runs the test for 30 seconds, and -P 4 uses 4 parallel streams.

You should see throughput close to 100 Gbits/sec. If you see significantly

less (say, under 80 Gbits/sec), check that jumbo frames are configured

correctly on both ends and that the cable is rated for 100G.

6.6 Configuring RoCE for RDMA

To enable RDMA over the ConnectX-7 interface, we need to configure RoCEv2.

First, install the RDMA user-space libraries:

sudo apt install -y rdma-core ibverbs-utils

Verify that the RDMA device is recognized:

ibv_devices

You should see the ConnectX-7 listed as an RDMA device. Now verify the device

attributes:

ibv_devinfo

This shows the RDMA capabilities of the device, including supported transport

types and maximum message sizes.

To configure RoCEv2 (as opposed to RoCEv1), we need to set the GID (Global

Identifier) index. RoCEv2 uses UDP/IP encapsulation, which is routable and

works with standard Ethernet infrastructure. The configuration is done through

the sysfs filesystem:

# Check available GID entries for the interface

# (substitute your actual interface and port number)

cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/1

For NCCL (which vLLM and other frameworks use for multi-node communication),

set the following environment variables to tell NCCL to use RoCEv2:

export NCCL_IB_GID_INDEX=3

export NCCL_IB_DISABLE=0

export NCCL_NET_GDR_LEVEL=5

export NCCL_SOCKET_IFNAME=enp1s0f0np0

Add these to your ~/.bashrc file on both nodes so they are set automatically

in every new shell session:

echo 'export NCCL_IB_GID_INDEX=3' >> ~/.bashrc

echo 'export NCCL_IB_DISABLE=0' >> ~/.bashrc

echo 'export NCCL_NET_GDR_LEVEL=5' >> ~/.bashrc

echo 'export NCCL_SOCKET_IFNAME=enp1s0f0np0' >> ~/.bashrc

source ~/.bashrc

The NCCL_NET_GDR_LEVEL=5 setting enables GPU Direct RDMA, which allows data

to be transferred directly between the GPU memory on one node and the GPU

memory on another node, bypassing the CPU and system memory entirely. This

is the highest level of RDMA optimization and provides the best performance

for distributed inference.

Chapter 7 - INFERENCE ENGINE 1: OLLAMA - THE FRIENDLY GIANT

7.1 What Is Ollama and Why Should You Care?

Ollama is an open-source tool that makes running large language models locally

as simple as running a Docker container. It handles model downloading, format

conversion, quantization, and serving through a clean REST API - all with a

single command. If you have ever wanted to run a model like Llama 3, Mistral,

or Qwen on your own hardware without wrestling with Python dependencies and

model format conversions, Ollama is the answer.

Ollama works by wrapping llama.cpp, a highly optimized C++ inference library,

with a user-friendly interface. It maintains a library of pre-quantized models

that you can download with a single command, and it automatically detects and

uses your GPU for acceleration.

For the DGX Spark, Ollama is an excellent starting point. It requires minimal

configuration, works out of the box with the NVIDIA GPU, and provides a REST

API that is easy to call from Python, JavaScript, or any other language. The

trade-off is that Ollama does not support multi-node inference natively - each

DGX Spark runs its own independent Ollama instance. However, you can use both

instances together in clever ways, which we will explore.

7.2 Installing Ollama on Both Nodes

The installation is refreshingly simple. On each node (Node A and Node B),

run the official installation script:

curl -fsSL https://ollama.com/install.sh | sh

This script detects your operating system and architecture (ARM64 in the case

of the DGX Spark's Grace CPU), downloads the appropriate binary, installs it

to /usr/local/bin/ollama, and creates a systemd service that starts Ollama

automatically on boot.

After installation, verify that Ollama is running:

systemctl status ollama

You should see "active (running)" in the output. If it is not running, start

it manually:

sudo systemctl start ollama

7.3 Configuring Ollama to Listen on the Network

By default, Ollama only listens on localhost (127.0.0.1), which means it can

only be accessed from the same machine. To allow Node B to send requests to

Node A's Ollama instance (and vice versa), we need to configure Ollama to

listen on all network interfaces.

Edit the Ollama systemd service file to add the necessary environment variable:

sudo systemctl edit ollama

This opens a text editor with an override file. Add the following content:

[Service]

Environment="OLLAMA_HOST=0.0.0.0"

Save and close the file. Then reload the systemd configuration and restart

Ollama:

sudo systemctl daemon-reload

sudo systemctl restart ollama

Now Ollama listens on all interfaces, including the ConnectX-7 interface at

10.0.0.1 (on Node A) and 10.0.0.2 (on Node B). This means you can send

inference requests from Node A to Node B's Ollama instance and vice versa.

7.4 Pulling and Running Your First Model

Let us download and run a model. We will start with Llama 3.2 3B, a capable

but compact model that downloads quickly and runs fast:

ollama pull llama3.2:3b

Ollama downloads the model in GGUF format (a quantized format optimized for

llama.cpp) and stores it in ~/.ollama/models. The download size is a few

gigabytes. Once downloaded, run the model interactively:

ollama run llama3.2:3b

You will see a prompt where you can type messages and receive responses. This

is the simplest possible way to interact with an LLM on your DGX Spark. Type

/bye to exit the interactive session.

For a more serious model that takes advantage of the DGX Spark's 128GB memory,

try Llama 3.1 70B:

ollama pull llama3.1:70b

This model is approximately 40GB in its quantized form and can run on a single

DGX Spark with memory to spare. The inference speed will be impressive - the

Blackwell GPU's 1 PFLOP of AI performance translates to fast token generation

even for this large model.

7.5 Using the Ollama REST API

The real power of Ollama comes from its REST API, which allows you to integrate

LLM inference into your own applications. The API is available at

http://localhost:11434 by default.

The following Python script demonstrates how to send a request to the Ollama

API and stream the response. Streaming means you receive tokens as they are

generated rather than waiting for the entire response to complete, which makes

the interaction feel much more responsive:

import requests

import json

def query_ollama(

prompt: str,

model: str = "llama3.2:3b",

host: str = "localhost",

port: int = 11434,

stream: bool = True

) -> str:

"""

Send a prompt to an Ollama instance and return the generated text.

Args:

prompt: The text prompt to send to the model.

model: The Ollama model name to use for inference.

host: The hostname or IP address of the Ollama server.

port: The port number on which Ollama is listening.

stream: If True, stream the response token by token.

Returns:

The complete generated text as a string.

"""

url = f"http://{host}:{port}/api/generate"

# Build the request payload according to the Ollama API specification.

# The 'stream' field controls whether the server sends back partial

# responses as they are generated or waits for the full completion.

payload = {

"model": model,

"prompt": prompt,

"stream": stream

}

full_response = ""

# Use a streaming HTTP request so we can process each chunk as it arrives.

# This is important for user-facing applications where responsiveness matters.

with requests.post(url, json=payload, stream=stream) as response:

response.raise_for_status()

for line in response.iter_lines():

if line:

# Each line from the Ollama streaming API is a JSON object

# containing a 'response' field with the next token(s).

chunk = json.loads(line)

token = chunk.get("response", "")

full_response += token

# Print each token immediately so the user sees output

# appearing in real time, just like ChatGPT's interface.

print(token, end="", flush=True)

# The 'done' field signals that generation is complete.

if chunk.get("done", False):

print() # Add a newline after the response is complete.

break

return full_response

if __name__ == "__main__":

# Query the local Ollama instance on Node A.

print("=== Querying local Ollama (Node A) ===")

response_a = query_ollama(

prompt="Explain quantum entanglement in simple terms.",

model="llama3.2:3b",

host="localhost"

)

# Query the remote Ollama instance on Node B via the ConnectX-7 link.

# Notice that we use the high-speed 10.0.0.2 address, not the

# management network address. This routes the traffic over the

# 100GbE direct connection for minimum latency.

print("\n=== Querying remote Ollama (Node B) ===")

response_b = query_ollama(

prompt="What are the applications of quantum computing?",

model="llama3.2:3b",

host="10.0.0.2"

)

This script is straightforward but illustrates a powerful concept: with two

DGX Spark units running Ollama, you can distribute inference requests across

both machines. Node A handles some requests while Node B handles others,

effectively doubling your throughput for workloads that involve many concurrent

users or many independent queries.

7.6 A Load Balancer for Two Ollama Instances

To automatically distribute requests between the two Ollama instances, you

can write a simple round-robin load balancer. This is useful when you want to

serve many users and want to spread the load evenly:

import requests

import json

import itertools

from typing import Iterator

class OllamaLoadBalancer:

"""

A simple round-robin load balancer for multiple Ollama instances.

This class cycles through a list of Ollama server addresses and

sends each request to the next server in the rotation. This ensures

that no single server is overwhelmed while others sit idle.

"""

def __init__(self, servers: list[dict]) -> None:

"""

Initialize the load balancer with a list of server configurations.

Args:

servers: A list of dicts, each with 'host' and 'port' keys.

Example: [{"host": "localhost", "port": 11434},

{"host": "10.0.0.2", "port": 11434}]

"""

self.servers = servers

# itertools.cycle creates an infinite iterator that cycles through

# the list: server0, server1, server0, server1, ...

self._server_cycle: Iterator[dict] = itertools.cycle(servers)

def _get_next_server(self) -> dict:

"""Return the next server in the rotation."""

return next(self._server_cycle)

def generate(

self,

prompt: str,

model: str = "llama3.2:3b"

) -> str:

"""

Send a generation request to the next available server.

Args:

prompt: The text prompt for the model.

model: The model name to use.

Returns:

The generated text response.

"""

server = self._get_next_server()

url = f"http://{server['host']}:{server['port']}/api/generate"

print(f"Routing request to {server['host']}:{server['port']}")

payload = {

"model": model,

"prompt": prompt,

"stream": False # Non-streaming for simplicity in this example.

}

response = requests.post(url, json=payload)

response.raise_for_status()

return response.json().get("response", "")

if __name__ == "__main__":

# Configure the load balancer with both DGX Spark nodes.

# Node A is accessed via localhost (we are running this script on Node A).

# Node B is accessed via the high-speed ConnectX-7 interface.

balancer = OllamaLoadBalancer(

servers=[

{"host": "localhost", "port": 11434},

{"host": "10.0.0.2", "port": 11434}

]

)

# Simulate five incoming requests. They will alternate between

# Node A and Node B automatically.

prompts = [

"What is machine learning?",

"Explain neural networks.",

"What is backpropagation?",

"Describe transformer architecture.",

"What is attention mechanism?"

]

for i, prompt in enumerate(prompts):

print(f"\n--- Request {i + 1} ---")

print(f"Prompt: {prompt}")

response = balancer.generate(prompt=prompt, model="llama3.2:3b")

print(f"Response: {response[:200]}...") # Print first 200 characters.

Chapter 8 - INFERENCE ENGINE 2: LM STUDIO - THE GUI POWERHOUSE

8.1 What Is LM Studio?

LM Studio is a desktop application that provides a polished graphical user

interface for downloading, managing, and running large language models locally.

If you have ever wished for a ChatGPT-like interface that runs entirely on your

own hardware, LM Studio is exactly that. It includes a model browser that lets

you search and download models from Hugging Face, a chat interface for

interactive conversations, and a local server that exposes an OpenAI-compatible

API.

LM Studio is particularly valuable for non-headless setups where you have a

monitor connected. It is the most beginner-friendly of the five inference

engines we cover, requiring no command-line interaction for basic use. However,

it also provides enough advanced features to satisfy experienced practitioners.

8.2 Installing LM Studio on the DGX Spark

LM Studio supports Linux ARM64, which is the architecture of the DGX Spark's

Grace CPU. Download the ARM64 AppImage from the LM Studio website:

wget https://releases.lmstudio.ai/linux/arm64/latest/LM_Studio-latest-arm64.AppImage

Make the downloaded file executable:

chmod +x LM_Studio-latest-arm64.AppImage

For a non-headless setup, simply double-click the AppImage in the file manager,

or run it from the terminal:

./LM_Studio-latest-arm64.AppImage

LM Studio will launch with a graphical interface. On first launch, it may ask

you to accept a license agreement and choose a directory for storing models.

The default location (~/.cache/lm-studio/models) is fine, but given the DGX

Spark's 4TB NVMe SSD, you have plenty of space to store many large models.

For a headless setup, LM Studio can be run in server mode without a graphical

interface. This is done using the lms command-line tool that LM Studio installs:

# First, install the lms CLI tool (run this after launching LM Studio at

# least once to complete the installation):

~/.lmstudio/bin/lms install

# Start the LM Studio server in headless mode:

~/.lmstudio/bin/lms server start --port 1234

8.3 Using LM Studio's GUI

In the graphical interface, the left sidebar has several icons. The first icon

(a magnifying glass) opens the model search interface, where you can browse

and download models from Hugging Face. Search for "llama" or "mistral" to find

popular models. LM Studio shows the model size and quantization level, helping

you choose a model that fits in your 128GB memory.

The second icon (a chat bubble) opens the chat interface. After loading a model

(click the model name in the top bar to load it), you can type messages and

receive responses in a familiar chat format. This is excellent for interactive

exploration and testing.

The third icon (a server icon) opens the local server settings. Enable the

server and it will listen on port 1234 by default, exposing an OpenAI-compatible

API. This is the same API format used by OpenAI's ChatGPT, which means any

code written for OpenAI's API works with LM Studio with minimal changes.

8.4 Using LM Studio's OpenAI-Compatible API

Once the LM Studio server is running (either in GUI mode or headless mode),

you can interact with it using the OpenAI Python library. This is one of the

most important aspects of LM Studio: because it speaks the OpenAI API protocol,

you can swap between LM Studio and actual OpenAI models with a single line

change in your code.

Install the OpenAI Python library if you have not already:

pip install openai

The following script demonstrates how to use LM Studio's API, including how

to switch seamlessly between local and remote models:

from openai import OpenAI

from typing import Optional

def create_lmstudio_client(

host: str = "localhost",

port: int = 1234

) -> OpenAI:

"""

Create an OpenAI client configured to talk to an LM Studio server.

LM Studio exposes an OpenAI-compatible API, so we use the standard

OpenAI Python library but point it at our local LM Studio instance.

The api_key parameter is required by the library but is not actually

validated by LM Studio - any non-empty string works.

Args:

host: The hostname or IP address of the LM Studio server.

port: The port number on which LM Studio is listening.

Returns:

A configured OpenAI client instance.

"""

return OpenAI(

base_url=f"http://{host}:{port}/v1",

api_key="lm-studio" # LM Studio ignores this, but it must be set.

)

def chat_with_model(

client: OpenAI,

system_prompt: str,

user_message: str,

model: str = "local-model",

temperature: float = 0.7,

max_tokens: int = 1024

) -> str:

"""

Send a chat message to an LM Studio model and return the response.

This function uses the chat completions API, which is the standard

way to interact with instruction-tuned models. It supports a system

prompt (which sets the model's persona and behavior) and a user

message (the actual question or instruction).

Args:

client: The OpenAI client configured for LM Studio.

system_prompt: Instructions that define the model's behavior.

user_message: The user's question or instruction.

model: The model identifier (LM Studio uses the loaded

model regardless of this value).

temperature: Controls randomness. 0.0 is deterministic,

1.0 is highly random. 0.7 is a good default.

max_tokens: Maximum number of tokens to generate.

Returns:

The model's response as a string.

"""

completion = client.chat.completions.create(

model=model,

messages=[

{

"role": "system",

"content": system_prompt

{

"role": "user",

"content": user_message

}

temperature=temperature,

max_tokens=max_tokens

)

# The response is nested inside the completion object.

# We extract just the text content of the first (and usually only)

# choice that the model generated.

return completion.choices[0].message.content

if __name__ == "__main__":

# Connect to LM Studio running on Node A (local).

local_client = create_lmstudio_client(host="localhost", port=1234)

# Connect to LM Studio running on Node B (remote, via ConnectX-7).

remote_client = create_lmstudio_client(host="10.0.0.2", port=1234)

system_prompt = (

"You are a helpful AI assistant specializing in explaining "

"complex technical concepts in simple, accessible language."

)

question = "How does a transformer neural network process text?"

print("=== Response from Node A (local LM Studio) ===")

response_local = chat_with_model(

client=local_client,

system_prompt=system_prompt,

user_message=question

)

print(response_local)

print("\n=== Response from Node B (remote LM Studio) ===")

response_remote = chat_with_model(

client=remote_client,

system_prompt=system_prompt,

user_message=question

)

print(response_remote)

Chapter 9 - INFERENCE ENGINE 3: VLLM - THE THROUGHPUT CHAMPION

9.1 What Is vLLM and Why Is It Different?

vLLM (Virtual Large Language Model) is an open-source inference engine

developed by researchers at UC Berkeley. It was created to solve a specific

problem: how do you serve large language models to many users simultaneously

with high throughput and low latency?

The key innovation in vLLM is a technique called PagedAttention. To understand

why PagedAttention matters, we need to briefly understand how LLM inference

works. When a model generates text, it maintains a "key-value cache" (KV cache)

for each token it has processed. This cache stores intermediate computations

that allow the model to attend to previous tokens efficiently. The KV cache

grows with each generated token and can consume enormous amounts of GPU memory.

The problem with naive KV cache management is memory fragmentation. If you are

serving 10 users simultaneously, each with a different conversation length,

the KV caches for those conversations are different sizes. Allocating fixed

blocks of memory for each conversation wastes space when conversations are

short and fails when conversations grow longer than expected.

PagedAttention borrows an idea from operating system virtual memory management:

it divides the KV cache into fixed-size "pages" and allocates them dynamically

as needed, similar to how an OS manages physical memory pages for multiple

processes. This eliminates fragmentation and allows vLLM to serve 2-4x more

concurrent users than naive implementations with the same hardware.

9.2 Installing vLLM

Create a Python virtual environment for vLLM to keep its dependencies isolated

from other tools:

python3 -m venv ~/vllm-env

source ~/vllm-env/bin/activate

Install vLLM. For the DGX Spark's ARM-based Grace CPU with CUDA 12.x, the

standard pip installation should work:

pip install vllm

If the pip installation fails (which can happen on ARM architectures where

pre-built wheels are not available), you may need to build from source:

git clone https://github.com/vllm-project/vllm.git

cd vllm

pip install -e .

The build from source takes 15-30 minutes as it compiles CUDA kernels. This

is normal and expected. The resulting installation is fully optimized for your

specific GPU architecture.

9.3 Running vLLM as a Single-Node Server

The simplest way to use vLLM is as a single-node OpenAI-compatible API server.

Start the server with a model from Hugging Face (vLLM downloads models

automatically from the Hugging Face Hub):

python -m vllm.entrypoints.openai.api_server \

--model meta-llama/Llama-3.1-70B-Instruct \

--host 0.0.0.0 \

--port 8000 \

--dtype bfloat16 \

--max-model-len 8192

Let us understand each argument. The --model flag specifies the Hugging Face

model identifier. The --host 0.0.0.0 flag makes the server listen on all

network interfaces, not just localhost. The --port 8000 flag sets the port

number. The --dtype bfloat16 flag tells vLLM to use 16-bit brain floating

point precision, which is the best precision for the Blackwell GPU. The

--max-model-len 8192 flag limits the maximum sequence length (input + output

tokens combined) to 8192 tokens, which controls memory usage.

For a model that requires authentication (like Llama 3.1), you need a Hugging

Face account and access token. Set the token as an environment variable:

export HF_TOKEN="your_huggingface_token_here"

9.4 Setting Up Two-Node Distributed Inference with vLLM

This is where things get genuinely exciting. With two DGX Spark units, you

can run a single model that is distributed across both machines. This allows

you to run models that are too large for a single 128GB memory pool, or to

run models faster by parallelizing the computation.

vLLM supports two forms of parallelism for multi-node inference. Tensor

parallelism splits individual matrix operations across multiple GPUs, with

each GPU computing a portion of each operation simultaneously. Pipeline

parallelism splits the model's layers across GPUs, with each GPU processing

a different set of layers in sequence. For two nodes, tensor parallelism

typically gives better performance.

vLLM uses Ray for multi-node coordination. Ray is a distributed computing

framework that handles process management, communication, and fault tolerance.

Install Ray on both nodes:

pip install ray

On Node A (the head node), start the Ray cluster:

ray start --head \

--node-ip-address=10.0.0.1 \

--port=6379 \

--dashboard-host=0.0.0.0

The --node-ip-address flag tells Ray to use the ConnectX-7 interface IP for

cluster communication. This routes all Ray traffic over the high-speed direct

connection rather than the management network.

On Node B (the worker node), join the Ray cluster:

ray start \

--address=10.0.0.1:6379 \

--node-ip-address=10.0.0.2

Now, on Node A, start vLLM with tensor parallelism across both nodes:

python -m vllm.entrypoints.openai.api_server \

--model meta-llama/Llama-3.1-70B-Instruct \

--host 0.0.0.0 \

--port 8000 \

--tensor-parallel-size 2 \

--dtype bfloat16 \

--max-model-len 8192

The --tensor-parallel-size 2 flag tells vLLM to split the model across 2 GPUs

(one on each node). vLLM uses the Ray cluster to coordinate with Node B

automatically. The model weights are split such that each node holds half the

model, and during inference, both nodes compute their portion simultaneously

and exchange results via the ConnectX-7 link.

9.5 Querying the vLLM Server

Once the vLLM server is running, you can query it using the OpenAI Python

library (since vLLM exposes an OpenAI-compatible API) or with direct HTTP

requests. The following script demonstrates both approaches and shows how to

handle the two-node setup:

import asyncio

import aiohttp

import json

from openai import AsyncOpenAI

from typing import AsyncIterator

class VLLMClient:

"""

A client for interacting with a vLLM OpenAI-compatible API server.

This client supports both synchronous and asynchronous operation,

and demonstrates how to use streaming responses for real-time output.

The vLLM server is accessed via the Node A management IP since it

acts as the head node and exposes the unified API endpoint.

"""

def __init__(

self,

host: str = "localhost",

port: int = 8000,

model: str = "meta-llama/Llama-3.1-70B-Instruct"

) -> None:

"""

Initialize the vLLM client.

Args:

host: The hostname or IP of the vLLM server (Node A).

port: The port number of the vLLM API server.

model: The model name as registered in the vLLM server.

"""

self.model = model

self.client = AsyncOpenAI(

base_url=f"http://{host}:{port}/v1",

api_key="not-needed" # vLLM does not require authentication

# by default, but the field must be set.

)

async def stream_completion(

self,

prompt: str,

max_tokens: int = 512,

temperature: float = 0.7

) -> AsyncIterator[str]:

"""

Stream a completion from the vLLM server token by token.

This is an async generator that yields each token as it is

generated. Using async streaming allows your application to

remain responsive while waiting for the model to generate text,

which is especially important for long responses.

Args:

prompt: The text prompt to complete.

max_tokens: Maximum number of tokens to generate.

temperature: Sampling temperature (0.0 = deterministic).

Yields:

Individual tokens as strings.

"""

stream = await self.client.completions.create(

model=self.model,

prompt=prompt,

max_tokens=max_tokens,

temperature=temperature,

stream=True

)

async for chunk in stream:

# Each chunk contains a list of choices. We take the first

# choice and extract the text delta (the new tokens).

if chunk.choices and chunk.choices[0].text:

yield chunk.choices[0].text

async def chat_completion(

self,

messages: list[dict],

max_tokens: int = 512,

temperature: float = 0.7

) -> str:

"""

Send a chat completion request and return the full response.

Args:

messages: A list of message dicts with 'role' and 'content'.

max_tokens: Maximum tokens to generate.

temperature: Sampling temperature.

Returns:

The model's response as a string.

"""

response = await self.client.chat.completions.create(

model=self.model,

messages=messages,

max_tokens=max_tokens,

temperature=temperature,

stream=False

)

return response.choices[0].message.content

async def main() -> None:

"""Demonstrate vLLM client usage with both streaming and non-streaming."""

# Connect to the vLLM server running on Node A.

# Even though the model is distributed across both nodes,

# all requests go to Node A's API endpoint. vLLM handles

# the distribution internally via the Ray cluster.

client = VLLMClient(

host="localhost", # or use the management IP: "192.168.1.100"

port=8000,

model="meta-llama/Llama-3.1-70B-Instruct"

)

# Demonstrate streaming completion.

print("=== Streaming Completion ===")

prompt = "The key advantages of distributed AI inference are:"

print(f"Prompt: {prompt}")

print("Response: ", end="")

async for token in client.stream_completion(

prompt=prompt,

max_tokens=256,

temperature=0.3

print(token, end="", flush=True)

print()

# Demonstrate chat completion with a system prompt.

print("\n=== Chat Completion ===")

messages = [

{

"role": "system",

"content": "You are an expert in distributed computing and AI systems."

{

"role": "user",

"content": "What is tensor parallelism and how does it work?"

}

]

response = await client.chat_completion(

messages=messages,

max_tokens=512,

temperature=0.5

)

print(f"Response: {response}")

if __name__ == "__main__":

asyncio.run(main())

Chapter 10 - INFERENCE ENGINE 4: SGLANG - THE STRUCTURED GENERATION WIZARD

10.1 What Is SGLang?

SGLang (Structured Generation Language) is an inference framework developed

at UC Berkeley that takes a different approach to LLM serving. While vLLM

focuses on maximizing throughput through efficient memory management, SGLang

focuses on making it easy and efficient to build complex LLM programs that

involve structured outputs, multi-step reasoning, and sophisticated prompting

patterns.

The key innovation in SGLang is RadixAttention, which is an extension of the

KV cache concept. In standard inference, each request has its own KV cache

that is discarded when the request completes. RadixAttention organizes KV

caches in a radix tree (a prefix tree) structure, allowing caches to be shared

between requests that share common prefixes. This is enormously valuable for

applications where many requests share a common system prompt or context, such

as a customer service bot where every conversation starts with the same

instructions.

For example, if you have a 2000-token system prompt that is the same for every

request, standard inference must recompute the KV cache for those 2000 tokens

for every single request. With RadixAttention, the KV cache for those 2000

tokens is computed once and reused for all subsequent requests that share that

prefix. This can reduce latency by 50-80% for such workloads.

10.2 Installing SGLang

Create a virtual environment for SGLang:

python3 -m venv ~/sglang-env

source ~/sglang-env/bin/activate

Install SGLang with all optional dependencies:

pip install "sglang[all]"

If you encounter issues with the ARM64 architecture, install from source:

git clone https://github.com/sgl-project/sglang.git

cd sglang

pip install -e ".[all]"

10.3 Starting the SGLang Server

SGLang provides a launch_server script that starts an OpenAI-compatible API

server. On Node A, start the server:

python -m sglang.launch_server \

--model-path meta-llama/Llama-3.1-8B-Instruct \

--host 0.0.0.0 \

--port 30000 \

--dtype bfloat16 \

--mem-fraction-static 0.85

The --mem-fraction-static 0.85 flag tells SGLang to use 85% of the available

GPU memory for the static KV cache pool. The remaining 15% is reserved for

dynamic allocations during inference. Adjusting this value lets you balance

between maximum batch size and stability.

For two-node distributed inference with SGLang, the setup uses torch.distributed

with NCCL as the communication backend. On Node A:

python -m sglang.launch_server \

--model-path meta-llama/Llama-3.1-70B-Instruct \

--host 0.0.0.0 \

--port 30000 \

--tp-size 2 \

--nnodes 2 \

--node-rank 0 \

--dist-init-addr 10.0.0.1:29500 \

--dtype bfloat16

On Node B (run this command simultaneously with the Node A command):

python -m sglang.launch_server \

--model-path meta-llama/Llama-3.1-70B-Instruct \

--host 0.0.0.0 \

--port 30000 \

--tp-size 2 \

--nnodes 2 \

--node-rank 1 \

--dist-init-addr 10.0.0.1:29500 \

--dtype bfloat16

The --tp-size 2 flag sets tensor parallelism to 2 (one GPU per node). The

--nnodes 2 flag specifies the total number of nodes. The --node-rank flag

identifies each node (0 for head, 1 for worker). The --dist-init-addr flag

specifies the address of the head node for distributed initialization, using

the ConnectX-7 IP for low-latency communication.

10.4 Using SGLang's Structured Generation Features

SGLang's most powerful feature is its ability to generate structured outputs

reliably. This is useful when you need the model to produce JSON, follow a

specific format, or make a series of decisions in a structured way. The

following example demonstrates how to use SGLang's Python API to generate

structured JSON output:

import sglang as sgl

from sglang import function, system, user, assistant, gen

import json

# SGLang uses a decorator-based programming model where you define

# generation programs as Python functions decorated with @sgl.function.

# This allows SGLang to optimize the execution of complex multi-step

# generation tasks.

@sgl.function

def analyze_text(s, text: str) -> None:

"""

Analyze a piece of text and extract structured information.

This SGLang program instructs the model to analyze input text and

produce a structured JSON response containing sentiment, key topics,

and a summary. The use of SGLang's constrained generation ensures

the output is valid JSON.

Args:

s: The SGLang state object (injected automatically).

text: The text to analyze.

"""

# Set the system prompt that defines the model's behavior.

s += system(

"You are a text analysis assistant. Always respond with valid JSON."

)

# Provide the user's request with the text to analyze.

s += user(

f"Analyze the following text and provide a JSON response with "

f"fields: 'sentiment' (positive/negative/neutral), "

f"'key_topics' (list of strings), and 'summary' (one sentence).\n\n"

f"Text: {text}"

)

# The gen() call tells SGLang to generate text here.

# The max_tokens parameter limits the response length.

# SGLang can also enforce JSON schema constraints if configured.

s += assistant(gen("analysis", max_tokens=256))

@sgl.function

def multi_step_reasoning(s, question: str) -> None:

"""

Perform multi-step chain-of-thought reasoning.

This program demonstrates SGLang's ability to structure complex

reasoning tasks. It first generates a step-by-step reasoning chain,

then uses that reasoning to produce a final answer. This two-step

approach often produces more accurate results than asking for the

answer directly.

Args:

s: The SGLang state object.

question: The question to reason about.

"""

s += system(

"You are a careful reasoner who thinks step by step before answering."

)

s += user(f"Question: {question}\n\nFirst, think through this step by step:")

# Generate the reasoning chain and store it in the 'reasoning' variable.

s += assistant(gen("reasoning", max_tokens=512))

# Now ask for the final answer, which can reference the reasoning above.

s += user("Based on your reasoning above, what is your final answer?")

# Generate the final answer and store it in the 'answer' variable.

s += assistant(gen("answer", max_tokens=128))

def run_sglang_examples() -> None:

"""

Run the SGLang example programs against the local server.

This function initializes the SGLang runtime to connect to the

server we started earlier, then runs both example programs.

"""

# Initialize the SGLang runtime to connect to the local server.

# If you want to use the distributed two-node setup, point this

# at Node A's management IP or localhost if running on Node A.

sgl.set_default_backend(

sgl.RuntimeEndpoint("http://localhost:30000")

)

# Run the text analysis program.

print("=== Structured Text Analysis ===")

sample_text = (

"The new NVIDIA DGX Spark is a remarkable piece of engineering. "

"It delivers petaFLOP-scale AI performance in a desktop form factor, "

"making enterprise-grade AI accessible to individual researchers."

)

result = analyze_text(text=sample_text)

# Access the generated content by variable name.

raw_analysis = result["analysis"]

print(f"Raw output: {raw_analysis}")

# Attempt to parse the JSON output.

try:

parsed = json.loads(raw_analysis)

print(f"Sentiment: {parsed.get('sentiment', 'N/A')}")

print(f"Key topics: {parsed.get('key_topics', [])}")

print(f"Summary: {parsed.get('summary', 'N/A')}")

except json.JSONDecodeError:

print("Note: Output was not valid JSON. Adjust the prompt for stricter formatting.")

# Run the multi-step reasoning program.

print("\n=== Multi-Step Reasoning ===")

question = (

"If two DGX Spark units each have 128GB of unified memory and are "

"connected via a 100GbE link, what is the theoretical maximum model "

"size they could run together, and what are the practical limitations?"

)

result = multi_step_reasoning(question=question)

print(f"Reasoning:\n{result['reasoning']}")

print(f"\nFinal Answer:\n{result['answer']}")

if __name__ == "__main__":

run_sglang_examples()

Chapter 11 - INFERENCE ENGINE 5: TENSORRT-LLM - MAXIMUM PERFORMANCE MODE

11.1 What Is TensorRT-LLM?

TensorRT-LLM is NVIDIA's own high-performance inference library, and it

represents the pinnacle of optimization for NVIDIA hardware. While Ollama,

LM Studio, vLLM, and SGLang are general-purpose frameworks that work across

different hardware, TensorRT-LLM is specifically engineered to extract every

last drop of performance from NVIDIA GPUs.

The way TensorRT-LLM achieves this is through model compilation. Instead of

running a model in its original format (PyTorch weights), TensorRT-LLM compiles

the model into a TensorRT engine - a highly optimized binary that is tailored

to the specific GPU architecture it will run on. This compilation process

applies a battery of optimizations: kernel fusion (combining multiple operations

into a single GPU kernel to reduce memory bandwidth), precision reduction

(converting weights to FP8 or INT4 format), layer optimization (replacing

generic PyTorch operations with hand-written CUDA kernels), and graph

optimization (reordering and eliminating redundant operations).

The result is typically 2-5x faster inference compared to unoptimized PyTorch,

with the exact speedup depending on the model architecture and the specific

GPU. For the DGX Spark's Blackwell GPU, which has dedicated hardware for FP4

and FP8 operations, the speedup can be even more dramatic.

The trade-off is complexity. TensorRT-LLM requires a compilation step that

can take 30 minutes to several hours for large models, and the compiled engine

is specific to the GPU architecture it was compiled for. You cannot compile

an engine on a Blackwell GPU and run it on an Ampere GPU.

11.2 Installing TensorRT-LLM

TensorRT-LLM is best installed inside a Docker container, as it has complex

dependencies that are pre-configured in NVIDIA's official container images.

The DGX Spark has Docker and the NVIDIA Container Runtime pre-installed.

Pull the official TensorRT-LLM container:

docker pull nvcr.io/nvidia/tensorrt-llm:latest

Alternatively, install via pip in a virtual environment:

python3 -m venv ~/trtllm-env

source ~/trtllm-env/bin/activate

pip install tensorrt-llm

If the pip installation fails on ARM64, use the Docker approach, which is

more reliable:

docker run --gpus all \

--rm \

-it \

-v /home/aiuser/models:/models \

nvcr.io/nvidia/tensorrt-llm:latest \

bash

The -v flag mounts your local models directory inside the container, so models

you download are accessible both inside and outside Docker.

11.3 Building a TensorRT-LLM Engine

Building a TensorRT-LLM engine is a two-step process. First, you convert the

model weights from Hugging Face format to TensorRT-LLM's internal format.

Second, you compile the converted weights into an optimized TensorRT engine.

Step 1: Convert the model weights. This example uses Llama 3.1 8B:

# Inside the TensorRT-LLM Docker container or virtual environment:

# Download the model from Hugging Face first.

# You need the huggingface_hub library for this.

pip install huggingface_hub

python3 -c "

from huggingface_hub import snapshot_download

snapshot_download(

repo_id='meta-llama/Llama-3.1-8B-Instruct',

local_dir='/models/llama-3.1-8b-hf',

token='your_hf_token_here'

)

# Convert the Hugging Face weights to TensorRT-LLM checkpoint format.

# This script is included in the TensorRT-LLM repository.

python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py \

--model-dir /models/llama-3.1-8b-hf \

--output-dir /models/llama-3.1-8b-trtllm-ckpt \

--dtype bfloat16 \

--tp-size 1

Step 2: Compile the TensorRT engine:

trtllm-build \

--checkpoint-dir /models/llama-3.1-8b-trtllm-ckpt \

--output-dir /models/llama-3.1-8b-trtllm-engine \

--gemm-plugin bfloat16 \

--max-batch-size 8 \

--max-input-len 2048 \

--max-seq-len 4096

The --gemm-plugin flag enables TensorRT's optimized GEMM (General Matrix

Multiply) kernels, which are the core operation in transformer inference.

The --max-batch-size flag sets the maximum number of requests that can be

processed simultaneously. The --max-input-len and --max-seq-len flags control

the maximum context length.

11.4 Two-Node TensorRT-LLM with MPI

For two-node distributed inference with TensorRT-LLM, we use MPI (Message

Passing Interface), the standard parallel computing communication library.

Install MPI on both nodes:

sudo apt install -y openmpi-bin openmpi-common libopenmpi-dev

For a two-node setup, rebuild the TensorRT engine with tensor parallelism:

# On Node A: Convert with tp-size 2

python3 /path/to/convert.py \

--model-dir /models/llama-3.1-70b-hf \

--output-dir /models/llama-3.1-70b-trtllm-ckpt \

--dtype bfloat16 \

--tp-size 2

# Build the engine for 2-GPU tensor parallelism

trtllm-build \

--checkpoint-dir /models/llama-3.1-70b-trtllm-ckpt \

--output-dir /models/llama-3.1-70b-trtllm-engine \

--gemm-plugin bfloat16 \

--max-batch-size 4 \

--max-input-len 2048 \

--max-seq-len 4096

Copy the engine to Node B (it must be identical on both nodes):

rsync -avz \

/models/llama-3.1-70b-trtllm-engine/ \

aiuser@10.0.0.2:/models/llama-3.1-70b-trtllm-engine/

Now launch the TensorRT-LLM server across both nodes using mpirun:

mpirun \

-n 2 \

--host 10.0.0.1,10.0.0.2 \

--mca btl_tcp_if_include enp1s0f0np0 \

python3 -m tensorrt_llm.serve \

--engine-dir /models/llama-3.1-70b-trtllm-engine \

--host 0.0.0.0 \

--port 8080

The -n 2 flag launches 2 MPI processes (one per node). The --host flag

specifies the two nodes using their ConnectX-7 IP addresses. The

--mca btl_tcp_if_include flag tells MPI to use the ConnectX-7 interface for

communication, routing all inter-node traffic over the high-speed direct link.

11.5 Querying the TensorRT-LLM Server

The TensorRT-LLM server exposes an OpenAI-compatible API, so the same client

code works as for vLLM and SGLang. The following example adds performance

measurement to help you appreciate the speed difference:

import time

import requests

import json

from dataclasses import dataclass

@dataclass

class InferenceResult:

"""

Container for inference results including performance metrics.

This dataclass bundles the generated text with timing information,

making it easy to compare performance across different inference

engines and configurations.

"""

response_text: str

prompt_tokens: int

completion_tokens: int

total_time_seconds: float

tokens_per_second: float

def query_trtllm_server(

prompt: str,

host: str = "localhost",

port: int = 8080,

max_tokens: int = 256,

temperature: float = 0.7

) -> InferenceResult:

"""

Query the TensorRT-LLM server and measure performance.

This function sends a completion request to the TensorRT-LLM server

and measures the time taken to generate the response. The tokens

per second metric is the key performance indicator for LLM inference:

higher is better, and TensorRT-LLM typically achieves the highest

values of any inference framework on NVIDIA hardware.

Args:

prompt: The text prompt to complete.

host: The TensorRT-LLM server hostname or IP.

port: The server port number.

max_tokens: Maximum tokens to generate.

temperature: Sampling temperature.

Returns:

An InferenceResult with the response and performance metrics.

"""

url = f"http://{host}:{port}/v1/completions"

payload = {

"model": "tensorrt-llm", # TensorRT-LLM uses this as a placeholder.

"prompt": prompt,

"max_tokens": max_tokens,

"temperature": temperature,

"stream": False

}

# Record the start time before sending the request.

start_time = time.perf_counter()

response = requests.post(url, json=payload)

response.raise_for_status()

# Record the end time after receiving the complete response.

end_time = time.perf_counter()

data = response.json()

total_time = end_time - start_time

# Extract usage statistics from the response.

usage = data.get("usage", {})

prompt_tokens = usage.get("prompt_tokens", 0)

completion_tokens = usage.get("completion_tokens", 0)

# Calculate tokens per second. This is the primary performance metric.

# Divide by total time to get the overall throughput including

# network overhead and server processing time.

tokens_per_second = completion_tokens / total_time if total_time > 0 else 0

response_text = data["choices"][0]["text"] if data.get("choices") else ""

return InferenceResult(

response_text=response_text,

prompt_tokens=prompt_tokens,

completion_tokens=completion_tokens,

total_time_seconds=total_time,

tokens_per_second=tokens_per_second

)

if __name__ == "__main__":

test_prompt = (

"Explain the difference between tensor parallelism and "

"pipeline parallelism in distributed deep learning:"

)

print("Querying TensorRT-LLM server (two-node distributed)...")

result = query_trtllm_server(

prompt=test_prompt,

host="localhost",

port=8080,

max_tokens=256

)

print(f"\nResponse:\n{result.response_text}")

print(f"\nPerformance Metrics:")

print(f" Prompt tokens: {result.prompt_tokens}")

print(f" Completion tokens: {result.completion_tokens}")

print(f" Total time: {result.total_time_seconds:.2f} seconds")

print(f" Throughput: {result.tokens_per_second:.1f} tokens/second")

Chapter 12 - WRITING CODE THAT TALKS TO LOCAL AND REMOTE LLMS

12.1 The Unified Client: One Interface, Five Engines

One of the most powerful patterns when working with multiple inference engines

is to write a unified client that abstracts away the differences between them.

All five engines we have covered (Ollama, LM Studio, vLLM, SGLang, and

TensorRT-LLM) expose either the Ollama API or the OpenAI API. This means we

can write a single client class that works with all of them by simply changing

the endpoint URL.

The following code implements a comprehensive unified LLM client that supports

both local models (via Ollama) and remote models (via any OpenAI-compatible

endpoint). It includes retry logic, error handling, and performance monitoring:

import time

import json

import logging

import requests

from enum import Enum

from dataclasses import dataclass, field

from typing import Optional, Iterator

from openai import OpenAI

# Configure logging so we can see what the client is doing.

# In production, you would configure this to write to a file.

logging.basicConfig(

level=logging.INFO,

format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"

)

logger = logging.getLogger("llm_client")

class BackendType(Enum):

"""

Enumeration of supported LLM backend types.

Each backend type corresponds to a different inference engine.

The client uses this to determine which API protocol to use

and how to format requests and parse responses.

"""

OLLAMA = "ollama"

OPENAI_COMPATIBLE = "openai_compatible" # vLLM, LM Studio, SGLang, TensorRT-LLM

@dataclass

class LLMConfig:

"""

Configuration for connecting to an LLM backend.

This dataclass holds all the information needed to connect to

and communicate with an LLM inference server. Separating

configuration from logic makes it easy to switch between

different servers and models.

"""

host: str

port: int

model: str

backend_type: BackendType

api_key: str = "not-required"

timeout_seconds: int = 120

max_retries: int = 3

retry_delay_seconds: float = 1.0

@property

def base_url(self) -> str:

"""Construct the base URL for the API endpoint."""

if self.backend_type == BackendType.OLLAMA:

return f"http://{self.host}:{self.port}"

else:

return f"http://{self.host}:{self.port}/v1"

@dataclass

class ChatMessage:

"""A single message in a conversation."""

role: str # "system", "user", or "assistant"

content: str

def to_dict(self) -> dict:

"""Convert to the dict format expected by API endpoints."""

return {"role": self.role, "content": self.content}

@dataclass

class GenerationResult:

"""

The result of a text generation request.

This dataclass captures both the generated content and metadata

about the generation, including timing information that helps

you understand and optimize your inference pipeline.

"""

content: str

model: str

backend: str

prompt_tokens: int = 0

completion_tokens: int = 0

generation_time_seconds: float = 0.0

@property

def tokens_per_second(self) -> float:

"""Calculate the generation throughput in tokens per second."""

if self.generation_time_seconds > 0 and self.completion_tokens > 0:

return self.completion_tokens / self.generation_time_seconds

return 0.0

def __str__(self) -> str:

return (

f"GenerationResult(\n"

f" backend={self.backend},\n"

f" model={self.model},\n"

f" tokens={self.completion_tokens},\n"

f" speed={self.tokens_per_second:.1f} tok/s\n"

f")"

)

class UnifiedLLMClient:

"""

A unified client for interacting with multiple LLM inference backends.

This client provides a consistent interface for sending requests to

any of the five inference engines covered in this tutorial. It handles

the differences in API protocols, request formats, and response

structures transparently.

The client supports both the Ollama native API and the OpenAI-compatible

API, automatically selecting the correct protocol based on the backend

type specified in the configuration.

Usage example:

# Configure for local Ollama

config = LLMConfig(

host="localhost",

port=11434,

model="llama3.2:3b",

backend_type=BackendType.OLLAMA

)

client = UnifiedLLMClient(config)

result = client.chat([ChatMessage("user", "Hello!")])

"""

def __init__(self, config: LLMConfig) -> None:

"""

Initialize the client with the given configuration.

Args:

config: The LLMConfig specifying which server to connect to.

"""

self.config = config

self._openai_client: Optional[OpenAI] = None

# Only create the OpenAI client for OpenAI-compatible backends.

if config.backend_type == BackendType.OPENAI_COMPATIBLE:

self._openai_client = OpenAI(

base_url=config.base_url,

api_key=config.api_key,

timeout=config.timeout_seconds

)

logger.info(

f"Initialized LLM client: {config.backend_type.value} "

f"at {config.host}:{config.port} "

f"using model '{config.model}'"

)

def _chat_via_ollama(

self,

messages: list[ChatMessage],

temperature: float,

max_tokens: int

) -> GenerationResult:

"""

Send a chat request using the Ollama native API.

The Ollama API uses a different request format than the OpenAI API.

It accepts messages in a similar format but uses different field

names and response structure.

Args:

messages: The conversation history.

temperature: Sampling temperature.

max_tokens: Maximum tokens to generate.

Returns:

A GenerationResult with the response and metadata.

"""

url = f"{self.config.base_url}/api/chat"

payload = {

"model": self.config.model,

"messages": [msg.to_dict() for msg in messages],

"stream": False,

"options": {

"temperature": temperature,

"num_predict": max_tokens

}

start_time = time.perf_counter()

response = requests.post(

url,

json=payload,

timeout=self.config.timeout_seconds

)

response.raise_for_status()

elapsed = time.perf_counter() - start_time

data = response.json()

# Extract the response content from the Ollama API response format.

content = data.get("message", {}).get("content", "")

# Ollama provides token counts in the response metadata.

prompt_tokens = data.get("prompt_eval_count", 0)

completion_tokens = data.get("eval_count", 0)

return GenerationResult(

content=content,

model=self.config.model,

backend="ollama",

prompt_tokens=prompt_tokens,

completion_tokens=completion_tokens,

generation_time_seconds=elapsed

)

def _chat_via_openai(

self,

messages: list[ChatMessage],

temperature: float,

max_tokens: int

) -> GenerationResult:

"""

Send a chat request using the OpenAI-compatible API.

This method works with vLLM, LM Studio, SGLang, and TensorRT-LLM,

all of which implement the OpenAI chat completions API.

Args:

messages: The conversation history.

temperature: Sampling temperature.

max_tokens: Maximum tokens to generate.

Returns:

A GenerationResult with the response and metadata.

"""

assert self._openai_client is not None, (

"OpenAI client not initialized. "

"Check that backend_type is OPENAI_COMPATIBLE."

)

start_time = time.perf_counter()

completion = self._openai_client.chat.completions.create(

model=self.config.model,

messages=[msg.to_dict() for msg in messages],

temperature=temperature,

max_tokens=max_tokens

)

elapsed = time.perf_counter() - start_time

content = completion.choices[0].message.content or ""

usage = completion.usage

return GenerationResult(

content=content,

model=self.config.model,

backend=self.config.backend_type.value,

prompt_tokens=usage.prompt_tokens if usage else 0,

completion_tokens=usage.completion_tokens if usage else 0,

generation_time_seconds=elapsed

)

def chat(

self,

messages: list[ChatMessage],

temperature: float = 0.7,

max_tokens: int = 512

) -> GenerationResult:

"""

Send a chat request with automatic retry on failure.

This is the primary public method for sending requests. It

automatically selects the correct API protocol based on the

backend type and retries failed requests up to max_retries times.

Args:

messages: The conversation history as a list of ChatMessages.

temperature: Sampling temperature (0.0 = deterministic).

max_tokens: Maximum number of tokens to generate.

Returns:

A GenerationResult with the response and performance metrics.

Raises:

RuntimeError: If all retry attempts fail.

"""

last_error: Optional[Exception] = None

for attempt in range(self.config.max_retries):

try:

if self.config.backend_type == BackendType.OLLAMA:

result = self._chat_via_ollama(

messages, temperature, max_tokens

)

else:

result = self._chat_via_openai(

messages, temperature, max_tokens

)

logger.info(

f"Request completed: {result.completion_tokens} tokens "

f"at {result.tokens_per_second:.1f} tok/s"

)

return result

except Exception as error:

last_error = error

logger.warning(

f"Request attempt {attempt + 1} failed: {error}. "

f"Retrying in {self.config.retry_delay_seconds}s..."

)

time.sleep(self.config.retry_delay_seconds)

raise RuntimeError(

f"All {self.config.max_retries} attempts failed. "

f"Last error: {last_error}"

)

def stream_chat(

self,

messages: list[ChatMessage],

temperature: float = 0.7,

max_tokens: int = 512

) -> Iterator[str]:

"""

Stream a chat response token by token.

This method yields tokens as they are generated, which is

useful for building responsive user interfaces. Note that

streaming is only supported for OpenAI-compatible backends

in this implementation.

Args:

messages: The conversation history.

temperature: Sampling temperature.

max_tokens: Maximum tokens to generate.

Yields:

Individual tokens as strings.

"""

if self.config.backend_type == BackendType.OLLAMA:

# Use the Ollama streaming API.

url = f"{self.config.base_url}/api/chat"

payload = {

"model": self.config.model,

"messages": [msg.to_dict() for msg in messages],

"stream": True,

"options": {"temperature": temperature, "num_predict": max_tokens}

}

with requests.post(url, json=payload, stream=True) as response:

response.raise_for_status()

for line in response.iter_lines():

if line:

chunk = json.loads(line)

token = chunk.get("message", {}).get("content", "")

if token:

yield token

if chunk.get("done", False):

break

else:

# Use the OpenAI streaming API.

assert self._openai_client is not None

stream = self._openai_client.chat.completions.create(

model=self.config.model,

messages=[msg.to_dict() for msg in messages],

temperature=temperature,

max_tokens=max_tokens,

stream=True

)

for chunk in stream:

if chunk.choices and chunk.choices[0].delta.content:

yield chunk.choices[0].delta.content

def demonstrate_all_backends() -> None:

"""

Demonstrate the unified client with all five inference backends.

This function creates a client for each backend and sends the same

question to all of them, then compares the responses and performance.

It assumes all servers are running on the appropriate ports as

configured throughout this tutorial.

"""

# Define configurations for all five backends.

# Adjust host addresses and ports to match your actual setup.

configs = [

LLMConfig(

host="localhost",

port=11434,

model="llama3.2:3b",

backend_type=BackendType.OLLAMA

LLMConfig(

host="localhost",

port=1234,

model="local-model",

backend_type=BackendType.OPENAI_COMPATIBLE

LLMConfig(

host="localhost",

port=8000,

model="meta-llama/Llama-3.1-70B-Instruct",

backend_type=BackendType.OPENAI_COMPATIBLE

LLMConfig(

host="localhost",

port=30000,

model="meta-llama/Llama-3.1-8B-Instruct",

backend_type=BackendType.OPENAI_COMPATIBLE

LLMConfig(

host="localhost",

port=8080,

model="tensorrt-llm",

backend_type=BackendType.OPENAI_COMPATIBLE

)

]

backend_names = ["Ollama", "LM Studio", "vLLM", "SGLang", "TensorRT-LLM"]

# The same question is sent to all backends for a fair comparison.

question = (

"In one paragraph, explain why unified memory architecture is "

"important for running large language models."

)

messages = [

ChatMessage(

role="system",

content="You are a concise technical expert. Answer in one paragraph."

ChatMessage(role="user", content=question)

]

print("=" * 70)

print("COMPARING ALL FIVE INFERENCE BACKENDS")

print("=" * 70)

print(f"Question: {question}\n")

for config, name in zip(configs, backend_names):

print(f"\n--- {name} ---")

try:

client = UnifiedLLMClient(config)

result = client.chat(messages=messages, temperature=0.3, max_tokens=256)

print(f"Response: {result.content}")

print(f"Speed: {result.tokens_per_second:.1f} tokens/second")

except Exception as error:

print(f"Error connecting to {name}: {error}")

print("(Make sure the server is running on the expected port)")

if __name__ == "__main__":

demonstrate_all_backends()

Chapter 13 - MONITORING, TROUBLESHOOTING, AND KEEPING THINGS RUNNING

13.1 Real-Time GPU Monitoring

Understanding what your GPUs are doing is essential for diagnosing performance

issues and ensuring your inference workloads are running efficiently. The

primary tool for this is nvidia-smi, which you can run in watch mode to get

a continuously updating display:

watch -n 1 nvidia-smi

This refreshes the output every second. You will see GPU utilization

(ideally close to 100% during inference), memory usage (which grows as models

are loaded), temperature (should stay below 85°C for sustained workloads),

and power consumption.

The nvtop tool provides a more visual, htop-like interface:

nvtop

For monitoring both nodes simultaneously from a single terminal, you can use

SSH to run nvidia-smi on Node B and display the output locally:

# In one terminal pane: monitor Node A

watch -n 1 nvidia-smi

# In another terminal pane: monitor Node B via SSH

ssh aiuser@10.0.0.2 "watch -n 1 nvidia-smi"

13.2 Monitoring Network Performance

During distributed inference, the ConnectX-7 link is the critical path. If

the network is not performing well, your distributed inference will be slow

regardless of how fast the GPUs are. Monitor network throughput with:

# Watch network interface statistics in real time.

# Replace enp1s0f0np0 with your actual ConnectX-7 interface name.

watch -n 1 "cat /proc/net/dev | grep enp1s0f0np0"

For a more detailed view, use the sar tool (part of the sysstat package):

sudo apt install -y sysstat

sar -n DEV 1 100

This shows network statistics for all interfaces, updated every second, for

100 iterations. Look for the enp1s0f0np0 interface and check that the

rxkB/s and txkB/s values are consistent with your expected workload.

13.3 Setting Up Systemd Services for Inference Engines

For production use, you want your inference servers to start automatically

when the machine boots and to restart automatically if they crash. Systemd

services handle this perfectly. Here is an example service file for vLLM:

sudo nano /etc/systemd/system/vllm-server.service

Enter the following content, adjusting paths and parameters for your setup:

[Unit]

Description=vLLM OpenAI-Compatible Inference Server

After=network.target

Wants=network.target

[Service]

Type=simple

User=aiuser

WorkingDirectory=/home/aiuser

Environment="PATH=/home/aiuser/vllm-env/bin:/usr/local/bin:/usr/bin:/bin"

Environment="HF_TOKEN=your_huggingface_token_here"

ExecStart=/home/aiuser/vllm-env/bin/python -m vllm.entrypoints.openai.api_server \

--model meta-llama/Llama-3.1-8B-Instruct \

--host 0.0.0.0 \

--port 8000 \

--dtype bfloat16 \

--max-model-len 8192

Restart=always

RestartSec=10

StandardOutput=journal

StandardError=journal

[Install]

WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload

sudo systemctl enable vllm-server

sudo systemctl start vllm-server

View the service logs:

journalctl -u vllm-server -f

The -f flag follows the log in real time, similar to tail -f. This is

invaluable for debugging startup issues.

13.4 A Health Check Script

The following script checks the health of all inference engines and reports

which ones are running and responding correctly. Run this after a reboot or

when you suspect something is not working:

import requests

import subprocess

from dataclasses import dataclass

@dataclass

class ServiceStatus:

"""Status information for a single inference service."""

host: str

port: int

is_running: bool

response_time_ms: float

error_message: str = ""

def check_service_health(

host: str,

port: int,

health_endpoint: str = "/health",

timeout: float = 5.0

) -> ServiceStatus:

"""

Check whether an inference service is running and responding.

This function attempts to connect to the service's health endpoint

and measures the response time. A successful response (HTTP 200)

indicates the service is healthy. Any error indicates a problem

that needs investigation.

Args:

host: The hostname or IP of the service.

port: The port number.

health_endpoint: The URL path for the health check endpoint.

timeout: Maximum time to wait for a response in seconds.

Returns:

A ServiceStatus object with the health check results.

"""

import time

url = f"http://{host}:{port}{health_endpoint}"

start = time.perf_counter()

try:

response = requests.get(url, timeout=timeout)

elapsed_ms = (time.perf_counter() - start) * 1000

return ServiceStatus(

name=name,

host=host,

port=port,

is_running=response.status_code == 200,

response_time_ms=elapsed_ms

)

except requests.exceptions.ConnectionError:

return ServiceStatus(

name=name,

host=host,

port=port,

is_running=False,

response_time_ms=0.0,

error_message="Connection refused - service may not be running"

)

except requests.exceptions.Timeout:

return ServiceStatus(

name=name,

host=host,

port=port,

is_running=False,

response_time_ms=timeout * 1000,

error_message="Timeout - service is running but not responding"

)

def check_gpu_health() -> dict:

"""

Check GPU health using nvidia-smi.

Returns a dict with GPU temperature, utilization, and memory usage.

This helps identify whether the GPU is overheating or running out

of memory, which are common causes of inference failures.

"""

try:

result = subprocess.run(

[

"nvidia-smi",

"--query-gpu=temperature.gpu,utilization.gpu,memory.used,memory.total",

"--format=csv,noheader,nounits"

capture_output=True,

text=True,

timeout=10

)

if result.returncode == 0:

values = result.stdout.strip().split(", ")

return {

"temperature_c": int(values[0]),

"utilization_pct": int(values[1]),

"memory_used_mb": int(values[2]),

"memory_total_mb": int(values[3])

}

except Exception as error:

return {"error": str(error)}

return {}

if __name__ == "__main__":

print("=" * 60)

print("DGX SPARK INFERENCE ENGINE HEALTH CHECK")

print("=" * 60)

# Define all services to check.

services_to_check = [

("Ollama (Node A)", "localhost", 11434, "/api/tags"),

("Ollama (Node B)", "10.0.0.2", 11434, "/api/tags"),

("LM Studio (Node A)", "localhost", 1234, "/v1/models"),

("vLLM (Node A)", "localhost", 8000, "/health"),

("SGLang (Node A)", "localhost", 30000, "/health"),

("TensorRT-LLM (Node A)","localhost", 8080, "/health"),

]

print("\nService Status:")

print("-" * 60)

for name, host, port, endpoint in services_to_check:

status = check_service_health(name, host, port, endpoint)

status_str = "RUNNING" if status.is_running else "DOWN"

if status.is_running:

print(

f" [{status_str:7}] {name:<30} "

f"{status.response_time_ms:.1f}ms"

)

else:

print(

f" [{status_str:7}] {name:<30} "

f"{status.error_message}"

)

print("\nGPU Health (Node A):")

print("-" * 60)

gpu_info = check_gpu_health()

if "error" not in gpu_info:

print(f" Temperature: {gpu_info['temperature_c']}°C")

print(f" Utilization: {gpu_info['utilization_pct']}%")

memory_used_gb = gpu_info['memory_used_mb'] / 1024

memory_total_gb = gpu_info['memory_total_mb'] / 1024

print(f" Memory: {memory_used_gb:.1f}GB / {memory_total_gb:.1f}GB")

else:

print(f" Error: {gpu_info['error']}")

13.5 Common Problems and Solutions

The following describes the most common issues you will encounter and how to

resolve them.

If nvidia-smi shows "No devices were found" after a kernel update, the GPU

driver module may not have been recompiled for the new kernel. Run

"sudo apt install --reinstall nvidia-driver-570" (or whatever the current

driver version is) to reinstall the driver, which triggers recompilation.

If an inference server fails to start with "CUDA out of memory," another

process is using GPU memory. Run "nvidia-smi" to identify the process, then

kill it with "sudo kill -9 <PID>". Also check that you are not trying to load

a model that exceeds the available memory.

If the ConnectX-7 link shows as "down" in "ip link show," check that the DAC

cable is fully seated in both ports. Try unplugging and replugging the cable.

If the problem persists, verify the cable is rated for 100G (QSFP28) and not

40G (QSFP+).

If NCCL reports "Connection refused" during multi-node startup, the firewall

may be blocking the NCCL communication ports. Disable the firewall temporarily

for testing: "sudo ufw disable". If this fixes the problem, add rules to allow

traffic on the NCCL ports (typically 29500 and above) between the two nodes.

If vLLM's Ray cluster fails to form, ensure that the Ray head node is fully

started before running "ray start" on the worker node. The head node logs

should show "Ray runtime started" before you proceed with the worker.

Chapter 14 - CLOSING THOUGHTS AND NEXT STEPS

14.1 What You Have Accomplished

If you have followed this tutorial to this point, you have done something

genuinely impressive. You have set up two of the most powerful personal AI

workstations available, connected them with a high-speed 100GbE direct link,

configured RDMA for low-latency GPU-to-GPU communication, and installed five

different inference engines that cover the full spectrum from beginner-friendly

to maximum-performance. You have also written Python code that can talk to all

of these engines, both locally and across the network.

This is not a trivial achievement. Many organizations spend months and

significant engineering resources to build AI inference infrastructure at this

level. You now have it running on two machines on your desk.

14.2 Choosing the Right Tool for the Job

Now that you have all five engines available, here is a practical guide to

choosing the right one for different situations.

Ollama is the right choice when you want to quickly experiment with a new

model, when you need the simplest possible setup, or when you are building a

prototype that you want to get running in minutes rather than hours. Its model

library is excellent, and the REST API is simple and well-documented.

LM Studio is the right choice when you want a graphical interface for

interactive model exploration, when you are demonstrating AI capabilities to

non-technical stakeholders, or when you want to quickly compare different

models side by side without writing code.

vLLM is the right choice when you need to serve many concurrent users with

high throughput, when you are building a production API service, or when you

need the flexibility of multi-node tensor parallelism for very large models.

Its PagedAttention makes it the most memory-efficient option for high-concurrency

workloads.

SGLang is the right choice when your application involves structured output

generation (JSON, XML, specific formats), complex multi-step reasoning chains,

or workloads where many requests share a common prefix (like a shared system

prompt). Its RadixAttention makes it uniquely efficient for these patterns.

TensorRT-LLM is the right choice when raw inference speed is the paramount

concern and you are willing to invest time in the compilation process. If you

are running the same model continuously in production and need the absolute

maximum tokens per second, TensorRT-LLM will outperform all other options on

NVIDIA hardware.

14.3 Next Steps and Further Exploration

The setup described in this tutorial is a solid foundation, but there is much

more to explore. Fine-tuning models on your own data using frameworks like

Hugging Face PEFT and LoRA is a natural next step that allows you to customize

models for your specific domain. The DGX Spark's unified memory architecture

makes fine-tuning of 7B-13B parameter models feasible on a single node.

Exploring quantization techniques - specifically GPTQ, AWQ, and GGUF - will

help you fit larger models into the available memory and run them faster.

Quantization reduces the precision of model weights (from 16-bit to 8-bit or

4-bit), trading a small amount of quality for significant reductions in memory

usage and inference time.

Building a proper model serving pipeline with load balancing, request queuing,

and monitoring using tools like Prometheus and Grafana will prepare your setup

for production use. The health check script in Chapter 13 is a starting point,

but a full observability stack gives you much deeper insight into system

behavior.

Experimenting with multimodal models - models that can process both text and

images - is another exciting direction. The DGX Spark's memory capacity makes

it well-suited for models like LLaVA, Qwen-VL, and similar vision-language

models.

Finally, connecting your two-node DGX Spark cluster to a larger network of

machines, or integrating it with cloud resources for burst capacity, opens up

possibilities for truly large-scale AI workloads. The skills and concepts you

have learned in this tutorial - RDMA networking, distributed inference,

multi-node coordination - are directly applicable to clusters of any size.

The machines are ready. The software is installed. The network is configured.

What you build next is entirely up to you.

APPENDIX: QUICK REFERENCE CARD

NETWORK ADDRESSES

Node A management: 192.168.1.100

Node B management: 192.168.1.101

Node A ConnectX-7: 10.0.0.1

Node B ConnectX-7: 10.0.0.2

INFERENCE ENGINE PORTS

Ollama: 11434 (both nodes)

LM Studio: 1234 (both nodes)

vLLM: 8000 (Node A, head node)

SGLang: 30000 (Node A, head node)

TensorRT-LLM: 8080 (Node A, head node)

ESSENTIAL COMMANDS

Check GPU status: nvidia-smi

Monitor GPU live: watch -n 1 nvidia-smi

Monitor GPU visual: nvtop

Check network: ip link show

Test bandwidth: iperf3 -c 10.0.0.2 -t 30 -P 4

Pull Ollama model: ollama pull llama3.2:3b

Run Ollama model: ollama run llama3.2:3b

Check service: systemctl status <service-name>

View service logs: journalctl -u <service-name> -f

ENVIRONMENT VARIABLES FOR NCCL

NCCL_IB_GID_INDEX=3

NCCL_IB_DISABLE=0

NCCL_NET_GDR_LEVEL=5

NCCL_SOCKET_IFNAME=enp1s0f0np0

Sunday, March 08, 2026

TWO HEADS ARE BETTER THAN ONE: A COMPLETE GUIDE TO SETTING UP, CONNECTING, AND RUNNING LARGE LANGUAGE MODELS ON TWO NVIDIA DGX SPARK WORKSTATIONS

(c) Nvidia

Chapter 0 - BEFORE WE BEGIN: WHAT IS THIS ALL ABOUT?

Chapter 1 - MEET THE MACHINE: THE NVIDIA DGX SPARK DEEP DIVE

1.1 The Big Picture

1.2 The Memory Architecture: Why 128GB Unified Memory Is a Game Changer

1.3 The Compute Specifications

1.4 The Software Foundation

Chapter 2 - THE NERVOUS SYSTEM: CONNECTX-7 AND HIGH-SPEED NETWORKING

2.1 What Is ConnectX-7?

2.2 RoCE: RDMA Over Converged Ethernet

2.3 The Cable You Need

Chapter 3 - PHYSICAL SETUP: CABLES, POWER, AND FIRST BOOT

3.1 Unboxing and Placement

3.2 Power Connections

3.3 The ConnectX-7 Network Connection

3.4 Display, Keyboard, and Mouse for Initial Setup

3.5 First Power-On

Chapter 4 - NON-HEADLESS SETUP: WORKING WITH A MONITOR AND KEYBOARD

4.1 Why You Might Want a Non-Headless Setup

4.2 Updating the System

4.3 Verifying the GPU Is Recognized

4.4 Installing Essential Tools

4.5 Configuring SSH for Remote Access

Chapter 5 - HEADLESS SETUP: SSH, REMOTE ACCESS, AND AUTOMATION

5.1 What Does "Headless" Mean and Why Would You Want It?

5.2 Completing the Initial Setup Without a Monitor (Node B)

5.3 Setting Up SSH Key Authentication

5.4 Setting Up SSH Config for Convenience

5.5 Configuring Passwordless SSH Between the Two Nodes

5.6 Disabling the Graphical Desktop on Headless Nodes

Chapter 6 - NETWORKING THE TWO NODES: IP ADDRESSES, ROCE, AND JUMBO FRAMES

6.1 Understanding the Two Network Interfaces

6.2 Identifying the ConnectX-7 Interface Name

6.3 Configuring Static IP Addresses on the ConnectX-7 Interface

6.4 Why Jumbo Frames Matter

6.5 Verifying the Connection

6.6 Configuring RoCE for RDMA

Chapter 7 - INFERENCE ENGINE 1: OLLAMA - THE FRIENDLY GIANT

7.1 What Is Ollama and Why Should You Care?

7.2 Installing Ollama on Both Nodes

7.3 Configuring Ollama to Listen on the Network

7.4 Pulling and Running Your First Model

7.5 Using the Ollama REST API

7.6 A Load Balancer for Two Ollama Instances

Chapter 8 - INFERENCE ENGINE 2: LM STUDIO - THE GUI POWERHOUSE

8.1 What Is LM Studio?

8.2 Installing LM Studio on the DGX Spark

8.3 Using LM Studio's GUI

8.4 Using LM Studio's OpenAI-Compatible API

Chapter 9 - INFERENCE ENGINE 3: VLLM - THE THROUGHPUT CHAMPION

9.1 What Is vLLM and Why Is It Different?

9.2 Installing vLLM

9.3 Running vLLM as a Single-Node Server

9.4 Setting Up Two-Node Distributed Inference with vLLM

9.5 Querying the vLLM Server

Chapter 10 - INFERENCE ENGINE 4: SGLANG - THE STRUCTURED GENERATION WIZARD

10.1 What Is SGLang?

10.2 Installing SGLang

10.3 Starting the SGLang Server

10.4 Using SGLang's Structured Generation Features

Chapter 11 - INFERENCE ENGINE 5: TENSORRT-LLM - MAXIMUM PERFORMANCE MODE

11.1 What Is TensorRT-LLM?

11.2 Installing TensorRT-LLM

11.3 Building a TensorRT-LLM Engine

11.4 Two-Node TensorRT-LLM with MPI

11.5 Querying the TensorRT-LLM Server

Chapter 12 - WRITING CODE THAT TALKS TO LOCAL AND REMOTE LLMS

12.1 The Unified Client: One Interface, Five Engines

Chapter 13 - MONITORING, TROUBLESHOOTING, AND KEEPING THINGS RUNNING

13.1 Real-Time GPU Monitoring

13.2 Monitoring Network Performance

13.3 Setting Up Systemd Services for Inference Engines

13.4 A Health Check Script

13.5 Common Problems and Solutions

Chapter 14 - CLOSING THOUGHTS AND NEXT STEPS

14.1 What You Have Accomplished

14.2 Choosing the Right Tool for the Job

14.3 Next Steps and Further Exploration