(c) Nvidia
Chapter 0 - BEFORE WE BEGIN: WHAT IS THIS ALL ABOUT?
Imagine having a personal AI supercomputer sitting on your desk. Not a cloud
instance you rent by the hour, not a shared cluster you have to queue for, but
a machine that belongs to you, runs entirely offline if you want, and can
execute large language models that would make most laptops weep. Now imagine
having two of them, connected at high speed, working in concert. That is
exactly what this tutorial is about.
The NVIDIA DGX Spark is not a gaming PC with a fancy GPU strapped to it. It
is a purpose-built AI workstation that packs a genuinely remarkable amount of
compute into a compact desktop chassis. When you connect two of them with
NVIDIA's ConnectX-7 networking adapter, you create a small but serious
two-node AI cluster capable of running models that exceed what either machine
could handle alone.
This tutorial assumes you are reasonably comfortable with Linux command-line
basics - you know what a terminal is, you can type commands, and you are not
afraid of a configuration file. Beyond that, no deep expertise is required.
Every concept will be explained from the ground up, including the "why" behind
each decision, not just the "how." By the time you finish reading, you will
understand what you are doing and why it works, not just which commands to
type.
We will cover five different inference engines - Ollama, LM Studio, vLLM,
SGLang, and TensorRT-LLM - because different tools excel at different tasks,
and a well-equipped AI practitioner knows when to reach for which tool. We
will also write actual Python code that communicates with these engines, both
locally and across the network between your two machines.
Let us begin.
Chapter 1 - MEET THE MACHINE: THE NVIDIA DGX SPARK DEEP DIVE
1.1 The Big Picture
The DGX Spark is built around a single chip that represents one of the most
significant architectural leaps in AI hardware in recent years: the NVIDIA GB10
Grace Blackwell Superchip. To understand why this chip matters, we need to
briefly discuss what has historically been the biggest bottleneck in running
large language models on a single machine.
Traditionally, a computer has a CPU (the general-purpose processor that runs
your operating system and applications) and a GPU (the massively parallel
processor that handles AI computations). These two components sit on separate
chips, connected by a PCIe bus. PCIe is fast by everyday standards, but it is
glacially slow compared to the internal buses within each chip. When a large
language model runs, data must constantly shuttle back and forth between CPU
memory (RAM) and GPU memory (VRAM). This shuttle service is expensive in both
time and energy.
The GB10 solves this problem by eliminating the separation entirely. The Grace
CPU and the Blackwell GPU are connected via NVLink-C2C (Chip-to-Chip), a
proprietary interconnect that delivers 900 gigabytes per second of bidirectional
bandwidth. For context, a typical PCIe 5.0 x16 connection delivers roughly
64 GB/s. The NVLink-C2C connection is therefore about 14 times faster. This
is not a minor improvement; it is a qualitative change in what becomes possible.
1.2 The Memory Architecture: Why 128GB Unified Memory Is a Game Changer
Because the CPU and GPU are so tightly coupled, NVIDIA designed the DGX Spark
with a single unified memory pool of 128 gigabytes of LPDDR5X memory. This
memory is simultaneously accessible by both the CPU cores and the Blackwell GPU
without any copying. When you load a 70-billion-parameter language model, it
lives in this shared pool and the GPU can access every byte of it at full speed.
To appreciate why this matters, consider a conventional workstation with a
high-end GPU that has 24GB of VRAM. If you want to run a model that requires
48GB, you simply cannot do it on that GPU alone - the model does not fit. You
would need to either quantize the model aggressively (reducing its quality) or
use multiple GPUs. On the DGX Spark, a 48GB model fits comfortably in the
128GB unified pool, and the GPU can access it as if it were native VRAM.
The practical consequence is that the DGX Spark can run models in the 70B
parameter range in 16-bit floating point precision, or models approaching
200B parameters in 4-bit quantized form. This is extraordinary for a single
desktop machine.
1.3 The Compute Specifications
Let us look at the full hardware specification of the DGX Spark:
GPU: NVIDIA Blackwell GPU (part of GB10 Superchip)
AI Performance: Up to 1 PFLOPS (petaFLOP per second) at FP4 precision
CPU: 20-core NVIDIA Grace CPU (ARM Neoverse N2 architecture)
Memory: 128 GB LPDDR5X unified memory (shared CPU + GPU)
Storage: 4 TB NVMe SSD
Networking: NVIDIA ConnectX-7 (100 Gigabit Ethernet)
USB: Multiple USB 3.2 and USB-C ports
Display: DisplayPort output
Power: 170W TDP (Thermal Design Power)
OS: Ubuntu 24.04 LTS (pre-installed)
One teraFLOP is one trillion floating-point operations per second. One
petaFLOP is one thousand teraFLOPs, or one quadrillion operations per second.
At FP4 precision (4-bit floating point, used for inference), the DGX Spark
delivers this performance in a machine that consumes only 170 watts - less
than many gaming GPUs alone.
The 20-core Grace CPU is based on ARM's Neoverse N2 architecture, which is
the same family of cores used in cloud data centers by companies like Amazon
(Graviton) and Ampere. It is a server-grade CPU, not a consumer chip, and it
is optimized for the kind of memory-intensive workloads that AI inference
demands.
1.4 The Software Foundation
The DGX Spark ships with Ubuntu 24.04 LTS pre-installed. This is important
because Ubuntu 24.04 is a Long-Term Support release, meaning it will receive
security updates and support until 2029. NVIDIA has configured the system with
all necessary GPU drivers, CUDA libraries, and NVIDIA Container Toolkit
pre-installed. You do not need to hunt for drivers or fight with kernel
modules. The machine is ready to run AI workloads out of the box.
The CUDA version on the DGX Spark is 12.x, which is required by all modern
inference frameworks. The system also includes cuDNN (CUDA Deep Neural Network
library), NCCL (NVIDIA Collective Communications Library, essential for
multi-node communication), and the NVIDIA Container Runtime, which allows
Docker containers to access the GPU directly.
Chapter 2 - THE NERVOUS SYSTEM: CONNECTX-7 AND HIGH-SPEED NETWORKING
2.1 What Is ConnectX-7?
The NVIDIA ConnectX-7 is a network adapter, but calling it "just a network
adapter" is like calling a Formula 1 car "just a car." ConnectX-7 is a
smart network interface card (SmartNIC) that supports both InfiniBand and
Ethernet protocols at speeds up to 400 Gb/s in its highest configurations.
In the DGX Spark, it operates at 100 Gigabit Ethernet (100GbE).
What makes ConnectX-7 special for AI workloads is its support for RDMA -
Remote Direct Memory Access. RDMA allows one machine to read from or write to
the memory of another machine directly, without involving the CPU of the
remote machine. In traditional networking, when machine A sends data to
machine B, machine B's CPU must be interrupted, the data must be copied from
the network buffer into application memory, and then the application can use
it. With RDMA, the data goes directly from machine A's memory to machine B's
memory, bypassing both CPUs entirely.
For distributed AI inference, this is enormously valuable. When two DGX Spark
units are running a model together and need to exchange intermediate results
(called activations) between layers, RDMA allows this exchange to happen at
near-memory speeds rather than at network speeds. The latency drops from
microseconds to nanoseconds, and the CPU is free to do other work.
2.2 RoCE: RDMA Over Converged Ethernet
InfiniBand is the traditional protocol for RDMA in high-performance computing,
but it requires specialized InfiniBand switches and cables. The DGX Spark uses
a technology called RoCE (RDMA over Converged Ethernet, pronounced "rocky"),
which brings RDMA capabilities to standard Ethernet infrastructure. This means
you can connect two DGX Spark units with a standard 100GbE cable and still
get RDMA performance.
RoCE version 2 (RoCEv2) is the relevant standard here. It encapsulates RDMA
packets inside standard UDP/IP packets, which means they can be routed across
standard Ethernet networks. For a direct connection between two machines, this
is straightforward to configure.
2.3 The Cable You Need
To connect two DGX Spark units directly, you need one of the following:
A DAC (Direct Attach Copper) cable is the simplest option for short distances
up to about 5 meters. It is a passive cable with QSFP28 connectors on each
end that plugs directly into the ConnectX-7 port. DAC cables are inexpensive
and reliable for desk-to-desk connections.
An active optical cable (AOC) or a combination of SFP28 transceivers and
fiber optic cable is appropriate for longer distances, up to hundreds of
meters. This is more expensive but necessary if your two machines are in
different rooms or on different floors.
For most users setting up two DGX Spark units in the same office or lab, a
100GbE DAC cable of 1-3 meters is the right choice. Make sure it is rated for
QSFP28 (100G), not the older QSFP+ (40G) standard.
Chapter 3 - PHYSICAL SETUP: CABLES, POWER, AND FIRST BOOT
3.1 Unboxing and Placement
When your DGX Spark units arrive, give them time to reach room temperature
before powering them on, especially if they were shipped in cold weather.
Condensation inside electronics is not your friend. An hour at room temperature
is sufficient.
Place the units on a stable, flat surface with adequate airflow. The DGX Spark
has intake vents on the sides and exhaust at the rear. Leave at least 10 cm
(4 inches) of clearance on all sides. Do not stack them directly on top of
each other without a spacer, as the bottom unit's exhaust will feed hot air
into the top unit's intake. Side-by-side placement is ideal.
We will call the two machines "Node A" and "Node B" throughout this tutorial.
You can label them with a piece of tape if that helps you keep track.
3.2 Power Connections
Each DGX Spark uses a standard IEC C13 power connector (the same type used by
most desktop computers and monitors). Connect each unit to a power outlet or,
preferably, to a UPS (Uninterruptible Power Supply). A UPS protects against
sudden power loss, which can corrupt filesystems and interrupt long-running
AI jobs. For two machines drawing up to 170W each, a 1000VA UPS is more than
sufficient.
3.3 The ConnectX-7 Network Connection
Locate the ConnectX-7 port on the rear of each DGX Spark. It is a QSFP28
port, which looks like a slightly larger version of a standard SFP+ port.
Connect one end of your 100GbE DAC cable to Node A and the other end to Node B.
The cable is keyed and will only insert in the correct orientation. You should
feel a positive click when it is fully seated.
In addition to the direct ConnectX-7 connection between the two nodes, you
will also want to connect each machine to your regular office or home network
via the standard 1GbE or 10GbE Ethernet port. This management network is used
for internet access, software updates, and SSH access from your laptop. The
ConnectX-7 link is dedicated to high-speed AI traffic between the two nodes.
3.4 Display, Keyboard, and Mouse for Initial Setup
For the very first boot, you need a monitor, keyboard, and mouse connected to
at least one of the machines (Node A is a good choice to start with). The DGX
Spark has a DisplayPort output, so you need either a DisplayPort monitor or a
DisplayPort-to-HDMI adapter. Connect a USB keyboard and mouse to the USB ports.
After the initial setup is complete, you can switch to headless operation and
disconnect the peripherals. We will cover both modes in detail in Chapters 4 and 5.
3.5 First Power-On
Press the power button on Node A. The system will run through a POST (Power-On
Self-Test) and then boot into Ubuntu 24.04. The first boot may take slightly
longer than subsequent boots as the system initializes hardware and expands
the filesystem to fill the 4TB NVMe SSD.
You will be greeted by the Ubuntu initial setup wizard, which walks you through
language selection, keyboard layout, timezone, and user account creation. Create
a user account with a strong password. For the username, something simple and
memorable works well - we will use "aiuser" in this tutorial, but you can
choose anything you like.
After completing the setup wizard, you will land on the Ubuntu desktop. Take a
moment to appreciate what you are looking at: a full desktop Linux environment
running on hardware that can execute a trillion AI operations per second.
Chapter 4 - NON-HEADLESS SETUP: WORKING WITH A MONITOR AND KEYBOARD
4.1 Why You Might Want a Non-Headless Setup
A non-headless setup means you are working directly at the machine with a
monitor, keyboard, and mouse attached. This is the most intuitive way to get
started, especially if you are new to Linux or to AI workstations. It gives
you a full graphical desktop environment where you can open a terminal, a web
browser, and graphical applications like LM Studio all in the same workspace.
The trade-off is that you need to be physically present at the machine to use
it. For many research and development workflows, this is perfectly acceptable.
You sit at your desk, you work on your DGX Spark, and you go home when you
are done. Simple and effective.
4.2 Updating the System
The very first thing you should do after the initial setup wizard completes is
update all installed software. NVIDIA ships the DGX Spark with a known-good
software configuration, but security patches and bug fixes accumulate quickly.
Open a terminal (press Ctrl+Alt+T or find the Terminal application in the
application menu) and run the following commands.
The first command refreshes the list of available packages from all configured
software repositories, so Ubuntu knows what updates are available:
sudo apt update
The second command downloads and installs all available updates. The -y flag
answers "yes" automatically to any confirmation prompts:
sudo apt upgrade -y
This process may take several minutes depending on how many updates are
available. After it completes, reboot the system to ensure all updates,
especially kernel updates, take effect:
sudo reboot
4.3 Verifying the GPU Is Recognized
After rebooting, open a terminal and run NVIDIA's System Management Interface
tool to confirm the GPU is properly recognized and the drivers are working:
nvidia-smi
You should see output similar to this (the exact numbers will reflect the
GB10 Blackwell GPU):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 570.xx.xx Driver Version: 570.xx.xx CUDA Version: 12.x |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA GB10 ... On | 00000000:01:00.0 Off | N/A |
| N/A 45C P0 25W / 170W | 2048MiB / 131072MiB| 0% Default |
+-----------------------------------------------------------------------------+
The key things to verify are that the GPU name is shown (GB10 or similar),
that memory shows approximately 131072 MiB (128 GB), and that the driver
version and CUDA version are displayed correctly. If you see "No devices were
found" or similar errors, something is wrong with the driver installation,
which is unusual on a DGX Spark but can happen after a kernel update.
4.4 Installing Essential Tools
Before diving into inference engines, install a set of tools that will make
your life easier throughout this tutorial. The following command installs
several utilities in one go:
sudo apt install -y \
git \
curl \
wget \
htop \
nvtop \
net-tools \
iperf3 \
python3-pip \
python3-venv \
build-essential \
openssh-server
Let us understand what each of these tools does and why we are installing it.
The git tool is the industry-standard version control system. You will use it
to clone repositories for inference frameworks and to manage your own code.
The curl and wget tools are command-line utilities for downloading files from
the internet. Many installation scripts use curl, and wget is useful for
downloading large files like model weights.
The htop tool is an interactive process viewer that shows CPU usage, memory
usage, and running processes in a colorful, easy-to-read format. It is far
more useful than the basic top command.
The nvtop tool is the GPU equivalent of htop. It shows real-time GPU
utilization, memory usage, and temperature. You will use this constantly to
monitor your inference workloads.
The net-tools package provides classic networking commands like ifconfig and
netstat, which are useful for diagnosing network issues.
The iperf3 tool is a network performance testing utility. You will use it to
verify that the 100GbE connection between your two nodes is working at full
speed.
The python3-pip and python3-venv tools are the Python package manager and
virtual environment manager, respectively. All five inference frameworks we
will install are Python-based.
The build-essential package installs the GCC compiler, make, and other tools
needed to compile software from source code. Some inference frameworks require
compilation steps.
The openssh-server package installs the SSH server daemon, which allows you
to connect to this machine remotely from another computer. Even in a
non-headless setup, having SSH available is valuable for scripting and remote
management.
4.5 Configuring SSH for Remote Access
Even if you are using a non-headless setup with a monitor attached, enabling
SSH is a good practice. It allows you to control the machine from your laptop,
copy files to and from it, and run commands without having to physically sit
at the machine.
After installing openssh-server, start the SSH service and configure it to
start automatically on boot:
sudo systemctl enable ssh
sudo systemctl start ssh
Now find the IP address of the machine on your regular network (not the
ConnectX-7 link, which we will configure later):
ip addr show
Look for an entry that shows your regular Ethernet interface (typically named
something like eth0, eno1, or enp3s0) with an IP address in your local network
range (typically 192.168.x.x or 10.x.x.x). Note this IP address - you will
use it to SSH into the machine from your laptop.
From your laptop, you can now connect with:
ssh aiuser@192.168.1.100
Replace 192.168.1.100 with the actual IP address of your Node A. You will be
prompted for the password you set during initial setup.
Chapter 5 - HEADLESS SETUP: SSH, REMOTE ACCESS, AND AUTOMATION
5.1 What Does "Headless" Mean and Why Would You Want It?
A headless setup means the machine runs without a monitor, keyboard, or mouse
attached. You interact with it entirely over the network via SSH. This is the
standard way to operate servers and AI workstations in professional
environments for several good reasons.
First, it saves money. Monitors, keyboards, and mice cost money, and if you
have two DGX Spark units, you do not need two sets of peripherals. One laptop
can manage both machines over SSH.
Second, it is more efficient. Once you are comfortable with the command line,
SSH is faster than working at a physical terminal. You can have multiple SSH
sessions open simultaneously, copy and paste between them, and script complex
operations.
Third, it enables automation. When your machines are managed entirely over the
network, you can write scripts that configure them, start inference servers,
monitor their health, and restart services automatically. This is essential
for production AI deployments.
5.2 Completing the Initial Setup Without a Monitor (Node B)
For Node B, you have two options for the initial setup. The first option is to
temporarily connect a monitor and keyboard, complete the Ubuntu setup wizard,
enable SSH, and then disconnect the peripherals. This is the simplest approach.
The second option is to use a technique called "blind configuration." If you
know the machine's IP address (which you can find from your router's DHCP
client list), you can SSH into it immediately after first boot, because Ubuntu
24.04 enables SSH by default in some configurations. However, this is not
guaranteed, so the first option is more reliable.
We will assume you have completed the initial setup wizard on both machines
with a monitor attached, enabled SSH on both, and noted their IP addresses.
From this point forward, all configuration will be done over SSH.
5.3 Setting Up SSH Key Authentication
Typing a password every time you SSH into a machine becomes tedious quickly.
SSH key authentication is more secure and more convenient. It works by
generating a pair of cryptographic keys: a private key that stays on your
laptop and a public key that you copy to the remote machine. When you connect,
the machines perform a cryptographic handshake that proves your identity
without requiring a password.
On your laptop (not on the DGX Spark), generate an SSH key pair if you do not
already have one:
ssh-keygen -t ed25519 -C "dgx-spark-access"
The -t ed25519 flag specifies the Ed25519 algorithm, which is modern, fast,
and secure. The -C flag adds a comment to help you identify the key later.
When prompted for a file location, press Enter to accept the default
(~/.ssh/id_ed25519). When prompted for a passphrase, you can either set one
(more secure) or press Enter for no passphrase (more convenient).
Now copy the public key to both DGX Spark nodes. The ssh-copy-id command
handles this automatically:
ssh-copy-id aiuser@192.168.1.100 # Node A
ssh-copy-id aiuser@192.168.1.101 # Node B
After this, you can SSH into either machine without a password:
ssh aiuser@192.168.1.100
5.4 Setting Up SSH Config for Convenience
Instead of typing IP addresses every time, create an SSH config file on your
laptop that gives friendly names to your machines. Open or create the file
~/.ssh/config on your laptop and add the following:
Host node-a
HostName 192.168.1.100
User aiuser
IdentityFile ~/.ssh/id_ed25519
ServerAliveInterval 60
ServerAliveCountMax 3
Host node-b
HostName 192.168.1.101
User aiuser
IdentityFile ~/.ssh/id_ed25519
ServerAliveInterval 60
ServerAliveCountMax 3
The ServerAliveInterval and ServerAliveCountMax settings tell your SSH client
to send keepalive packets every 60 seconds and to give up after 3 missed
responses. This prevents SSH sessions from dropping when you are running long
jobs and not typing anything.
Now you can connect with simply:
ssh node-a
ssh node-b
5.5 Configuring Passwordless SSH Between the Two Nodes
For distributed inference frameworks like vLLM and TensorRT-LLM, the two
nodes need to be able to SSH into each other without passwords. This is
because the head node (Node A) will launch processes on the worker node
(Node B) automatically.
On Node A, generate an SSH key pair:
ssh-keygen -t ed25519 -C "node-a-to-node-b" -f ~/.ssh/id_ed25519_cluster
Then copy Node A's public key to Node B. First, display the public key:
cat ~/.ssh/id_ed25519_cluster.pub
Copy the output, then SSH into Node B and add it to the authorized_keys file:
# On Node B:
echo "PASTE_THE_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
Now do the reverse: on Node B, generate a key and copy it to Node A's
authorized_keys. After this, both nodes can SSH into each other without
passwords, which is required for MPI-based distributed inference.
5.6 Disabling the Graphical Desktop on Headless Nodes
Running a full graphical desktop environment on a headless machine wastes
memory and CPU cycles. Ubuntu 24.04 uses the GNOME desktop by default, which
can consume 1-2 GB of RAM even when idle. For a headless AI workstation, you
want all available resources dedicated to inference.
To switch to a text-only boot target (which still allows you to start a
graphical session manually if needed), run:
sudo systemctl set-default multi-user.target
This tells systemd (the Linux init system) to boot into a multi-user text
mode by default instead of the graphical desktop. The change takes effect on
the next reboot. To revert to graphical mode if needed:
sudo systemctl set-default graphical.target
After setting multi-user mode, reboot:
sudo reboot
When the machine comes back up, it will present a text login prompt instead
of a graphical desktop. SSH into it from your laptop as usual - the SSH
server starts in both modes.
Chapter 6 - NETWORKING THE TWO NODES: IP ADDRESSES, ROCE, AND JUMBO FRAMES
6.1 Understanding the Two Network Interfaces
Each DGX Spark has at least two network interfaces that we care about:
The management interface is the standard Ethernet port (1GbE or 10GbE) that
connects to your regular office or home network. This is used for internet
access, SSH from your laptop, and downloading models. We will call the IP
addresses on this interface the "management IPs" - for example, 192.168.1.100
for Node A and 192.168.1.101 for Node B.
The high-speed interface is the ConnectX-7 port that connects the two DGX
Spark units directly to each other via the DAC cable. This is used exclusively
for high-speed AI traffic between the nodes. We will assign IP addresses in a
separate subnet to this interface - for example, 10.0.0.1 for Node A and
10.0.0.2 for Node B.
Keeping these two networks separate is important. It ensures that AI traffic
does not compete with management traffic, and it makes routing simpler because
each network has a clear purpose.
6.2 Identifying the ConnectX-7 Interface Name
Linux assigns names to network interfaces automatically. The ConnectX-7
interface will have a name like enp1s0f0np0 or similar, depending on which
PCIe slot it occupies. To find the correct interface name, run:
ip link show
You will see a list of all network interfaces. The ConnectX-7 interface will
typically show a link speed of 100000 Mb/s when the DAC cable is connected.
You can also use:
ethtool <interface_name> | grep Speed
to check the speed of a specific interface. Alternatively, the mlxlink tool
(part of the Mellanox/NVIDIA networking tools) provides detailed ConnectX-7
status:
sudo mlxlink -d /dev/mst/mt4129_pciconf0 --show_module
The interface name for the ConnectX-7 on Node A might be enp1s0f0np0. We
will use this name in examples below, but substitute the actual name you find
on your system.
6.3 Configuring Static IP Addresses on the ConnectX-7 Interface
Ubuntu 24.04 uses Netplan for network configuration. Netplan is a declarative
network configuration system that reads YAML files and generates configuration
for the underlying network daemon (NetworkManager or systemd-networkd).
On Node A, create a new Netplan configuration file for the ConnectX-7
interface. The file must be in /etc/netplan/ and have a .yaml extension. We
will call it 10-connectx7.yaml (the number prefix determines the order in
which files are processed):
sudo nano /etc/netplan/10-connectx7.yaml
Enter the following configuration. Be very careful with indentation - YAML
is whitespace-sensitive, and incorrect indentation will cause errors:
network:
version: 2
ethernets:
enp1s0f0np0:
dhcp4: false
addresses:
- 10.0.0.1/24
mtu: 9000
The dhcp4: false line tells Netplan not to request an IP address from a DHCP
server on this interface - we are assigning a static address manually. The
addresses section assigns the IP address 10.0.0.1 with a /24 subnet mask
(which means addresses 10.0.0.1 through 10.0.0.254 are on the same network).
The mtu: 9000 line sets the Maximum Transmission Unit to 9000 bytes, which
are called "jumbo frames."
On Node B, create the same file with the IP address changed to 10.0.0.2:
network:
version: 2
ethernets:
enp1s0f0np0:
dhcp4: false
addresses:
- 10.0.0.2/24
mtu: 9000
Apply the configuration on both nodes:
sudo netplan apply
6.4 Why Jumbo Frames Matter
The standard Ethernet MTU (Maximum Transmission Unit) is 1500 bytes. This
means each network packet can carry at most 1500 bytes of data. When
transferring large amounts of data (like the activation tensors exchanged
between nodes during distributed inference), using 1500-byte packets means
a lot of overhead: each packet has headers, checksums, and other metadata
that do not carry useful data. With 1500-byte packets, this overhead is
relatively large compared to the payload.
Jumbo frames increase the MTU to 9000 bytes, which means each packet carries
six times more data for the same amount of overhead. For high-bandwidth,
low-latency applications like distributed AI inference, this can improve
throughput by 10-20% and reduce CPU overhead significantly.
The key requirement for jumbo frames is that both endpoints must be configured
with the same MTU. Since we are connecting the two DGX Spark units directly
(without a switch in between), we only need to configure the MTU on the two
machines themselves. If you were using a switch, the switch ports would also
need to be configured for jumbo frames.
6.5 Verifying the Connection
After applying the Netplan configuration, verify that the two nodes can
communicate over the ConnectX-7 link. From Node A, ping Node B:
ping -c 4 10.0.0.2
You should see responses with very low latency, typically under 0.1 milliseconds
for a direct connection. If the ping fails, check that the cable is properly
seated, that both nodes have the correct IP addresses, and that the interface
is up (ip link show should show "UP" for the ConnectX-7 interface).
Now test the actual bandwidth using iperf3. On Node B, start the iperf3 server:
iperf3 -s
On Node A, run the iperf3 client pointing at Node B's ConnectX-7 IP:
iperf3 -c 10.0.0.2 -t 30 -P 4
The -t 30 flag runs the test for 30 seconds, and -P 4 uses 4 parallel streams.
You should see throughput close to 100 Gbits/sec. If you see significantly
less (say, under 80 Gbits/sec), check that jumbo frames are configured
correctly on both ends and that the cable is rated for 100G.
6.6 Configuring RoCE for RDMA
To enable RDMA over the ConnectX-7 interface, we need to configure RoCEv2.
First, install the RDMA user-space libraries:
sudo apt install -y rdma-core ibverbs-utils
Verify that the RDMA device is recognized:
ibv_devices
You should see the ConnectX-7 listed as an RDMA device. Now verify the device
attributes:
ibv_devinfo
This shows the RDMA capabilities of the device, including supported transport
types and maximum message sizes.
To configure RoCEv2 (as opposed to RoCEv1), we need to set the GID (Global
Identifier) index. RoCEv2 uses UDP/IP encapsulation, which is routable and
works with standard Ethernet infrastructure. The configuration is done through
the sysfs filesystem:
# Check available GID entries for the interface
# (substitute your actual interface and port number)
cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/1
For NCCL (which vLLM and other frameworks use for multi-node communication),
set the following environment variables to tell NCCL to use RoCEv2:
export NCCL_IB_GID_INDEX=3
export NCCL_IB_DISABLE=0
export NCCL_NET_GDR_LEVEL=5
export NCCL_SOCKET_IFNAME=enp1s0f0np0
Add these to your ~/.bashrc file on both nodes so they are set automatically
in every new shell session:
echo 'export NCCL_IB_GID_INDEX=3' >> ~/.bashrc
echo 'export NCCL_IB_DISABLE=0' >> ~/.bashrc
echo 'export NCCL_NET_GDR_LEVEL=5' >> ~/.bashrc
echo 'export NCCL_SOCKET_IFNAME=enp1s0f0np0' >> ~/.bashrc
source ~/.bashrc
The NCCL_NET_GDR_LEVEL=5 setting enables GPU Direct RDMA, which allows data
to be transferred directly between the GPU memory on one node and the GPU
memory on another node, bypassing the CPU and system memory entirely. This
is the highest level of RDMA optimization and provides the best performance
for distributed inference.
Chapter 7 - INFERENCE ENGINE 1: OLLAMA - THE FRIENDLY GIANT
7.1 What Is Ollama and Why Should You Care?
Ollama is an open-source tool that makes running large language models locally
as simple as running a Docker container. It handles model downloading, format
conversion, quantization, and serving through a clean REST API - all with a
single command. If you have ever wanted to run a model like Llama 3, Mistral,
or Qwen on your own hardware without wrestling with Python dependencies and
model format conversions, Ollama is the answer.
Ollama works by wrapping llama.cpp, a highly optimized C++ inference library,
with a user-friendly interface. It maintains a library of pre-quantized models
that you can download with a single command, and it automatically detects and
uses your GPU for acceleration.
For the DGX Spark, Ollama is an excellent starting point. It requires minimal
configuration, works out of the box with the NVIDIA GPU, and provides a REST
API that is easy to call from Python, JavaScript, or any other language. The
trade-off is that Ollama does not support multi-node inference natively - each
DGX Spark runs its own independent Ollama instance. However, you can use both
instances together in clever ways, which we will explore.
7.2 Installing Ollama on Both Nodes
The installation is refreshingly simple. On each node (Node A and Node B),
run the official installation script:
curl -fsSL https://ollama.com/install.sh | sh
This script detects your operating system and architecture (ARM64 in the case
of the DGX Spark's Grace CPU), downloads the appropriate binary, installs it
to /usr/local/bin/ollama, and creates a systemd service that starts Ollama
automatically on boot.
After installation, verify that Ollama is running:
systemctl status ollama
You should see "active (running)" in the output. If it is not running, start
it manually:
sudo systemctl start ollama
7.3 Configuring Ollama to Listen on the Network
By default, Ollama only listens on localhost (127.0.0.1), which means it can
only be accessed from the same machine. To allow Node B to send requests to
Node A's Ollama instance (and vice versa), we need to configure Ollama to
listen on all network interfaces.
Edit the Ollama systemd service file to add the necessary environment variable:
sudo systemctl edit ollama
This opens a text editor with an override file. Add the following content:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Save and close the file. Then reload the systemd configuration and restart
Ollama:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Now Ollama listens on all interfaces, including the ConnectX-7 interface at
10.0.0.1 (on Node A) and 10.0.0.2 (on Node B). This means you can send
inference requests from Node A to Node B's Ollama instance and vice versa.
7.4 Pulling and Running Your First Model
Let us download and run a model. We will start with Llama 3.2 3B, a capable
but compact model that downloads quickly and runs fast:
ollama pull llama3.2:3b
Ollama downloads the model in GGUF format (a quantized format optimized for
llama.cpp) and stores it in ~/.ollama/models. The download size is a few
gigabytes. Once downloaded, run the model interactively:
ollama run llama3.2:3b
You will see a prompt where you can type messages and receive responses. This
is the simplest possible way to interact with an LLM on your DGX Spark. Type
/bye to exit the interactive session.
For a more serious model that takes advantage of the DGX Spark's 128GB memory,
try Llama 3.1 70B:
ollama pull llama3.1:70b
This model is approximately 40GB in its quantized form and can run on a single
DGX Spark with memory to spare. The inference speed will be impressive - the
Blackwell GPU's 1 PFLOP of AI performance translates to fast token generation
even for this large model.
7.5 Using the Ollama REST API
The real power of Ollama comes from its REST API, which allows you to integrate
LLM inference into your own applications. The API is available at
http://localhost:11434 by default.
The following Python script demonstrates how to send a request to the Ollama
API and stream the response. Streaming means you receive tokens as they are
generated rather than waiting for the entire response to complete, which makes
the interaction feel much more responsive:
import requests
import json
def query_ollama(
prompt: str,
model: str = "llama3.2:3b",
host: str = "localhost",
port: int = 11434,
stream: bool = True
) -> str:
"""
Send a prompt to an Ollama instance and return the generated text.
Args:
prompt: The text prompt to send to the model.
model: The Ollama model name to use for inference.
host: The hostname or IP address of the Ollama server.
port: The port number on which Ollama is listening.
stream: If True, stream the response token by token.
Returns:
The complete generated text as a string.
"""
url = f"http://{host}:{port}/api/generate"
# Build the request payload according to the Ollama API specification.
# The 'stream' field controls whether the server sends back partial
# responses as they are generated or waits for the full completion.
payload = {
"model": model,
"prompt": prompt,
"stream": stream
}
full_response = ""
# Use a streaming HTTP request so we can process each chunk as it arrives.
# This is important for user-facing applications where responsiveness matters.
with requests.post(url, json=payload, stream=stream) as response:
response.raise_for_status()
for line in response.iter_lines():
if line:
# Each line from the Ollama streaming API is a JSON object
# containing a 'response' field with the next token(s).
chunk = json.loads(line)
token = chunk.get("response", "")
full_response += token
# Print each token immediately so the user sees output
# appearing in real time, just like ChatGPT's interface.
print(token, end="", flush=True)
# The 'done' field signals that generation is complete.
if chunk.get("done", False):
print() # Add a newline after the response is complete.
break
return full_response
if __name__ == "__main__":
# Query the local Ollama instance on Node A.
print("=== Querying local Ollama (Node A) ===")
response_a = query_ollama(
prompt="Explain quantum entanglement in simple terms.",
model="llama3.2:3b",
host="localhost"
)
# Query the remote Ollama instance on Node B via the ConnectX-7 link.
# Notice that we use the high-speed 10.0.0.2 address, not the
# management network address. This routes the traffic over the
# 100GbE direct connection for minimum latency.
print("\n=== Querying remote Ollama (Node B) ===")
response_b = query_ollama(
prompt="What are the applications of quantum computing?",
model="llama3.2:3b",
host="10.0.0.2"
)
This script is straightforward but illustrates a powerful concept: with two
DGX Spark units running Ollama, you can distribute inference requests across
both machines. Node A handles some requests while Node B handles others,
effectively doubling your throughput for workloads that involve many concurrent
users or many independent queries.
7.6 A Load Balancer for Two Ollama Instances
To automatically distribute requests between the two Ollama instances, you
can write a simple round-robin load balancer. This is useful when you want to
serve many users and want to spread the load evenly:
import requests
import json
import itertools
from typing import Iterator
class OllamaLoadBalancer:
"""
A simple round-robin load balancer for multiple Ollama instances.
This class cycles through a list of Ollama server addresses and
sends each request to the next server in the rotation. This ensures
that no single server is overwhelmed while others sit idle.
"""
def __init__(self, servers: list[dict]) -> None:
"""
Initialize the load balancer with a list of server configurations.
Args:
servers: A list of dicts, each with 'host' and 'port' keys.
Example: [{"host": "localhost", "port": 11434},
{"host": "10.0.0.2", "port": 11434}]
"""
self.servers = servers
# itertools.cycle creates an infinite iterator that cycles through
# the list: server0, server1, server0, server1, ...
self._server_cycle: Iterator[dict] = itertools.cycle(servers)
def _get_next_server(self) -> dict:
"""Return the next server in the rotation."""
return next(self._server_cycle)
def generate(
self,
prompt: str,
model: str = "llama3.2:3b"
) -> str:
"""
Send a generation request to the next available server.
Args:
prompt: The text prompt for the model.
model: The model name to use.
Returns:
The generated text response.
"""
server = self._get_next_server()
url = f"http://{server['host']}:{server['port']}/api/generate"
print(f"Routing request to {server['host']}:{server['port']}")
payload = {
"model": model,
"prompt": prompt,
"stream": False # Non-streaming for simplicity in this example.
}
response = requests.post(url, json=payload)
response.raise_for_status()
return response.json().get("response", "")
if __name__ == "__main__":
# Configure the load balancer with both DGX Spark nodes.
# Node A is accessed via localhost (we are running this script on Node A).
# Node B is accessed via the high-speed ConnectX-7 interface.
balancer = OllamaLoadBalancer(
servers=[
{"host": "localhost", "port": 11434},
{"host": "10.0.0.2", "port": 11434}
]
)
# Simulate five incoming requests. They will alternate between
# Node A and Node B automatically.
prompts = [
"What is machine learning?",
"Explain neural networks.",
"What is backpropagation?",
"Describe transformer architecture.",
"What is attention mechanism?"
]
for i, prompt in enumerate(prompts):
print(f"\n--- Request {i + 1} ---")
print(f"Prompt: {prompt}")
response = balancer.generate(prompt=prompt, model="llama3.2:3b")
print(f"Response: {response[:200]}...") # Print first 200 characters.
Chapter 8 - INFERENCE ENGINE 2: LM STUDIO - THE GUI POWERHOUSE
8.1 What Is LM Studio?
LM Studio is a desktop application that provides a polished graphical user
interface for downloading, managing, and running large language models locally.
If you have ever wished for a ChatGPT-like interface that runs entirely on your
own hardware, LM Studio is exactly that. It includes a model browser that lets
you search and download models from Hugging Face, a chat interface for
interactive conversations, and a local server that exposes an OpenAI-compatible
API.
LM Studio is particularly valuable for non-headless setups where you have a
monitor connected. It is the most beginner-friendly of the five inference
engines we cover, requiring no command-line interaction for basic use. However,
it also provides enough advanced features to satisfy experienced practitioners.
8.2 Installing LM Studio on the DGX Spark
LM Studio supports Linux ARM64, which is the architecture of the DGX Spark's
Grace CPU. Download the ARM64 AppImage from the LM Studio website:
wget https://releases.lmstudio.ai/linux/arm64/latest/LM_Studio-latest-arm64.AppImage
Make the downloaded file executable:
chmod +x LM_Studio-latest-arm64.AppImage
For a non-headless setup, simply double-click the AppImage in the file manager,
or run it from the terminal:
./LM_Studio-latest-arm64.AppImage
LM Studio will launch with a graphical interface. On first launch, it may ask
you to accept a license agreement and choose a directory for storing models.
The default location (~/.cache/lm-studio/models) is fine, but given the DGX
Spark's 4TB NVMe SSD, you have plenty of space to store many large models.
For a headless setup, LM Studio can be run in server mode without a graphical
interface. This is done using the lms command-line tool that LM Studio installs:
# First, install the lms CLI tool (run this after launching LM Studio at
# least once to complete the installation):
~/.lmstudio/bin/lms install
# Start the LM Studio server in headless mode:
~/.lmstudio/bin/lms server start --port 1234
8.3 Using LM Studio's GUI
In the graphical interface, the left sidebar has several icons. The first icon
(a magnifying glass) opens the model search interface, where you can browse
and download models from Hugging Face. Search for "llama" or "mistral" to find
popular models. LM Studio shows the model size and quantization level, helping
you choose a model that fits in your 128GB memory.
The second icon (a chat bubble) opens the chat interface. After loading a model
(click the model name in the top bar to load it), you can type messages and
receive responses in a familiar chat format. This is excellent for interactive
exploration and testing.
The third icon (a server icon) opens the local server settings. Enable the
server and it will listen on port 1234 by default, exposing an OpenAI-compatible
API. This is the same API format used by OpenAI's ChatGPT, which means any
code written for OpenAI's API works with LM Studio with minimal changes.
8.4 Using LM Studio's OpenAI-Compatible API
Once the LM Studio server is running (either in GUI mode or headless mode),
you can interact with it using the OpenAI Python library. This is one of the
most important aspects of LM Studio: because it speaks the OpenAI API protocol,
you can swap between LM Studio and actual OpenAI models with a single line
change in your code.
Install the OpenAI Python library if you have not already:
pip install openai
The following script demonstrates how to use LM Studio's API, including how
to switch seamlessly between local and remote models:
from openai import OpenAI
from typing import Optional
def create_lmstudio_client(
host: str = "localhost",
port: int = 1234
) -> OpenAI:
"""
Create an OpenAI client configured to talk to an LM Studio server.
LM Studio exposes an OpenAI-compatible API, so we use the standard
OpenAI Python library but point it at our local LM Studio instance.
The api_key parameter is required by the library but is not actually
validated by LM Studio - any non-empty string works.
Args:
host: The hostname or IP address of the LM Studio server.
port: The port number on which LM Studio is listening.
Returns:
A configured OpenAI client instance.
"""
return OpenAI(
base_url=f"http://{host}:{port}/v1",
api_key="lm-studio" # LM Studio ignores this, but it must be set.
)
def chat_with_model(
client: OpenAI,
system_prompt: str,
user_message: str,
model: str = "local-model",
temperature: float = 0.7,
max_tokens: int = 1024
) -> str:
"""
Send a chat message to an LM Studio model and return the response.
This function uses the chat completions API, which is the standard
way to interact with instruction-tuned models. It supports a system
prompt (which sets the model's persona and behavior) and a user
message (the actual question or instruction).
Args:
client: The OpenAI client configured for LM Studio.
system_prompt: Instructions that define the model's behavior.
user_message: The user's question or instruction.
model: The model identifier (LM Studio uses the loaded
model regardless of this value).
temperature: Controls randomness. 0.0 is deterministic,
1.0 is highly random. 0.7 is a good default.
max_tokens: Maximum number of tokens to generate.
Returns:
The model's response as a string.
"""
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": user_message
}
],
temperature=temperature,
max_tokens=max_tokens
)
# The response is nested inside the completion object.
# We extract just the text content of the first (and usually only)
# choice that the model generated.
return completion.choices[0].message.content
if __name__ == "__main__":
# Connect to LM Studio running on Node A (local).
local_client = create_lmstudio_client(host="localhost", port=1234)
# Connect to LM Studio running on Node B (remote, via ConnectX-7).
remote_client = create_lmstudio_client(host="10.0.0.2", port=1234)
system_prompt = (
"You are a helpful AI assistant specializing in explaining "
"complex technical concepts in simple, accessible language."
)
question = "How does a transformer neural network process text?"
print("=== Response from Node A (local LM Studio) ===")
response_local = chat_with_model(
client=local_client,
system_prompt=system_prompt,
user_message=question
)
print(response_local)
print("\n=== Response from Node B (remote LM Studio) ===")
response_remote = chat_with_model(
client=remote_client,
system_prompt=system_prompt,
user_message=question
)
print(response_remote)
Chapter 9 - INFERENCE ENGINE 3: VLLM - THE THROUGHPUT CHAMPION
9.1 What Is vLLM and Why Is It Different?
vLLM (Virtual Large Language Model) is an open-source inference engine
developed by researchers at UC Berkeley. It was created to solve a specific
problem: how do you serve large language models to many users simultaneously
with high throughput and low latency?
The key innovation in vLLM is a technique called PagedAttention. To understand
why PagedAttention matters, we need to briefly understand how LLM inference
works. When a model generates text, it maintains a "key-value cache" (KV cache)
for each token it has processed. This cache stores intermediate computations
that allow the model to attend to previous tokens efficiently. The KV cache
grows with each generated token and can consume enormous amounts of GPU memory.
The problem with naive KV cache management is memory fragmentation. If you are
serving 10 users simultaneously, each with a different conversation length,
the KV caches for those conversations are different sizes. Allocating fixed
blocks of memory for each conversation wastes space when conversations are
short and fails when conversations grow longer than expected.
PagedAttention borrows an idea from operating system virtual memory management:
it divides the KV cache into fixed-size "pages" and allocates them dynamically
as needed, similar to how an OS manages physical memory pages for multiple
processes. This eliminates fragmentation and allows vLLM to serve 2-4x more
concurrent users than naive implementations with the same hardware.
9.2 Installing vLLM
Create a Python virtual environment for vLLM to keep its dependencies isolated
from other tools:
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
Install vLLM. For the DGX Spark's ARM-based Grace CPU with CUDA 12.x, the
standard pip installation should work:
pip install vllm
If the pip installation fails (which can happen on ARM architectures where
pre-built wheels are not available), you may need to build from source:
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
The build from source takes 15-30 minutes as it compiles CUDA kernels. This
is normal and expected. The resulting installation is fully optimized for your
specific GPU architecture.
9.3 Running vLLM as a Single-Node Server
The simplest way to use vLLM is as a single-node OpenAI-compatible API server.
Start the server with a model from Hugging Face (vLLM downloads models
automatically from the Hugging Face Hub):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--max-model-len 8192
Let us understand each argument. The --model flag specifies the Hugging Face
model identifier. The --host 0.0.0.0 flag makes the server listen on all
network interfaces, not just localhost. The --port 8000 flag sets the port
number. The --dtype bfloat16 flag tells vLLM to use 16-bit brain floating
point precision, which is the best precision for the Blackwell GPU. The
--max-model-len 8192 flag limits the maximum sequence length (input + output
tokens combined) to 8192 tokens, which controls memory usage.
For a model that requires authentication (like Llama 3.1), you need a Hugging
Face account and access token. Set the token as an environment variable:
export HF_TOKEN="your_huggingface_token_here"
9.4 Setting Up Two-Node Distributed Inference with vLLM
This is where things get genuinely exciting. With two DGX Spark units, you
can run a single model that is distributed across both machines. This allows
you to run models that are too large for a single 128GB memory pool, or to
run models faster by parallelizing the computation.
vLLM supports two forms of parallelism for multi-node inference. Tensor
parallelism splits individual matrix operations across multiple GPUs, with
each GPU computing a portion of each operation simultaneously. Pipeline
parallelism splits the model's layers across GPUs, with each GPU processing
a different set of layers in sequence. For two nodes, tensor parallelism
typically gives better performance.
vLLM uses Ray for multi-node coordination. Ray is a distributed computing
framework that handles process management, communication, and fault tolerance.
Install Ray on both nodes:
pip install ray
On Node A (the head node), start the Ray cluster:
ray start --head \
--node-ip-address=10.0.0.1 \
--port=6379 \
--dashboard-host=0.0.0.0
The --node-ip-address flag tells Ray to use the ConnectX-7 interface IP for
cluster communication. This routes all Ray traffic over the high-speed direct
connection rather than the management network.
On Node B (the worker node), join the Ray cluster:
ray start \
--address=10.0.0.1:6379 \
--node-ip-address=10.0.0.2
Now, on Node A, start vLLM with tensor parallelism across both nodes:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--max-model-len 8192
The --tensor-parallel-size 2 flag tells vLLM to split the model across 2 GPUs
(one on each node). vLLM uses the Ray cluster to coordinate with Node B
automatically. The model weights are split such that each node holds half the
model, and during inference, both nodes compute their portion simultaneously
and exchange results via the ConnectX-7 link.
9.5 Querying the vLLM Server
Once the vLLM server is running, you can query it using the OpenAI Python
library (since vLLM exposes an OpenAI-compatible API) or with direct HTTP
requests. The following script demonstrates both approaches and shows how to
handle the two-node setup:
import asyncio
import aiohttp
import json
from openai import AsyncOpenAI
from typing import AsyncIterator
class VLLMClient:
"""
A client for interacting with a vLLM OpenAI-compatible API server.
This client supports both synchronous and asynchronous operation,
and demonstrates how to use streaming responses for real-time output.
The vLLM server is accessed via the Node A management IP since it
acts as the head node and exposes the unified API endpoint.
"""
def __init__(
self,
host: str = "localhost",
port: int = 8000,
model: str = "meta-llama/Llama-3.1-70B-Instruct"
) -> None:
"""
Initialize the vLLM client.
Args:
host: The hostname or IP of the vLLM server (Node A).
port: The port number of the vLLM API server.
model: The model name as registered in the vLLM server.
"""
self.model = model
self.client = AsyncOpenAI(
base_url=f"http://{host}:{port}/v1",
api_key="not-needed" # vLLM does not require authentication
# by default, but the field must be set.
)
async def stream_completion(
self,
prompt: str,
max_tokens: int = 512,
temperature: float = 0.7
) -> AsyncIterator[str]:
"""
Stream a completion from the vLLM server token by token.
This is an async generator that yields each token as it is
generated. Using async streaming allows your application to
remain responsive while waiting for the model to generate text,
which is especially important for long responses.
Args:
prompt: The text prompt to complete.
max_tokens: Maximum number of tokens to generate.
temperature: Sampling temperature (0.0 = deterministic).
Yields:
Individual tokens as strings.
"""
stream = await self.client.completions.create(
model=self.model,
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
stream=True
)
async for chunk in stream:
# Each chunk contains a list of choices. We take the first
# choice and extract the text delta (the new tokens).
if chunk.choices and chunk.choices[0].text:
yield chunk.choices[0].text
async def chat_completion(
self,
messages: list[dict],
max_tokens: int = 512,
temperature: float = 0.7
) -> str:
"""
Send a chat completion request and return the full response.
Args:
messages: A list of message dicts with 'role' and 'content'.
max_tokens: Maximum tokens to generate.
temperature: Sampling temperature.
Returns:
The model's response as a string.
"""
response = await self.client.chat.completions.create(
model=self.model,
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
stream=False
)
return response.choices[0].message.content
async def main() -> None:
"""Demonstrate vLLM client usage with both streaming and non-streaming."""
# Connect to the vLLM server running on Node A.
# Even though the model is distributed across both nodes,
# all requests go to Node A's API endpoint. vLLM handles
# the distribution internally via the Ray cluster.
client = VLLMClient(
host="localhost", # or use the management IP: "192.168.1.100"
port=8000,
model="meta-llama/Llama-3.1-70B-Instruct"
)
# Demonstrate streaming completion.
print("=== Streaming Completion ===")
prompt = "The key advantages of distributed AI inference are:"
print(f"Prompt: {prompt}")
print("Response: ", end="")
async for token in client.stream_completion(
prompt=prompt,
max_tokens=256,
temperature=0.3
):
print(token, end="", flush=True)
print()
# Demonstrate chat completion with a system prompt.
print("\n=== Chat Completion ===")
messages = [
{
"role": "system",
"content": "You are an expert in distributed computing and AI systems."
},
{
"role": "user",
"content": "What is tensor parallelism and how does it work?"
}
]
response = await client.chat_completion(
messages=messages,
max_tokens=512,
temperature=0.5
)
print(f"Response: {response}")
if __name__ == "__main__":
asyncio.run(main())
Chapter 10 - INFERENCE ENGINE 4: SGLANG - THE STRUCTURED GENERATION WIZARD
10.1 What Is SGLang?
SGLang (Structured Generation Language) is an inference framework developed
at UC Berkeley that takes a different approach to LLM serving. While vLLM
focuses on maximizing throughput through efficient memory management, SGLang
focuses on making it easy and efficient to build complex LLM programs that
involve structured outputs, multi-step reasoning, and sophisticated prompting
patterns.
The key innovation in SGLang is RadixAttention, which is an extension of the
KV cache concept. In standard inference, each request has its own KV cache
that is discarded when the request completes. RadixAttention organizes KV
caches in a radix tree (a prefix tree) structure, allowing caches to be shared
between requests that share common prefixes. This is enormously valuable for
applications where many requests share a common system prompt or context, such
as a customer service bot where every conversation starts with the same
instructions.
For example, if you have a 2000-token system prompt that is the same for every
request, standard inference must recompute the KV cache for those 2000 tokens
for every single request. With RadixAttention, the KV cache for those 2000
tokens is computed once and reused for all subsequent requests that share that
prefix. This can reduce latency by 50-80% for such workloads.
10.2 Installing SGLang
Create a virtual environment for SGLang:
python3 -m venv ~/sglang-env
source ~/sglang-env/bin/activate
Install SGLang with all optional dependencies:
pip install "sglang[all]"
If you encounter issues with the ARM64 architecture, install from source:
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e ".[all]"
10.3 Starting the SGLang Server
SGLang provides a launch_server script that starts an OpenAI-compatible API
server. On Node A, start the server:
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--dtype bfloat16 \
--mem-fraction-static 0.85
The --mem-fraction-static 0.85 flag tells SGLang to use 85% of the available
GPU memory for the static KV cache pool. The remaining 15% is reserved for
dynamic allocations during inference. Adjusting this value lets you balance
between maximum batch size and stability.
For two-node distributed inference with SGLang, the setup uses torch.distributed
with NCCL as the communication backend. On Node A:
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--tp-size 2 \
--nnodes 2 \
--node-rank 0 \
--dist-init-addr 10.0.0.1:29500 \
--dtype bfloat16
On Node B (run this command simultaneously with the Node A command):
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--tp-size 2 \
--nnodes 2 \
--node-rank 1 \
--dist-init-addr 10.0.0.1:29500 \
--dtype bfloat16
The --tp-size 2 flag sets tensor parallelism to 2 (one GPU per node). The
--nnodes 2 flag specifies the total number of nodes. The --node-rank flag
identifies each node (0 for head, 1 for worker). The --dist-init-addr flag
specifies the address of the head node for distributed initialization, using
the ConnectX-7 IP for low-latency communication.
10.4 Using SGLang's Structured Generation Features
SGLang's most powerful feature is its ability to generate structured outputs
reliably. This is useful when you need the model to produce JSON, follow a
specific format, or make a series of decisions in a structured way. The
following example demonstrates how to use SGLang's Python API to generate
structured JSON output:
import sglang as sgl
from sglang import function, system, user, assistant, gen
import json
# SGLang uses a decorator-based programming model where you define
# generation programs as Python functions decorated with @sgl.function.
# This allows SGLang to optimize the execution of complex multi-step
# generation tasks.
@sgl.function
def analyze_text(s, text: str) -> None:
"""
Analyze a piece of text and extract structured information.
This SGLang program instructs the model to analyze input text and
produce a structured JSON response containing sentiment, key topics,
and a summary. The use of SGLang's constrained generation ensures
the output is valid JSON.
Args:
s: The SGLang state object (injected automatically).
text: The text to analyze.
"""
# Set the system prompt that defines the model's behavior.
s += system(
"You are a text analysis assistant. Always respond with valid JSON."
)
# Provide the user's request with the text to analyze.
s += user(
f"Analyze the following text and provide a JSON response with "
f"fields: 'sentiment' (positive/negative/neutral), "
f"'key_topics' (list of strings), and 'summary' (one sentence).\n\n"
f"Text: {text}"
)
# The gen() call tells SGLang to generate text here.
# The max_tokens parameter limits the response length.
# SGLang can also enforce JSON schema constraints if configured.
s += assistant(gen("analysis", max_tokens=256))
@sgl.function
def multi_step_reasoning(s, question: str) -> None:
"""
Perform multi-step chain-of-thought reasoning.
This program demonstrates SGLang's ability to structure complex
reasoning tasks. It first generates a step-by-step reasoning chain,
then uses that reasoning to produce a final answer. This two-step
approach often produces more accurate results than asking for the
answer directly.
Args:
s: The SGLang state object.
question: The question to reason about.
"""
s += system(
"You are a careful reasoner who thinks step by step before answering."
)
s += user(f"Question: {question}\n\nFirst, think through this step by step:")
# Generate the reasoning chain and store it in the 'reasoning' variable.
s += assistant(gen("reasoning", max_tokens=512))
# Now ask for the final answer, which can reference the reasoning above.
s += user("Based on your reasoning above, what is your final answer?")
# Generate the final answer and store it in the 'answer' variable.
s += assistant(gen("answer", max_tokens=128))
def run_sglang_examples() -> None:
"""
Run the SGLang example programs against the local server.
This function initializes the SGLang runtime to connect to the
server we started earlier, then runs both example programs.
"""
# Initialize the SGLang runtime to connect to the local server.
# If you want to use the distributed two-node setup, point this
# at Node A's management IP or localhost if running on Node A.
sgl.set_default_backend(
sgl.RuntimeEndpoint("http://localhost:30000")
)
# Run the text analysis program.
print("=== Structured Text Analysis ===")
sample_text = (
"The new NVIDIA DGX Spark is a remarkable piece of engineering. "
"It delivers petaFLOP-scale AI performance in a desktop form factor, "
"making enterprise-grade AI accessible to individual researchers."
)
result = analyze_text(text=sample_text)
# Access the generated content by variable name.
raw_analysis = result["analysis"]
print(f"Raw output: {raw_analysis}")
# Attempt to parse the JSON output.
try:
parsed = json.loads(raw_analysis)
print(f"Sentiment: {parsed.get('sentiment', 'N/A')}")
print(f"Key topics: {parsed.get('key_topics', [])}")
print(f"Summary: {parsed.get('summary', 'N/A')}")
except json.JSONDecodeError:
print("Note: Output was not valid JSON. Adjust the prompt for stricter formatting.")
# Run the multi-step reasoning program.
print("\n=== Multi-Step Reasoning ===")
question = (
"If two DGX Spark units each have 128GB of unified memory and are "
"connected via a 100GbE link, what is the theoretical maximum model "
"size they could run together, and what are the practical limitations?"
)
result = multi_step_reasoning(question=question)
print(f"Reasoning:\n{result['reasoning']}")
print(f"\nFinal Answer:\n{result['answer']}")
if __name__ == "__main__":
run_sglang_examples()
Chapter 11 - INFERENCE ENGINE 5: TENSORRT-LLM - MAXIMUM PERFORMANCE MODE
11.1 What Is TensorRT-LLM?
TensorRT-LLM is NVIDIA's own high-performance inference library, and it
represents the pinnacle of optimization for NVIDIA hardware. While Ollama,
LM Studio, vLLM, and SGLang are general-purpose frameworks that work across
different hardware, TensorRT-LLM is specifically engineered to extract every
last drop of performance from NVIDIA GPUs.
The way TensorRT-LLM achieves this is through model compilation. Instead of
running a model in its original format (PyTorch weights), TensorRT-LLM compiles
the model into a TensorRT engine - a highly optimized binary that is tailored
to the specific GPU architecture it will run on. This compilation process
applies a battery of optimizations: kernel fusion (combining multiple operations
into a single GPU kernel to reduce memory bandwidth), precision reduction
(converting weights to FP8 or INT4 format), layer optimization (replacing
generic PyTorch operations with hand-written CUDA kernels), and graph
optimization (reordering and eliminating redundant operations).
The result is typically 2-5x faster inference compared to unoptimized PyTorch,
with the exact speedup depending on the model architecture and the specific
GPU. For the DGX Spark's Blackwell GPU, which has dedicated hardware for FP4
and FP8 operations, the speedup can be even more dramatic.
The trade-off is complexity. TensorRT-LLM requires a compilation step that
can take 30 minutes to several hours for large models, and the compiled engine
is specific to the GPU architecture it was compiled for. You cannot compile
an engine on a Blackwell GPU and run it on an Ampere GPU.
11.2 Installing TensorRT-LLM
TensorRT-LLM is best installed inside a Docker container, as it has complex
dependencies that are pre-configured in NVIDIA's official container images.
The DGX Spark has Docker and the NVIDIA Container Runtime pre-installed.
Pull the official TensorRT-LLM container:
docker pull nvcr.io/nvidia/tensorrt-llm:latest
Alternatively, install via pip in a virtual environment:
python3 -m venv ~/trtllm-env
source ~/trtllm-env/bin/activate
pip install tensorrt-llm
If the pip installation fails on ARM64, use the Docker approach, which is
more reliable:
docker run --gpus all \
--rm \
-it \
-v /home/aiuser/models:/models \
nvcr.io/nvidia/tensorrt-llm:latest \
bash
The -v flag mounts your local models directory inside the container, so models
you download are accessible both inside and outside Docker.
11.3 Building a TensorRT-LLM Engine
Building a TensorRT-LLM engine is a two-step process. First, you convert the
model weights from Hugging Face format to TensorRT-LLM's internal format.
Second, you compile the converted weights into an optimized TensorRT engine.
Step 1: Convert the model weights. This example uses Llama 3.1 8B:
# Inside the TensorRT-LLM Docker container or virtual environment:
# Download the model from Hugging Face first.
# You need the huggingface_hub library for this.
pip install huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='meta-llama/Llama-3.1-8B-Instruct',
local_dir='/models/llama-3.1-8b-hf',
token='your_hf_token_here'
)
"
# Convert the Hugging Face weights to TensorRT-LLM checkpoint format.
# This script is included in the TensorRT-LLM repository.
python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py \
--model-dir /models/llama-3.1-8b-hf \
--output-dir /models/llama-3.1-8b-trtllm-ckpt \
--dtype bfloat16 \
--tp-size 1
Step 2: Compile the TensorRT engine:
trtllm-build \
--checkpoint-dir /models/llama-3.1-8b-trtllm-ckpt \
--output-dir /models/llama-3.1-8b-trtllm-engine \
--gemm-plugin bfloat16 \
--max-batch-size 8 \
--max-input-len 2048 \
--max-seq-len 4096
The --gemm-plugin flag enables TensorRT's optimized GEMM (General Matrix
Multiply) kernels, which are the core operation in transformer inference.
The --max-batch-size flag sets the maximum number of requests that can be
processed simultaneously. The --max-input-len and --max-seq-len flags control
the maximum context length.
11.4 Two-Node TensorRT-LLM with MPI
For two-node distributed inference with TensorRT-LLM, we use MPI (Message
Passing Interface), the standard parallel computing communication library.
Install MPI on both nodes:
sudo apt install -y openmpi-bin openmpi-common libopenmpi-dev
For a two-node setup, rebuild the TensorRT engine with tensor parallelism:
# On Node A: Convert with tp-size 2
python3 /path/to/convert.py \
--model-dir /models/llama-3.1-70b-hf \
--output-dir /models/llama-3.1-70b-trtllm-ckpt \
--dtype bfloat16 \
--tp-size 2
# Build the engine for 2-GPU tensor parallelism
trtllm-build \
--checkpoint-dir /models/llama-3.1-70b-trtllm-ckpt \
--output-dir /models/llama-3.1-70b-trtllm-engine \
--gemm-plugin bfloat16 \
--max-batch-size 4 \
--max-input-len 2048 \
--max-seq-len 4096
Copy the engine to Node B (it must be identical on both nodes):
rsync -avz \
/models/llama-3.1-70b-trtllm-engine/ \
aiuser@10.0.0.2:/models/llama-3.1-70b-trtllm-engine/
Now launch the TensorRT-LLM server across both nodes using mpirun:
mpirun \
-n 2 \
--host 10.0.0.1,10.0.0.2 \
--mca btl_tcp_if_include enp1s0f0np0 \
python3 -m tensorrt_llm.serve \
--engine-dir /models/llama-3.1-70b-trtllm-engine \
--host 0.0.0.0 \
--port 8080
The -n 2 flag launches 2 MPI processes (one per node). The --host flag
specifies the two nodes using their ConnectX-7 IP addresses. The
--mca btl_tcp_if_include flag tells MPI to use the ConnectX-7 interface for
communication, routing all inter-node traffic over the high-speed direct link.
11.5 Querying the TensorRT-LLM Server
The TensorRT-LLM server exposes an OpenAI-compatible API, so the same client
code works as for vLLM and SGLang. The following example adds performance
measurement to help you appreciate the speed difference:
import time
import requests
import json
from dataclasses import dataclass
@dataclass
class InferenceResult:
"""
Container for inference results including performance metrics.
This dataclass bundles the generated text with timing information,
making it easy to compare performance across different inference
engines and configurations.
"""
response_text: str
prompt_tokens: int
completion_tokens: int
total_time_seconds: float
tokens_per_second: float
def query_trtllm_server(
prompt: str,
host: str = "localhost",
port: int = 8080,
max_tokens: int = 256,
temperature: float = 0.7
) -> InferenceResult:
"""
Query the TensorRT-LLM server and measure performance.
This function sends a completion request to the TensorRT-LLM server
and measures the time taken to generate the response. The tokens
per second metric is the key performance indicator for LLM inference:
higher is better, and TensorRT-LLM typically achieves the highest
values of any inference framework on NVIDIA hardware.
Args:
prompt: The text prompt to complete.
host: The TensorRT-LLM server hostname or IP.
port: The server port number.
max_tokens: Maximum tokens to generate.
temperature: Sampling temperature.
Returns:
An InferenceResult with the response and performance metrics.
"""
url = f"http://{host}:{port}/v1/completions"
payload = {
"model": "tensorrt-llm", # TensorRT-LLM uses this as a placeholder.
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": temperature,
"stream": False
}
# Record the start time before sending the request.
start_time = time.perf_counter()
response = requests.post(url, json=payload)
response.raise_for_status()
# Record the end time after receiving the complete response.
end_time = time.perf_counter()
data = response.json()
total_time = end_time - start_time
# Extract usage statistics from the response.
usage = data.get("usage", {})
prompt_tokens = usage.get("prompt_tokens", 0)
completion_tokens = usage.get("completion_tokens", 0)
# Calculate tokens per second. This is the primary performance metric.
# Divide by total time to get the overall throughput including
# network overhead and server processing time.
tokens_per_second = completion_tokens / total_time if total_time > 0 else 0
response_text = data["choices"][0]["text"] if data.get("choices") else ""
return InferenceResult(
response_text=response_text,
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_time_seconds=total_time,
tokens_per_second=tokens_per_second
)
if __name__ == "__main__":
test_prompt = (
"Explain the difference between tensor parallelism and "
"pipeline parallelism in distributed deep learning:"
)
print("Querying TensorRT-LLM server (two-node distributed)...")
result = query_trtllm_server(
prompt=test_prompt,
host="localhost",
port=8080,
max_tokens=256
)
print(f"\nResponse:\n{result.response_text}")
print(f"\nPerformance Metrics:")
print(f" Prompt tokens: {result.prompt_tokens}")
print(f" Completion tokens: {result.completion_tokens}")
print(f" Total time: {result.total_time_seconds:.2f} seconds")
print(f" Throughput: {result.tokens_per_second:.1f} tokens/second")
Chapter 12 - WRITING CODE THAT TALKS TO LOCAL AND REMOTE LLMS
12.1 The Unified Client: One Interface, Five Engines
One of the most powerful patterns when working with multiple inference engines
is to write a unified client that abstracts away the differences between them.
All five engines we have covered (Ollama, LM Studio, vLLM, SGLang, and
TensorRT-LLM) expose either the Ollama API or the OpenAI API. This means we
can write a single client class that works with all of them by simply changing
the endpoint URL.
The following code implements a comprehensive unified LLM client that supports
both local models (via Ollama) and remote models (via any OpenAI-compatible
endpoint). It includes retry logic, error handling, and performance monitoring:
import time
import json
import logging
import requests
from enum import Enum
from dataclasses import dataclass, field
from typing import Optional, Iterator
from openai import OpenAI
# Configure logging so we can see what the client is doing.
# In production, you would configure this to write to a file.
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
logger = logging.getLogger("llm_client")
class BackendType(Enum):
"""
Enumeration of supported LLM backend types.
Each backend type corresponds to a different inference engine.
The client uses this to determine which API protocol to use
and how to format requests and parse responses.
"""
OLLAMA = "ollama"
OPENAI_COMPATIBLE = "openai_compatible" # vLLM, LM Studio, SGLang, TensorRT-LLM
@dataclass
class LLMConfig:
"""
Configuration for connecting to an LLM backend.
This dataclass holds all the information needed to connect to
and communicate with an LLM inference server. Separating
configuration from logic makes it easy to switch between
different servers and models.
"""
host: str
port: int
model: str
backend_type: BackendType
api_key: str = "not-required"
timeout_seconds: int = 120
max_retries: int = 3
retry_delay_seconds: float = 1.0
@property
def base_url(self) -> str:
"""Construct the base URL for the API endpoint."""
if self.backend_type == BackendType.OLLAMA:
return f"http://{self.host}:{self.port}"
else:
return f"http://{self.host}:{self.port}/v1"
@dataclass
class ChatMessage:
"""A single message in a conversation."""
role: str # "system", "user", or "assistant"
content: str
def to_dict(self) -> dict:
"""Convert to the dict format expected by API endpoints."""
return {"role": self.role, "content": self.content}
@dataclass
class GenerationResult:
"""
The result of a text generation request.
This dataclass captures both the generated content and metadata
about the generation, including timing information that helps
you understand and optimize your inference pipeline.
"""
content: str
model: str
backend: str
prompt_tokens: int = 0
completion_tokens: int = 0
generation_time_seconds: float = 0.0
@property
def tokens_per_second(self) -> float:
"""Calculate the generation throughput in tokens per second."""
if self.generation_time_seconds > 0 and self.completion_tokens > 0:
return self.completion_tokens / self.generation_time_seconds
return 0.0
def __str__(self) -> str:
return (
f"GenerationResult(\n"
f" backend={self.backend},\n"
f" model={self.model},\n"
f" tokens={self.completion_tokens},\n"
f" speed={self.tokens_per_second:.1f} tok/s\n"
f")"
)
class UnifiedLLMClient:
"""
A unified client for interacting with multiple LLM inference backends.
This client provides a consistent interface for sending requests to
any of the five inference engines covered in this tutorial. It handles
the differences in API protocols, request formats, and response
structures transparently.
The client supports both the Ollama native API and the OpenAI-compatible
API, automatically selecting the correct protocol based on the backend
type specified in the configuration.
Usage example:
# Configure for local Ollama
config = LLMConfig(
host="localhost",
port=11434,
model="llama3.2:3b",
backend_type=BackendType.OLLAMA
)
client = UnifiedLLMClient(config)
result = client.chat([ChatMessage("user", "Hello!")])
"""
def __init__(self, config: LLMConfig) -> None:
"""
Initialize the client with the given configuration.
Args:
config: The LLMConfig specifying which server to connect to.
"""
self.config = config
self._openai_client: Optional[OpenAI] = None
# Only create the OpenAI client for OpenAI-compatible backends.
if config.backend_type == BackendType.OPENAI_COMPATIBLE:
self._openai_client = OpenAI(
base_url=config.base_url,
api_key=config.api_key,
timeout=config.timeout_seconds
)
logger.info(
f"Initialized LLM client: {config.backend_type.value} "
f"at {config.host}:{config.port} "
f"using model '{config.model}'"
)
def _chat_via_ollama(
self,
messages: list[ChatMessage],
temperature: float,
max_tokens: int
) -> GenerationResult:
"""
Send a chat request using the Ollama native API.
The Ollama API uses a different request format than the OpenAI API.
It accepts messages in a similar format but uses different field
names and response structure.
Args:
messages: The conversation history.
temperature: Sampling temperature.
max_tokens: Maximum tokens to generate.
Returns:
A GenerationResult with the response and metadata.
"""
url = f"{self.config.base_url}/api/chat"
payload = {
"model": self.config.model,
"messages": [msg.to_dict() for msg in messages],
"stream": False,
"options": {
"temperature": temperature,
"num_predict": max_tokens
}
}
start_time = time.perf_counter()
response = requests.post(
url,
json=payload,
timeout=self.config.timeout_seconds
)
response.raise_for_status()
elapsed = time.perf_counter() - start_time
data = response.json()
# Extract the response content from the Ollama API response format.
content = data.get("message", {}).get("content", "")
# Ollama provides token counts in the response metadata.
prompt_tokens = data.get("prompt_eval_count", 0)
completion_tokens = data.get("eval_count", 0)
return GenerationResult(
content=content,
model=self.config.model,
backend="ollama",
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
generation_time_seconds=elapsed
)
def _chat_via_openai(
self,
messages: list[ChatMessage],
temperature: float,
max_tokens: int
) -> GenerationResult:
"""
Send a chat request using the OpenAI-compatible API.
This method works with vLLM, LM Studio, SGLang, and TensorRT-LLM,
all of which implement the OpenAI chat completions API.
Args:
messages: The conversation history.
temperature: Sampling temperature.
max_tokens: Maximum tokens to generate.
Returns:
A GenerationResult with the response and metadata.
"""
assert self._openai_client is not None, (
"OpenAI client not initialized. "
"Check that backend_type is OPENAI_COMPATIBLE."
)
start_time = time.perf_counter()
completion = self._openai_client.chat.completions.create(
model=self.config.model,
messages=[msg.to_dict() for msg in messages],
temperature=temperature,
max_tokens=max_tokens
)
elapsed = time.perf_counter() - start_time
content = completion.choices[0].message.content or ""
usage = completion.usage
return GenerationResult(
content=content,
model=self.config.model,
backend=self.config.backend_type.value,
prompt_tokens=usage.prompt_tokens if usage else 0,
completion_tokens=usage.completion_tokens if usage else 0,
generation_time_seconds=elapsed
)
def chat(
self,
messages: list[ChatMessage],
temperature: float = 0.7,
max_tokens: int = 512
) -> GenerationResult:
"""
Send a chat request with automatic retry on failure.
This is the primary public method for sending requests. It
automatically selects the correct API protocol based on the
backend type and retries failed requests up to max_retries times.
Args:
messages: The conversation history as a list of ChatMessages.
temperature: Sampling temperature (0.0 = deterministic).
max_tokens: Maximum number of tokens to generate.
Returns:
A GenerationResult with the response and performance metrics.
Raises:
RuntimeError: If all retry attempts fail.
"""
last_error: Optional[Exception] = None
for attempt in range(self.config.max_retries):
try:
if self.config.backend_type == BackendType.OLLAMA:
result = self._chat_via_ollama(
messages, temperature, max_tokens
)
else:
result = self._chat_via_openai(
messages, temperature, max_tokens
)
logger.info(
f"Request completed: {result.completion_tokens} tokens "
f"at {result.tokens_per_second:.1f} tok/s"
)
return result
except Exception as error:
last_error = error
logger.warning(
f"Request attempt {attempt + 1} failed: {error}. "
f"Retrying in {self.config.retry_delay_seconds}s..."
)
time.sleep(self.config.retry_delay_seconds)
raise RuntimeError(
f"All {self.config.max_retries} attempts failed. "
f"Last error: {last_error}"
)
def stream_chat(
self,
messages: list[ChatMessage],
temperature: float = 0.7,
max_tokens: int = 512
) -> Iterator[str]:
"""
Stream a chat response token by token.
This method yields tokens as they are generated, which is
useful for building responsive user interfaces. Note that
streaming is only supported for OpenAI-compatible backends
in this implementation.
Args:
messages: The conversation history.
temperature: Sampling temperature.
max_tokens: Maximum tokens to generate.
Yields:
Individual tokens as strings.
"""
if self.config.backend_type == BackendType.OLLAMA:
# Use the Ollama streaming API.
url = f"{self.config.base_url}/api/chat"
payload = {
"model": self.config.model,
"messages": [msg.to_dict() for msg in messages],
"stream": True,
"options": {"temperature": temperature, "num_predict": max_tokens}
}
with requests.post(url, json=payload, stream=True) as response:
response.raise_for_status()
for line in response.iter_lines():
if line:
chunk = json.loads(line)
token = chunk.get("message", {}).get("content", "")
if token:
yield token
if chunk.get("done", False):
break
else:
# Use the OpenAI streaming API.
assert self._openai_client is not None
stream = self._openai_client.chat.completions.create(
model=self.config.model,
messages=[msg.to_dict() for msg in messages],
temperature=temperature,
max_tokens=max_tokens,
stream=True
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
def demonstrate_all_backends() -> None:
"""
Demonstrate the unified client with all five inference backends.
This function creates a client for each backend and sends the same
question to all of them, then compares the responses and performance.
It assumes all servers are running on the appropriate ports as
configured throughout this tutorial.
"""
# Define configurations for all five backends.
# Adjust host addresses and ports to match your actual setup.
configs = [
LLMConfig(
host="localhost",
port=11434,
model="llama3.2:3b",
backend_type=BackendType.OLLAMA
),
LLMConfig(
host="localhost",
port=1234,
model="local-model",
backend_type=BackendType.OPENAI_COMPATIBLE
),
LLMConfig(
host="localhost",
port=8000,
model="meta-llama/Llama-3.1-70B-Instruct",
backend_type=BackendType.OPENAI_COMPATIBLE
),
LLMConfig(
host="localhost",
port=30000,
model="meta-llama/Llama-3.1-8B-Instruct",
backend_type=BackendType.OPENAI_COMPATIBLE
),
LLMConfig(
host="localhost",
port=8080,
model="tensorrt-llm",
backend_type=BackendType.OPENAI_COMPATIBLE
)
]
backend_names = ["Ollama", "LM Studio", "vLLM", "SGLang", "TensorRT-LLM"]
# The same question is sent to all backends for a fair comparison.
question = (
"In one paragraph, explain why unified memory architecture is "
"important for running large language models."
)
messages = [
ChatMessage(
role="system",
content="You are a concise technical expert. Answer in one paragraph."
),
ChatMessage(role="user", content=question)
]
print("=" * 70)
print("COMPARING ALL FIVE INFERENCE BACKENDS")
print("=" * 70)
print(f"Question: {question}\n")
for config, name in zip(configs, backend_names):
print(f"\n--- {name} ---")
try:
client = UnifiedLLMClient(config)
result = client.chat(messages=messages, temperature=0.3, max_tokens=256)
print(f"Response: {result.content}")
print(f"Speed: {result.tokens_per_second:.1f} tokens/second")
except Exception as error:
print(f"Error connecting to {name}: {error}")
print("(Make sure the server is running on the expected port)")
if __name__ == "__main__":
demonstrate_all_backends()
Chapter 13 - MONITORING, TROUBLESHOOTING, AND KEEPING THINGS RUNNING
13.1 Real-Time GPU Monitoring
Understanding what your GPUs are doing is essential for diagnosing performance
issues and ensuring your inference workloads are running efficiently. The
primary tool for this is nvidia-smi, which you can run in watch mode to get
a continuously updating display:
watch -n 1 nvidia-smi
This refreshes the output every second. You will see GPU utilization
(ideally close to 100% during inference), memory usage (which grows as models
are loaded), temperature (should stay below 85°C for sustained workloads),
and power consumption.
The nvtop tool provides a more visual, htop-like interface:
nvtop
For monitoring both nodes simultaneously from a single terminal, you can use
SSH to run nvidia-smi on Node B and display the output locally:
# In one terminal pane: monitor Node A
watch -n 1 nvidia-smi
# In another terminal pane: monitor Node B via SSH
ssh aiuser@10.0.0.2 "watch -n 1 nvidia-smi"
13.2 Monitoring Network Performance
During distributed inference, the ConnectX-7 link is the critical path. If
the network is not performing well, your distributed inference will be slow
regardless of how fast the GPUs are. Monitor network throughput with:
# Watch network interface statistics in real time.
# Replace enp1s0f0np0 with your actual ConnectX-7 interface name.
watch -n 1 "cat /proc/net/dev | grep enp1s0f0np0"
For a more detailed view, use the sar tool (part of the sysstat package):
sudo apt install -y sysstat
sar -n DEV 1 100
This shows network statistics for all interfaces, updated every second, for
100 iterations. Look for the enp1s0f0np0 interface and check that the
rxkB/s and txkB/s values are consistent with your expected workload.
13.3 Setting Up Systemd Services for Inference Engines
For production use, you want your inference servers to start automatically
when the machine boots and to restart automatically if they crash. Systemd
services handle this perfectly. Here is an example service file for vLLM:
sudo nano /etc/systemd/system/vllm-server.service
Enter the following content, adjusting paths and parameters for your setup:
[Unit]
Description=vLLM OpenAI-Compatible Inference Server
After=network.target
Wants=network.target
[Service]
Type=simple
User=aiuser
WorkingDirectory=/home/aiuser
Environment="PATH=/home/aiuser/vllm-env/bin:/usr/local/bin:/usr/bin:/bin"
Environment="HF_TOKEN=your_huggingface_token_here"
ExecStart=/home/aiuser/vllm-env/bin/python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--max-model-len 8192
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Enable and start the service:
sudo systemctl daemon-reload
sudo systemctl enable vllm-server
sudo systemctl start vllm-server
View the service logs:
journalctl -u vllm-server -f
The -f flag follows the log in real time, similar to tail -f. This is
invaluable for debugging startup issues.
13.4 A Health Check Script
The following script checks the health of all inference engines and reports
which ones are running and responding correctly. Run this after a reboot or
when you suspect something is not working:
import requests
import subprocess
from dataclasses import dataclass
@dataclass
class ServiceStatus:
"""Status information for a single inference service."""
name: str
host: str
port: int
is_running: bool
response_time_ms: float
error_message: str = ""
def check_service_health(
name: str,
host: str,
port: int,
health_endpoint: str = "/health",
timeout: float = 5.0
) -> ServiceStatus:
"""
Check whether an inference service is running and responding.
This function attempts to connect to the service's health endpoint
and measures the response time. A successful response (HTTP 200)
indicates the service is healthy. Any error indicates a problem
that needs investigation.
Args:
name: Human-readable name of the service.
host: The hostname or IP of the service.
port: The port number.
health_endpoint: The URL path for the health check endpoint.
timeout: Maximum time to wait for a response in seconds.
Returns:
A ServiceStatus object with the health check results.
"""
import time
url = f"http://{host}:{port}{health_endpoint}"
start = time.perf_counter()
try:
response = requests.get(url, timeout=timeout)
elapsed_ms = (time.perf_counter() - start) * 1000
return ServiceStatus(
name=name,
host=host,
port=port,
is_running=response.status_code == 200,
response_time_ms=elapsed_ms
)
except requests.exceptions.ConnectionError:
return ServiceStatus(
name=name,
host=host,
port=port,
is_running=False,
response_time_ms=0.0,
error_message="Connection refused - service may not be running"
)
except requests.exceptions.Timeout:
return ServiceStatus(
name=name,
host=host,
port=port,
is_running=False,
response_time_ms=timeout * 1000,
error_message="Timeout - service is running but not responding"
)
def check_gpu_health() -> dict:
"""
Check GPU health using nvidia-smi.
Returns a dict with GPU temperature, utilization, and memory usage.
This helps identify whether the GPU is overheating or running out
of memory, which are common causes of inference failures.
"""
try:
result = subprocess.run(
[
"nvidia-smi",
"--query-gpu=temperature.gpu,utilization.gpu,memory.used,memory.total",
"--format=csv,noheader,nounits"
],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
values = result.stdout.strip().split(", ")
return {
"temperature_c": int(values[0]),
"utilization_pct": int(values[1]),
"memory_used_mb": int(values[2]),
"memory_total_mb": int(values[3])
}
except Exception as error:
return {"error": str(error)}
return {}
if __name__ == "__main__":
print("=" * 60)
print("DGX SPARK INFERENCE ENGINE HEALTH CHECK")
print("=" * 60)
# Define all services to check.
services_to_check = [
("Ollama (Node A)", "localhost", 11434, "/api/tags"),
("Ollama (Node B)", "10.0.0.2", 11434, "/api/tags"),
("LM Studio (Node A)", "localhost", 1234, "/v1/models"),
("vLLM (Node A)", "localhost", 8000, "/health"),
("SGLang (Node A)", "localhost", 30000, "/health"),
("TensorRT-LLM (Node A)","localhost", 8080, "/health"),
]
print("\nService Status:")
print("-" * 60)
for name, host, port, endpoint in services_to_check:
status = check_service_health(name, host, port, endpoint)
status_str = "RUNNING" if status.is_running else "DOWN"
if status.is_running:
print(
f" [{status_str:7}] {name:<30} "
f"{status.response_time_ms:.1f}ms"
)
else:
print(
f" [{status_str:7}] {name:<30} "
f"{status.error_message}"
)
print("\nGPU Health (Node A):")
print("-" * 60)
gpu_info = check_gpu_health()
if "error" not in gpu_info:
print(f" Temperature: {gpu_info['temperature_c']}°C")
print(f" Utilization: {gpu_info['utilization_pct']}%")
memory_used_gb = gpu_info['memory_used_mb'] / 1024
memory_total_gb = gpu_info['memory_total_mb'] / 1024
print(f" Memory: {memory_used_gb:.1f}GB / {memory_total_gb:.1f}GB")
else:
print(f" Error: {gpu_info['error']}")
13.5 Common Problems and Solutions
The following describes the most common issues you will encounter and how to
resolve them.
If nvidia-smi shows "No devices were found" after a kernel update, the GPU
driver module may not have been recompiled for the new kernel. Run
"sudo apt install --reinstall nvidia-driver-570" (or whatever the current
driver version is) to reinstall the driver, which triggers recompilation.
If an inference server fails to start with "CUDA out of memory," another
process is using GPU memory. Run "nvidia-smi" to identify the process, then
kill it with "sudo kill -9 <PID>". Also check that you are not trying to load
a model that exceeds the available memory.
If the ConnectX-7 link shows as "down" in "ip link show," check that the DAC
cable is fully seated in both ports. Try unplugging and replugging the cable.
If the problem persists, verify the cable is rated for 100G (QSFP28) and not
40G (QSFP+).
If NCCL reports "Connection refused" during multi-node startup, the firewall
may be blocking the NCCL communication ports. Disable the firewall temporarily
for testing: "sudo ufw disable". If this fixes the problem, add rules to allow
traffic on the NCCL ports (typically 29500 and above) between the two nodes.
If vLLM's Ray cluster fails to form, ensure that the Ray head node is fully
started before running "ray start" on the worker node. The head node logs
should show "Ray runtime started" before you proceed with the worker.
Chapter 14 - CLOSING THOUGHTS AND NEXT STEPS
14.1 What You Have Accomplished
If you have followed this tutorial to this point, you have done something
genuinely impressive. You have set up two of the most powerful personal AI
workstations available, connected them with a high-speed 100GbE direct link,
configured RDMA for low-latency GPU-to-GPU communication, and installed five
different inference engines that cover the full spectrum from beginner-friendly
to maximum-performance. You have also written Python code that can talk to all
of these engines, both locally and across the network.
This is not a trivial achievement. Many organizations spend months and
significant engineering resources to build AI inference infrastructure at this
level. You now have it running on two machines on your desk.
14.2 Choosing the Right Tool for the Job
Now that you have all five engines available, here is a practical guide to
choosing the right one for different situations.
Ollama is the right choice when you want to quickly experiment with a new
model, when you need the simplest possible setup, or when you are building a
prototype that you want to get running in minutes rather than hours. Its model
library is excellent, and the REST API is simple and well-documented.
LM Studio is the right choice when you want a graphical interface for
interactive model exploration, when you are demonstrating AI capabilities to
non-technical stakeholders, or when you want to quickly compare different
models side by side without writing code.
vLLM is the right choice when you need to serve many concurrent users with
high throughput, when you are building a production API service, or when you
need the flexibility of multi-node tensor parallelism for very large models.
Its PagedAttention makes it the most memory-efficient option for high-concurrency
workloads.
SGLang is the right choice when your application involves structured output
generation (JSON, XML, specific formats), complex multi-step reasoning chains,
or workloads where many requests share a common prefix (like a shared system
prompt). Its RadixAttention makes it uniquely efficient for these patterns.
TensorRT-LLM is the right choice when raw inference speed is the paramount
concern and you are willing to invest time in the compilation process. If you
are running the same model continuously in production and need the absolute
maximum tokens per second, TensorRT-LLM will outperform all other options on
NVIDIA hardware.
14.3 Next Steps and Further Exploration
The setup described in this tutorial is a solid foundation, but there is much
more to explore. Fine-tuning models on your own data using frameworks like
Hugging Face PEFT and LoRA is a natural next step that allows you to customize
models for your specific domain. The DGX Spark's unified memory architecture
makes fine-tuning of 7B-13B parameter models feasible on a single node.
Exploring quantization techniques - specifically GPTQ, AWQ, and GGUF - will
help you fit larger models into the available memory and run them faster.
Quantization reduces the precision of model weights (from 16-bit to 8-bit or
4-bit), trading a small amount of quality for significant reductions in memory
usage and inference time.
Building a proper model serving pipeline with load balancing, request queuing,
and monitoring using tools like Prometheus and Grafana will prepare your setup
for production use. The health check script in Chapter 13 is a starting point,
but a full observability stack gives you much deeper insight into system
behavior.
Experimenting with multimodal models - models that can process both text and
images - is another exciting direction. The DGX Spark's memory capacity makes
it well-suited for models like LLaVA, Qwen-VL, and similar vision-language
models.
Finally, connecting your two-node DGX Spark cluster to a larger network of
machines, or integrating it with cloud resources for burst capacity, opens up
possibilities for truly large-scale AI workloads. The skills and concepts you
have learned in this tutorial - RDMA networking, distributed inference,
multi-node coordination - are directly applicable to clusters of any size.
The machines are ready. The software is installed. The network is configured.
What you build next is entirely up to you.
APPENDIX: QUICK REFERENCE CARD
NETWORK ADDRESSES
Node A management: 192.168.1.100
Node B management: 192.168.1.101
Node A ConnectX-7: 10.0.0.1
Node B ConnectX-7: 10.0.0.2
INFERENCE ENGINE PORTS
Ollama: 11434 (both nodes)
LM Studio: 1234 (both nodes)
vLLM: 8000 (Node A, head node)
SGLang: 30000 (Node A, head node)
TensorRT-LLM: 8080 (Node A, head node)
ESSENTIAL COMMANDS
Check GPU status: nvidia-smi
Monitor GPU live: watch -n 1 nvidia-smi
Monitor GPU visual: nvtop
Check network: ip link show
Test bandwidth: iperf3 -c 10.0.0.2 -t 30 -P 4
Pull Ollama model: ollama pull llama3.2:3b
Run Ollama model: ollama run llama3.2:3b
Check service: systemctl status <service-name>
View service logs: journalctl -u <service-name> -f
ENVIRONMENT VARIABLES FOR NCCL
NCCL_IB_GID_INDEX=3
NCCL_IB_DISABLE=0
NCCL_NET_GDR_LEVEL=5
NCCL_SOCKET_IFNAME=enp1s0f0np0
No comments:
Post a Comment