Saturday, May 16, 2026

COMBINING TWO NVIDIA DGX SPARK WORKSTATIONS FOR MULTI-NODE LLM INFERENCE




PREFACE: WHY THIS GUIDE EXISTS

Imagine having access to a system capable of running a 405-billion-parameter AI model on your desk. Not in a data center somewhere in Virginia, not rented by the hour from a cloud provider, but physically sitting in your office, humming quietly, ready to serve your queries with complete data privacy and zero monthly subscription fees. This is not science fiction. This is exactly what two NVIDIA DGX Spark workstations, connected together with a single cable, can deliver today.

This guide will take you from zero — including the very first power-on with no monitor attached — to a fully operational dual-node LLM inference cluster. Along the way you will learn why each piece of hardware and software exists, what problem it solves, and how all the pieces fit together into a coherent, high-performance whole.

This guide is written for headless operation. The DGX Spark is designed to sit on a shelf or in a rack without a monitor, keyboard, or mouse. Every step assumes you are working from a separate laptop or workstation connected to the same network, accessing the DGX Spark systems entirely over SSH.

We cover three major inference frameworks — TensorRT-LLM, vLLM, and SGLang — and explain the performance technologies that make the difference between a sluggish system and one that feels genuinely fast and responsive.


PART ZERO: BEFORE YOU BEGIN — HEADLESS SETUP AND REMOTE ACCESS


CHAPTER 0: INITIAL DISCOVERY AND SSH ACCESS (HEADLESS FIRST BOOT)

This chapter must be read before anything else. The DGX Spark ships with DGX OS (Ubuntu-based) pre-installed. You do not need a monitor for any part of this guide after the very first power-on.

0.1 What You Need on Your Laptop

On your laptop (macOS, Windows with WSL2, or Linux), ensure you have:

  • An SSH client (ssh command, available by default on macOS and Linux; use Windows Terminal + OpenSSH on Windows)
  • nmap (optional, for network scanning)
  • A QSFP56 cable (for the inter-node link — this comes later)
  • Your router's admin panel access (to find DHCP leases)

0.2 First Power-On and Network Discovery

Connect each DGX Spark to your local network via its management Ethernet port (the standard RJ-45 port, separate from the QSFP56 ports). Power on the machine. DGX OS will boot and automatically obtain an IP address via DHCP.

Method A — mDNS (easiest, no router access needed):

The DGX Spark broadcasts its hostname via mDNS. The default hostname follows the pattern dgx-spark-XXXX.localwhere XXXX is derived from the MAC address. From your laptop:

# macOS / Linux with avahi-daemon:
ping dgx-spark-XXXX.local

# Or use avahi-browse to discover all DGX Sparks on the network:
avahi-browse -t _ssh._tcp

Method B — Router DHCP lease table:

Log into your router's admin panel (typically 192.168.1.1 or 192.168.0.1). Look for DHCP leases with hostnames matching dgx-spark-* or with NVIDIA MAC address prefixes.

Method C — Network scan:

# Scan your local subnet for SSH-capable hosts:
nmap -p 22 --open 192.168.1.0/24

0.3 First SSH Connection

Once you have the IP address or mDNS hostname:

ssh dgx@dgx-spark-XXXX.local
# or
ssh dgx@192.168.1.XXX

The default credentials on DGX OS are:

  • Username: dgx (or the username you set during initial OS configuration)
  • Password: Set during first-boot wizard (if you connected a monitor briefly) or via the NVIDIA Sync app

Note: If you have the NVIDIA Sync desktop application on your laptop, it can discover DGX Spark devices on your network automatically and manage SSH keys for you. This is the easiest path for initial setup.

0.4 Setting Up SSH Key Authentication (Mandatory for Headless Production)

Password authentication is convenient but insecure for a production server. Set up key-based authentication immediately.

On your laptop, generate a key pair if you do not have one:

ssh-keygen -t ed25519 -C "my-laptop-to-dgx-spark" -f ~/.ssh/id_dgx_spark

Copy the public key to both DGX Spark systems:

ssh-copy-id -i ~/.ssh/id_dgx_spark.pub dgx@192.168.1.XXX   # DGX Spark #1
ssh-copy-id -i ~/.ssh/id_dgx_spark.pub dgx@192.168.1.YYY   # DGX Spark #2

Create a convenient ~/.ssh/config on your laptop so you can use short names:

# ~/.ssh/config on your laptop

Host spark1
    HostName 192.168.1.XXX          # Replace with actual IP of DGX Spark #1
    User dgx
    IdentityFile ~/.ssh/id_dgx_spark
    ServerAliveInterval 60
    ServerAliveCountMax 10
    Compression yes

Host spark2
    HostName 192.168.1.YYY          # Replace with actual IP of DGX Spark #2
    User dgx
    IdentityFile ~/.ssh/id_dgx_spark
    ServerAliveInterval 60
    ServerAliveCountMax 10
    Compression yes

After this, you can simply type ssh spark1 or ssh spark2 from your laptop.

0.5 Hardening SSH for Production

On both DGX Spark systems, edit the SSH server configuration:

sudo nano /etc/ssh/sshd_config

Set or verify these values:

# /etc/ssh/sshd_config — security hardening

PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
X11Forwarding no
MaxAuthTries 3
LoginGraceTime 20
ClientAliveInterval 60
ClientAliveCountMax 10

Restart SSH:

sudo systemctl restart ssh

Install fail2ban to block brute-force attempts:

sudo apt install -y fail2ban
sudo systemctl enable --now fail2ban

0.6 Accessing Remote Services via SSH Tunnels

When inference servers are running, you need to access their web UIs and APIs from your laptop. Use SSH port forwarding:

# Access vLLM API (port 8000) on spark1 from your laptop:
ssh -L 8000:localhost:8000 spark1 -N &

# Access Ray Dashboard (port 8265) on spark1 from your laptop:
ssh -L 8265:localhost:8265 spark1 -N &

# Access SGLang API (port 30000) on spark1 from your laptop:
ssh -L 30000:localhost:30000 spark1 -N &

# Access Open WebUI (port 3000) on spark1 from your laptop:
ssh -L 3000:localhost:3000 spark1 -N &

After running these commands, open your laptop's browser and navigate to http://localhost:8000http://localhost:8265, etc.

Tip: Create a script ~/bin/tunnel-spark.sh on your laptop for convenience:

#!/usr/bin/env bash
# ~/bin/tunnel-spark.sh
# Opens all useful SSH tunnels to the DGX Spark cluster.
# Each tunnel runs in the background. Kill them with: pkill -f "ssh -L"

set -e

echo "Opening SSH tunnels to spark1..."

ssh -L 8000:localhost:8000 spark1 -N &
TUNNEL_PID_8000=$!

ssh -L 8265:localhost:8265 spark1 -N &
TUNNEL_PID_8265=$!

ssh -L 3000:localhost:3000 spark1 -N &
TUNNEL_PID_3000=$!

ssh -L 30000:localhost:30000 spark1 -N &
TUNNEL_PID_30000=$!

echo "Tunnels open on ports 8000, 8265, 3000, 30000."
echo "Press Ctrl+C to close all tunnels."

# Wait for any tunnel to exit (e.g., on Ctrl+C)
trap "kill $TUNNEL_PID_8000 $TUNNEL_PID_8265 $TUNNEL_PID_3000 $TUNNEL_PID_30000 2>/dev/null; echo 'Tunnels closed.'" EXIT
wait $TUNNEL_PID_8000 $TUNNEL_PID_8265 $TUNNEL_PID_3000 $TUNNEL_PID_30000
chmod +x ~/bin/tunnel-spark.sh

0.7 Keeping Sessions Alive with tmux

When running long commands over SSH (model downloads, engine builds), always use tmux so your work survives connection drops:

# Install tmux on both nodes:
sudo apt install -y tmux

# Start a named session:
tmux new -s cluster-setup

# Detach from session (leaves it running):
# Press Ctrl+B, then D

# Reattach to session after reconnecting:
tmux attach -t cluster-setup

PART ONE: UNDERSTANDING YOUR HARDWARE


CHAPTER 1: THE NVIDIA DGX SPARK — A MARVEL OF MINIATURIZATION

Before we connect anything or type a single command, we need to understand what we are working with. The DGX Spark is not simply a powerful gaming PC with a good graphics card. It is a fundamentally different kind of machine, and understanding its architecture is essential for making good decisions about how to configure and use it.

The DGX Spark was unveiled by NVIDIA at CES 2025 (January 2025) and became publicly available in October 2025. It represents NVIDIA's first personal AI supercomputer. The machine is built around the NVIDIA GB10 Grace Blackwell Superchip, which is a single integrated package combining two distinct but deeply interconnected processing units: a Blackwell-generation GPU and a 20-core ARM-based Grace CPU. These two processors are not connected via a traditional PCIe bus as you would find in a conventional desktop workstation. Instead, they communicate through NVIDIA's proprietary NVLink-C2C interconnect, which provides extremely high bandwidth between the CPU and GPU — far exceeding what PCIe 5.0 x16 can deliver.

Why does this matter so much? Because LLM inference is fundamentally a memory-bandwidth-bound workload. When a language model generates text, it must constantly read enormous weight matrices from memory, perform matrix multiplications, and move the results around. The bottleneck is almost never raw compute throughput; it is how fast you can feed data to the compute units. The NVLink-C2C interconnect ensures that data movement within the system is as fast as physically possible.

The Grace CPU is a 20-core ARM processor combining 10 Cortex-X925 performance cores with 10 Cortex-A725 efficiency cores. The CPU runs DGX OS, which is a customized Ubuntu Linux distribution optimized for AI workloads.

The system memory is 128 gigabytes of LPDDR5x unified memory shared between the CPU and GPU. The word "unified" is critically important. Unlike a traditional PC where the CPU has its own RAM and the GPU has its own VRAM, the DGX Spark has a single pool of high-bandwidth memory that both processors can access directly. This means that a 70-billion-parameter model, which might require 140 gigabytes of memory in FP16 precision, can be loaded into system memory and accessed by the GPU without any copying or transfer overhead.

The Blackwell GPU inside the GB10 delivers up to 1 petaFLOP of AI performance at FP4 precision. FP4 is a 4-bit floating point format that NVIDIA introduced with the Blackwell architecture, specifically designed for inference workloads where reduced numerical precision is acceptable in exchange for dramatically higher throughput and lower memory consumption.

Power: The GB10 SoC (GPU + CPU together) has a Thermal Design Power (TDP) of 140 W. The total system power — which includes the ConnectX-7 NIC, Wi-Fi, NVMe SSD, USB-C ports, and all other components — is 240 W, supplied by the included external power supply.

The physical form factor is almost comically small given its capabilities: 150 mm × 150 mm × 50.5 mm, roughly the size of a thick hardcover book. It weighs 1.2 kilograms. Storage is provided by a 1 TB or 4 TB NVMe M.2 SSD with self-encryption.

The rear panel features two QSFP56 network connectors driven by an integrated ConnectX-7 Smart NIC. The ConnectX-7 is NVIDIA's advanced network interface controller. In the DGX Spark, the two QSFP56 ports are internally connected to the GB10 SoC via two PCIe Gen5 x4 links, each providing approximately 100 gigabits per second of bandwidth. When you connect two DGX Spark systems directly with a single QSFP56 cable, you get a 200 gigabit per second direct link — the foundation of everything we build in this guide.

    +--------------------------------------------------+
    |              DGX SPARK INTERNALS                 |
    |                                                  |
    |  +------------------+    NVLink-C2C              |
    |  |  Grace CPU       |<=========================> |
    |  |  20-core ARM     |    High-bandwidth          |
    |  |  (10x X925 +     |    coherent interconnect   |
    |  |   10x A725)      |                            |
    |  +------------------+   +--------------------+   |
    |                         |  Blackwell GPU     |   |
    |  128 GB LPDDR5x         |  1 PetaFLOP FP4    |   |
    |  Unified Memory         |  5th Gen Tensors   |   |
    |  (shared CPU + GPU)     +--------------------+   |
    |                                                  |
    |  GB10 SoC TDP: 140 W                             |
    |  Total System Power: 240 W                       |
    |                                                  |
    |  ConnectX-7 Smart NIC                            |
    |  2x QSFP56 ports (2x PCIe Gen5 x4 = 2x ~100G)    |
    +--------------------------------------------------+

CHAPTER 2: THE CONNECTX-7 AND THE QSFP56 LINK — YOUR NEURAL HIGHWAY

The ConnectX-7 Smart NIC is the linchpin of the entire dual-node setup. Without a high-speed, low-latency interconnect, running a single model across two machines would be impractical because the communication overhead would overwhelm any benefit from having additional compute resources.

The ConnectX-7 is not a simple network card. It is a programmable data processing unit that can offload network processing tasks from the main CPU, implement RDMA (Remote Direct Memory Access) protocols, and participate in GPU-to-GPU communication patterns that bypass the CPU entirely.

RDMA (Remote Direct Memory Access): In traditional networking, when machine A sends data to machine B, the sequence involves multiple memory copies and CPU interruptions. With RDMA, machine A's network card writes data directly into machine B's memory without involving machine B's CPU at all. This dramatically reduces latency and CPU overhead — critical when two GPUs need to exchange tensor data hundreds of times per second during inference.

GPUDirect RDMA takes this further by allowing the network card to read data directly from GPU memory and write data directly into GPU memory, bypassing the CPU and system memory entirely. When two DGX Spark systems exchange intermediate tensor values during a distributed forward pass, GPUDirect RDMA ensures those tensors travel from GPU to GPU via the ConnectX-7 without ever touching the CPU.

The ConnectX-7 in the DGX Spark supports RoCE v2 (RDMA over Converged Ethernet version 2). RoCE v2 encapsulates RDMA traffic in standard UDP/IP packets, meaning it can traverse standard Ethernet infrastructure while still providing low-latency, high-throughput RDMA characteristics. This means you do not need specialized InfiniBand switches; a direct QSFP56 cable between two DGX Spark systems is sufficient.

Note on achieving full 200 Gbps: The DGX Spark uses two PCIe Gen5 x4 links to the ConnectX-7, each providing ~200 Gbps. Achieving the full 200 Gbps requires that one link are properly utilized. In practice, careful configuration (correct MTU, PFC, and NCCL settings) is needed; some users have observed ~185–190 Gbps in real-world testing.

    +------------------+       QSFP56 Cable        +------------------+
    |   DGX SPARK #1   |                           |   DGX SPARK #2   |
    |   (Head Node)    |<=========================>|   (Worker Node)  |
    |   spark1         |    200 Gbps Direct Link   |   spark2         |
    |                  |    RoCE v2 / GPUDirect    |                  |
    |  ConnectX-7 NIC  |    RDMA                   |  ConnectX-7 NIC  |
    |  QSFP56 Port 0   |                           |  QSFP56 Port 0   |
    |  192.168.100.1   |                           |  192.168.100.2   |
    +------------------+                           +------------------+
    |  Blackwell GPU   |                           |  Blackwell GPU   |
    |  128 GB Unified  |                           |  128 GB Unified  |
    |  Memory          |                           |  Memory          |
    +------------------+                           +------------------+

    Combined: 256 GB unified memory total

PART TWO: THE SOFTWARE LANDSCAPE


CHAPTER 3: UNDERSTANDING LLM INFERENCE — WHAT ACTUALLY HAPPENS INSIDE

Before we can configure inference software intelligently, we need to understand what LLM inference actually does at a computational level.

A large language model is a function that takes a sequence of tokens as input and produces a probability distribution over the next token as output. Tokens are subword units, and a typical sentence might contain 10 to 30 tokens. When you ask a language model a question:

  1. Your question is tokenized — converted from text into integer IDs indexing into the model's vocabulary (typically 32,000 to 128,000 tokens).
  2. These IDs pass through an embedding layer converting each integer into a dense vector (typically 4,096 to 16,384 dimensions for large models).
  3. These vectors pass through a series of transformer layers, each applying attention mechanisms and feed-forward networks. A large model might have 80 to 128 such layers.
  4. The output of the final layer passes through a linear projection and softmax to produce a probability distribution over the vocabulary.
  5. A token is sampled from this distribution (greedy, top-k, top-p, etc.).
  6. This new token is appended to the sequence, and the entire process repeats until an end-of-sequence token or maximum length is reached.

This is called autoregressive generation, and it has a fundamental characteristic: each token depends on all previous tokens, so generation is inherently sequential.

There is an important distinction between two phases:

  • Prefill phase: Processes your entire input prompt in a single forward pass. All input tokens are known in advance, so the GPU can process them in parallel, achieving high utilization.
  • Decode phase: Generates one token at a time. Each step requires a full forward pass through the model. The GPU is doing relatively little computation per step but must still load the entire model's weights from memory. This makes the decode phase extremely memory-bandwidth-bound.

The attention mechanism within each transformer layer is particularly important because it is the source of most memory management complexity. During the decode phase, the attention mechanism needs to compare the current token against all previous tokens. To avoid recomputing the key and value vectors for all previous tokens at every step, these vectors are cached in the KV cache (Key-Value cache). The KV cache grows linearly with sequence length and can consume enormous amounts of memory — managing it efficiently is one of the central challenges of LLM serving.


CHAPTER 4: PARALLELISM STRATEGIES — HOW TO SPLIT A MODEL ACROSS MACHINES

When a model is too large to fit on a single GPU, or when you want to improve throughput by using multiple GPUs, you need to distribute the model across multiple devices. There are several fundamentally different ways to do this.

Tensor Parallelism (TP)

Tensor parallelism splits the weight matrices within each transformer layer across multiple devices. Each device holds a fraction of each layer's weights and performs a fraction of each layer's computation.

Consider the feed-forward network within a transformer layer. In a 70B model, these matrices might be 8,192 × 28,672 elements. With TP across 2 GPUs, each GPU holds a 8,192 × 14,336 slice. Each GPU performs its portion of the matrix multiplication independently, and then the results are combined using an all-reduce collective operation before proceeding to the next layer.

Advantage: Distributes both memory requirements and compute requirements evenly. Disadvantage: Requires frequent all-reduce operations at every transformer layer, which are bandwidth-intensive and latency-sensitive. Works best with high-bandwidth, low-latency interconnects — ideally NVLink within a single node.

Important for dual DGX Spark: Each DGX Spark has exactly one GPU. Tensor parallelism is designed to split a model across multiple GPUs within a node using NVLink. When applied across two separate nodes (each with one GPU), the all-reduce must traverse the inter-node network link on every single transformer layer. This is functional but introduces more latency than intra-node NVLink TP. See Chapter 4's recommendation table below.

    LAYER N FEED-FORWARD NETWORK — TENSOR PARALLELISM

    Node 1 (spark1):               Node 2 (spark2):
    +------------------+           +------------------+
    | Weight slice 1   |           | Weight slice 2   |
    | (cols 0-14335)   |           | (cols 14336-28671)|
    |                  |           |                  |
    | Partial result 1 |           | Partial result 2 |
    +------------------+           +------------------+
              |                              |
              +----------All-Reduce----------+
                         (sum results)
                              |
                    +------------------+
                    |  Final result    |
                    |  (full output)   |
                    +------------------+

    This all-reduce happens at EVERY layer, EVERY token step.
    High bandwidth between nodes is critical.
    For single-GPU-per-node setups, PP is generally preferred.

Pipeline Parallelism (PP)

Pipeline parallelism assigns different layers to different devices. In a model with 126 transformer layers running across 2 nodes, node 1 handles layers 1–63 and node 2 handles layers 64–126.

Advantage: Requires much less inter-node communication — only the output activations of one stage are passed to the next, once per token per forward pass. Much more tolerant of higher-latency interconnects and is the recommended strategy for single-GPU-per-node setupsDisadvantage: Pipeline bubbles — some stages are idle waiting for data. Various techniques like micro-batching reduce but cannot eliminate these bubbles.

Choosing the Right Strategy for Dual DGX Spark

Since each DGX Spark has exactly one Blackwell GPU, the recommended parallelism strategy is:

Strategy--tensor-parallel-size--pipeline-parallel-sizeAll-Reduce per LayerRecommended For
Pipeline Parallel12No — only at stage boundaryPrimary recommendation for 1 GPU/node
Tensor Parallel21Yes — every layer, cross-nodeOnly if PP is not supported by the framework for your use case
Combined12NoSame as PP for single-GPU nodes

Why PP is recommended for single-GPU-per-node: With PP=2, only the activations at the layer boundary cross the network once per forward pass. With TP=2 across nodes, an all-reduce must cross the network at every transformer layer (80+ times for a 70B model per token step). Even with a 200 Gbps link, this per-layer overhead accumulates and increases latency compared to PP.


PART THREE: PHYSICAL SETUP AND NETWORK CONFIGURATION


CHAPTER 5: RENAMING YOUR DGX SPARK WORKSTATIONS

Before configuring networking, rename both machines to meaningful, consistent hostnames. This makes every subsequent command, configuration file, and log message readable. We will name them spark1 and spark2.

Why do this first? Hostname changes affect SSH known_hosts entries, /etc/hosts files, and Ray cluster configuration. Doing it before any other setup avoids having to undo configurations later.

5.1 Rename DGX Spark #1 to spark1

SSH into the first machine (using its current mDNS name or IP):

ssh dgx@dgx-spark-XXXX.local

Set the hostname permanently using hostnamectl:

sudo hostnamectl set-hostname spark1

Verify the change:

hostnamectl
# Should show: Static hostname: spark1

Update /etc/hosts to reflect the new hostname (replace the old hostname entry):

sudo nano /etc/hosts

Find the line that reads 127.0.1.1 dgx-spark-XXXX and change it to:

127.0.1.1   spark1

The file should look like this when done:

127.0.0.1   localhost
127.0.1.1   spark1
::1         localhost ip6-localhost ip6-loopback
ff02::1     ip6-allnodes
ff02::2     ip6-allrouters

Apply the hostname change to the current session without rebooting:

exec bash
# Your prompt should now show: dgx@spark1:~$

5.2 Rename DGX Spark #2 to spark2

SSH into the second machine and repeat the process:

ssh dgx@dgx-spark-YYYY.local

sudo hostnamectl set-hostname spark2

sudo nano /etc/hosts
# Change: 127.0.1.1   dgx-spark-YYYY
# To:     127.0.1.1   spark2

exec bash
# Your prompt should now show: dgx@spark2:~$

5.3 Update Your Laptop's SSH Config

Now that the machines have proper hostnames, update ~/.ssh/config on your laptop to use the new hostnames in the Host aliases (the HostName field still uses the IP address until we configure static IPs in the next chapter):

Host spark1
    HostName 192.168.1.XXX
    User dgx
    IdentityFile ~/.ssh/id_dgx_spark
    ServerAliveInterval 60
    ServerAliveCountMax 10

Host spark2
    HostName 192.168.1.YYY
    User dgx
    IdentityFile ~/.ssh/id_dgx_spark
    ServerAliveInterval 60
    ServerAliveCountMax 10

5.4 Reboot Both Machines

A reboot ensures the hostname change is fully propagated to all system services:

# On spark1:
sudo reboot

# On spark2:
sudo reboot

Wait 60–90 seconds, then reconnect:

ssh spark1
ssh spark2

CHAPTER 6: STEP-BY-STEP PHYSICAL SETUP AND NETWORK CONFIGURATION

Step 1: Physical Cable Connection

Connect one end of the QSFP56 cable to port 0 of the ConnectX-7 NIC on spark1, and the other end to port 0 of the ConnectX-7 NIC on spark2. The ports are on the rear panel. You should hear or feel a click when the cable is properly seated.

Cable types:

  • Direct Attach Copper (DAC): Passive copper cable, suitable for distances up to ~5 meters. Less expensive, no power consumption. Best for same-desk or same-rack setups.
  • Active Optical Cable (AOC): Uses fiber optics, suitable for distances up to 100 meters or more. Required if the machines are in different racks or rooms.

Step 2: Verifying Link Status

After connecting the cable, verify the link is up. SSH into spark1:

# List all network interfaces:
ip link show

# Check RDMA adapter status (requires MLNX_OFED, pre-installed on DGX OS):
ibstat

Look for an interface with a name like enp1s0f0np0 or similar. In the ibstat output, you should see:

  • State: Active
  • Physical state: LinkUp
  • Rate: 200 (Gb/s)

Finding your exact interface name: Run ip link show and look for the interface that is NOT lo(loopback) and NOT your management Ethernet. It will have a name like enp1s0f0np0. The corresponding RDMA device name (used in NCCL and ibstat) will be similar but prefixed differently, e.g., rocep1s0f0. Confirm with: ls /sys/class/infiniband/

Step 3: Configuring Network Interfaces with Static IP Addresses

We assign static IP addresses to the QSFP56 interfaces using the 192.168.100.0/24 subnet. This subnet is dedicated to inter-node communication and is separate from your management network.

On spark1, create the Netplan configuration:

sudo nano /etc/netplan/60-dgx-interconnect.yaml
# /etc/netplan/60-dgx-interconnect.yaml — spark1
network:
  version: 2
  ethernets:
    enp1s0f0np0:          # Replace with your actual interface name
      addresses:
        - 192.168.100.1/24
      mtu: 9000
      optional: true      # Prevents boot delay if cable is temporarily disconnected

On spark2, create the same file with a different address:

sudo nano /etc/netplan/60-dgx-interconnect.yaml
# /etc/netplan/60-dgx-interconnect.yaml — spark2
network:
  version: 2
  ethernets:
    enp1s0f0np0:          # Replace with your actual interface name
      addresses:
        - 192.168.100.2/24
      mtu: 9000
      optional: true

Set correct permissions (Netplan requires strict permissions):

sudo chmod 600 /etc/netplan/60-dgx-interconnect.yaml

Apply the configuration on both nodes:

sudo netplan try    # Test for 120 seconds, auto-reverts if you don't confirm
sudo netplan apply  # Apply permanently

Why MTU 9000? The standard Ethernet MTU is 1500 bytes. "Jumbo frames" with MTU 9000 reduce the CPU overhead of processing many small packets and improve throughput for large data transfers — exactly what we need when transferring tensor data between GPUs. Both endpoints must have the same MTU.

Step 4: Enabling RoCE v2 and Configuring Priority Flow Control

RoCE v2 requires a lossless Ethernet network. Priority Flow Control (PFC) allows a receiver to signal to a sender to pause transmission temporarily when its buffers are filling up, preventing packet loss.

Configure PFC on both nodes:

# Enable PFC on traffic class 3 (standard class for RoCE traffic):
sudo mlnx_qos -i enp1s0f0np0 --pfc 0,0,0,1,0,0,0,0

Make PFC persistent across reboots by creating a systemd service:

sudo nano /etc/systemd/system/roce-pfc.service
# /etc/systemd/system/roce-pfc.service
[Unit]
Description=Configure RoCE Priority Flow Control
After=network.target
Wants=network.target

[Service]
Type=oneshot
ExecStart=/usr/bin/mlnx_qos -i enp1s0f0np0 --pfc 0,0,0,1,0,0,0,0
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now roce-pfc.service

Enable the nvidia-peermem kernel module (required for GPUDirect RDMA):

sudo modprobe nvidia-peermem

Make it persistent:

echo "nvidia-peermem" | sudo tee /etc/modules-load.d/nvidia-peermem.conf

Step 5: Verifying Connectivity and Bandwidth

Test basic connectivity:

# From spark1, ping spark2:
ping -c 4 192.168.100.2
# Expected: round-trip times of 0.05 to 0.2 ms

Test RDMA bandwidth using ib_send_bw from the perftest package:

# Install perftest if not already present:
sudo apt install -y perftest

# On spark2 (server side) — run first:
ib_send_bw -d rocep1s0f0 -i 1 -F --report_gbits

# On spark1 (client side) — run second:
ib_send_bw -d rocep1s0f0 -i 1 -F --report_gbits 192.168.100.2

Expected result: bandwidth approaching 185–200 Gbps. If you see significantly lower numbers, check MTU configuration and PFC settings.

Finding your RDMA device name: Run ibstat and look for the CA (Channel Adapter) name. Alternatively: ls /sys/class/infiniband/ — the device listed there is your RDMA device name.

Step 6: Configuring Passwordless SSH Between Nodes

Many distributed tools require passwordless SSH between the nodes themselves (not just from your laptop). Set this up on both nodes:

On spark1:

ssh-keygen -t ed25519 -C "spark1-to-spark2" -f ~/.ssh/id_spark_cluster -N ""
ssh-copy-id -i ~/.ssh/id_spark_cluster.pub dgx@192.168.100.2

On spark2:

ssh-keygen -t ed25519 -C "spark2-to-spark1" -f ~/.ssh/id_spark_cluster -N ""
ssh-copy-id -i ~/.ssh/id_spark_cluster.pub dgx@192.168.100.1

Test both directions:

# From spark1:
ssh dgx@192.168.100.2 hostname   # Should print: spark2

# From spark2:
ssh dgx@192.168.100.1 hostname   # Should print: spark1

Step 7: Configuring /etc/hosts for Reliable Name Resolution

Add entries to /etc/hosts on both nodes:

sudo nano /etc/hosts

Add these lines (in addition to the existing entries):

# DGX Spark Cluster — Inter-Node Network
192.168.100.1   spark1
192.168.100.2   spark2

Step 8: Configuring Firewall Rules

Open the necessary ports on both nodes:

# Allow all traffic on the inter-node subnet (safe for a direct point-to-point link):
sudo ufw allow from 192.168.100.0/24 to any comment "DGX Spark inter-node"

# If you want more granular control, open specific ports:
# vLLM API server:
sudo ufw allow 8000/tcp comment "vLLM API"
# Ray GCS (Global Control Store):
sudo ufw allow 6379/tcp comment "Ray GCS"
# Ray Dashboard:
sudo ufw allow 8265/tcp comment "Ray Dashboard"
# SGLang API server:
sudo ufw allow 30000/tcp comment "SGLang API"
# SGLang distributed init:
sudo ufw allow 20000/tcp comment "SGLang dist-init"
# PyTorch distributed:
sudo ufw allow 29500/tcp comment "PyTorch distributed"
# TensorRT-LLM / trtllm-serve:
sudo ufw allow 8001/tcp comment "trtllm-serve HTTP"
# Triton gRPC (if using Triton):
sudo ufw allow 8002/tcp comment "Triton gRPC"

sudo ufw reload

Step 9: Downloading Models (Headless)

Models must be downloaded to both nodes. Use the Hugging Face CLI:

# Install on both nodes:
pip install huggingface_hub[cli]

# Authenticate (requires a Hugging Face account and access token):
huggingface-cli login
# Enter your HF token when prompted

# Download a model to /models/ (run on BOTH nodes):
sudo mkdir -p /models
sudo chown dgx:dgx /models

# Example: Llama 3.1 70B Instruct
huggingface-cli download \
    meta-llama/Llama-3.1-70B-Instruct \
    --local-dir /models/llama-3.1-70b-instruct \
    --local-dir-use-symlinks False

# Example: Llama 3.1 8B (for speculative decoding draft model)
huggingface-cli download \
    meta-llama/Llama-3.1-8B-Instruct \
    --local-dir /models/llama-3.1-8b-instruct \
    --local-dir-use-symlinks False

Storage tip: The 70B model in BF16 is approximately 140 GB. The 4 TB NVMe SSD option is strongly recommended if you plan to store multiple models. Use df -h /models to check available space.

Syncing models between nodes: If you download on spark1 and want to copy to spark2 over the high-speed link:

rsync -avP --progress \
    /models/llama-3.1-70b-instruct/ \
    dgx@192.168.100.2:/models/llama-3.1-70b-instruct/

PART FOUR: INFERENCE FRAMEWORK DEEP DIVES


CHAPTER 7: TensorRT-LLM — NVIDIA'S OPTIMIZED INFERENCE ENGINE

TensorRT-LLM is NVIDIA's flagship inference optimization library. Unlike vLLM and SGLang, which work with standard PyTorch model weights, TensorRT-LLM requires a compilation step that converts your model into a highly optimized TensorRT engine. This compilation step takes time (minutes to hours depending on model size), but the resulting engine is significantly faster than what you can achieve with a general-purpose framework.

TensorRT-LLM performs a comprehensive set of optimizations during compilation: it fuses multiple operations into single GPU kernels, selects the optimal algorithm for each matrix multiplication, applies quantization to reduce memory bandwidth requirements, and generates code specifically optimized for the Blackwell architecture with its fifth-generation Tensor Cores and FP4 support.

Installing TensorRT-LLM on DGX Spark

The recommended installation method is NVIDIA's pre-built Docker containers from the NGC registry. These containers include TensorRT-LLM along with all dependencies, pre-compiled for the DGX Spark's ARM64 architecture and Blackwell GPU.

Verify Docker and the NVIDIA Container Toolkit are installed (pre-installed on DGX OS):

docker --version
nvidia-smi

Pull the TensorRT-LLM container:

docker pull nvcr.io/nvidia/tensorrt-llm/release:latest

Verify it works:

docker run --rm --gpus all \
    nvcr.io/nvidia/tensorrt-llm/release:latest \
    nvidia-smi

Understanding the TensorRT-LLM Workflow

The TensorRT-LLM workflow has three distinct phases:

  1. Model conversion: Convert a Hugging Face checkpoint to TensorRT-LLM's internal checkpoint format.
  2. Engine building: Compile the checkpoint into an optimized TensorRT engine (GPU-architecture-specific).
  3. Inference serving: Load the compiled engine and serve inference requests via trtllm-serve.

Converting a Model Checkpoint (Single Node, 70B)

Start a Docker container with access to your model files:

docker run -it --gpus all \
    --network host \
    -v /models:/models \
    -v /output:/output \
    nvcr.io/nvidia/tensorrt-llm/release:latest \
    bash

Inside the container, navigate to the Llama conversion scripts:

cd /app/tensorrt_llm/examples/llama

Convert the Hugging Face checkpoint to TensorRT-LLM format with FP8 quantization:

python3 convert_checkpoint.py \
    --model_dir /models/llama-3.1-70b-instruct \
    --output_dir /output/llama-70b-trtllm-checkpoint \
    --dtype bfloat16 \
    --use_fp8 \
    --tp_size 1 \
    --pp_size 1

Building the TensorRT Engine (Single Node, 70B)

trtllm-build \
    --checkpoint_dir /output/llama-70b-trtllm-checkpoint \
    --output_dir /output/llama-70b-engine \
    --gemm_plugin bfloat16 \
    --max_batch_size 8 \
    --max_input_len 4096 \
    --max_output_len 2048 \
    --paged_kv_cache enable \
    --use_paged_context_fmha enable \
    --enable_chunked_context

Flag explanations:

  • --gemm_plugin bfloat16: Enables TensorRT-LLM's custom GEMM plugin with optimized matrix multiplication kernels.
  • --max_batch_size 8: Maximum concurrent requests. Larger values improve throughput but require more memory.
  • --max_input_len 4096 / --max_output_len 2048: Maximum sequence lengths. Larger values allow longer conversations but increase memory requirements.
  • --paged_kv_cache enable: Enables paged KV cache management (see Chapter 10).
  • --use_paged_context_fmha enable: Enables paged context Flash Multi-Head Attention.
  • --enable_chunked_context: Improves memory efficiency for long input sequences.

Serving with trtllm-serve (Single Node, 70B)

The trtllm-serve CLI is the standard serving interface for TensorRT-LLM engines. The model path (or Hugging Face name) is a positional argument, followed by --engine_dir pointing to the pre-built engine:

trtllm-serve meta-llama/Llama-3.1-70B-Instruct \
    --engine_dir /output/llama-70b-engine \
    --tp_size 1 \
    --pp_size 1 \
    --gpus_per_node 1 \
    --max_batch_size 8 \
    --max_num_tokens 4096 \
    --port 8000 \
    --log_level info

Multi-Node Configuration with TensorRT-LLM (405B, Two Nodes)

For a 405B model across two DGX Spark nodes, use pipeline parallelism (PP=2). Each node handles approximately half the transformer layers.

Memory check for 405B:

PrecisionWeight MemoryFits in 256 GB total?
FP16~810 GBNo
FP8~405 GBNo
INT4 / FP4~203 GBYes (~53 GB headroom for KV cache)

For two DGX Spark nodes, FP4 or INT4 quantization is required for the 405B model. FP8 alone is insufficient for the combined 256 GB unified memory.

Convert the 405B checkpoint with pipeline parallelism and FP8 (as a base; use FP4 if your TRT-LLM version supports --use_fp4):

python3 convert_checkpoint.py \
    --model_dir /models/llama-3.1-405b \
    --output_dir /output/llama-405b-trtllm-checkpoint \
    --dtype bfloat16 \
    --use_fp8 \
    --tp_size 1 \
    --pp_size 2

Note on FP4: For the 405B model to fit in 256 GB combined, FP4 or INT4 quantization is required. Use --use_fp4 if supported in your TensorRT-LLM container version. Check the release notes for your specific container. FP8 alone (~405 GB) does not fit.

Build the engine:

trtllm-build \
    --checkpoint_dir /output/llama-405b-trtllm-checkpoint \
    --output_dir /output/llama-405b-engine \
    --gemm_plugin bfloat16 \
    --max_batch_size 4 \
    --max_input_len 4096 \
    --max_output_len 2048 \
    --paged_kv_cache enable \
    --use_paged_context_fmha enable

Create an MPI hostfile (OpenMPI format; requires OpenMPI to be installed):

# Install OpenMPI if not present:
sudo apt install -y openmpi-bin openmpi-common libopenmpi-common

# Create the hostfile:
cat <<EOF | sudo tee /etc/mpi/hostfile
192.168.100.1 slots=1
192.168.100.2 slots=1
EOF

Launch the inference server across both nodes using mpirun and trtllm-serve:

# The model path is a positional argument to trtllm-serve.
# --engine_dir points to the pre-built engine directory.
# --pp_size 2 matches the pp_size used during engine build.
# --gpus_per_node 1 matches the DGX Spark (1 GPU per node).

mpirun -n 2 \
    --hostfile /etc/mpi/hostfile \
    --allow-run-as-root \
    --mca btl_tcp_if_include enp1s0f0np0 \
    --mca oob_tcp_if_include enp1s0f0np0 \
    -x NCCL_SOCKET_IFNAME=enp1s0f0np0 \
    -x NCCL_IB_HCA=rocep1s0f0 \
    -x NCCL_IB_ROCE_VERSION_NUM=2 \
    -x NCCL_NET_GDR_LEVEL=5 \
    -x NCCL_NET_GDR_READ=1 \
    trtllm-serve meta-llama/Llama-3.1-405B-Instruct \
        --engine_dir /output/llama-405b-engine \
        --tp_size 1 \
        --pp_size 2 \
        --gpus_per_node 1 \
        --max_batch_size 4 \
        --max_num_tokens 2048 \
        --port 8000 \
        --log_level info

Slurm-based multi-node deployment example (for HPC clusters):

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1
#SBATCH --mpi=pmix
#SBATCH --gres=gpu:1
#SBATCH --container-image=nvcr.io/nvidia/tensorrt-llm/release:latest
#SBATCH --container-mounts=/models:/models,/output:/output
#SBATCH --container-workdir /workspace

# trtllm-llmapi-launch is a wrapper provided by TensorRT-LLM that
# handles MPI rank assignment and distributed initialization automatically.
srun bash -c "trtllm-llmapi-launch trtllm-serve meta-llama/Llama-3.1-405B-Instruct \
    --engine_dir /output/llama-405b-engine \
    --tp_size 1 \
    --pp_size 2 \
    --gpus_per_node 1 \
    --max_batch_size 4 \
    --max_num_tokens 2048 \
    --port 8000"

Test the TensorRT-LLM server:

# From your laptop (with SSH tunnel open on port 8000):
curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-405B-Instruct",
        "messages": [
            {"role": "user", "content": "Explain quantum entanglement in simple terms."}
        ],
        "max_tokens": 500,
        "temperature": 0.7
    }'

CHAPTER 8: vLLM — THE VERSATILE WORKHORSE

vLLM is the most widely used open-source LLM inference framework as of 2025. It combines excellent performance with broad model compatibility, a clean OpenAI-compatible API, and an active development community. Unlike TensorRT-LLM, vLLM does not require a compilation step — you can load a Hugging Face model directly and start serving inference requests within minutes.

The key innovations that make vLLM fast are PagedAttention (its implementation of paged KV cache management) and continuous batching. vLLM was one of the first frameworks to implement these techniques together, and they remain central to its performance advantage.

Setting Up the vLLM Ray Cluster

vLLM uses Ray as its distributed execution backend for multi-node inference. Ray manages worker processes on each node and coordinates distributed tensor operations.

Install Ray and vLLM on both nodes:

# Create a virtual environment on both nodes:
python3 -m venv /opt/vllm-env
source /opt/vllm-env/bin/activate

# Install Ray and vLLM:
pip install "ray[default]" vllm

# Add to shell profile for convenience:
echo 'source /opt/vllm-env/bin/activate' >> ~/.bashrc

Set NCCL environment variables on both nodes:

Create /etc/profile.d/nccl-spark.sh on both nodes:

sudo nano /etc/profile.d/nccl-spark.sh
# /etc/profile.d/nccl-spark.sh — NCCL configuration for DGX Spark cluster
# Loaded for all login shells on both nodes

export NCCL_SOCKET_IFNAME=enp1s0f0np0
export GLOO_SOCKET_IFNAME=enp1s0f0np0
export NCCL_IB_HCA=rocep1s0f0
export NCCL_IB_ROCE_VERSION_NUM=2
export NCCL_NET_GDR_LEVEL=5
export NCCL_NET_GDR_READ=1
export NCCL_IB_DISABLE=0
# Note: NCCL_IB_GID_INDEX is intentionally NOT set here.
# NCCL >= 2.21 selects the GID index dynamically.
# If you experience NCCL errors on older versions, run 'show_gids' to find
# the correct index for your RoCE v2 interface and set it explicitly:
# export NCCL_IB_GID_INDEX=<your_index>
sudo chmod +x /etc/profile.d/nccl-spark.sh
source /etc/profile.d/nccl-spark.sh

Start the Ray head node on spark1:

# On spark1:
# --block keeps the process in the foreground; & backgrounds it so the
# terminal remains usable. Use tmux to keep this running after disconnect.
source /opt/vllm-env/bin/activate
ray start --head \
    --node-ip-address=192.168.100.1 \
    --port=6379 \
    --dashboard-host=0.0.0.0 \
    --dashboard-port=8265 \
    --block &

Connect spark2 as a worker node:

# On spark2:
source /opt/vllm-env/bin/activate
ray start \
    --address=192.168.100.1:6379 \
    --node-ip-address=192.168.100.2 \
    --block &

Verify the cluster:

# On spark1:
ray status

Expected output:

======== Cluster status: 2025-10-16 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node(s) with resources: {'GPU': 1.0, 'CPU': 20.0, 'memory': ...}
 1 node(s) with resources: {'GPU': 1.0, 'CPU': 20.0, 'memory': ...}

Access the Ray Dashboard from your laptop (with SSH tunnel open): http://localhost:8265

Launching vLLM for Multi-Node Inference

For dual DGX Spark systems with one GPU per node, the recommended parallelism strategy is pipeline parallelism (--pipeline-parallel-size 2--tensor-parallel-size 1). This sends activations across the network only once per forward pass at the layer boundary, rather than performing an all-reduce at every transformer layer.

For a 70B model using pipeline parallelism (PP=2, recommended for 1 GPU/node):

# On spark1 (after Ray cluster is running):
source /opt/vllm-env/bin/activate
python3 -m vllm.entrypoints.openai.api_server \
    --model /models/llama-3.1-70b-instruct \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 2 \
    --distributed-executor-backend ray \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.85 \
    --enable-prefix-caching \
    --speculative-model /models/llama-3.1-8b-instruct \
    --num-speculative-tokens 5 \
    --max-num-seqs 32 \
    --max-num-batched-tokens 8192 \
    --disable-log-requests

For a 405B model using pipeline parallelism (PP=2, FP4 required):

python3 -m vllm.entrypoints.openai.api_server \
    --model /models/llama-3.1-405b \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 2 \
    --distributed-executor-backend ray \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90 \
    --quantization fp4 \
    --enable-prefix-caching \
    --max-num-seqs 4 \
    --disable-log-requests

Parallelism flag reference for dual DGX Spark (1 GPU per node):

Use Case--tensor-parallel-size--pipeline-parallel-sizeNotes
70B — recommended12PP: one activation transfer per forward pass
405B — only option12FP4/INT4 quantization required
70B — TP across nodes21All-reduce every layer; higher inter-node traffic; not recommended for 1 GPU/node

Note on TP=2 across nodes: Setting --tensor-parallel-size 2 with one GPU per node is supported by vLLM via Ray, but requires an all-reduce collective across the inter-node network at every transformer layer (80+ times per token for a 70B model). For single-GPU-per-node setups, pipeline parallelism is the recommended approach as it requires only one activation transfer per forward pass.

Testing vLLM

# From your laptop (with SSH tunnel open on port 8000):
curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.1-70b-instruct",
        "messages": [
            {"role": "user", "content": "Write a Python function to compute the Fibonacci sequence."}
        ],
        "max_tokens": 500,
        "temperature": 0.1
    }'

Connecting Open WebUI to vLLM (Headless Browser Access)

Run Open WebUI on spark1 (accessible via SSH tunnel from your laptop):

docker run -d \
    --name open-webui \
    --restart unless-stopped \
    -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://192.168.100.1:8000/v1 \
    -e OPENAI_API_KEY=not-needed \
    ghcr.io/open-webui/open-webui:main

From your laptop, open the SSH tunnel for port 3000:

ssh -L 3000:localhost:3000 spark1 -N &

Then navigate to http://localhost:3000 in your laptop's browser.


CHAPTER 9: SGLang — THE STRUCTURED GENERATION SPECIALIST

SGLang (Structured Generation Language) is a newer but rapidly maturing inference framework that excels in scenarios involving structured outputs, tool calling, and complex multi-turn interactions. If TensorRT-LLM is the performance-maximizing specialist and vLLM is the versatile workhorse, SGLang is the intelligent orchestrator that excels at complex reasoning and agentic workflows.

The core innovation in SGLang is RadixAttention, a KV cache management system that uses a radix tree data structure to automatically identify and reuse shared prefixes across requests. For agentic applications where the system prompt is long and consistent, this can reduce prefill time by 80% or more.

SGLang also implements zero-overhead scheduling, where the CPU prepares the next batch of requests while the GPU is still processing the current batch. This overlapping ensures the GPU is never idle waiting for the CPU.

Installing and Configuring SGLang for Multi-Node Inference

Pull the official SGLang Docker image on both nodes:

docker pull lmsysorg/sglang:latest

Start the SGLang server on spark1 (node rank 0 — head node):

docker run -d \
    --name sglang-head \
    --restart unless-stopped \
    --gpus all \
    --network host \
    --shm-size 16g \
    --privileged \
    -v /models:/models \
    -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
    -e NCCL_IB_HCA=rocep1s0f0 \
    -e NCCL_IB_ROCE_VERSION_NUM=2 \
    -e NCCL_NET_GDR_LEVEL=5 \
    -e NCCL_NET_GDR_READ=1 \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path /models/llama-3.1-70b-instruct \
        --tp 2 \
        --dist-init-addr 192.168.100.1:20000 \
        --nnodes 2 \
        --node-rank 0 \
        --host 0.0.0.0 \
        --port 30000 \
        --mem-fraction-static 0.85 \
        --max-running-requests 16 \
        --chunked-prefill-size 4096

Start the SGLang worker on spark2 (node rank 1 — worker node):

docker run -d \
    --name sglang-worker \
    --restart unless-stopped \
    --gpus all \
    --network host \
    --shm-size 16g \
    --privileged \
    -v /models:/models \
    -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
    -e NCCL_IB_HCA=rocep1s0f0 \
    -e NCCL_IB_ROCE_VERSION_NUM=2 \
    -e NCCL_NET_GDR_LEVEL=5 \
    -e NCCL_NET_GDR_READ=1 \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path /models/llama-3.1-70b-instruct \
        --tp 2 \
        --dist-init-addr 192.168.100.1:20000 \
        --nnodes 2 \
        --node-rank 1

Note: The worker node (node-rank 1) does not expose an HTTP API endpoint. All client requests go to the head node (spark1:30000). The worker participates only in distributed computation.

Flag explanations:

  • --tp 2: Total tensor parallel size across ALL nodes combined. SGLang distributes this evenly (1 GPU per node × 2 nodes = 2).
  • --dist-init-addr 192.168.100.1:20000: Head node address for distributed initialization. Port 20000 must be open in the firewall.
  • --nnodes 2: Total number of nodes.
  • --node-rank 0/1: Rank of this node (0 = head, 1 = worker).
  • --network host: Container uses host network stack directly — required for RDMA.
  • --privileged: Required for RDMA device access.
  • --shm-size 16g: Allocates 16 GB shared memory for inter-process communication within the container.
  • --mem-fraction-static 0.85: Fraction of memory reserved for model weights and KV cache.
  • --chunked-prefill-size 4096: Processes long prompts in 4096-token chunks, reducing TTFT.

Monitor SGLang startup:

# Watch logs on spark1:
docker logs -f sglang-head

# Watch logs on spark2:
docker logs -f sglang-worker

The server is ready when you see: The server is fired up and ready to roll!

Test SGLang:

# From your laptop (with SSH tunnel open on port 30000):
curl -X POST http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.1-70b-instruct",
        "messages": [
            {"role": "user", "content": "Hello! What can you do?"}
        ],
        "max_tokens": 200
    }'

SGLang with Speculative Decoding (EAGLE)

docker run -d \
    --name sglang-head-eagle \
    --gpus all \
    --network host \
    --shm-size 16g \
    --privileged \
    -v /models:/models \
    -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
    -e NCCL_IB_HCA=rocep1s0f0 \
    -e NCCL_IB_ROCE_VERSION_NUM=2 \
    -e NCCL_NET_GDR_LEVEL=5 \
    -e NCCL_NET_GDR_READ=1 \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path /models/llama-3.1-70b-instruct \
        --tp 2 \
        --dist-init-addr 192.168.100.1:20000 \
        --nnodes 2 \
        --node-rank 0 \
        --host 0.0.0.0 \
        --port 30000 \
        --mem-fraction-static 0.80 \
        --speculative-algorithm EAGLE \
        --speculative-draft-model-path /models/eagle3-llama3.1-70b \
        --speculative-num-steps 5 \
        --speculative-eagle-topk 8 \
        --chunked-prefill-size 4096

SGLang Structured Output Example

One of SGLang's most powerful features is constrained generation — guaranteeing that the model's output conforms to a specified JSON schema. This is invaluable for agentic applications:

#!/usr/bin/env python3
"""
sglang_structured_output.py
Demonstrates SGLang constrained generation with JSON schema.
"""

import json
import requests

SGLANG_URL = "http://localhost:30000"  # Via SSH tunnel from your laptop

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "skills": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["name", "age", "skills"]
}

response = requests.post(
    f"{SGLANG_URL}/v1/chat/completions",
    json={
        "model": "llama-3.1-70b-instruct",
        "messages": [
            {
                "role": "user",
                "content": (
                    "Extract information about: Alice is a 32-year-old software "
                    "engineer who knows Python, Rust, and CUDA."
                )
            }
        ],
        "response_format": {
            "type": "json_schema",
            "json_schema": {
                "name": "person_info",
                "schema": schema,
                "strict": True
            }
        },
        "max_tokens": 200
    }
)

result = json.loads(response.json()["choices"][0]["message"]["content"])
print(json.dumps(result, indent=2))
# Guaranteed output: {"name": "Alice", "age": 32, "skills": ["Python", "Rust", "CUDA"]}

PART FIVE: PERFORMANCE OPTIMIZATION TECHNOLOGIES


CHAPTER 10: SPECULATIVE DECODING — MAKING THE GPU WORK HARDER

Speculative decoding is one of the most elegant performance optimization techniques in modern LLM inference. To understand it, we must first understand why LLM inference is slow.

During the decode phase, the GPU must load the entire model's weights from memory to generate a single token. For a 70B model in BF16 precision, that is 140 gigabytes of data that must be read from memory per token generation step. The GPU's compute units are largely idle while waiting for data to arrive from memory. This is the fundamental memory-bandwidth bottleneck of autoregressive LLM decoding.

Speculative decoding exploits this idle compute capacity. The key insight is that while the large target model is loading its weights for one token, the GPU's compute units could be doing useful work.

The mechanism:

  1. A small, fast draft model (typically 1–8B parameters) generates K candidate tokens very quickly.
  2. The large target model performs a single forward pass that processes all K candidate tokens simultaneously (in parallel, like a prefill operation).
  3. The target model accepts the longest prefix of the candidate tokens that it would have generated on its own.
  4. If the draft model guessed correctly for all K tokens: K tokens generated for ~1 model load cost.
  5. If the draft model guessed correctly for M tokens then diverged: M+1 tokens generated for ~1 model load cost.

Critical property: The output of speculative decoding is mathematically identical to what the target model would have produced without speculative decoding. This is a lossless optimization.

    WITHOUT SPECULATIVE DECODING:
    Step 1: Load 70B weights → Generate token "The"      (0.51s)
    Step 2: Load 70B weights → Generate token "quick"    (0.51s)
    Step 3: Load 70B weights → Generate token "brown"    (0.51s)
    Step 4: Load 70B weights → Generate token "fox"      (0.51s)
    Total: 4 × 0.51s = 2.04s for 4 tokens

    WITH SPECULATIVE DECODING (K=4 draft tokens):
    Draft: Load 7B weights → Propose ["The", "quick", "brown", "fox"]  (0.05s)
    Verify: Load 70B weights → Verify all 4 tokens in parallel          (0.55s)
    If all accepted: 4 tokens in 0.60s  →  3.4× speedup

Practical speedups of 2× to 4× are achievable with acceptance rates of 70–90%.

NVIDIA EAGLE-3 is an advanced variant that attaches a lightweight autoregressive prediction head directly to the target model's internal layers, eliminating the need for a separate draft model. EAGLE-3 achieves acceptance rates above 90% on many benchmarks, leading to speedups of 3× to 5×.

Configuring Speculative Decoding in vLLM

Add these flags to your vLLM launch command:

--speculative-model /models/llama-3.1-8b-instruct \
--num-speculative-tokens 5 \
--speculative-draft-tensor-parallel-size 1

The optimal number of speculative tokens depends on the acceptance rate of your specific model pair and workload. A value of 4 to 6 is typically a good starting point.

Configuring Speculative Decoding in SGLang

Add these flags to your SGLang launch command:

--speculative-algorithm EAGLE \
--speculative-draft-model-path /models/eagle3-llama3.1-70b \
--speculative-num-steps 5 \
--speculative-eagle-topk 8

CHAPTER 11: KV CACHE, PAGED ATTENTION, AND CONTINUOUS BATCHING

These three technologies work together as a system.

The KV Cache in Depth

During the decode phase, the attention mechanism at each layer computes attention scores between the current token and all previous tokens. Without caching, the key and value vectors for all previous tokens would be recomputed from scratch at every step — quadratic complexity in sequence length. The KV cache stores these vectors so they only need to be computed once.

Memory requirements: For a Llama 3.1 70B model with 80 layers, 64 attention heads, and a head dimension of 128, using BF16 precision:

  • KV cache per token per layer: 2 × 64 × 128 × 2 bytes = 32,768 bytes ≈ 32 KB
  • KV cache per token (all layers): 80 × 32 KB = 2.56 MB
  • KV cache for one 4,096-token sequence: 4,096 × 2.56 MB ≈ 10.5 GB
  • KV cache for 8 concurrent requests: 84 GB — a significant fraction of the 128 GB unified memory

This illustrates why KV cache management is so critical.

Paged Attention in Depth

Traditional KV cache management allocates a contiguous block of memory for each request at the start, sized for the maximum possible sequence length. If you allocate for 4,096 tokens but the actual response is 500 tokens, you waste 87.8% of the allocated memory.

Paged attention divides the KV cache into fixed-size blocks (typically 16 or 32 tokens per block) and allocates blocks on demand as the sequence grows. A logical block table maps the sequence's token positions to physical memory blocks, which do not need to be contiguous — exactly like an operating system's virtual memory paging system.

    TRADITIONAL KV CACHE (3 requests, max_len=4096):
    +------------------+------------------+------------------+
    | Request 1        | Request 2        | Request 3        |
    | 4096 tok alloc'd | 4096 tok alloc'd | 4096 tok alloc'd |
    | (200 used)       | (1500 used)      | (50 used)        |
    +------------------+------------------+------------------+
    Waste: ~87% of allocated memory

    PAGED ATTENTION (block size = 16 tokens):
    Block 0:  Req1 tok 0-15      Block 8:  Req2 tok 0-15
    Block 1:  Req1 tok 16-31     Block 9:  Req2 tok 16-31
    ...                           ...
    Block 12: Req1 tok 192-207   Block 96: Req2 tok 1488-1503
    Block 13: Req3 tok 0-15      [free blocks available]
    Waste: at most 15 tokens per request (last partial block)

Benefits:

  • Near-zero memory waste
  • Enables memory sharing between requests with common prefixes (prefix caching)
  • Allows serving many more concurrent users

Prefix Caching

Prefix caching reuses KV cache blocks for common prompt prefixes across multiple requests. If you have a system prompt sent with every request, its KV cache is computed once and reused for all subsequent requests — eliminating the prefill cost for the system prompt entirely.

Enable in vLLM: --enable-prefix-caching Enabled by default in SGLang via RadixAttention.

System prompt design for maximum cache efficiency:

FIXED PREFIX (cached after first request):
"You are a helpful, accurate, and thoughtful AI assistant. You provide
clear, well-organized responses. You acknowledge uncertainty when you
are not sure about something. You do not make up facts. [... rest of
fixed instructions ...]"

VARIABLE SUFFIX (not cached, appended per request):
"The current date is 2025-10-16. The user's name is Alice."

Every character of the system prompt that changes between requests invalidates the cached KV cache for that portion.

Continuous Batching in Depth

Static batching processes requests in fixed-size batches, waiting for all sequences to complete before starting the next batch. LLM output lengths are highly variable; the GPU sits idle waiting for the longest sequence in a batch to finish.

Continuous batching (in-flight batching) operates at the token level. At each generation step, the scheduler examines all active and waiting requests and selects a set to process. When a request finishes, its slot is immediately freed and a new request takes its place.

Result: GPU utilization of 80–95% compared to 30–60% with static batching. Throughput improvements of up to 23× have been reported.

Both vLLM and SGLang implement continuous batching by default. TensorRT-LLM calls its implementation "in-flight batching."


CHAPTER 12: QUANTIZATION — FITTING MORE MODEL INTO LESS MEMORY

Quantization represents model weights and activations using lower-precision numerical formats, reducing memory consumption and often improving inference speed.

The Blackwell GPU in the DGX Spark has native hardware support for FP8 and FP4 arithmetic — dedicated tensor core units that perform these operations at full speed without emulation overhead.

Memory requirements by precision for common models:

ModelFP16BF16FP8INT4/FP4
Llama 3.1 8B16 GB16 GB8 GB4 GB
Llama 3.1 70B140 GB140 GB70 GB35 GB
Llama 3.1 405B~810 GB~810 GB~405 GB~203 GB

For two DGX Spark nodes (256 GB combined unified memory):

ModelRecommended PrecisionWeight MemoryKV Cache Headroom
70BFP870 GB per node58 GB per node
70BFP435 GB per node93 GB per node
405BFP4 / INT4~100 GB per node~28 GB per node

Quality impact:

  • FP8: Negligible impact on output quality for most tasks (benchmark scores within 1% of FP16).
  • FP4: Slightly larger impact but acceptable for many applications, especially with calibration techniques like GPTQ or AWQ.

Enable FP8 in vLLM:

--quantization fp8

Enable FP4 in vLLM (requires Blackwell GPU and compatible vLLM version):

--quantization fp4

Enable quantization in TensorRT-LLM: Specified during engine building via --use_fp8 or --use_fp4 flags in trtllm-build.


PART SIX: CONFIGURATION GUIDES FOR SPECIFIC USE CASES


CHAPTER 13: OPTIMAL CONFIGURATION FOR AN LLM CHATBOT

A chatbot application requires:

  • Low TTFT (Time to First Token): Under 500 ms for a good user experience.
  • Streaming token rate: 20 to 100 tokens per second.
  • Concurrent user capacity: Serving 5 to 20 users simultaneously with acceptable latency.

Recommended vLLM configuration for a chatbot (70B model, dual DGX Spark):

# On spark1, after Ray cluster is running:
source /opt/vllm-env/bin/activate
python3 -m vllm.entrypoints.openai.api_server \
    --model /models/llama-3.1-70b-instruct \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 2 \
    --distributed-executor-backend ray \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.85 \
    --enable-prefix-caching \
    --speculative-model /models/llama-3.1-8b-instruct \
    --num-speculative-tokens 5 \
    --max-num-seqs 32 \
    --max-num-batched-tokens 8192 \
    --disable-log-requests

Flag rationale:

  • 70B model: Faster than 405B, excellent quality for conversational use cases.
  • --tensor-parallel-size 1 --pipeline-parallel-size 2: Correct for 1 GPU per node; PP minimizes inter-node communication.
  • --max-model-len 16384: Generous for most conversations without wasting KV cache memory.
  • --gpu-memory-utilization 0.85: 15% safety margin prevents OOM crashes in production.
  • --max-num-seqs 32: Limits concurrent sequences to maintain low latency.
  • --max-num-batched-tokens 8192: Controls the throughput/latency tradeoff.
  • Speculative decoding with the 8B draft model: Reduces per-token latency, making responses feel more responsive.

CHAPTER 14: OPTIMAL CONFIGURATION FOR AGENTIC AI

Agentic AI systems have fundamentally different requirements from chatbots. An AI agent is a system where the LLM acts as a reasoning engine that can plan, use tools, and execute multi-step tasks autonomously.

Key differences from chatbot workloads:

  • The LLM is called many times per user request (not just once).
  • Each call involves a long, consistent system prompt describing the agent's capabilities and tools.
  • Structured output requirements (the agent must return actions in a specific format).
  • LLM calls are often sequential — latency compounds.

SGLang is particularly well-suited for agentic applications because of its RadixAttention prefix caching, support for constrained generation, and efficient handling of tool-calling patterns.

Recommended SGLang configuration for agentic AI (70B model, dual DGX Spark):

# On spark1:
docker run -d \
    --name sglang-agentic \
    --restart unless-stopped \
    --gpus all \
    --network host \
    --shm-size 16g \
    --privileged \
    -v /models:/models \
    -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
    -e NCCL_IB_HCA=rocep1s0f0 \
    -e NCCL_IB_ROCE_VERSION_NUM=2 \
    -e NCCL_NET_GDR_LEVEL=5 \
    -e NCCL_NET_GDR_READ=1 \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path /models/llama-3.1-70b-instruct \
        --tp 2 \
        --dist-init-addr 192.168.100.1:20000 \
        --nnodes 2 \
        --node-rank 0 \
        --host 0.0.0.0 \
        --port 30000 \
        --mem-fraction-static 0.85 \
        --max-running-requests 16 \
        --enable-cache-report \
        --chunked-prefill-size 4096

Tool Calling Architecture for Agentic AI:

#!/usr/bin/env python3
"""
agentic_loop.py
Demonstrates a safe agentic tool-calling loop using SGLang.

SECURITY NOTE on safe_calculate():
  This function parses the expression into an AST, whitelists only safe
  node types, and then calls eval() on the compiled AST. The eval() call
  is intentional and controlled: only expressions whose every AST node
  has been verified against the whitelist reach eval(). This is NOT the
  same as calling eval() on raw user input.

  For production use with complex mathematical expressions, consider
  replacing this with a dedicated safe math library such as:
    - sympy:   result = sympy.sympify(expression)
    - asteval: from asteval import Interpreter; aeval = Interpreter()
    - numexpr: import numexpr; numexpr.evaluate(expression)
"""

import ast
import json
import requests

SGLANG_URL = "http://localhost:30000"  # Via SSH tunnel from your laptop

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for current information about a topic.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query string."
                    }
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculate",
            "description": (
                "Evaluate a simple arithmetic expression. "
                "Supports: +, -, *, /, **, and parentheses. "
                "Example: '(3 + 4) * 2'"
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "A safe arithmetic expression to evaluate."
                    }
                },
                "required": ["expression"]
            }
        }
    }
]


def execute_web_search(query: str) -> str:
    """Placeholder: replace with your actual search API call."""
    return f"[Search results for '{query}': This is a placeholder. Integrate a real search API.]"


def safe_calculate(expression: str) -> str:
    """
    Evaluate a simple arithmetic expression safely.

    Approach: parse the expression into an AST, walk every node and
    verify it belongs to an explicit whitelist of safe arithmetic node
    types, then compile and eval() the validated AST.

    The eval() here operates on a pre-validated AST — not on raw user
    input — so it cannot execute imports, function calls, attribute
    access, or any other Python construct outside the whitelist.

    Whitelisted node types:
      Expression, BinOp, UnaryOp, Constant (numbers only),
      Add, Sub, Mult, Div, FloorDiv, Mod, Pow, USub, UAdd

    Note: ast.Num is deprecated since Python 3.8 and removed in 3.12;
    ast.Constant covers all literal values in modern Python.
    """
    # Allowed AST node types — strictly arithmetic only
    ALLOWED_NODES = (
        ast.Expression,
        ast.BinOp,
        ast.UnaryOp,
        ast.Constant,   # Covers numeric literals (replaces deprecated ast.Num)
        ast.Add,
        ast.Sub,
        ast.Mult,
        ast.Div,
        ast.FloorDiv,
        ast.Mod,
        ast.Pow,
        ast.USub,
        ast.UAdd,
    )

    try:
        tree = ast.parse(expression, mode='eval')
    except SyntaxError as e:
        return f"Error: Invalid expression syntax: {e}"

    # Validate every node in the AST against the whitelist
    for node in ast.walk(tree):
        if not isinstance(node, ALLOWED_NODES):
            return (
                f"Error: Expression contains disallowed operation: "
                f"'{type(node).__name__}'. Only basic arithmetic is permitted."
            )

    # Additional check: Constant nodes must be numeric (not strings, booleans, etc.)
    for node in ast.walk(tree):
        if isinstance(node, ast.Constant) and not isinstance(node.value, (int, float)):
            return (
                f"Error: Non-numeric constant '{node.value}' is not allowed."
            )

    try:
        # eval() on a pre-validated, whitelisted AST — not on raw input
        result = eval(compile(tree, "<safe_arithmetic>", "eval"))  # noqa: S307
        return str(result)
    except ZeroDivisionError:
        return "Error: Division by zero."
    except OverflowError:
        return "Error: Result is too large to compute."
    except Exception as e:
        return f"Error evaluating expression: {e}"


def run_agent(user_message: str, max_iterations: int = 10) -> str:
    """
    Run an agentic loop until the model produces a final response
    without any tool calls, or until max_iterations is reached.
    """
    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful assistant with access to tools. "
                "Use tools when you need current information or need to perform calculations. "
                "Always reason step by step before calling a tool."
            )
        },
        {"role": "user", "content": user_message}
    ]

    for iteration in range(max_iterations):
        response = requests.post(
            f"{SGLANG_URL}/v1/chat/completions",
            json={
                "model": "llama-3.1-70b-instruct",
                "messages": messages,
                "tools": TOOLS,
                "tool_choice": "auto",
                "max_tokens": 1000,
                "temperature": 0.1
            },
            timeout=120
        )
        response.raise_for_status()
        data = response.json()

        message = data["choices"][0]["message"]
        messages.append(message)

        if message.get("tool_calls"):
            for tool_call in message["tool_calls"]:
                tool_name = tool_call["function"]["name"]
                tool_args = json.loads(tool_call["function"]["arguments"])

                if tool_name == "search_web":
                    result = execute_web_search(tool_args["query"])
                elif tool_name == "calculate":
                    result = safe_calculate(tool_args["expression"])
                else:
                    result = f"Error: Unknown tool '{tool_name}'"

                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call["id"],
                    "content": result
                })
        else:
            # No tool calls — model has produced its final response
            return message["content"]

    return "Error: Maximum iterations reached without a final response."


if __name__ == "__main__":
    answer = run_agent(
        "What is 2 to the power of 32, and what is it commonly used for in computing?"
    )
    print(answer)

CHAPTER 15: NCCL CONFIGURATION FOR MAXIMUM INTER-NODE PERFORMANCE

NCCL (NVIDIA Collective Communications Library) handles the collective communication operations (all-reduce, all-gather, broadcast, etc.) used during distributed inference. Getting NCCL configured correctly is the difference between achieving near-theoretical bandwidth and getting only a fraction of it.

Create /etc/nccl.conf on both nodes:

sudo nano /etc/nccl.conf
# /etc/nccl.conf — NCCL configuration for DGX Spark dual-node cluster
# This file is read by NCCL at runtime on both nodes.

# Network interface for socket-based communication (fallback and bootstrap):
NCCL_SOCKET_IFNAME=enp1s0f0np0

# RDMA Host Channel Adapter device name:
NCCL_IB_HCA=rocep1s0f0

# RoCE version (2 = routable, UDP/IP encapsulation):
NCCL_IB_ROCE_VERSION_NUM=2

# GPUDirect RDMA level (5 = full GPU-to-NIC direct access):
NCCL_NET_GDR_LEVEL=5

# Enable GPUDirect RDMA for read operations:
NCCL_NET_GDR_READ=1

# Explicitly enable InfiniBand/RoCE transport (0 = enabled):
NCCL_IB_DISABLE=0

# Algorithm: Ring is efficient for two-node all-reduce:
NCCL_ALGO=Ring

# Protocol: Simple is efficient for large inter-node messages (tensor transfers).
# Use LL (Low Latency) only for intra-node NVLink communication.
NCCL_PROTO=Simple

# Communication buffer size (4 MB — appropriate for large tensor transfers):
NCCL_BUFFSIZE=4194304

# Socket thread pool configuration (fallback path):
NCCL_SOCKET_NTHREADS=4
NCCL_NSOCKS_PERTHREAD=4

# GID Index: INTENTIONALLY NOT SET.
# NCCL >= 2.21 selects the GID index dynamically based on the configured
# IP address and NCCL_IB_ROCE_VERSION_NUM. Hardcoding this value can cause
# "ibv_modify_qp failed with error Invalid argument" errors if the index
# does not match your interface's RoCE v2 GID.
#
# To find the correct index for your system, run:
#   show_gids
# Look for the entry matching your interface (enp1s0f0np0) with "V2" in
# the VER column and your IP address (192.168.100.1 or .2) in the IPv4 column.
# If you must set it explicitly (NCCL < 2.21 only), uncomment:
# NCCL_IB_GID_INDEX=<index_from_show_gids>

# Address family for dynamic GID selection (NCCL >= 2.21):
# Uncomment if you need to restrict dynamic GID selection to IPv4:
# NCCL_IB_ADDR_FAMILY=AF_INET

# Address range for dynamic GID selection (NCCL >= 2.21, CIDR format):
# Uncomment and set to your inter-node subnet if needed:
# NCCL_IB_ADDR_RANGE=192.168.100.0/24

Finding the Correct GID Index (If Needed)

# Run on both nodes to see the GID table:
show_gids

# Example output:
# DEV         PORT  INDEX  GID                                    IPv4            VER   DEV
# ---         ----  -----  ---                                    ----            ---   ---
# rocep1s0f0  1     0      fe80:0000:0000:0000:...                               V1    enp1s0f0np0
# rocep1s0f0  1     1      fe80::...                                              V2    enp1s0f0np0
# rocep1s0f0  1     2      0000:0000:0000:0000:0000:ffff:c0a8:6401 192.168.100.1  V1    enp1s0f0np0
# rocep1s0f0  1     3      ::ffff:192.168.100.1                    192.168.100.1  V2    enp1s0f0np0
#
# The correct index for RoCE v2 with your IP is INDEX=3 in this example.
# Your system may differ — always verify with show_gids.
# With NCCL >= 2.21, this is selected automatically; no manual setting needed
# unless you encounter "ibv_modify_qp" errors.

Verifying NCCL Performance

Clone and build the nccl-tests suite:

# Install OpenMPI development headers:
sudo apt install -y openmpi-bin openmpi-common libopenmpi-dev

# Clone nccl-tests:
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests

# Build for ARM64 (DGX Spark uses aarch64, NOT x86_64):
make MPI=1 \
    MPI_HOME=/usr/lib/aarch64-linux-gnu/openmpi \
    NCCL_HOME=/usr/local/cuda \
    CUDA_HOME=/usr/local/cuda

Run the all-reduce bandwidth test:

mpirun -n 2 \
    --host 192.168.100.1:1,192.168.100.2:1 \
    --mca btl_tcp_if_include enp1s0f0np0 \
    --mca oob_tcp_if_include enp1s0f0np0 \
    ./build/all_reduce_perf \
        -b 1M -e 1G -f 2 \
        -g 1 -c 1 -n 100

Expected: For a two-node ring all-reduce over a 200 Gbps link (25 GB/s raw), the effective all-reduce bandwidth is approximately 12–15 GB/s (ring all-reduce has ~50% efficiency for two nodes: 2*(N-1)/(2*N) = 50%). If you see significantly lower numbers, enable debugging:

NCCL_DEBUG=INFO mpirun -n 2 \
    --host 192.168.100.1:1,192.168.100.2:1 \
    --mca btl_tcp_if_include enp1s0f0np0 \
    ./build/all_reduce_perf -b 1M -e 1G -f 2 -g 1 -c 1 -n 5

Look for NET/IB in the output (indicates RDMA is being used). If you see NET/Socket, RDMA is not working — check that nvidia-peermem is loaded and NCCL_IB_HCA points to the correct device.


PART SEVEN: PRODUCTION DEPLOYMENT


CHAPTER 16: DOCKER COMPOSE FOR PRODUCTION DEPLOYMENT

For a production deployment that is reliable, automatically restarts after failures, and is easy to manage, Docker Compose combined with systemd provides the best foundation.

spark1 — Head Node Docker Compose

sudo mkdir -p /opt/spark-cluster
sudo nano /opt/spark-cluster/docker-compose.yml
# /opt/spark-cluster/docker-compose.yml — spark1 (HEAD NODE)
# The 'version' key is deprecated in modern Docker Compose (V2) and is omitted.

services:
  ray-head:
    image: nvcr.io/nvidia/vllm:latest
    container_name: ray-head
    network_mode: host
    privileged: true
    shm_size: "16g"
    volumes:
      - /models:/models:ro
      - /etc/nccl.conf:/etc/nccl.conf:ro
      - /tmp/ray:/tmp/ray
    environment:
      - NCCL_SOCKET_IFNAME=enp1s0f0np0
      - GLOO_SOCKET_IFNAME=enp1s0f0np0
      - NCCL_IB_HCA=rocep1s0f0
      - NCCL_IB_ROCE_VERSION_NUM=2
      - NCCL_NET_GDR_LEVEL=5
      - NCCL_NET_GDR_READ=1
      - NCCL_IB_DISABLE=0
      - VLLM_HOST_IP=192.168.100.1
    # Use a shell script entrypoint to sequence Ray startup and vLLM launch.
    # The 'command' uses a bash -c heredoc to avoid YAML folded-scalar
    # newline-collapsing issues with multi-line shell commands.
    command:
      - bash
      - -c
      - |
        ray start \
          --head \
          --node-ip-address=192.168.100.1 \
          --port=6379 \
          --dashboard-host=0.0.0.0 \
          --dashboard-port=8265 \
          --block &
        echo "Waiting for Ray to initialize..."
        sleep 15
        echo "Starting vLLM server..."
        python3 -m vllm.entrypoints.openai.api_server \
          --model /models/llama-3.1-70b-instruct \
          --tensor-parallel-size 1 \
          --pipeline-parallel-size 2 \
          --distributed-executor-backend ray \
          --host 0.0.0.0 \
          --port 8000 \
          --max-model-len 16384 \
          --gpu-memory-utilization 0.85 \
          --enable-prefix-caching \
          --speculative-model /models/llama-3.1-8b-instruct \
          --num-speculative-tokens 5 \
          --max-num-seqs 32 \
          --disable-log-requests
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    network_mode: host
    environment:
      - OPENAI_API_BASE_URL=http://192.168.100.1:8000/v1
      - OPENAI_API_KEY=not-needed
    ports:
      - "3000:8080"
    restart: unless-stopped
    depends_on:
      ray-head:
        condition: service_healthy

spark2 — Worker Node Docker Compose

sudo mkdir -p /opt/spark-cluster
sudo nano /opt/spark-cluster/docker-compose.yml
# /opt/spark-cluster/docker-compose.yml — spark2 (WORKER NODE)
# The 'version' key is deprecated in modern Docker Compose (V2) and is omitted.

services:
  ray-worker:
    image: nvcr.io/nvidia/vllm:latest
    container_name: ray-worker
    network_mode: host
    privileged: true
    shm_size: "16g"
    volumes:
      - /models:/models:ro
      - /etc/nccl.conf:/etc/nccl.conf:ro
      - /tmp/ray:/tmp/ray
    environment:
      - NCCL_SOCKET_IFNAME=enp1s0f0np0
      - GLOO_SOCKET_IFNAME=enp1s0f0np0
      - NCCL_IB_HCA=rocep1s0f0
      - NCCL_IB_ROCE_VERSION_NUM=2
      - NCCL_NET_GDR_LEVEL=5
      - NCCL_NET_GDR_READ=1
      - NCCL_IB_DISABLE=0
      - VLLM_HOST_IP=192.168.100.2
    command:
      - bash
      - -c
      - |
        echo "Waiting for head node to be ready..."
        sleep 20
        ray start \
          --address=192.168.100.1:6379 \
          --node-ip-address=192.168.100.2 \
          --block
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Systemd Service for Automatic Boot-Time Startup

Create a systemd service on both nodes that starts Docker Compose after the network is ready:

sudo nano /etc/systemd/system/spark-cluster.service
# /etc/systemd/system/spark-cluster.service
# Starts the DGX Spark cluster services at boot time.
# Install on both spark1 and spark2.

[Unit]
Description=DGX Spark Cluster Services
After=docker.service network-online.target
Wants=docker.service network-online.target
Requires=docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/spark-cluster
ExecStart=/usr/bin/docker compose up -d
ExecStop=/usr/bin/docker compose down
TimeoutStartSec=300
TimeoutStopSec=120
Restart=no

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable spark-cluster.service

Starting and stopping the cluster:

# Start:
sudo systemctl start spark-cluster

# Stop:
sudo systemctl stop spark-cluster

# Check status:
sudo systemctl status spark-cluster

# View logs:
sudo journalctl -u spark-cluster -f

PART EIGHT: MONITORING AND TROUBLESHOOTING


CHAPTER 17: MONITORING YOUR CLUSTER (HEADLESS)

All monitoring is done remotely via SSH from your laptop.

GPU Monitoring

# Real-time GPU status (run on either node via SSH):
ssh spark1 'watch -n 1 nvidia-smi'

# DCGM monitoring (pre-installed on DGX OS):
ssh spark1 'dcgmi dmon -e 203,204,1002,1003,1004'
# Metrics: GPU util (203), memory util (204), memory bandwidth (1002-1004)

Ray Dashboard

With the SSH tunnel open (ssh -L 8265:localhost:8265 spark1 -N &), navigate to: http://localhost:8265

The Ray Dashboard shows:

  • Cluster resource utilization
  • Running and pending tasks
  • Worker node status
  • Memory usage per node

vLLM Metrics (Prometheus)

vLLM exposes Prometheus metrics at /metrics:

# From your laptop (with tunnel open on port 8000):
curl http://localhost:8000/metrics | grep vllm

# Key metrics to watch:
# vllm:num_requests_running    — active requests
# vllm:num_requests_waiting    — queued requests
# vllm:gpu_cache_usage_perc    — KV cache utilization
# vllm:avg_prompt_throughput_toks_per_s   — prefill speed
# vllm:avg_generation_throughput_toks_per_s — decode speed

Inter-Node Bandwidth Monitoring

# Monitor RDMA interface statistics (run on spark1):
ssh spark1 'watch -n 1 "perfquery -x rocep1s0f0 1"'

Container Log Monitoring

# Follow vLLM/Ray logs on spark1:
ssh spark1 'docker logs -f ray-head'

# Follow worker logs on spark2:
ssh spark2 'docker logs -f ray-worker'

# Follow SGLang logs:
ssh spark1 'docker logs -f sglang-head'
ssh spark2 'docker logs -f sglang-worker'

CHAPTER 18: TROUBLESHOOTING COMMON ISSUES

Problem: Cannot SSH to DGX Spark After Renaming

Symptom: SSH fails with "Host key verification failed" after hostname change.

Solution: Remove the old host key from your laptop's known_hosts:

ssh-keygen -R dgx-spark-XXXX.local
ssh-keygen -R 192.168.1.XXX
# Then reconnect:
ssh spark1

Problem: vLLM Fails to Detect Both GPUs / Ray Workers

Symptom: vLLM reports only 1 GPU available, or fails to start workers on spark2.

Solution:

# Check Ray cluster status:
ssh spark1 'source /opt/vllm-env/bin/activate && ray status'

# Verify VLLM_HOST_IP is set correctly on both nodes:
ssh spark1 'echo $VLLM_HOST_IP'   # Should be 192.168.100.1
ssh spark2 'echo $VLLM_HOST_IP'   # Should be 192.168.100.2

# Check Ray logs:
ssh spark1 'cat /tmp/ray/session_latest/logs/gcs_server.out | tail -50'

# Verify firewall allows port 6379:
ssh spark1 'sudo ufw status | grep 6379'

# Force Ray to spread workers across nodes using valid JSON:
export VLLM_DISTRIBUTED_EXECUTOR_CONFIG='{"placement_group_options": {"strategy": "SPREAD"}}'

Problem: NCCL Communication Is Slow or Fails

Symptom: Inference is much slower than expected, or NCCL timeout errors appear.

Solution:

# Enable NCCL debugging:
NCCL_DEBUG=INFO python3 -m vllm.entrypoints.openai.api_server ...

# Look for "NET/IB" in output (RDMA working) vs "NET/Socket" (RDMA not working)

# Verify nvidia-peermem is loaded:
lsmod | grep nvidia_peermem
# If not loaded:
sudo modprobe nvidia-peermem

# Verify RDMA device name:
ls /sys/class/infiniband/
ibstat

# Check GID index (for NCCL < 2.21 or if you see ibv_modify_qp errors):
show_gids
# Find the entry with your IP and "V2" — note the INDEX value
# If NCCL < 2.21, set: export NCCL_IB_GID_INDEX=<that_index>

# Verify MTU is 9000 on both interfaces:
ip link show enp1s0f0np0 | grep mtu

Problem: Out of Memory Errors

Symptom: Inference server crashes with CUDA/memory OOM errors.

Solution:

# Reduce KV cache allocation:
--gpu-memory-utilization 0.75  # Instead of 0.85 or 0.90

# Reduce maximum context length:
--max-model-len 8192  # Instead of 16384

# Reduce concurrent requests:
--max-num-seqs 16  # Instead of 32

# Use more aggressive quantization:
--quantization fp8  # Or fp4 for Blackwell

# Check current memory usage:
ssh spark1 'nvidia-smi --query-gpu=memory.used,memory.total --format=csv'

Problem: High Time to First Token (TTFT)

Symptom: Users experience long pauses before the first token appears.

Solution:

# Enable chunked prefill (allows decode to start before full prompt is processed):
# In vLLM: add --enable-chunked-prefill
# In SGLang: add --chunked-prefill-size 4096

# Enable prefix caching:
# In vLLM: add --enable-prefix-caching
# In SGLang: enabled by default via RadixAttention

# Reduce max batch size to prevent long prefill from blocking decode:
--max-num-batched-tokens 4096  # Instead of 8192

# Verify speculative decoding is configured correctly:
# Check acceptance rate in vLLM metrics:
curl http://localhost:8000/metrics | grep speculative

Problem: Speculative Decoding Not Providing Speedup

Symptom: Enabling speculative decoding does not improve throughput or latency.

Solution:

# Check acceptance rate:
curl http://localhost:8000/metrics | grep spec_decode_draft_acceptance_rate

# If acceptance rate < 50%:
# - The draft model may not be well-matched to the target model
# - Try reducing the number of speculative tokens:
--num-speculative-tokens 3  # Instead of 5

# Ensure draft model is from the same model family:
# Good: Llama-3.1-8B as draft for Llama-3.1-70B target
# Bad:  Mistral-7B as draft for Llama-3.1-70B target

Problem: SGLang Worker Node Not Connecting

Symptom: SGLang head node starts but worker never connects; head node hangs.

Solution:

# Verify port 20000 is open on spark1:
ssh spark1 'sudo ufw status | grep 20000'
# If not open:
ssh spark1 'sudo ufw allow 20000/tcp'

# Check that dist-init-addr matches spark1's inter-node IP:
# Must be 192.168.100.1:20000, NOT the management IP

# Verify both nodes use --network host in Docker:
# Without host networking, RDMA devices are not accessible

# Check worker logs:
ssh spark2 'docker logs sglang-worker 2>&1 | tail -30'

ADDENDUM A: NETWORK CONFIGURATION CHEATSHEET

================================================================================
  NETWORK CONFIGURATION CHEATSHEET
  DGX Spark Dual-Node Cluster
================================================================================

INTERFACE NAMES (verify with: ip link show / ls /sys/class/infiniband/)
  Network interface:    enp1s0f0np0   (may differ on your system)
  RDMA device:          rocep1s0f0    (may differ on your system)

IP ADDRESSES
  spark1 (head):        192.168.100.1/24
  spark2 (worker):      192.168.100.2/24
  Subnet:               192.168.100.0/24

MTU
  Both interfaces:      9000 (jumbo frames — required for performance)

CABLE
  Type:                 QSFP56 DAC (<=5m) or AOC (<=100m)
  Speed:                200 Gbps (theoretical); ~185-190 Gbps typical
  Port:                 QSFP56 Port 0 on each machine

QUICK COMMANDS
  Check link:           ip link show enp1s0f0np0
  Check RDMA:           ibstat
  Check MTU:            ip link show enp1s0f0np0 | grep mtu
  Test ping:            ping -c 4 192.168.100.2
  Test RDMA BW:         ib_send_bw -d rocep1s0f0 -i 1 -F --report_gbits [IP]
  Show GID table:       show_gids
  Check PFC:            mlnx_qos -i enp1s0f0np0
  Set PFC:              sudo mlnx_qos -i enp1s0f0np0 --pfc 0,0,0,1,0,0,0,0
  Load peermem:         sudo modprobe nvidia-peermem
  Check peermem:        lsmod | grep nvidia_peermem

NETPLAN FILE LOCATION
  /etc/netplan/60-dgx-interconnect.yaml

PORTS TO OPEN IN FIREWALL
  22      SSH
  6379    Ray GCS (head node only)
  8000    vLLM API / trtllm-serve
  8265    Ray Dashboard (head node only)
  20000   SGLang dist-init (head node only)
  29500   PyTorch distributed
  30000   SGLang API (head node only)
  8001    trtllm-serve HTTP (if using Triton)
  8002    Triton gRPC (if using Triton)
================================================================================

ADDENDUM B: vLLM CONFIGURATION CHEATSHEET

================================================================================
  vLLM CONFIGURATION CHEATSHEET
================================================================================

INSTALLATION
  python3 -m venv /opt/vllm-env
  source /opt/vllm-env/bin/activate
  pip install "ray[default]" vllm

RAY CLUSTER STARTUP
  # spark1 (head):
  ray start --head --node-ip-address=192.168.100.1 --port=6379 \
            --dashboard-host=0.0.0.0 --dashboard-port=8265 --block &

  # spark2 (worker):
  ray start --address=192.168.100.1:6379 --node-ip-address=192.168.100.2 --block &

  # Check status:
  ray status

  # Stop Ray on a node:
  ray stop

RAY DASHBOARD (from laptop with SSH tunnel)
  ssh -L 8265:localhost:8265 spark1 -N &
  # Open: http://localhost:8265

vLLM LAUNCH — 70B MODEL, RECOMMENDED (PP=2, 1 GPU/node)
  python3 -m vllm.entrypoints.openai.api_server \
    --model /models/llama-3.1-70b-instruct \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 2 \
    --distributed-executor-backend ray \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.85 \
    --enable-prefix-caching \
    --speculative-model /models/llama-3.1-8b-instruct \
    --num-speculative-tokens 5 \
    --max-num-seqs 32 \
    --max-num-batched-tokens 8192 \
    --disable-log-requests

vLLM LAUNCH — 405B MODEL (PP=2, FP4 required)
  python3 -m vllm.entrypoints.openai.api_server \
    --model /models/llama-3.1-405b \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 2 \
    --distributed-executor-backend ray \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90 \
    --quantization fp4 \
    --enable-prefix-caching \
    --max-num-seqs 4 \
    --disable-log-requests

PARALLELISM QUICK REFERENCE (1 GPU per node — RECOMMENDED)
  70B:   --tensor-parallel-size 1 --pipeline-parallel-size 2
  405B:  --tensor-parallel-size 1 --pipeline-parallel-size 2

  Note: TP=2 across two single-GPU nodes is supported but requires
  an all-reduce at every transformer layer across the inter-node link.
  PP=2 is recommended as it requires only one activation transfer
  per forward pass at the layer boundary.

KEY FLAGS
  --gpu-memory-utilization 0.85   KV cache memory fraction (leave 15% headroom)
  --enable-prefix-caching         Reuse KV cache for common prefixes
  --enable-chunked-prefill        Reduce TTFT for long prompts
  --quantization fp8/fp4          Enable quantization (Blackwell native)
  --max-num-seqs N                Max concurrent sequences
  --max-num-batched-tokens N      Max tokens per batch step

TEST API
  curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"llama-3.1-70b-instruct",
         "messages":[{"role":"user","content":"Hello!"}],
         "max_tokens":100}'

METRICS
  curl http://localhost:8000/metrics | grep vllm

FORCE SPREAD PLACEMENT (if Ray packs workers on one node)
  # Use valid JSON with proper shell quoting:
  export VLLM_DISTRIBUTED_EXECUTOR_CONFIG='{"placement_group_options": {"strategy": "SPREAD"}}'
================================================================================

ADDENDUM C: SGLang CONFIGURATION CHEATSHEET

================================================================================
  SGLang CONFIGURATION CHEATSHEET
================================================================================

INSTALLATION
  docker pull lmsysorg/sglang:latest

LAUNCH — 70B MODEL, TWO NODES (TP=2)
  # spark1 (node-rank 0 — head, serves API on port 30000):
  docker run -d --name sglang-head --gpus all --network host \
    --shm-size 16g --privileged -v /models:/models \
    -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
    -e NCCL_IB_HCA=rocep1s0f0 \
    -e NCCL_IB_ROCE_VERSION_NUM=2 \
    -e NCCL_NET_GDR_LEVEL=5 \
    -e NCCL_NET_GDR_READ=1 \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
      --model-path /models/llama-3.1-70b-instruct \
      --tp 2 \
      --dist-init-addr 192.168.100.1:20000 \
      --nnodes 2 --node-rank 0 \
      --host 0.0.0.0 --port 30000 \
      --mem-fraction-static 0.85 \
      --max-running-requests 16 \
      --chunked-prefill-size 4096

  # spark2 (node-rank 1 — worker, no HTTP API exposed):
  docker run -d --name sglang-worker --gpus all --network host \
    --shm-size 16g --privileged -v /models:/models \
    -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
    -e NCCL_IB_HCA=rocep1s0f0 \
    -e NCCL_IB_ROCE_VERSION_NUM=2 \
    -e NCCL_NET_GDR_LEVEL=5 \
    -e NCCL_NET_GDR_READ=1 \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
      --model-path /models/llama-3.1-70b-instruct \
      --tp 2 \
      --dist-init-addr 192.168.100.1:20000 \
      --nnodes 2 --node-rank 1

WITH EAGLE SPECULATIVE DECODING
  Add to head node command:
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path /models/eagle3-llama3.1-70b \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 8

KEY FLAGS
  --tp N                    Total tensor parallel size (all nodes combined)
  --dist-init-addr IP:PORT  Head node address for distributed init (port 20000)
  --nnodes N                Total number of nodes
  --node-rank N             This node's rank (0=head, 1,2,...=workers)
  --mem-fraction-static F   Memory fraction for weights+KV cache (0.0-1.0)
  --max-running-requests N  Max concurrent requests (controls latency)
  --chunked-prefill-size N  Chunk size for long prompt prefill
  --enable-cache-report     Log prefix cache hit rates

MONITOR LOGS
  docker logs -f sglang-head    # On spark1
  docker logs -f sglang-worker  # On spark2

TEST API (from laptop with SSH tunnel on port 30000)
  curl -X POST http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"llama-3.1-70b-instruct",
         "messages":[{"role":"user","content":"Hello!"}],
         "max_tokens":100}'

STOP
  docker stop sglang-head sglang-worker
  docker rm sglang-head sglang-worker
================================================================================

ADDENDUM D: TensorRT-LLM CONFIGURATION CHEATSHEET

================================================================================
  TensorRT-LLM CONFIGURATION CHEATSHEET
================================================================================

INSTALLATION
  docker pull nvcr.io/nvidia/tensorrt-llm/release:latest

START CONTAINER (interactive)
  docker run -it --gpus all --network host \
    -v /models:/models -v /output:/output \
    nvcr.io/nvidia/tensorrt-llm/release:latest bash

WORKFLOW: Convert → Build → Serve

STEP 1: CONVERT CHECKPOINT (70B, FP8, single node)
  cd /app/tensorrt_llm/examples/llama
  python3 convert_checkpoint.py \
    --model_dir /models/llama-3.1-70b-instruct \
    --output_dir /output/llama-70b-ckpt \
    --dtype bfloat16 \
    --use_fp8 \
    --tp_size 1 --pp_size 1

STEP 2: BUILD ENGINE (70B, single node)
  trtllm-build \
    --checkpoint_dir /output/llama-70b-ckpt \
    --output_dir /output/llama-70b-engine \
    --gemm_plugin bfloat16 \
    --max_batch_size 8 \
    --max_input_len 4096 \
    --max_output_len 2048 \
    --paged_kv_cache enable \
    --use_paged_context_fmha enable \
    --enable_chunked_context

STEP 2b: SERVE (70B, single node)
  # Model path is a POSITIONAL argument; --engine_dir points to built engine.
  trtllm-serve meta-llama/Llama-3.1-70B-Instruct \
    --engine_dir /output/llama-70b-engine \
    --tp_size 1 \
    --pp_size 1 \
    --gpus_per_node 1 \
    --max_batch_size 8 \
    --max_num_tokens 4096 \
    --port 8000 \
    --log_level info

STEP 3: CONVERT CHECKPOINT (405B, FP4/FP8, two nodes, PP=2)
  python3 convert_checkpoint.py \
    --model_dir /models/llama-3.1-405b \
    --output_dir /output/llama-405b-ckpt \
    --dtype bfloat16 \
    --use_fp8 \
    --tp_size 1 --pp_size 2
  # Note: Use --use_fp4 if available in your container version.
  # FP8 alone (~405 GB) does NOT fit in 256 GB combined memory.
  # FP4 (~203 GB) is required for 405B on dual DGX Spark.

STEP 4: BUILD ENGINE (405B, two nodes)
  trtllm-build \
    --checkpoint_dir /output/llama-405b-ckpt \
    --output_dir /output/llama-405b-engine \
    --gemm_plugin bfloat16 \
    --max_batch_size 4 \
    --max_input_len 4096 \
    --max_output_len 2048 \
    --paged_kv_cache enable \
    --use_paged_context_fmha enable

STEP 5: CREATE MPI HOSTFILE (OpenMPI format)
  sudo apt install -y openmpi-bin openmpi-common libopenmpi-common
  cat <<EOF | sudo tee /etc/mpi/hostfile
  192.168.100.1 slots=1
  192.168.100.2 slots=1
  EOF

STEP 6: SERVE (405B, two nodes via mpirun)
  # Model path is a POSITIONAL argument to trtllm-serve.
  mpirun -n 2 \
    --hostfile /etc/mpi/hostfile \
    --allow-run-as-root \
    --mca btl_tcp_if_include enp1s0f0np0 \
    --mca oob_tcp_if_include enp1s0f0np0 \
    -x NCCL_SOCKET_IFNAME=enp1s0f0np0 \
    -x NCCL_IB_HCA=rocep1s0f0 \
    -x NCCL_IB_ROCE_VERSION_NUM=2 \
    -x NCCL_NET_GDR_LEVEL=5 \
    -x NCCL_NET_GDR_READ=1 \
    trtllm-serve meta-llama/Llama-3.1-405B-Instruct \
      --engine_dir /output/llama-405b-engine \
      --tp_size 1 \
      --pp_size 2 \
      --gpus_per_node 1 \
      --max_batch_size 4 \
      --max_num_tokens 2048 \
      --port 8000 \
      --log_level info

MEMORY REQUIREMENTS (405B)
  FP16:  ~810 GB  — Does NOT fit in 256 GB combined
  FP8:   ~405 GB  — Does NOT fit in 256 GB combined
  FP4:   ~203 GB  — FITS (~53 GB headroom for KV cache)
  → FP4 or INT4 quantization is REQUIRED for 405B on dual DGX Spark

TEST
  curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"meta-llama/Llama-3.1-405B-Instruct",
         "messages":[{"role":"user","content":"Hello!"}],
         "max_tokens":100}'
================================================================================

ADDENDUM E: NCCL CONFIGURATION CHEATSHEET

================================================================================
  NCCL CONFIGURATION CHEATSHEET
================================================================================

CONFIG FILE: /etc/nccl.conf (on both nodes)
  NCCL_SOCKET_IFNAME=enp1s0f0np0
  NCCL_IB_HCA=rocep1s0f0
  NCCL_IB_ROCE_VERSION_NUM=2
  NCCL_NET_GDR_LEVEL=5
  NCCL_NET_GDR_READ=1
  NCCL_IB_DISABLE=0
  NCCL_ALGO=Ring
  NCCL_PROTO=Simple
  NCCL_BUFFSIZE=4194304
  NCCL_SOCKET_NTHREADS=4
  NCCL_NSOCKS_PERTHREAD=4
  # NCCL_IB_GID_INDEX — do NOT set unless NCCL < 2.21
  # Find correct value with: show_gids (look for your IP + VER=V2)
  # NCCL_IB_ADDR_FAMILY=AF_INET  (optional: restrict dynamic GID to IPv4)
  # NCCL_IB_ADDR_RANGE=192.168.100.0/24  (optional: restrict GID subnet)

ENVIRONMENT FILE: /etc/profile.d/nccl-spark.sh (on both nodes)
  export NCCL_SOCKET_IFNAME=enp1s0f0np0
  export GLOO_SOCKET_IFNAME=enp1s0f0np0
  export NCCL_IB_HCA=rocep1s0f0
  export NCCL_IB_ROCE_VERSION_NUM=2
  export NCCL_NET_GDR_LEVEL=5
  export NCCL_NET_GDR_READ=1
  export NCCL_IB_DISABLE=0

KEY VARIABLES EXPLAINED
  NCCL_SOCKET_IFNAME      Network interface for bootstrap/fallback
  NCCL_IB_HCA             RDMA device name (find with: ibstat)
  NCCL_IB_ROCE_VERSION_NUM=2  Use RoCE v2 (UDP/IP encapsulation)
  NCCL_NET_GDR_LEVEL=5    Full GPUDirect RDMA (GPU to NIC direct)
  NCCL_NET_GDR_READ=1     GPU memory readable by NIC for sends
  NCCL_IB_DISABLE=0       Enable InfiniBand/RoCE transport
  NCCL_ALGO=Ring          Ring algorithm (efficient for 2 nodes)
  NCCL_PROTO=Simple       Simple protocol (efficient for large inter-node msgs)
  NCCL_IB_ADDR_FAMILY     IP family for dynamic GID selection (NCCL >= 2.21)
  NCCL_IB_ADDR_RANGE      CIDR subnet for dynamic GID selection (NCCL >= 2.21)

DEBUGGING
  NCCL_DEBUG=INFO         Show transport selection and warnings
  NCCL_DEBUG=TRACE        Show all NCCL operations (very verbose)
  NCCL_DEBUG_SUBSYS=NET   Show only network-related debug info

GOOD SIGN IN DEBUG OUTPUT
  "NET/IB"     RDMA is being used  (correct)

BAD SIGN IN DEBUG OUTPUT
  "NET/Socket" Falling back to TCP (RDMA not working — investigate)

BANDWIDTH TEST (build nccl-tests first)
  # Build (ARM64 — DGX Spark is aarch64):
  make MPI=1 MPI_HOME=/usr/lib/aarch64-linux-gnu/openmpi \
       NCCL_HOME=/usr/local/cuda CUDA_HOME=/usr/local/cuda

  # Run all-reduce test:
  mpirun -n 2 \
    --host 192.168.100.1:1,192.168.100.2:1 \
    --mca btl_tcp_if_include enp1s0f0np0 \
    ./build/all_reduce_perf -b 1M -e 1G -f 2 -g 1 -c 1 -n 100

  # Expected: ~12-15 GB/s effective all-reduce bandwidth
  # (200 Gbps = 25 GB/s raw; ring all-reduce 2-node efficiency ~50%)

GID INDEX (if needed for NCCL < 2.21 or ibv_modify_qp errors)
  show_gids
  # Find row: your interface + your IP + VER=V2
  # Note the INDEX value
  # Set: export NCCL_IB_GID_INDEX=<that_value>
  # With NCCL >= 2.21, this is selected automatically — do not set unless needed.
================================================================================

ADDENDUM F: SSH AND REMOTE ACCESS CHEATSHEET

================================================================================
  SSH AND REMOTE ACCESS CHEATSHEET
================================================================================

INITIAL DISCOVERY (before static IPs are configured)
  # mDNS discovery:
  ping dgx-spark-XXXX.local
  avahi-browse -t _ssh._tcp

  # Network scan:
  nmap -p 22 --open 192.168.1.0/24

LAPTOP SSH CONFIG (~/.ssh/config)
  Host spark1
      HostName 192.168.1.XXX        # Management network IP
      User dgx
      IdentityFile ~/.ssh/id_dgx_spark
      ServerAliveInterval 60
      ServerAliveCountMax 10

  Host spark2
      HostName 192.168.1.YYY
      User dgx
      IdentityFile ~/.ssh/id_dgx_spark
      ServerAliveInterval 60
      ServerAliveCountMax 10

KEY SETUP
  # Generate key:
  ssh-keygen -t ed25519 -C "laptop-to-dgx" -f ~/.ssh/id_dgx_spark

  # Copy to both nodes:
  ssh-copy-id -i ~/.ssh/id_dgx_spark.pub dgx@192.168.1.XXX
  ssh-copy-id -i ~/.ssh/id_dgx_spark.pub dgx@192.168.1.YYY

  # Set permissions:
  chmod 700 ~/.ssh
  chmod 600 ~/.ssh/config

SSH TUNNELS (open from laptop to access remote services)
  # vLLM API:
  ssh -L 8000:localhost:8000 spark1 -N &

  # Ray Dashboard:
  ssh -L 8265:localhost:8265 spark1 -N &

  # SGLang API:
  ssh -L 30000:localhost:30000 spark1 -N &

  # Open WebUI:
  ssh -L 3000:localhost:3000 spark1 -N &

  # All at once (using tunnel script — see Chapter 0.6):
  ~/bin/tunnel-spark.sh

TMUX (survive connection drops)
  tmux new -s work          # Start named session
  tmux attach -t work       # Reattach after disconnect
  Ctrl+B, D                 # Detach (leave running)
  tmux ls                   # List sessions

HOSTNAME RENAME
  sudo hostnamectl set-hostname spark1   # or spark2
  sudo nano /etc/hosts                   # Update 127.0.1.1 entry
  sudo reboot

SSHD HARDENING (/etc/ssh/sshd_config)
  PermitRootLogin no
  PasswordAuthentication no
  PubkeyAuthentication yes
  MaxAuthTries 3
  LoginGraceTime 20
  ClientAliveInterval 60
  ClientAliveCountMax 10

FAIL2BAN
  sudo apt install -y fail2ban
  sudo systemctl enable --now fail2ban

CLEAR OLD HOST KEY (after hostname/IP change)
  ssh-keygen -R old-hostname-or-ip
================================================================================

ADDENDUM G: COMPLETE SETUP CHECKLIST

================================================================================
  COMPLETE SETUP CHECKLIST — DGX SPARK DUAL-NODE CLUSTER
================================================================================

PHASE 0: INITIAL ACCESS
  [ ] Power on both DGX Spark systems
  [ ] Discover IPs via mDNS (dgx-spark-XXXX.local) or router DHCP table
  [ ] SSH into both systems using default credentials
  [ ] Set up SSH key authentication from your laptop to both nodes
  [ ] Configure ~/.ssh/config on your laptop
  [ ] Install tmux on both nodes
  [ ] Harden SSH (disable password auth, install fail2ban)

PHASE 1: HOSTNAME CONFIGURATION
  [ ] Rename spark1: sudo hostnamectl set-hostname spark1
  [ ] Update /etc/hosts on spark1 (127.0.1.1 entry)
  [ ] Rename spark2: sudo hostnamectl set-hostname spark2
  [ ] Update /etc/hosts on spark2 (127.0.1.1 entry)
  [ ] Reboot both nodes
  [ ] Verify: ssh spark1 'hostname'  =>  spark1
  [ ] Verify: ssh spark2 'hostname'  =>  spark2

PHASE 2: PHYSICAL AND NETWORK SETUP
  [ ] Connect QSFP56 cable between spark1 port 0 and spark2 port 0
  [ ] Verify link: ibstat (State: Active, Rate: 200)
  [ ] Configure Netplan on spark1: 192.168.100.1/24, MTU 9000
  [ ] Configure Netplan on spark2: 192.168.100.2/24, MTU 9000
  [ ] Apply Netplan: sudo netplan apply (both nodes)
  [ ] Configure PFC: sudo mlnx_qos -i enp1s0f0np0 --pfc 0,0,0,1,0,0,0,0
  [ ] Create roce-pfc.service systemd unit (both nodes)
  [ ] Load nvidia-peermem: sudo modprobe nvidia-peermem
  [ ] Make nvidia-peermem persistent: /etc/modules-load.d/
  [ ] Test ping: ping -c 4 192.168.100.2 (from spark1)
  [ ] Test RDMA BW: ib_send_bw (expect ~185-200 Gbps)
  [ ] Set up passwordless SSH between nodes (spark1 <-> spark2)
  [ ] Update /etc/hosts on both nodes (add spark1/spark2 entries)
  [ ] Configure firewall rules (all required ports, including 20000 for SGLang)
  [ ] Create /etc/nccl.conf on both nodes
  [ ] Create /etc/profile.d/nccl-spark.sh on both nodes

PHASE 3: SOFTWARE INSTALLATION
  [ ] Install pip, huggingface_hub on both nodes
  [ ] Download models to /models/ on both nodes
  [ ] Install vLLM + Ray on both nodes (in /opt/vllm-env)
  [ ] Pull SGLang Docker image on both nodes
  [ ] Pull TensorRT-LLM Docker image on both nodes
  [ ] Build nccl-tests (ARM64 build flags: aarch64-linux-gnu/openmpi)

PHASE 4: VALIDATION
  [ ] Run nccl-tests all_reduce_perf (expect ~12-15 GB/s)
  [ ] Start Ray cluster (spark1 head, spark2 worker)
  [ ] Verify: ray status shows 2 nodes with GPU resources
  [ ] Launch vLLM 70B (TP=1, PP=2)
  [ ] Test vLLM API with curl
  [ ] Launch SGLang 70B (TP=2, two nodes, dist-init-addr port 20000)
  [ ] Test SGLang API with curl
  [ ] Open SSH tunnels from laptop
  [ ] Access Ray Dashboard at http://localhost:8265
  [ ] Deploy Open WebUI, access at http://localhost:3000

PHASE 5: PRODUCTION HARDENING
  [ ] Create Docker Compose files on both nodes
  [ ] Verify Docker Compose command blocks use 'bash -c |' literal scalar
  [ ] Create systemd spark-cluster.service on both nodes
  [ ] Enable systemd service: sudo systemctl enable spark-cluster
  [ ] Test automatic restart: sudo reboot, verify services start
  [ ] Set up monitoring (DCGM, vLLM metrics, Ray Dashboard)
  [ ] Document your specific interface names and GID indices
  [ ] Verify VLLM_DISTRIBUTED_EXECUTOR_CONFIG uses valid JSON if set
================================================================================

CONCLUSION: YOUR PERSONAL AI SUPERCOMPUTER CLUSTER

You have now completed a comprehensive journey from the very first power-on of a headless DGX Spark workstation to a fully operational, production-grade dual-node LLM inference cluster — accessible entirely over SSH from your laptop.

With two DGX Spark systems connected by a QSFP56 cable and configured as described in this guide, you have a system capable of running 405-billion-parameter models locally, with complete data privacy and no cloud costs. You have three inference frameworks at your disposal: TensorRT-LLM for maximum performance when you need every last token per second, vLLM for versatile production serving with broad model compatibility, and SGLang for agentic applications that require structured outputs and efficient prefix reuse.

You have applied speculative decoding to reduce per-token latency by 2× to 4×, paged attention to maximize the number of concurrent users your system can serve, continuous batching to keep your GPUs busy at all times, and prefix caching to eliminate redundant computation for common prompt prefixes. You have configured NCCL for maximum inter-node bandwidth using RoCE v2 and GPUDirect RDMA, ensuring that the 200 Gbps link between your two nodes is used as efficiently as possible. You have chosen pipeline parallelism as the recommended strategy for single-GPU-per-node setups, minimizing inter-node communication overhead. And you have set up the entire system to operate headlessly — no monitor, no keyboard, no mouse — accessible from anywhere on your network with a single ssh spark1 command.

The most important thing to remember is that every configuration decision in this guide was made for a reason, and understanding those reasons is what allows you to adapt intelligently when your requirements change. When you need to serve more users, you know to increase the batch size and reduce the per-request memory allocation. When you need lower latency, you know to enable speculative decoding and chunked prefill. When you need to run a larger model, you know to use more aggressive quantization and pipeline parallelism.

Your personal AI supercomputer cluster is ready. The frontier-scale models are loaded. The only question now is: what will you build with it?