Wednesday, January 07, 2026

MASTERING LLM DEVELOPMENT ON THE NVIDIA DGX SPARK: A GUIDE TO INFERENCE, FINE-TUNING, AND PRODUCTION DEPLOYMENT

 



INTRODUCTION: UNLEASHING THE POWER OF PERSONAL AI SUPERCOMPUTING

Welcome to an exciting journey into the world of large language model development on the NVIDIA DGX Spark. This tutorial will transform you from a curious learner into a capable practitioner who can build, deploy, and fine-tune LLM applications with confidence. The DGX Spark represents a revolutionary shift in AI development, bringing supercomputer-level capabilities to your desk. With its GB10 Grace Blackwell Superchip delivering up to one petaFLOP of AI performance, 128GB of unified memory, and a sophisticated architecture designed specifically for AI workloads, this device enables you to work with models ranging from 7 billion to 200 billion parameters entirely locally.

Throughout this tutorial, we will explore the complete lifecycle of LLM application development. You will learn how to set up vLLM and SGLang, two cutting-edge inference frameworks that maximize the DGX Spark's capabilities. We will dive deep into programmatic access patterns, showing you how to build client applications that communicate with your DGX Spark over a network. You will master advanced techniques like request batching and speculative decoding that dramatically improve throughput and reduce latency. Finally, we will tackle fine-tuning, demonstrating how to customize models for your specific use cases using parameter-efficient methods like LoRA and QLoRA.

The approach taken here emphasizes practical, production-ready code. Every concept will be illustrated with working examples, and we will build toward a comprehensive running example that demonstrates all the techniques in a cohesive application. By the end of this tutorial, you will possess the knowledge and confidence to leverage the DGX Spark for serious AI development work.

UNDERSTANDING THE DGX SPARK ARCHITECTURE

Before we dive into installation and coding, let us establish a solid understanding of what makes the DGX Spark special. The heart of this system is the GB10 Grace Blackwell Superchip, which represents a fundamental rethinking of how CPUs and GPUs work together for AI workloads.

The Blackwell GPU component features 6,144 CUDA cores and 192 fifth-generation Tensor Cores optimized for FP4 precision. These Tensor Cores are specifically designed for the matrix multiplication operations that dominate LLM inference and training. The FP4 precision capability is particularly important because it allows the system to process four-bit quantized models with native hardware acceleration, dramatically increasing throughput for large models.

Paired with the GPU is a 20-core Arm CPU featuring a mix of high-performance Cortex-X925 cores and efficient Cortex-A725 cores. This heterogeneous CPU design allows the system to handle both compute-intensive tasks and background operations efficiently.

What truly sets the DGX Spark apart is its unified memory architecture. Using NVLink-C2C technology, the system creates a coherent memory space between the CPU and GPU. The 128GB of LPDDR5x memory is accessible to both processors without explicit data transfers. This eliminates the traditional bottleneck of moving data between CPU and GPU memory spaces. For LLM inference, this means that large model weights can be loaded once and accessed by both processors as needed. The 273 GB/s memory bandwidth ensures that this unified memory can feed data to the compute units at the rates they demand.

The ConnectX-7 Smart NIC provides high-performance networking capabilities. This is crucial for our tutorial because it enables efficient client-server communication patterns. When you run inference servers on the DGX Spark and connect from client machines on the same network, this networking hardware ensures minimal latency and maximum throughput.

The system runs NVIDIA DGX OS, a customized Ubuntu Linux distribution that comes preconfigured with the NVIDIA AI software stack. This includes optimized CUDA libraries, container runtime support, and the tools needed for AI development. The operating system is designed to extract maximum performance from the hardware without requiring extensive manual configuration.

SETTING UP YOUR DGX SPARK FOR LLM DEVELOPMENT

Now that we understand the hardware, let us prepare the DGX Spark for our work. The setup process involves ensuring that the base system is properly configured, installing container runtime components, and preparing the networking environment.

First, verify that your DGX Spark is running the latest DGX OS. You can check this by opening a terminal and examining the system information. The OS should be based on Ubuntu 22.04 or later. Ensure that the NVIDIA drivers are version 580.95.05 or newer, as these drivers include critical support for the Blackwell GB10 GPU architecture.

Next, confirm that Docker is installed and properly configured with the NVIDIA Container Toolkit. This toolkit allows Docker containers to access the GPU resources. You can verify the installation by running a simple test. Open a terminal and execute the following command:

docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

This command pulls a basic CUDA container and runs the nvidia-smi utility inside it. If everything is configured correctly, you should see output displaying your GB10 GPU information, including its name, driver version, CUDA version, and current utilization. The output will show that the GPU has approximately 128GB of memory available, reflecting the unified memory architecture.

If the docker command fails with permission errors, you may need to add your user to the docker group. This can be done with the following command, followed by logging out and back in:

sudo usermod -aG docker $USER

With Docker confirmed working, we need to ensure that the system can pull containers from NVIDIA GPU Cloud. NGC is NVIDIA's registry for optimized containers. While many NGC containers are publicly accessible, you may want to configure authentication for access to all resources. Visit ngc.nvidia.com, create an account if you do not have one, and generate an API key. Then configure Docker to use this key:

docker login nvcr.io

When prompted, use the string dollar-sign-oauthtoken as the username and paste your NGC API key as the password. This authentication will allow Docker to pull containers from NVIDIA's private repositories.

Now let us prepare the networking environment. For this tutorial, we assume you have a client machine on the same local network as your DGX Spark. Identify the IP address of your DGX Spark by running:

ip addr show

Look for the network interface that is connected to your local network. It will typically be named something like eth0 or enp0s1. Note the inet address, which will be in a format like 192.168.1.100. This is the IP address that your client machines will use to connect to services running on the DGX Spark.

Ensure that your firewall allows incoming connections on the ports we will use. The default ports for vLLM and SGLang are typically 8000 and 30000 respectively. You can configure the firewall using ufw:

sudo ufw allow 8000/tcp
sudo ufw allow 30000/tcp
sudo ufw allow 8001/tcp

With these foundational steps complete, your DGX Spark is ready for installing the inference frameworks.

INSTALLING VLLM ON THE DGX SPARK

The vLLM framework is designed for high-throughput, memory-efficient LLM inference. It implements several advanced techniques including PagedAttention for efficient KV cache management and continuous batching for maximizing GPU utilization. Installing vLLM on the DGX Spark requires special attention to the Blackwell architecture and CUDA 13.0 support.

NVIDIA provides optimized containers for vLLM that are specifically built for the DGX Spark's GB10 GPU. This is the recommended installation method because it includes all necessary dependencies with versions that are tested and optimized for the hardware.

To install vLLM using the NGC container approach, we will pull the official NVIDIA vLLM image. Open a terminal on your DGX Spark and execute:

docker pull nvcr.io/nvidia/vllm:25.11-py3

This command downloads a container image that includes vLLM, PyTorch 2.9.0 with CUDA 13.0 bindings, Triton 3.5.0, and all necessary CUDA libraries optimized for Blackwell. The download is substantial, typically around 15-20 GB, so it may take some time depending on your internet connection.

Once the pull completes, verify that the container works correctly by running a simple test:

docker run --rm --gpus all nvcr.io/nvidia/vllm:25.11-py3 python -c "import vllm; print(vllm.__version__)"

This command starts a container, imports the vLLM library, prints its version, and then exits. You should see a version number printed, confirming that vLLM is installed and importable within the container environment.

For those who prefer a native installation rather than using containers, there is a one-command installation script available specifically for DGX Spark. This script automates the compilation of vLLM with all the necessary Blackwell-specific optimizations. To use this approach, you would execute:

curl -fsSL https://raw.githubusercontent.com/nvidia/vllm-dgx-spark/main/install.sh | bash

This script handles the installation of CUDA 13.0 support, compilation of MOE kernels, and configuration of PyTorch with the correct CUDA bindings. The installation process can take 30 minutes to an hour because it compiles C++ and CUDA code. The script requires approximately 50GB of free disk space and at least 8GB of RAM during the build process.

For this tutorial, we will use the container-based approach because it provides a consistent, reproducible environment and avoids potential compilation issues. The container also makes it easy to upgrade to newer versions of vLLM as they become available.

INSTALLING SGLANG ON THE DGX SPARK

SGLang is a high-performance serving framework that excels at structured generation, prefix caching, and speculative decoding. Like vLLM, NVIDIA provides an optimized container for SGLang that is specifically built for the DGX Spark.

To install SGLang, pull the official NGC container:

docker pull nvcr.io/nvidia/sglang:latest

The SGLang container is similarly sized to the vLLM container and includes all necessary dependencies for running on Blackwell architecture. Once the download completes, verify the installation:

docker run --rm --gpus all nvcr.io/nvidia/sglang:latest python -c "import sglang; print(sglang.__version__)"

You should see the SGLang version number printed, confirming successful installation.

SGLang can also be installed natively using pip or uv. The uv package manager is a fast Python package installer that SGLang recommends. To install SGLang natively, you would first install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh

Then create a virtual environment and install SGLang:

uv venv sglang --python 3.12
source sglang/bin/activate
uv pip install "sglang[all]>=0.4.4.post1"

However, as with vLLM, we will use the container approach for consistency and to ensure we have all the Blackwell-specific optimizations.

Now that both frameworks are installed, we can begin exploring how to use them for inference.

LAUNCHING A VLLM INFERENCE SERVER

With vLLM installed, let us launch an inference server that will accept requests from client machines on the network. The vLLM server implements an OpenAI-compatible API, which means you can use the standard OpenAI Python client library to interact with it.

To start a vLLM server, we need to specify which model to load and configure the server parameters. For this example, we will use the Meta Llama 3.1 8B Instruct model, which is a capable instruction-following model that fits comfortably in the DGX Spark's memory.

Before we can use the model, we need to download it from Hugging Face. The DGX Spark will cache the model locally after the first download. You will need a Hugging Face account and may need to accept Meta's license agreement for Llama models. Set up your Hugging Face token as an environment variable:

export HF_TOKEN=your_huggingface_token_here

Now launch the vLLM server using Docker:

docker run -d \
  --name vllm-server \
  --gpus all \
  -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/vllm:25.11-py3 \
  python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --max-model-len 8192

Let us break down what each part of this command does. The docker run command starts a new container. The minus-d flag runs it in detached mode, meaning it runs in the background. The double-dash-name flag gives the container a friendly name we can reference later. The double-dash-gpus all flag makes all GPUs available to the container, though the DGX Spark has only one GPU.

The minus-p 8000:8000 flag maps port 8000 inside the container to port 8000 on the host, making the server accessible from other machines on the network. The minus-e flag passes the Hugging Face token as an environment variable. The minus-v flag mounts the Hugging Face cache directory from the host into the container, so downloaded models persist across container restarts.

The remaining arguments configure vLLM itself. The double-dash-model argument specifies which model to load from Hugging Face. The double-dash-host 0.0.0.0 makes the server listen on all network interfaces, not just localhost. The double-dash-dtype bfloat16 specifies that we want to use 16-bit brain floating point precision, which provides a good balance between memory usage and accuracy. The double-dash-max-model-len 8192 sets the maximum context length to 8192 tokens.

After running this command, vLLM will begin downloading the model if it is not already cached. For an 8B parameter model, this is approximately 16GB of data. Once the download completes, vLLM will load the model into GPU memory and start the server. You can monitor the progress by viewing the container logs:

docker logs -f vllm-server

You will see extensive output as vLLM initializes. Look for a line that says something like "Uvicorn running on http://0.0.0.0:8000". This indicates that the server is ready to accept requests. Press Control-C to stop following the logs; the server will continue running in the background.

To verify that the server is working, you can send a test request using curl from the DGX Spark itself:

curl http://localhost:8000/v1/models

This should return a JSON response listing the available models, confirming that the server is responding to requests.

LAUNCHING AN SGLANG INFERENCE SERVER

Now let us set up an SGLang server alongside our vLLM server. SGLang offers some unique capabilities that complement vLLM, particularly for structured generation and speculative decoding. We will run SGLang on a different port to avoid conflicts.

Launch the SGLang server with the following command:

docker run -d \
  --name sglang-server \
  --gpus all \
  -p 30000:30000 \
  -e HF_TOKEN=$HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/sglang:latest \
  python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --dtype bfloat16 \
  --context-length 8192

The structure of this command is very similar to the vLLM launch command. We use port 30000 instead of 8000, and the SGLang server is launched using sglang.launch_server instead of vllm.entrypoints.openai.api_server. The arguments are slightly different in naming but serve the same purposes.

Monitor the SGLang server startup:

docker logs -f sglang-server

You will see SGLang initialize its runtime, load the model, and start the server. Look for output indicating that the server is listening on port 30000. Once you see this, the server is ready.

Verify the SGLang server with a curl request:

curl http://localhost:30000/v1/models

You should receive a JSON response similar to the vLLM server, confirming that SGLang is operational.

Now we have both vLLM and SGLang running on the DGX Spark, each serving the same model but with different optimization strategies. In the following sections, we will build client applications that leverage these servers.

BUILDING A BASIC CLIENT APPLICATION FOR VLLM

With the vLLM server running on the DGX Spark, let us create a client application that runs on a separate computer within the same network. This client will send inference requests to the vLLM server and display the results.

The client application uses the OpenAI Python library because vLLM implements an OpenAI-compatible API. This means that code written for OpenAI's API works with minimal modifications when pointed at a vLLM server.

First, on your client machine, install the OpenAI library. Create a new Python virtual environment to keep dependencies isolated:

python3 -m venv llm_client_env
source llm_client_env/bin/activate
pip install openai

Now create a Python script called vllm_basic_client.py. This script will connect to the vLLM server and perform a simple inference request:

from openai import OpenAI

# Configure the client to point to your vLLM server
# Replace 192.168.1.100 with your DGX Spark's actual IP address
client = OpenAI(
    base_url="http://192.168.1.100:8000/v1",
    api_key="not-needed"
)

# Define a prompt for the model
prompt = "Explain the concept of quantum entanglement in simple terms that a high school student could understand."

# Send a chat completion request
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful physics teacher who explains complex concepts clearly."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.7,
    max_tokens=500
)

# Extract and display the response
answer = response.choices[0].message.content
print("Question:", prompt)
print("\nAnswer:", answer)
print("\nTokens used:", response.usage.total_tokens)

Let us examine this code in detail. We begin by importing the OpenAI class from the openai library. We then create an instance of this class, but instead of using OpenAI's servers, we point it to our vLLM server by setting the base_url parameter to the DGX Spark's IP address and port. The api_key parameter is required by the OpenAI library but vLLM does not actually validate it, so we can use any string.

We define a prompt asking for an explanation of quantum entanglement. This is a good test prompt because it requires the model to demonstrate both knowledge and the ability to simplify complex information.

The chat.completions.create method sends the request to the vLLM server. We pass several parameters. The model parameter must match the model name that the vLLM server loaded. The messages parameter is a list of message dictionaries following the chat format. We include a system message that sets the context for the model's behavior and a user message containing the actual question.

The temperature parameter controls randomness in the output. A value of 0.7 provides a good balance between creativity and coherence. The max_tokens parameter limits the length of the response to 500 tokens.

The response object contains the model's output along with metadata. We extract the actual text content from response.choices[0].message.content. The response.usage.total_tokens field tells us how many tokens were used, which is useful for understanding the computational cost of the request.

Run this script on your client machine:

python vllm_basic_client.py

You should see the question printed, followed by a detailed explanation of quantum entanglement, and finally the token count. The response should be coherent and appropriate for a high school audience, demonstrating that the model is functioning correctly.

This basic example establishes the fundamental pattern for interacting with the vLLM server. In the next sections, we will build on this foundation to implement more advanced features.

BUILDING A BASIC CLIENT APPLICATION FOR SGLANG

Now let us create a client for the SGLang server. SGLang offers two client interfaces: an OpenAI-compatible API similar to vLLM, and a native SGLang API that provides additional features. We will explore both.

First, let us use the OpenAI-compatible interface. Create a script called sglang_basic_client.py:

from openai import OpenAI

# Configure the client to point to your SGLang server
client = OpenAI(
    base_url="http://192.168.1.100:30000/v1",
    api_key="not-needed"
)

# Define a prompt
prompt = "Write a short poem about the beauty of mathematics."

# Send a chat completion request
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a creative poet with a love for mathematics."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.8,
    max_tokens=200
)

# Display the response
poem = response.choices[0].message.content
print("Prompt:", prompt)
print("\nPoem:\n", poem)

This script is nearly identical to the vLLM client, with the only difference being the base_url pointing to port 30000 where SGLang is running. This demonstrates the power of the OpenAI-compatible API standard: the same client code works with different inference backends.

Run the script:

python sglang_basic_client.py

You should receive a creative poem about mathematics, demonstrating that the SGLang server is functioning correctly.

Now let us explore SGLang's native API, which provides more control and access to SGLang-specific features. Install the SGLang client library on your client machine:

pip install "sglang[all]"

Create a new script called sglang_native_client.py:

import sglang as sgl

# Set the default backend to your remote SGLang server
sgl.set_default_backend(sgl.RuntimeEndpoint("http://192.168.1.100:30000"))

# Define an SGLang function for generation
@sgl.function
def generate_story(s, topic, length):
    s += f"Write a {length} story about {topic}."
    s += sgl.gen("story", max_tokens=300, temperature=0.7)

# Run the function
state = generate_story.run(topic="a robot learning to paint", length="short")

# Access the generated content
print("Generated Story:")
print(state["story"])

This example demonstrates SGLang's functional programming approach. We use sgl.set_default_backend to configure the connection to our remote server. Then we define a function using the @sgl.function decorator. This function takes parameters and uses them to construct a prompt dynamically.

Inside the function, we use the special s object to build the prompt. The += operator appends to the prompt. The sgl.gen call marks a point where the model should generate text, and we give this generated text a name ("story") so we can retrieve it later.

We call the function with run, passing our parameters. The function executes on the remote SGLang server, and we get back a state object containing the results. We can access the generated story using dictionary-style indexing with the name we assigned.

Run this script:

python sglang_native_client.py

You should see a creative short story about a robot learning to paint. The native SGLang API provides more structure and makes it easier to build complex prompting workflows, which we will explore further in the advanced sections.

IMPLEMENTING REQUEST BATCHING FOR IMPROVED THROUGHPUT

One of the key advantages of both vLLM and SGLang is their ability to efficiently batch multiple requests. While the servers handle batching automatically through continuous batching mechanisms, we can maximize throughput by sending multiple requests concurrently from the client side.

Let us create a client that sends multiple requests in parallel and measures the performance improvement. Create a script called batched_inference_client.py:

from openai import OpenAI
import concurrent.futures
import time

# Configure the client
client = OpenAI(
    base_url="http://192.168.1.100:8000/v1",
    api_key="not-needed"
)

# Define a set of diverse prompts
prompts = [
    "Explain how photosynthesis works.",
    "What are the main causes of climate change?",
    "Describe the water cycle in nature.",
    "How do vaccines help prevent diseases?",
    "What is the theory of evolution?",
    "Explain the concept of gravity.",
    "How does the human immune system work?",
    "What causes earthquakes?",
    "Describe the process of DNA replication.",
    "How do airplanes generate lift?"
]

# Function to send a single request
def send_request(prompt):
    try:
        response = client.chat.completions.create(
            model="meta-llama/Meta-Llama-3.1-8B-Instruct",
            messages=[
                {"role": "system", "content": "You are a knowledgeable science educator."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=200
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error processing prompt '{prompt}': {e}")
        return None

# Sequential processing
print("Processing requests sequentially...")
start_time = time.time()
sequential_results = []
for prompt in prompts:
    result = send_request(prompt)
    sequential_results.append(result)
sequential_time = time.time() - start_time
print(f"Sequential processing took {sequential_time:.2f} seconds")

# Parallel processing
print("\nProcessing requests in parallel...")
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    parallel_results = list(executor.map(send_request, prompts))
parallel_time = time.time() - start_time
print(f"Parallel processing took {parallel_time:.2f} seconds")

# Calculate speedup
speedup = sequential_time / parallel_time
print(f"\nSpeedup from batching: {speedup:.2f}x")

# Display one example result
print("\nExample result:")
print(f"Prompt: {prompts[0]}")
if parallel_results[0]:
    print(f"Response: {parallel_results[0]}")

This script demonstrates the power of concurrent request submission. We define a list of ten different science questions. The send_request function encapsulates the logic for sending a single request to the vLLM server, with error handling to gracefully manage any failures.

We first process all requests sequentially using a simple for loop. This represents the baseline performance without any batching benefits. We measure the total time taken.

Then we process the same requests using concurrent.futures.ThreadPoolExecutor. This Python standard library module allows us to execute functions in parallel using a thread pool. We create a pool with ten worker threads, matching the number of requests. The executor.map method applies our send_request function to each prompt in parallel.

From the client's perspective, we are sending ten separate HTTP requests. However, the vLLM server's continuous batching mechanism automatically groups these concurrent requests and processes them together on the GPU. This dramatically improves throughput because the GPU can process multiple sequences in parallel, sharing the computational overhead of loading model weights and performing matrix operations.

Run this script:

python batched_inference_client.py

You will see output showing the time taken for sequential processing, then parallel processing, and finally the speedup factor. On a typical setup, you should see a speedup of 3x to 5x or more, depending on the specific model and request characteristics. The exact speedup depends on factors like the length of the prompts, the requested output length, and the GPU's ability to parallelize the workload.

The script also displays one example result to verify that the responses are still high quality. The batching optimization is transparent to the application logic; the responses are identical to what you would get from sequential processing.

This batching approach is crucial for production applications that need to serve multiple users concurrently. By sending requests in parallel, you allow the inference server to maximize GPU utilization and minimize per-request latency.

IMPLEMENTING BATCH INFERENCE WITH SGLANG NATIVE API

SGLang's native API provides a more elegant approach to batch inference through its run_batch method. This allows you to submit multiple requests as a batch and receive all results together. Let us create a client that demonstrates this capability.

Create a script called sglang_batch_client.py:

import sglang as sgl
import time

# Set the default backend to your remote SGLang server
sgl.set_default_backend(sgl.RuntimeEndpoint("http://192.168.1.100:30000"))

# Define an SGLang function for answering questions
@sgl.function
def answer_question(s, question):
    s += sgl.system("You are a knowledgeable science educator.")
    s += sgl.user(question)
    s += sgl.assistant(sgl.gen("answer", max_tokens=200, temperature=0.7))

# Prepare a list of questions
questions = [
    "Explain how photosynthesis works.",
    "What are the main causes of climate change?",
    "Describe the water cycle in nature.",
    "How do vaccines help prevent diseases?",
    "What is the theory of evolution?",
    "Explain the concept of gravity.",
    "How does the human immune system work?",
    "What causes earthquakes?",
    "Describe the process of DNA replication.",
    "How do airplanes generate lift?"
]

# Run batch inference
print("Processing batch of questions...")
start_time = time.time()

try:
    results = answer_question.run_batch(
        [{"question": q} for q in questions],
        progress_bar=True
    )
    batch_time = time.time() - start_time

    print(f"\nBatch processing took {batch_time:.2f} seconds")
    print(f"Average time per question: {batch_time / len(questions):.2f} seconds")

    # Display results
    print("\n" + "="*80)
    print("RESULTS")
    print("="*80)
    for i, (question, state) in enumerate(zip(questions, results)):
        print(f"\nQuestion {i+1}: {question}")
        print(f"Answer: {state['answer']}")
        print("-"*80)
except Exception as e:
    print(f"Error during batch processing: {e}")

This script showcases SGLang's batch processing capabilities. We define an SGLang function that uses the chat template methods (sgl.system, sgl.user, sgl.assistant) to structure the conversation properly. The sgl.gen call within sgl.assistant specifies where the model should generate the answer.

The run_batch method accepts a list of dictionaries, where each dictionary contains the parameters for one function call. We use a list comprehension to create this list from our questions. The progress_bar=True parameter displays a progress indicator during processing.

The run_batch method handles all the complexity of sending requests to the server, managing the batching, and collecting results. It returns a list of state objects, one for each input.

Run this script:

python sglang_batch_client.py

You will see a progress bar as the batch processes, followed by timing information and all the results. The batch processing approach is particularly efficient because SGLang can optimize the execution of all requests together, sharing computation where possible.

IMPLEMENTING SPECULATIVE DECODING WITH SGLANG

Speculative decoding is an advanced optimization technique that can significantly accelerate inference without sacrificing output quality. The technique uses a smaller, faster draft model to propose multiple tokens, which are then verified in parallel by the larger target model. This reduces the number of sequential forward passes required, decreasing latency.

SGLang has excellent support for speculative decoding. To use this feature, we need to configure the SGLang server with both a target model and a draft model. Let us restart the SGLang server with speculative decoding enabled.

First, stop the existing SGLang server:

docker stop sglang-server
docker rm sglang-server

Now launch a new SGLang server with speculative decoding configuration. For this example, we will use Llama 3.1 8B as the target model and Llama 3.2 1B as the draft model:

docker run -d \
  --name sglang-server \
  --gpus all \
  -p 30000:30000 \
  -e HF_TOKEN=$HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/sglang:latest \
  python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --speculative-draft-model-path meta-llama/Llama-3.2-1B-Instruct \
  --speculative-num-steps 5 \
  --host 0.0.0.0 \
  --port 30000 \
  --dtype bfloat16 \
  --context-length 8192

The new parameters configure speculative decoding. The double-dash-speculative-draft-model-path specifies the smaller draft model. The double-dash-speculative-num-steps parameter controls how many tokens the draft model proposes before verification. A value of 5 is a good starting point, balancing speculation depth with verification overhead.

Monitor the server startup:

docker logs -f sglang-server

You will see SGLang load both the target model and the draft model. The draft model is much smaller, so it loads quickly. Once both models are loaded, the server is ready.

Now let us create a client that demonstrates the performance improvement from speculative decoding. Create a script called speculative_decoding_client.py:

from openai import OpenAI
import time

# Configure the client
client = OpenAI(
    base_url="http://192.168.1.100:30000/v1",
    api_key="not-needed"
)

# Define a prompt that requires substantial generation
prompt = """Write a detailed technical explanation of how a convolutional neural network 
processes an image, including the roles of convolutional layers, pooling layers, 
and fully connected layers. Include specific examples of filter operations."""

# Function to generate and time a response
def generate_with_timing(prompt, num_runs=3):
    times = []
    response_text = None
    tokens_generated = 0
    
    for i in range(num_runs):
        try:
            start_time = time.time()
            response = client.chat.completions.create(
                model="meta-llama/Meta-Llama-3.1-8B-Instruct",
                messages=[
                    {"role": "system", "content": "You are an expert in deep learning and computer vision."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.7,
                max_tokens=500
            )
            elapsed_time = time.time() - start_time
            times.append(elapsed_time)
            
            if i == 0:
                response_text = response.choices[0].message.content
                tokens_generated = response.usage.completion_tokens
        except Exception as e:
            print(f"Error during generation run {i+1}: {e}")
            continue
    
    if not times:
        return None, 0, 0
    
    avg_time = sum(times) / len(times)
    tokens_per_second = tokens_generated / avg_time if avg_time > 0 else 0
    
    return response_text, avg_time, tokens_per_second

# Generate response
print("Generating response with speculative decoding enabled...")
response, avg_time, tokens_per_sec = generate_with_timing(prompt)

if response:
    print(f"\nAverage generation time: {avg_time:.2f} seconds")
    print(f"Tokens per second: {tokens_per_sec:.2f}")
    print(f"\nGenerated response:\n{response}")
else:
    print("Failed to generate response")

This script measures the performance of speculative decoding. We define a prompt that requires a lengthy, detailed response, which gives speculative decoding more opportunity to demonstrate its benefits. The generate_with_timing function runs the generation multiple times and calculates the average time and throughput, with error handling for robustness.

Run this script:

python speculative_decoding_client.py

You will see the generation time and tokens per second. The exact numbers depend on the specific prompt and model, but with speculative decoding enabled, you should see significantly higher tokens per second compared to standard autoregressive generation.

The beauty of speculative decoding is that it is completely transparent to the client. The output is identical to what you would get without speculation; the technique only affects the internal processing strategy. The draft model proposes tokens that the target model then verifies, accepting correct predictions and rejecting incorrect ones. This allows multiple tokens to be generated per forward pass of the target model, dramatically improving throughput.

FINE-TUNING LLMS WITH LORA ON THE DGX SPARK

Now that we have mastered inference, let us explore fine-tuning. Fine-tuning allows you to customize a pre-trained model for your specific use case by training it on your own data. The DGX Spark's 128GB of unified memory makes it capable of fine-tuning models up to 70 billion parameters using parameter-efficient techniques.

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that dramatically reduces memory requirements and training time. Instead of updating all model weights, LoRA freezes the original model and injects small trainable matrices into the attention layers. This can reduce trainable parameters by over 90 percent while maintaining high performance.

Let us set up a fine-tuning environment on the DGX Spark. We will use a dedicated fine-tuning container that includes all necessary tools. First, create a directory structure for our fine-tuning project:

mkdir -p ~/llm-finetuning/data
mkdir -p ~/llm-finetuning/output
cd ~/llm-finetuning

Now let us prepare a sample dataset. For this tutorial, we will create a small dataset in Alpaca format, which is a common format for instruction fine-tuning. Create a file called data/training_data.json:

[
    {
        "instruction": "Explain the concept of machine learning to a beginner.",
        "input": "",
        "output": "Machine learning is a branch of artificial intelligence where computers learn from data without being explicitly programmed. Instead of following fixed rules, machine learning algorithms identify patterns in data and use those patterns to make predictions or decisions. For example, a machine learning system can learn to recognize cats in photos by studying thousands of cat images, rather than being given explicit rules about what makes a cat."
    },
    {
        "instruction": "What is the difference between supervised and unsupervised learning?",
        "input": "",
        "output": "Supervised learning uses labeled data, where each example has a known correct answer. The algorithm learns to map inputs to outputs by studying these examples. For instance, training a spam detector with emails labeled as spam or not spam. Unsupervised learning works with unlabeled data, finding hidden patterns or structures without predefined answers. An example is clustering customers into groups based on purchasing behavior without knowing the groups in advance."
    },
    {
        "instruction": "Describe what a neural network is.",
        "input": "",
        "output": "A neural network is a computational model inspired by the human brain. It consists of layers of interconnected nodes called neurons. Each neuron receives inputs, applies a mathematical transformation, and passes the result to the next layer. The network learns by adjusting the strength of connections between neurons during training. Neural networks excel at recognizing complex patterns in data, making them powerful tools for tasks like image recognition, natural language processing, and game playing."
    },
    {
        "instruction": "What is deep learning?",
        "input": "",
        "output": "Deep learning is a subset of machine learning that uses neural networks with multiple layers (hence 'deep'). These deep neural networks can automatically learn hierarchical representations of data. For example, in image recognition, early layers might detect edges, middle layers detect shapes, and later layers detect complex objects. Deep learning has achieved remarkable success in areas like computer vision, speech recognition, and natural language understanding."
    },
    {
        "instruction": "Explain what reinforcement learning is.",
        "input": "",
        "output": "Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions and learns to maximize cumulative rewards over time. It's like training a dog with treats - the agent learns which actions lead to positive outcomes. This approach has been successfully used in game playing (like AlphaGo), robotics, and autonomous systems."
    }
]

This is a minimal dataset for demonstration purposes. In a real fine-tuning scenario, you would want at least 1000 high-quality examples covering the full range of tasks you want the model to perform.

Now let us create a fine-tuning script. Create a file called finetune_lora.py:

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import os

# Configuration
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
OUTPUT_DIR = "./output/llama-3.1-8b-lora"
DATA_PATH = "./data/training_data.json"

# LoRA configuration
lora_config = LoraConfig(
    r=16,                    # Rank of the low-rank matrices
    lora_alpha=32,           # Scaling factor (typically 2x rank)
    target_modules=[         # Which modules to apply LoRA to
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj"
    ],
    lora_dropout=0.05,       # Dropout for regularization
    bias="none",             # Don't train bias parameters
    task_type="CAUSAL_LM"    # Causal language modeling task
)

# Training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_ratio=0.03,
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    fp16=False,
    bf16=True,               # Use bfloat16 for better numerical stability
    gradient_checkpointing=True,
    optim="adamw_torch"
)

def format_instruction(example):
    """Format examples in instruction-following format."""
    if example["input"]:
        text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    else:
        text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return {"text": text}

def main():
    print("Loading tokenizer and model...")
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True
    )
    
    # Prepare model for LoRA training
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    
    # Print trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
    
    # Load and prepare dataset
    print("Loading dataset...")
    dataset = load_dataset("json", data_files=DATA_PATH)
    dataset = dataset.map(format_instruction, remove_columns=dataset["train"].column_names)
    
    # Tokenize dataset
    def tokenize_function(examples):
        return tokenizer(
            examples["text"],
            truncation=True,
            max_length=512,
            padding="max_length"
        )
    
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=["text"]
    )
    
    # Create data collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False
    )
    
    # Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        data_collator=data_collator
    )
    
    # Train
    print("Starting training...")
    trainer.train()
    
    # Save final model
    print("Saving model...")
    trainer.save_model(OUTPUT_DIR)
    tokenizer.save_pretrained(OUTPUT_DIR)
    
    print(f"Training complete! Model saved to {OUTPUT_DIR}")

if __name__ == "__main__":
    main()

This script implements a complete LoRA fine-tuning pipeline. Let us examine the key components:

The LoraConfig specifies the LoRA hyperparameters. The rank r=16 determines the dimensionality of the low-rank matrices. Higher ranks provide more capacity but require more memory. The lora_alpha scaling factor is typically set to 2x the rank. The target_modules list specifies which layers receive LoRA adapters; we target all the projection layers in the attention mechanism and the MLP layers.

The TrainingArguments configure the training process. We use a batch size of 4 with gradient accumulation over 4 steps, giving an effective batch size of 16. The learning rate of 2e-4 is a good starting point for LoRA. We enable bfloat16 precision and gradient checkpointing to reduce memory usage.

The format_instruction function converts our dataset into the instruction-following format that the model expects. The tokenize_function prepares the text for the model by converting it to token IDs.

Now let us run the fine-tuning. We will use a Docker container with all the necessary dependencies:

docker run --rm -it \
  --gpus all \
  -v ~/llm-finetuning:/workspace \
  -w /workspace \
  -e HF_TOKEN=$HF_TOKEN \
  nvcr.io/nvidia/pytorch:25.01-py3 \
  bash

Inside the container, install the required packages:

pip install transformers datasets peft accelerate bitsandbytes

Now run the fine-tuning script:

python finetune_lora.py

You will see the model loading, followed by training progress. The script will print the number of trainable parameters, which should be around 1-2 percent of the total model parameters, demonstrating LoRA's efficiency.

Training on this small dataset will complete quickly, typically in a few minutes. In a real scenario with a larger dataset, training might take several hours, but the DGX Spark's powerful GPU ensures efficient processing.

FINE-TUNING LARGE MODELS WITH QLORA

For larger models that do not fit in memory even with LoRA, we can use QLoRA (Quantized LoRA). QLoRA quantizes the base model to 4-bit precision while keeping the LoRA adapters in higher precision. This dramatically reduces memory usage, allowing the DGX Spark to fine-tune models up to 70 billion parameters.

Let us create a QLoRA fine-tuning script for a larger model. Create a file called finetune_qlora.py:

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import os

# Configuration
MODEL_NAME = "meta-llama/Meta-Llama-3.1-70B-Instruct"
OUTPUT_DIR = "./output/llama-3.1-70b-qlora"
DATA_PATH = "./data/training_data.json"

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# QLoRA configuration (higher rank for larger model)
lora_config = LoraConfig(
    r=64,                    # Higher rank for 70B model
    lora_alpha=128,          # 2x rank
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj"
    ],
    lora_dropout=0.1,        # Higher dropout for regularization
    bias="none",
    task_type="CAUSAL_LM"
)

# Training arguments optimized for QLoRA
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=2,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=1e-4,
    warmup_ratio=0.03,
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    logging_steps=5,
    save_steps=50,
    save_total_limit=2,
    fp16=False,
    bf16=True,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",  # Memory-efficient optimizer
    max_grad_norm=0.3
)

def format_instruction(example):
    """Format examples in instruction-following format."""
    if example["input"]:
        text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    else:
        text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return {"text": text}

def main():
    print("Loading tokenizer and model...")
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    
    # Load model with 4-bit quantization
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )
    
    # Prepare model for QLoRA training
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    
    # Print trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
    
    # Load and prepare dataset
    print("Loading dataset...")
    dataset = load_dataset("json", data_files=DATA_PATH)
    dataset = dataset.map(format_instruction, remove_columns=dataset["train"].column_names)
    
    # Tokenize dataset
    def tokenize_function(examples):
        return tokenizer(
            examples["text"],
            truncation=True,
            max_length=512,
            padding="max_length"
        )
    
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=["text"]
    )
    
    # Create data collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False
    )
    
    # Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        data_collator=data_collator
    )
    
    # Train
    print("Starting training...")
    trainer.train()
    
    # Save final model
    print("Saving model...")
    trainer.save_model(OUTPUT_DIR)
    tokenizer.save_pretrained(OUTPUT_DIR)
    
    print(f"Training complete! Model saved to {OUTPUT_DIR}")

if __name__ == "__main__":
    main()

The key differences from the LoRA script are:

  1. We use BitsAndBytesConfig to enable 4-bit quantization with NF4 (Normal Float 4) quantization type
  2. We use a higher LoRA rank (64 instead of 16) because larger models benefit from more adapter capacity
  3. We use a smaller batch size (1) with more gradient accumulation to manage memory
  4. We use the paged_adamw_8bit optimizer which stores optimizer states in 8-bit format

To run this script, you would use the same Docker container approach as before. Note that downloading and loading a 70B model takes significantly longer than an 8B model, so be patient during initialization.

USING A FINE-TUNED MODEL FOR INFERENCE

After fine-tuning, you need to load your custom model for inference. There are two approaches: merging the LoRA weights into the base model, or loading the LoRA adapter separately.

Let us create a script that loads a fine-tuned model and runs inference. Create a file called inference_finetuned.py:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Configuration
BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
LORA_WEIGHTS = "./output/llama-3.1-8b-lora"

def load_model_and_tokenizer():
    """Load the base model with LoRA adapter."""
    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
    
    print("Loading base model...")
    base_model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    
    print("Loading LoRA adapter...")
    model = PeftModel.from_pretrained(base_model, LORA_WEIGHTS)
    
    # Optionally merge the adapter into the base model for faster inference
    # model = model.merge_and_unload()
    
    return model, tokenizer

def generate_response(model, tokenizer, instruction, input_text=""):
    """Generate a response using the fine-tuned model."""
    # Format the prompt
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    
    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode and extract only the response part
    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = full_text.split("### Response:\n")[-1]
    
    return response

def main():
    # Load model
    model, tokenizer = load_model_and_tokenizer()
    
    # Test with some prompts
    test_instructions = [
        "Explain the concept of machine learning to a beginner.",
        "What is deep learning?",
        "Describe what reinforcement learning is."
    ]
    
    print("\n" + "="*80)
    print("TESTING FINE-TUNED MODEL")
    print("="*80)
    
    for instruction in test_instructions:
        print(f"\nInstruction: {instruction}")
        try:
            response = generate_response(model, tokenizer, instruction)
            print(f"Response: {response}")
        except Exception as e:
            print(f"Error generating response: {e}")
        print("-"*80)

if __name__ == "__main__":
    main()

This script demonstrates how to load and use a fine-tuned model. The PeftModel.from_pretrained method loads the LoRA adapter on top of the base model. During inference, both the base model and the adapter are used together.

If you want faster inference, you can merge the adapter into the base model using model.merge_and_unload(). This creates a single model with the LoRA weights merged in, eliminating the overhead of the adapter mechanism during inference.

Run the script:

python inference_finetuned.py

You should see responses that reflect the style and knowledge from your training data.

SERVING A FINE-TUNED MODEL WITH VLLM

To serve your fine-tuned model at scale, you can load it into vLLM. The vLLM server supports LoRA adapters, allowing you to serve multiple fine-tuned versions of the same base model efficiently.

First, let us merge the LoRA weights into the base model to create a standalone model that vLLM can load directly. Create a script called merge_lora.py:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
LORA_WEIGHTS = "./output/llama-3.1-8b-lora"
OUTPUT_DIR = "./output/llama-3.1-8b-merged"

print("Loading base model...")
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

print("Loading LoRA adapter...")
model = PeftModel.from_pretrained(base_model, LORA_WEIGHTS)

print("Merging adapter into base model...")
model = model.merge_and_unload()

print("Saving merged model...")
model.save_pretrained(OUTPUT_DIR)

print("Saving tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained(OUTPUT_DIR)

print(f"Merged model saved to {OUTPUT_DIR}")

Run this script to create the merged model:

python merge_lora.py

Now you can launch a vLLM server with your fine-tuned model:

docker run -d \
  --name vllm-finetuned \
  --gpus all \
  -p 8001:8001 \
  -v ~/llm-finetuning/output/llama-3.1-8b-merged:/model \
  nvcr.io/nvidia/vllm:25.11-py3 \
  python -m vllm.entrypoints.openai.api_server \
  --model /model \
  --host 0.0.0.0 \
  --port 8001 \
  --dtype bfloat16 \
  --max-model-len 8192

This launches vLLM serving your fine-tuned model on port 8001. You can now use all the client code we developed earlier, just changing the port number to 8001.

COMPREHENSIVE PRODUCTION EXAMPLE: MULTI-MODEL INFERENCE SYSTEM

Let us bring everything together into a comprehensive production-ready example. We will build a system that serves multiple models (base and fine-tuned) and intelligently routes requests based on the task type.

Create a file called production_inference_system.py:

from openai import OpenAI
import concurrent.futures
import time
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
from enum import Enum

class ModelType(Enum):
    """Types of models available in the system."""
    BASE_MODEL = "base"
    FINETUNED_MODEL = "finetuned"
    SPECULATIVE_MODEL = "speculative"

@dataclass
class InferenceRequest:
    """Represents a single inference request."""
    prompt: str
    model_type: ModelType
    temperature: float = 0.7
    max_tokens: int = 500
    system_message: str = "You are a helpful assistant."

@dataclass
class InferenceResponse:
    """Represents the response from an inference request."""
    prompt: str
    response: str
    model_type: ModelType
    tokens_used: int
    latency: float

class MultiModelInferenceSystem:
    """Production-ready multi-model inference system."""
    
    def __init__(self, base_url_base: str, base_url_finetuned: str, base_url_speculative: str):
        """
        Initialize the inference system with multiple model endpoints.
        
        Args:
            base_url_base: URL for the base model server
            base_url_finetuned: URL for the fine-tuned model server
            base_url_speculative: URL for the speculative decoding server
        """
        self.clients = {
            ModelType.BASE_MODEL: OpenAI(base_url=base_url_base, api_key="not-needed"),
            ModelType.FINETUNED_MODEL: OpenAI(base_url=base_url_finetuned, api_key="not-needed"),
            ModelType.SPECULATIVE_MODEL: OpenAI(base_url=base_url_speculative, api_key="not-needed")
        }
        
        self.model_names = {
            ModelType.BASE_MODEL: "meta-llama/Meta-Llama-3.1-8B-Instruct",
            ModelType.FINETUNED_MODEL: "meta-llama/Meta-Llama-3.1-8B-Instruct",
            ModelType.SPECULATIVE_MODEL: "meta-llama/Meta-Llama-3.1-8B-Instruct"
        }
    
    def route_request(self, prompt: str) -> ModelType:
        """
        Intelligently route requests to appropriate models based on content.
        
        Args:
            prompt: The user's prompt
            
        Returns:
            The model type to use for this request
        """
        # Simple keyword-based routing (in production, use a classifier)
        ml_keywords = ["machine learning", "neural network", "deep learning", "training", "model"]
        long_generation_keywords = ["write a story", "detailed explanation", "comprehensive"]
        
        prompt_lower = prompt.lower()
        
        # Route ML-related queries to fine-tuned model
        if any(keyword in prompt_lower for keyword in ml_keywords):
            return ModelType.FINETUNED_MODEL
        
        # Route long generations to speculative decoding
        if any(keyword in prompt_lower for keyword in long_generation_keywords):
            return ModelType.SPECULATIVE_MODEL
        
        # Default to base model
        return ModelType.BASE_MODEL
    
    def process_single_request(self, request: InferenceRequest) -> Optional[InferenceResponse]:
        """
        Process a single inference request.
        
        Args:
            request: The inference request to process
            
        Returns:
            The inference response or None if error occurs
        """
        try:
            client = self.clients[request.model_type]
            model_name = self.model_names[request.model_type]
            
            start_time = time.time()
            
            response = client.chat.completions.create(
                model=model_name,
                messages=[
                    {"role": "system", "content": request.system_message},
                    {"role": "user", "content": request.prompt}
                ],
                temperature=request.temperature,
                max_tokens=request.max_tokens
            )
            
            latency = time.time() - start_time
            
            return InferenceResponse(
                prompt=request.prompt,
                response=response.choices[0].message.content,
                model_type=request.model_type,
                tokens_used=response.usage.total_tokens,
                latency=latency
            )
        except Exception as e:
            print(f"Error processing request: {e}")
            return None
    
    def process_batch(self, requests: List[InferenceRequest], max_workers: int = 10) -> List[InferenceResponse]:
        """
        Process multiple requests in parallel.
        
        Args:
            requests: List of inference requests
            max_workers: Maximum number of parallel workers
            
        Returns:
            List of inference responses (excluding failed requests)
        """
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = [executor.submit(self.process_single_request, req) for req in requests]
            results = []
            for future in concurrent.futures.as_completed(futures):
                result = future.result()
                if result is not None:
                    results.append(result)
        
        return results
    
    def auto_route_and_process(self, prompts: List[str], **kwargs) -> List[InferenceResponse]:
        """
        Automatically route prompts to appropriate models and process them.
        
        Args:
            prompts: List of prompts to process
            **kwargs: Additional arguments for InferenceRequest
            
        Returns:
            List of inference responses
        """
        requests = []
        for prompt in prompts:
            model_type = self.route_request(prompt)
            requests.append(InferenceRequest(prompt=prompt, model_type=model_type, **kwargs))
        
        return self.process_batch(requests)
    
    def print_statistics(self, responses: List[InferenceResponse]):
        """Print statistics about the processed requests."""
        if not responses:
            print("No responses to analyze")
            return
        
        total_requests = len(responses)
        total_tokens = sum(r.tokens_used for r in responses)
        avg_latency = sum(r.latency for r in responses) / total_requests
        
        model_counts = {}
        for r in responses:
            model_counts[r.model_type] = model_counts.get(r.model_type, 0) + 1
        
        print("\n" + "="*80)
        print("INFERENCE STATISTICS")
        print("="*80)
        print(f"Total requests processed: {total_requests}")
        print(f"Total tokens used: {total_tokens:,}")
        print(f"Average latency: {avg_latency:.2f} seconds")
        print(f"\nModel distribution:")
        for model_type, count in model_counts.items():
            print(f"  {model_type.value}: {count} requests ({100*count/total_requests:.1f}%)")
        print("="*80)

def main():
    """Main function demonstrating the production inference system."""
    
    # Initialize the system with multiple model endpoints
    # In production, these would point to different servers or different models
    system = MultiModelInferenceSystem(
        base_url_base="http://192.168.1.100:8000/v1",
        base_url_finetuned="http://192.168.1.100:8001/v1",
        base_url_speculative="http://192.168.1.100:30000/v1"
    )
    
    # Define a diverse set of test prompts
    test_prompts = [
        "What is the capital of France?",
        "Explain how machine learning algorithms learn from data.",
        "Write a detailed explanation of neural network architectures.",
        "What are the main causes of climate change?",
        "Describe the process of training a deep learning model.",
        "How does photosynthesis work?",
        "What is the difference between supervised and unsupervised learning?",
        "Explain the water cycle.",
        "Write a comprehensive guide to model evaluation metrics in machine learning.",
        "What causes earthquakes?"
    ]
    
    print("Processing requests with automatic routing...")
    start_time = time.time()
    
    responses = system.auto_route_and_process(
        test_prompts,
        temperature=0.7,
        max_tokens=300
    )
    
    total_time = time.time() - start_time
    
    # Display results
    print("\n" + "="*80)
    print("RESULTS")
    print("="*80)
    
    for i, response in enumerate(responses, 1):
        print(f"\n[Request {i}] Model: {response.model_type.value}")
        print(f"Prompt: {response.prompt}")
        print(f"Response: {response.response[:200]}...")  # Truncate for readability
        print(f"Tokens: {response.tokens_used}, Latency: {response.latency:.2f}s")
        print("-"*80)
    
    # Print statistics
    system.print_statistics(responses)
    
    if total_time > 0 and len(responses) > 0:
        print(f"\nTotal processing time: {total_time:.2f} seconds")
        print(f"Throughput: {len(responses)/total_time:.2f} requests/second")

if __name__ == "__main__":
    main()

This comprehensive example demonstrates a production-ready inference system with the following features:

  1. Multi-model support: The system can route requests to different models based on the task
  2. Intelligent routing: Requests are automatically routed to the most appropriate model
  3. Batch processing: Multiple requests are processed in parallel for maximum throughput
  4. Performance monitoring: The system tracks latency, token usage, and model distribution
  5. Error handling: Robust error handling ensures the system continues even if individual requests fail
  6. Clean architecture: The code uses dataclasses and type hints for maintainability

To run this example, ensure you have all three servers running (base model on port 8000, fine-tuned model on port 8001, and SGLang with speculative decoding on port 30000), then execute:

python production_inference_system.py

You will see the system process all requests in parallel, automatically routing them to appropriate models, and then display comprehensive statistics about the batch.

CONCLUSION AND BEST PRACTICES

Congratulations! You have now mastered the complete workflow of LLM development on the NVIDIA DGX Spark, from basic inference to advanced fine-tuning and production deployment.

Here are the key best practices to remember:

For Inference:

  • Use vLLM for maximum throughput with continuous batching
  • Use SGLang when you need structured generation or speculative decoding
  • Always send requests in parallel when possible to maximize GPU utilization
  • Monitor token usage and latency to optimize your application
  • Implement proper error handling for production robustness

For Fine-Tuning:

  • Start with LoRA for models up to 13B parameters
  • Use QLoRA for larger models (13B+)
  • Ensure your dataset is high-quality with at least 1000 diverse examples
  • Use appropriate hyperparameters: rank 16-64, learning rate 1e-4 to 2e-4
  • Enable gradient checkpointing and bfloat16 to reduce memory usage
  • Save checkpoints regularly during long training runs

For Production:

  • Implement intelligent request routing based on task type
  • Use multiple models for different use cases
  • Monitor performance metrics continuously
  • Implement proper error handling and retry logic
  • Use Docker containers for reproducible deployments
  • Test thoroughly before deploying to production
  • Keep models and dependencies up to date

For Networking:

  • Ensure firewall rules allow necessary ports
  • Use appropriate IP addresses for your network configuration
  • Monitor network latency and bandwidth usage
  • Consider load balancing for high-traffic scenarios

The NVIDIA DGX Spark provides an exceptional platform for LLM development, combining powerful hardware with a unified memory architecture that simplifies working with large models. By following the patterns and techniques demonstrated in this tutorial, you can build sophisticated AI applications that leverage the full capabilities of this remarkable system.

No comments: