Hitchhiker's Guide to AI, Software Architecture, and Everything Else: RUNNING HUGGINGFACE MODELS EVERYWHERE: A COMPLETE GUIDE TO ADAPTING MODELS FOR OLLAMA, MLX, LM STUDIO, AND LLAMA.CPP

WHY THIS MATTERS, AND WHERE WE ARE HEADING

The landscape of local large language model inference has changed dramatically over the past two years. What once required expensive cloud APIs and specialized hardware is now routinely done on a laptop. A MacBook Pro with Apple Silicon, a gaming PC with a mid-range GPU, or even a machine running purely on CPU can serve a capable language model in real time, with full privacy, zero API costs, and no rate limits. The tools that made this possible form a small but powerful ecosystem: llama.cpp at the foundation, Ollama as a developer-friendly server layer, LM Studio as a polished graphical interface, and Apple's MLX framework for those running on Apple Silicon. The one thing all of these tools have in common is that they do not natively consume the raw model files you download from HuggingFace. They need models in specific formats, and bridging that gap is exactly what this guide teaches you to do.

HuggingFace is the world's largest open repository of machine learning models. At the time of writing it hosts hundreds of thousands of models, from tiny classifiers to massive 70-billion-parameter language models. When a research lab or a company releases a new open-weight model, HuggingFace is almost always the first place it appears. The weights are typically stored in the safetensors format (a safe, fast binary format developed by HuggingFace) or the older PyTorch .bin format, accompanied by configuration files, tokenizer files, and sometimes generation configuration files. These files are designed to be consumed by the transformers Python library, which is the dominant framework for research and fine-tuning. They are not, however, what llama.cpp or Ollama expects.

This guide walks you through the entire pipeline, from downloading a model off HuggingFace all the way to running it in four different local inference environments. You will understand not just the commands to type but also why each step exists, what each file does, and what choices you are making when you pick one quantization level over another. By the end, you will be able to take virtually any open-weight transformer model from HuggingFace and run it locally with confidence.

The guide is structured as a journey. We start by understanding the formats and the tools. We then go through the conversion pipeline for the llama.cpp family, which covers llama.cpp itself, Ollama, and LM Studio, since they all share the same underlying format. After that, we take a separate path for Apple's MLX framework, which has its own conversion toolchain and its own format. Throughout, we use concrete, real examples with actual commands you can copy and run. The final part of the guide presents a complete Python automation script that performs the entire pipeline with a single command, handling platform detection, dependency management, conversion, quantization, template detection, and validation automatically.

CHAPTER ONE: UNDERSTANDING THE FORMATS AND THE ECOSYSTEM

Before touching a single command, it is worth spending time understanding what you are actually doing when you convert a model. This understanding will save you hours of debugging and will help you make better decisions about quantization, context length, and tool selection.

The HuggingFace Model Zoo and What It Contains

When you browse HuggingFace and find a model you want to run locally — say Meta's Llama 3.1 8B Instruct, or Mistral AI's Mistral 7B Instruct v0.3, or Google's Gemma 2 9B — you are looking at a repository that typically contains several categories of files.

The model weights themselves are the largest files. In modern repositories these are stored as safetensors files, which are named things like model-00001-of-00004.safetensors for large models that are split across multiple shards, or model.safetensors for smaller models that fit in a single file. Older repositories may use pytorch_model.bin files instead, or a mix of both. The safetensors format was designed to be faster to load and safer to share than pickle-based PyTorch files, and it has become the standard.

The configuration file, config.json, is a JSON document that describes the model architecture. It tells the loading code how many layers the model has, how wide the hidden states are, how many attention heads exist, what activation function is used, and dozens of other architectural details. This file is critical for conversion because the conversion scripts read it to understand what they are dealing with.

The tokenizer files tell the model how to turn text into numbers and numbers back into text. A tokenizer typically consists of a tokenizer.json file, a tokenizer_config.json, and sometimes a tokenizer.model file (a SentencePiece binary used by older LLaMA-family models). The tokenizer_config.json deserves special attention because it contains a field called chat_template. This is a Jinja2 template string that defines the exact text format the model was trained with for multi-turn conversations. It specifies how system prompts, user messages, and assistant responses are wrapped with special tokens and delimiters. This template is the authoritative source of truth for how to format prompts for any given model, and it becomes critical when you configure Ollama's Modelfile later in this guide.

Understanding GGUF: The Universal Format for Local Inference

GGUF is the binary model format created by Georgi Gerganov, the author of llama.cpp, as a successor to earlier formats called GGML, GGMF, and GGJT. It is the standard format for llama.cpp-based inference and is consumed by llama.cpp, Ollama, and LM Studio alike.

GGUF is a single binary file that contains everything needed to run a model: the architecture metadata, the tokenizer vocabulary and rules, all the model weights, and any other configuration. This self-contained nature is one of GGUF's greatest practical advantages. You do not need a separate config.json or tokenizer.json sitting next to it; everything is embedded. This makes GGUF files easy to share, easy to move around, and easy to load in any application that supports the format.

The format is also designed with quantization in mind. Quantization is the process of reducing the numerical precision of the model weights. A full-precision model stores each weight as a 32-bit floating point number (float32), or more commonly these days as a 16-bit floating point (float16 or bfloat16). Quantization reduces this to fewer bits — say 8 bits or 4 bits — which dramatically reduces the file size and memory requirements at the cost of some accuracy. A 7-billion-parameter model in float16 requires roughly 14 gigabytes of memory. The same model quantized to 4 bits requires only about 4 to 5 gigabytes, making it runnable on a machine with 8 gigabytes of RAM.

GGUF supports a rich set of quantization schemes. The naming convention follows a logical pattern. Q4_K_M means 4-bit quantization using the K-quant method at medium quality. The K-quants apply different precision levels to different parts of the model: more sensitive tensors get slightly higher precision, while less sensitive tensors get lower precision. This produces better quality than naive uniform quantization at the same average bit width.

Q2_K uses approximately 2.6 bits per weight and produces the smallest possible files, but quality degrades noticeably, especially for complex reasoning tasks. It is useful only when memory is extremely constrained. Q3_K_M uses approximately 3.3 bits per weight and represents a significant step up in quality while still being very compact. Q4_K_Muses approximately 4.8 bits per weight and is by far the most popular quantization level in the community. It strikes an excellent balance between file size, memory usage, and output quality, and the quality loss compared to the original float16 model is minimal for most practical tasks. Q5_K_M uses approximately 5.7 bits per weight and is the choice when you want near-original quality and have a bit more memory to spare. Q6_K uses approximately 6.6 bits per weight and is very close to lossless. Q8_0 uses 8 bits per weight and is essentially lossless for most purposes, with quality nearly indistinguishable from the original float16 model.

The llama.cpp Ecosystem: Three Tools, One Format

llama.cpp is a C++ inference engine for large language models. It was originally written to run LLaMA models on consumer hardware, but it has grown into a comprehensive inference framework that supports dozens of model architectures and runs on CPUs, NVIDIA GPUs, AMD GPUs, Apple Silicon, and even some mobile devices. It is the engine that powers both Ollama and LM Studio.

Ollama is a tool that wraps llama.cpp in a clean, Docker-like interface. You run ollama serve to start a local server, and then you can pull models with ollama pull, run them with ollama run, and interact with them via a REST API. Ollama exposes two API styles: its native API at /api/chat and /api/generate, and an OpenAI-compatible API at /v1/chat/completions and /v1/models. The OpenAI-compatible endpoints at the /v1/ prefix are what you should use when pointing tools or libraries that expect the OpenAI API format at your local Ollama instance. Ollama manages model storage, versioning, and serving automatically.

LM Studio is a desktop application with a graphical user interface that also uses llama.cpp under the hood. It provides a chat interface, a model browser that connects to HuggingFace, and a local server mode. It is the most accessible option for users who prefer not to work in the terminal.

llama.cpp itself is the underlying engine that you interact with directly when you want maximum control. You compile it yourself, run the inference binary directly, and manage everything manually. It is the right choice for developers who want to integrate LLM inference into their own applications or who need features not yet exposed by Ollama or LM Studio.

All three of these tools consume GGUF files. This means that once you have converted your HuggingFace model to GGUF, you can use it in any of the three environments with minimal additional work. For very large models such as 70B-parameter variants, llama.cpp can produce split GGUF files (named model-00001-of-00003.gguf and so on) to keep individual file sizes manageable. Ollama, llama-cli, and LM Studio all handle split GGUFs transparently by passing the path to the first shard.

The MLX Ecosystem: Apple Silicon's Native Path

Apple's MLX framework is a different beast entirely. MLX is a numerical computation framework designed specifically for Apple Silicon (M1, M2, M3, and M4 series chips). It exploits the unified memory architecture of these chips, where the CPU and GPU share the same physical memory pool, to run models with exceptional efficiency. A model that would require a discrete GPU on other hardware can run entirely in the unified memory on an Apple Silicon Mac.

MLX uses its own model format. When you convert a model using the mlx-lm package, the output directory contains safetensors weight files (using the same container format as HuggingFace, but with MLX-specific quantization applied to the tensor values when quantization is requested), a config.json that includes quantization metadata such as the number of bits and group size, and the standard tokenizer files copied from the source model. This format is distinct from GGUF and is not interchangeable with it.

For Apple Silicon users, MLX often provides strong inference performance thanks to its tight hardware-specific optimizations for the Apple Neural Engine and GPU cores. llama.cpp's Metal backend has also matured significantly, so the performance gap between the two has narrowed. The best choice depends on the specific model and workload; many practitioners benchmark both and use whichever is faster for their use case.

CHAPTER TWO: SETTING UP YOUR ENVIRONMENT

The conversion process requires a Python environment with several packages installed. This section walks you through setting up everything you need before you touch a single model file.

Installing the Core Dependencies

You will need Python 3.9 or later. Python 3.11 is a solid choice at the time of writing, as it is well-supported by all the packages involved. If you are using a Mac with Apple Silicon and plan to use MLX, Python 3.12 is also fine.

The first thing to do is create a virtual environment to keep your conversion tools isolated from your system Python. This is good practice that prevents dependency conflicts.

On Linux or macOS:

python3 -m venv llm-convert
source llm-convert/bin/activate

On Windows (Command Prompt):

python -m venv llm-convert
llm-convert\Scripts\activate.bat

On Windows (PowerShell):

python -m venv llm-convert
llm-convert\Scripts\Activate.ps1

If PowerShell blocks the activation script, run this first to allow local scripts:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Now install the HuggingFace Hub CLI and the transformers library:

pip install huggingface_hub transformers sentencepiece numpy requests tqdm

For PyTorch, the installation command depends on your hardware. On a machine with an NVIDIA GPU, visit https://pytorch.org/get-started/locally/ to get the exact command for your CUDA version. For a CPU-only installation that is sufficient for conversion:

pip install torch --index-url https://download.pytorch.org/whl/cpu

Cloning and Building llama.cpp

The conversion script for GGUF lives inside the llama.cpp repository. You need to clone it and install its Python dependencies.

git clone --depth 1 https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt

The --depth 1 flag creates a shallow clone, downloading only the latest commit rather than the full history. This is significantly faster for a repository of llama.cpp's size and is all you need for conversion and compilation.

To compile the C++ binaries, you use CMake. The compilation flags differ by platform. On Linux with an NVIDIA GPU:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j4

On macOS (Apple Silicon or Intel), Metal support is enabled automatically without any extra flag. You do need Xcode Command Line Tools installed (xcode-select --install):

cmake -B build
cmake --build build --config Release -j4

On Linux with an AMD GPU using ROCm:

cmake -B build -DGGML_HIPBLAS=ON
cmake --build build --config Release -j4

On Windows with MSVC (run from a Developer Command Prompt for Visual Studio):

cmake -B build
cmake --build build --config Release -j4

The -j4 flag enables parallel compilation using 4 threads and works on all platforms with CMake 3.12 or later. You can increase this number to match your CPU core count. On Linux you can use $(nproc) and on macOS $(sysctl -n hw.ncpu) to get the core count dynamically, but a fixed number like 4 or 8 works fine on any platform.

After compilation, the binaries on Linux and macOS are in build/bin/. On Windows they are in build\bin\Release\. The key binaries are llama-quantize (for quantization) and llama-cli (for direct inference testing). On Windows, these have a .exe extension.

Installing Ollama

Ollama has a one-line installer for Linux and macOS:

curl -fsSL https://ollama.com/install.sh | sh

On Windows, download and run the installer from https://ollama.com. After installation, start the Ollama server in a terminal window that you leave open:

ollama serve

This starts a local HTTP server on port 11434.

Installing mlx-lm (Apple Silicon Only)

If you are on an Apple Silicon Mac, install the mlx-lm package:

pip install mlx-lm

This single package contains both the conversion tools and the inference runtime. It will pull in the mlx framework as a dependency automatically. This package will not install or function correctly on non-Apple-Silicon hardware; skip this step if you are on any other platform.

CHAPTER THREE: DOWNLOADING A MODEL FROM HUGGINGFACE

Before you can convert anything, you need the model files on your local machine.

Using the HuggingFace CLI

The cleanest and most reliable method is the huggingface-cli download command. Let us use Mistral 7B Instruct v0.3 as our running example throughout this guide, because it is a well-known, capable model that is freely available without gating.

huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 \
    --local-dir ./Mistral-7B-Instruct-v0.3 \
    --local-dir-use-symlinks False

The --local-dir flag specifies where to save the files. The --local-dir-use-symlinks False flag tells the CLI to copy the actual files rather than creating symlinks into the HuggingFace cache, which makes the directory self-contained and easier to work with. If a download is interrupted, simply re-run the same command; the HuggingFace Hub CLI resumes from where it left off.

For models that require you to agree to a license (called "gated" models, such as Meta's Llama 3 family), you first need to log in with your HuggingFace account:

huggingface-cli login

This will prompt you for a HuggingFace access token, which you can generate at https://huggingface.co/settings/tokens. Once logged in, the download command works the same way.

After the download completes, let us look at what we have:

config.json
generation_config.json
model-00001-of-00003.safetensors
model-00002-of-00003.safetensors
model-00003-of-00003.safetensors
model.safetensors.index.json
special_tokens_map.json
tokenizer.json
tokenizer.model
tokenizer_config.json

The three safetensors files together contain all the model weights. The config.json describes the architecture. The tokenizer files describe the vocabulary and text encoding. This is the complete set of ingredients that the conversion script needs.

A Quick Inspection of config.json

It is worth spending a moment looking at the config.json to understand what you are working with:

import json
with open("./Mistral-7B-Instruct-v0.3/config.json") as f:
    config = json.load(f)
print(f"Architecture: {config['architectures']}")
print(f"Hidden size:  {config['hidden_size']}")
print(f"Num layers:   {config['num_hidden_layers']}")
print(f"Vocab size:   {config['vocab_size']}")

Running this produces output like:

Architecture: ['MistralForCausalLM']
Hidden size:  4096
Num layers:   32
Vocab size:   32768

This tells you that the conversion script needs to handle the MistralForCausalLM architecture, which llama.cpp's convert_hf_to_gguf.py fully supports. Knowing the architecture name is useful when troubleshooting conversion errors, because the error messages often reference the architecture class.

CHAPTER FOUR: CONVERTING TO GGUF FOR LLAMA.CPP, OLLAMA, AND LM STUDIO

This is the heart of the conversion process. The llama.cpp repository contains a Python script called convert_hf_to_gguf.py that reads a HuggingFace model directory and produces a GGUF file.

Step 1: Running the Conversion Script

Navigate to the llama.cpp directory and run the conversion script, pointing it at your downloaded model:

cd llama.cpp

python convert_hf_to_gguf.py \
    ../Mistral-7B-Instruct-v0.3 \
    --outfile ../Mistral-7B-Instruct-v0.3-F16.gguf \
    --outtype f16

The first positional argument is the path to the HuggingFace model directory. The --outfile argument specifies the path and filename for the output GGUF file. The --outtype argument specifies the output data type for the weights. The valid options are f32 (32-bit float), f16 (16-bit float), bf16 (bfloat16), q8_0 (8-bit quantization applied during conversion), and auto (uses the model's native dtype, usually f16 or bf16). Using f16 is the recommended starting point because it preserves full precision and gives you a clean base from which to produce multiple quantization levels.

The script will print progress as it processes each tensor. The output GGUF file in f16 format will be approximately 14.5 GB for a 7B model. This is your base GGUF, from which you will create quantized versions.

Step 2: Quantizing the GGUF File

Now comes the step that makes the model practical for most hardware: quantization. The llama-quantize binary takes the f16 GGUF and produces a quantized version. The reason you always quantize from the f16 base rather than from a previously quantized file is to avoid compounding quality loss: each quantization step introduces a small error, and quantizing an already-quantized file would stack those errors.

On Linux and macOS:

./build/bin/llama-quantize \
    ../Mistral-7B-Instruct-v0.3-F16.gguf \
    ../Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
    Q4_K_M

On Windows (from the llama.cpp directory):

build\bin\Release\llama-quantize.exe ^
    ..\Mistral-7B-Instruct-v0.3-F16.gguf ^
    ..\Mistral-7B-Instruct-v0.3-Q4_K_M.gguf ^
    Q4_K_M

The three arguments are: the input GGUF file, the output GGUF file, and the quantization type. The process takes a few minutes on a modern CPU and produces a file that is roughly 4.4 GB for a 7B model at Q4_K_M. If you want multiple quantization levels, simply run the quantize command again from the same f16 base with different output filenames and quantization types.

Step 3: Verifying the GGUF File with llama.cpp

Before loading the model into Ollama or LM Studio, it is a good idea to verify it works with llama.cpp directly. On Linux and macOS:

./build/bin/llama-cli \
    -m ../Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
    -p "[INST] What is the capital of France? [/INST]" \
    -n 100 \
    --log-disable

On Windows:

build\bin\Release\llama-cli.exe ^
    -m ..\Mistral-7B-Instruct-v0.3-Q4_K_M.gguf ^
    -p "[INST] What is the capital of France? [/INST]" ^
    -n 100 ^
    --log-disable

The --log-disable flag suppresses the verbose model-loading messages so that only the generated text appears in the output. A successful run produces coherent output. If you see garbage output or the model immediately produces end-of-sequence tokens, the most common cause is an incorrect prompt format. Different model families use different chat templates, and using the wrong one is a very common mistake. We address this in detail in the Ollama section.

CHAPTER FIVE: LOADING YOUR GGUF MODEL IN OLLAMA

Ollama's model management system is inspired by Docker. Models are identified by names, stored in a local registry, and served by a background daemon. To bring a custom GGUF model into this system, you use a Modelfile.

Understanding the Modelfile and Chat Templates

A Modelfile is a plain text file with a specific syntax. The most important instruction is FROM, which tells Ollama where to find the model weights. For a custom GGUF, FROM points to the local file path. Using absolute paths in the FROM line is strongly recommended, because the path is interpreted relative to your current working directory at the time you run the ollama create command, not relative to the Modelfile itself, and this distinction causes frequent confusion.

The TEMPLATE instruction defines the chat template, which is the text structure that wraps user messages and assistant responses. The TEMPLATE value is a Go template string enclosed in triple quotes. A critical formatting rule that trips up many users is that newlines within the TEMPLATE block must be actual newline characters in the file, not the escape sequence \n. The Go template engine does not interpret \n as a newline; it would pass the literal characters backslash and n through to the model, corrupting the prompt format. When you write a Modelfile by hand, always press Enter to create real line breaks inside the TEMPLATE block.

The BOS (beginning of sequence) token is stored in the GGUF file's metadata and is prepended automatically by llama.cpp at the start of each inference call. You should not include the BOS token in your Modelfile TEMPLATE, because doing so would result in a double BOS token, which degrades model behavior. This is why Ollama's official templates for all models do not include BOS tokens even though the models' raw Jinja2 templates in tokenizer_config.json may include them.

An important subtlety about Llama-family models: both Llama 2 and Llama 3 use the same architecture class name (LlamaForCausalLM) in config.json, but they use completely different chat templates. You can reliably distinguish them by the vocabulary size: Llama 2 has a vocab size of 32,000, while Llama 3 has a vocab size of 128,256. Always check which generation of Llama you are working with before writing the Modelfile.

Here is a complete Modelfile for our Mistral 7B Instruct v0.3 model. Replace /absolute/path/to/ with the actual absolute path on your system. The TEMPLATE block uses actual newlines as required:

FROM /absolute/path/to/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf

TEMPLATE """[INST] {{ if .System }}{{ .System }}
{{ end }}{{ .Prompt }} [/INST]"""

PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192

The TEMPLATE block uses Go template syntax. The .System variable is replaced with the system prompt if one is provided, and .Prompt is replaced with the user's message. The PARAMETER stop lines tell the model to stop generating when it produces these strings. The repeat_penalty parameter (default 1.1) reduces the likelihood of the model repeating itself, which is particularly helpful with smaller quantized models. The num_ctx parameter sets the context window size in tokens; Mistral 7B v0.3 supports up to 32,768 tokens, but larger contexts consume more memory, so 8,192 is a practical default.

Here is the correct Modelfile TEMPLATE for a Llama 2 Chat model. Note the <<SYS>> block for system prompts, which is specific to Llama 2:

FROM /absolute/path/to/Llama-2-7b-chat-Q4_K_M.gguf

TEMPLATE """[INST] {{ if .System }}<<SYS>>
{{ .System }}
<</SYS>>

{{ end }}{{ .Prompt }} [/INST]"""

PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096

Here is the correct Modelfile TEMPLATE for a Llama 3 Instruct model. The begin_of_text token is absent because it is the BOS token and is handled automatically by llama.cpp from the GGUF metadata:

FROM /absolute/path/to/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER temperature 0.6
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192

Here are the correct templates for several other popular model families.

Gemma 2 Instruct (also works for Gemma 3):

FROM /absolute/path/to/gemma-2-9b-it-Q4_K_M.gguf

TEMPLATE """<start_of_turn>user
{{ if .System }}{{ .System }}

{{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
"""

PARAMETER stop "<end_of_turn>"
PARAMETER temperature 0.7
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192

Qwen2 / Qwen2.5 / Yi (ChatML format):

FROM /absolute/path/to/Qwen2-7B-Instruct-Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192

Phi-3 / Phi-3.5 / Phi-3.5 MoE:

FROM /absolute/path/to/Phi-3-mini-4k-instruct-Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}<|user|>
{{ .Prompt }}<|end|>
<|assistant|>
"""

PARAMETER stop "<|end|>"
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096

DeepSeek V3 / R1 (note: these models use full-width Unicode vertical bars ｜ in their special tokens, not ASCII pipes |):

FROM /absolute/path/to/DeepSeek-R1-Q4_K_M.gguf

TEMPLATE """{{ if .System }}{{ .System }}

{{ end }}{{ .Prompt }}<｜Assistant｜>"""

PARAMETER stop "<｜User｜>"
PARAMETER stop "<｜Assistant｜>"
PARAMETER temperature 0.6
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192

Creating and Running the Model in Ollama

Save your Modelfile as a file named Modelfile (no extension) in any convenient directory. Make sure the Ollama server is running (ollama serve in a separate terminal), then run:

ollama create mistral-7b-custom -f ./Modelfile

Ollama will process the GGUF file and register it in its local model library. Now you can run the model interactively:

ollama run mistral-7b-custom

You can also query the model via the REST API. Ollama provides two API styles. The native Ollama API:

# Linux, macOS, PowerShell 7+, Windows PowerShell 5.1 with curl.exe
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b-custom",
    "messages": [{"role": "user", "content": "Explain attention in transformers."}],
    "stream": false
  }'

The OpenAI-compatible API (use this when integrating with tools that expect the OpenAI format):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ollama" \
  -d '{
    "model": "mistral-7b-custom",
    "messages": [{"role": "user", "content": "Explain attention in transformers."}]
  }'

On Windows Command Prompt, curl uses the real curl binary (not an alias), but single quotes are not interpreted, so use escaped double quotes:

curl http://localhost:11434/api/chat ^
  -H "Content-Type: application/json" ^
  -d "{\"model\":\"mistral-7b-custom\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"stream\":false}"

On Windows PowerShell 5.1, curl is an alias for Invoke-WebRequest. Use curl.exe to call the real binary, or use Invoke-RestMethod which handles JSON natively:

Invoke-RestMethod -Uri "http://localhost:11434/api/chat" `
  -Method POST `
  -ContentType "application/json" `
  -Body '{"model":"mistral-7b-custom","messages":[{"role":"user","content":"Hello"}],"stream":false}'

You can list all registered models with:

ollama list

CHAPTER SIX: LOADING YOUR GGUF MODEL IN LM STUDIO

LM Studio is a desktop application available for Mac, Windows, and Linux that provides a graphical interface for downloading, managing, and chatting with models. Under the hood it uses llama.cpp, so it consumes the same GGUF files you have already created.

The Model Directory Structure

LM Studio organizes models in a specific directory structure on your filesystem. On macOS and Linux, the models directory is at:

~/.cache/lm-studio/models/

On Windows it is at:

C:\Users\<YourUsername>\.cache\lm-studio\models\

Within this directory, models are organized in a two-level hierarchy that mirrors the HuggingFace naming convention: publisher/model-name. To make your converted GGUF available in LM Studio, copy or move it into an appropriate subdirectory. On Linux or macOS:

mkdir -p ~/.cache/lm-studio/models/mistralai/Mistral-7B-Instruct-v0.3-GGUF/
cp ../Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
   ~/.cache/lm-studio/models/mistralai/Mistral-7B-Instruct-v0.3-GGUF/

After copying the file, open LM Studio. Click on "My Models" in the left sidebar, and your model will appear in the list. LM Studio automatically detects the model architecture from the GGUF metadata and configures the chat template appropriately for well-known model families.

LM Studio also provides a local server mode that exposes an OpenAI-compatible API for chat completions, just like Ollama. This means you can use LM Studio as the backend for any application that supports the OpenAI chat completions API, simply by changing the base URL.

CHAPTER SEVEN: CONVERTING TO MLX FOR APPLE SILICON

If you are running on an Apple Silicon Mac (M1, M2, M3, or M4 series), you have access to a second, parallel conversion path that produces models optimized for Apple's unified memory architecture. The mlx-lm package handles both conversion and inference.

Why MLX Is Different and Why It Matters

Apple Silicon chips have a unified memory architecture, meaning the CPU and GPU share the same physical memory pool. There is no separate GPU memory; instead, the GPU can access the same RAM that the CPU uses. This has a profound implication for LLM inference: a model that is 14 GB in size can run on a Mac with 16 GB of RAM because there is no need to fit the model into a separate, smaller GPU memory pool.

MLX is Apple's numerical computation framework designed to exploit this unified memory architecture. It uses lazy evaluation, just-in-time compilation, and hardware-specific optimizations for the Apple Neural Engine and the GPU cores in Apple Silicon. The mlx-lm package converts HuggingFace models into MLX's native format and provides a complete inference runtime.

Step 1: Converting a HuggingFace Model to MLX Format

python -m mlx_lm.convert \
    --hf-path ./Mistral-7B-Instruct-v0.3 \
    --mlx-path ./Mistral-7B-Instruct-v0.3-MLX

Without any quantization flag, this converts the model to float16 MLX format.

Step 2: Quantizing During Conversion

To produce a 4-bit quantized MLX model:

python -m mlx_lm.convert \
    --hf-path ./Mistral-7B-Instruct-v0.3 \
    --mlx-path ./Mistral-7B-Instruct-v0.3-MLX-4bit \
    --q-bits 4

For 8-bit quantization:

python -m mlx_lm.convert \
    --hf-path ./Mistral-7B-Instruct-v0.3 \
    --mlx-path ./Mistral-7B-Instruct-v0.3-MLX-8bit \
    --q-bits 8

Step 3: Running Inference with mlx-lm

Once converted, you can run the model directly:

python -m mlx_lm.generate \
    --model ./Mistral-7B-Instruct-v0.3-MLX-4bit \
    --prompt "[INST] What is the capital of France? [/INST]" \
    --max-tokens 200 \
    --temp 0.7

You can also use mlx-lm from Python code. The apply_chat_template method on the tokenizer is the cleanest way to format prompts correctly, because it reads the Jinja2 template from tokenizer_config.json and applies it automatically:

from mlx_lm import load, generate

model, tokenizer = load("./Mistral-7B-Instruct-v0.3-MLX-4bit")

messages = [{"role": "user", "content": "Explain quantum entanglement in one paragraph."}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=300)
print(response)

Using Pre-Converted MLX Models from the MLX Community

The MLX Community on HuggingFace at https://huggingface.co/mlx-community maintains a large collection of pre-converted MLX models. Many popular models are already available there, which means you can skip the conversion step entirely:

huggingface-cli download mlx-community/Mistral-7B-Instruct-v0.3-4bit \
    --local-dir ./Mistral-7B-Instruct-v0.3-MLX-4bit

CHAPTER EIGHT: A COMPLETE WORKED EXAMPLE FROM START TO FINISH

To solidify everything we have covered, let us walk through a complete, end-to-end example using Google's Gemma 2 9B Instruct.

Downloading Gemma 2 9B Instruct

Gemma 2 requires you to accept Google's usage terms on HuggingFace before downloading. After doing so at https://huggingface.co/google/gemma-2-9b-it, log in and download:

huggingface-cli login
huggingface-cli download google/gemma-2-9b-it \
    --local-dir ./gemma-2-9b-it \
    --local-dir-use-symlinks False

Converting to GGUF and Quantizing

cd llama.cpp
python convert_hf_to_gguf.py \
    ../gemma-2-9b-it \
    --outfile ../gemma-2-9b-it-F16.gguf \
    --outtype f16

./build/bin/llama-quantize \
    ../gemma-2-9b-it-F16.gguf \
    ../gemma-2-9b-it-Q4_K_M.gguf \
    Q4_K_M

The resulting Q4_K_M file will be approximately 5.8 GB.

Creating the Ollama Modelfile for Gemma 2

Save the following as a file named Modelfile-Gemma2. Remember that the blank lines inside the TEMPLATE block are literal newlines, not \n escape sequences:

FROM /absolute/path/to/gemma-2-9b-it-Q4_K_M.gguf

TEMPLATE """<start_of_turn>user
{{ if .System }}{{ .System }}

{{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
"""

PARAMETER stop "<end_of_turn>"
PARAMETER temperature 0.7
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192

ollama create gemma2-9b-custom -f ./Modelfile-Gemma2
ollama run gemma2-9b-custom

Converting Gemma 2 to MLX on Apple Silicon

python -m mlx_lm.convert \
    --hf-path ./gemma-2-9b-it \
    --mlx-path ./gemma-2-9b-it-MLX-4bit \
    --q-bits 4

Running inference with the correct chat template applied automatically:

from mlx_lm import load, generate

model, tokenizer = load("./gemma-2-9b-it-MLX-4bit")

messages = [{"role": "user", "content": "Tell me something fascinating about prime numbers."}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=300)
print(response)

CHAPTER NINE: TROUBLESHOOTING COMMON PROBLEMS

The conversion script fails with "unknown architecture" or "Model architecture not supported". This happens when convert_hf_to_gguf.py does not recognize the model architecture. The fix is to update your llama.cpp clone to the latest version, since new architecture support is added frequently:

cd llama.cpp
git fetch origin
git reset --hard origin/HEAD
pip install -r requirements.txt

Then re-run the conversion. Note that the script was previously named convert.py in older versions of llama.cpp; if you have a very old clone, updating will also rename it correctly.

The model generates garbage or repetitive output. This is almost always a chat template problem. The model is receiving input in a format it was not trained with. Double-check the template in your Modelfile against the model's tokenizer_config.json on HuggingFace. Look for the chat_template field in that JSON file; it is the authoritative Jinja2 template that you need to translate into Ollama's Go template syntax. Remember that newlines in the Modelfile TEMPLATE block must be actual newline characters, not the \n escape sequence. Also remember that for Llama-family models, Llama 2 and Llama 3 use completely different templates despite sharing the same architecture class name; distinguish them by their vocabulary size (32,000 for Llama 2, 128,256 for Llama 3). Adding PARAMETER repeat_penalty 1.1 can also help if the model is looping.

The model runs out of memory during conversion. The conversion script loads all model weights into memory at once. For large models (13B parameters and above), this can require 30 GB or more of RAM. If you run out of memory, consider using a machine with more RAM for conversion, or look for pre-converted GGUF files on HuggingFace. The user bartowski at https://huggingface.co/bartowski maintains high-quality GGUF conversions of many popular models.

The quantized model is much slower than expected. Check whether llama.cpp was compiled with the appropriate GPU acceleration flags. On Apple Silicon, make sure you did not accidentally disable Metal. On Linux with an NVIDIA GPU, make sure you compiled with -DGGML_CUDA=ON. For Ollama, GPU support is configured automatically during installation on most systems.

MLX conversion fails or mlx-lm is not found. The mlx-lm package only works on Apple Silicon Macs. If you are on any other platform, use the GGUF path with llama.cpp, Ollama, or LM Studio instead.

DeepSeek model output is garbled. DeepSeek V3 and R1 use full-width Unicode vertical bars (｜, U+FF5C) in their special tokens (<｜User｜> and <｜Assistant｜>), not ASCII pipes (|). Make sure your Modelfile uses the correct Unicode characters. Copy them from this guide or from the model's tokenizer_config.json rather than typing them manually.

CHAPTER TEN: CHOOSING THE RIGHT TOOL FOR YOUR SITUATION

If you are on Apple Silicon and want strong performance with tight hardware integration, MLX is an excellent primary option. The unified memory architecture and Apple's hardware-specific optimizations give MLX consistent advantages on Apple hardware. llama.cpp's Metal backend has also matured significantly, so benchmarking both for your specific model and workload is worthwhile. Use mlx-lm for MLX conversion and inference, and check the MLX Community on HuggingFace first to see if your model is already converted.

If you are on any hardware and want a developer-friendly server with both a native API and an OpenAI-compatible API (/v1/chat/completions), Ollama is the right choice. It is easy to install, easy to manage, and its API compatibility makes it straightforward to integrate into existing applications.

If you want a graphical interface with no command-line work, LM Studio is the answer. It is particularly useful for non-technical users or for quickly experimenting with different models and parameters without writing any code.

If you are building a custom application or need maximum control over inference parameters, compiling and using llama.cpp directly gives you the most flexibility.

These tools are not mutually exclusive. Many practitioners use Ollama for development and API access, LM Studio for quick experiments, and llama.cpp directly for benchmarking and custom integrations. The GGUF format that you create once can be used in all three of these environments without any additional conversion.

CHAPTER ELEVEN: AUTOMATING THE ENTIRE PIPELINE

Everything described in the preceding parts can be automated with a single Python script. The script presented in this section, convert_model.py, handles the complete pipeline: it detects your operating system and GPU hardware, checks and installs dependencies, clones and compiles llama.cpp, downloads the model from HuggingFace, converts it to GGUF, quantizes it to one or more levels, detects the correct chat template from the model's tokenizer configuration (including correctly distinguishing Llama 2 from Llama 3 by vocabulary size, and handling DeepSeek's full-width Unicode tokens), generates a Modelfile, registers the model with Ollama, converts to MLX on Apple Silicon, sets up the LM Studio directory structure, runs validation inference tests, and produces a summary report.

You invoke it with a single command:

python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3 --quant Q4_K_M Q8_0

The full script, along with its dependency file and a plain-text README, follows immediately below.

Now the three production-ready files:

FILE 1 OF 3 — requirements_convert.txt

# Requirements for convert_model.py
# Install with: pip install -r requirements_convert.txt
#
# After installing these, also install PyTorch separately because the
# correct command depends on your hardware:
#
#   CPU-only (sufficient for conversion on any platform):
#     pip install torch --index-url https://download.pytorch.org/whl/cpu
#
#   NVIDIA GPU (visit https://pytorch.org for the right CUDA version):
#     pip install torch --index-url https://download.pytorch.org/whl/cu121
#
#   Apple Silicon (standard install, includes MPS support):
#     pip install torch
#
# mlx-lm is Apple Silicon ONLY and is intentionally NOT listed here
# because `pip install mlx-lm` fails on non-Apple hardware.
# On Apple Silicon, install it separately after activating your venv:
#   pip install mlx-lm

huggingface_hub>=0.23.0
transformers>=4.40.0
sentencepiece>=0.1.98
numpy>=1.26.0
requests>=2.31.0
tqdm>=4.66.0

FILE 2 OF 3 — convert_model.py

#!/usr/bin/env python3
"""
convert_model.py  —  Universal HuggingFace → Local Inference Converter
=======================================================================
Automates the full pipeline for converting any HuggingFace language model
to run locally on llama.cpp, Ollama, LM Studio, and Apple MLX.

Pipeline steps
--------------
1.  Environment and hardware detection
2.  Dependency verification
3.  llama.cpp clone / update and compilation
4.  Model download from HuggingFace
5.  Architecture validation and chat-template detection
6.  GGUF conversion (F16 base)
7.  Quantization to one or more levels
8.  Ollama Modelfile generation
9.  Ollama model registration and API test
10. MLX conversion and inference test (Apple Silicon only)
11. LM Studio directory setup
12. llama-cli validation test
13. Summary report

Usage
-----
    python convert_model.py --model <hf-repo-id> [options]

Quick examples
--------------
    # Basic — Q4_K_M only, all steps:
    python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3

    # Multiple quant levels, gated model:
    python convert_model.py \\
        --model meta-llama/Meta-Llama-3.1-8B-Instruct \\
        --hf-token hf_xxx \\
        --quant Q4_K_M Q5_K_M Q8_0 \\
        --ollama-name llama31-8b

    # Apple Silicon: also produce MLX model:
    python convert_model.py \\
        --model google/gemma-2-9b-it \\
        --quant Q4_K_M \\
        --mlx --mlx-bits 4

    # GGUF only, skip Ollama / LM Studio:
    python convert_model.py \\
        --model Qwen/Qwen2-7B-Instruct \\
        --skip-ollama --skip-lmstudio

Run with --help for the full option reference.
"""

from __future__ import annotations

import argparse
import json
import logging
import os
import platform
import re
import shutil
import subprocess
import sys
import textwrap
import time
from pathlib import Path
from typing import Dict, List, Optional, Tuple

# ---------------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------------

LOG_FORMAT = "%(asctime)s  %(levelname)-8s  %(message)s"
logging.basicConfig(
    level=logging.INFO,
    format=LOG_FORMAT,
    datefmt="%H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
log = logging.getLogger("convert_model")

# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------

LLAMA_CPP_REPO = "https://github.com/ggerganov/llama.cpp.git"

ALL_QUANT_LEVELS: List[str] = [
    "Q2_K",
    "Q3_K_M",
    "Q4_K_M",
    "Q5_K_M",
    "Q6_K",
    "Q8_0",
]

DEFAULT_QUANT_LEVELS: List[str] = ["Q4_K_M"]

# Approximate bits-per-weight for rough size estimation.
QUANT_BPW: Dict[str, float] = {
    "Q2_K":   2.6,
    "Q3_K_M": 3.3,
    "Q4_K_M": 4.8,
    "Q5_K_M": 5.7,
    "Q6_K":   6.6,
    "Q8_0":   8.5,
    "F16":   16.0,
}

# Vocabulary sizes used to distinguish Llama 2 from Llama 3.
# Both use LlamaForCausalLM; vocab size is the reliable differentiator.
LLAMA2_VOCAB_SIZE: int = 32000
LLAMA3_VOCAB_SIZE_MIN: int = 128000  # Llama 3 uses 128256

# Maximum context size cap for auto-detection.
# Set high enough to accommodate 128 K-context models (Llama 3.1, Qwen2, etc.)
# while still preventing absurdly large defaults on machines with limited RAM.
MAX_AUTO_CTX: int = 131072

# ---------------------------------------------------------------------------
# Ollama chat templates
# ---------------------------------------------------------------------------
#
# IMPORTANT — newline handling:
#   The template strings below use Python \n escape sequences, which produce
#   ACTUAL newline characters in the Python string.  When written to a file,
#   these become literal newlines in the Modelfile TEMPLATE block, which is
#   exactly what Ollama requires.  The Go template engine does NOT interpret
#   the two-character sequence backslash-n as a newline.
#
# IMPORTANT — BOS token:
#   No template includes a BOS (beginning-of-sequence) token.  llama.cpp
#   reads the BOS token from the GGUF metadata and prepends it automatically.
#   Including it in the template would produce a double-BOS and degrade output.
#
# Each entry:
#   "template" : Go-template string for the Modelfile TEMPLATE block
#   "stop"     : stop token strings for PARAMETER stop lines
#   "notes"    : human-readable description
# ---------------------------------------------------------------------------

OLLAMA_TEMPLATES: Dict[str, Dict] = {
    # ------------------------------------------------------------------
    # Llama 3 / 3.1 / 3.2 / 3.3 Instruct
    # ------------------------------------------------------------------
    "llama3": {
        "template": (
            "{{ if .System }}<|start_header_id|>system<|end_header_id|>\n\n"
            "{{ .System }}<|eot_id|>{{ end }}"
            "<|start_header_id|>user<|end_header_id|>\n\n"
            "{{ .Prompt }}<|eot_id|>"
            "<|start_header_id|>assistant<|end_header_id|>\n\n"
        ),
        "stop": [
            "<|start_header_id|>",
            "<|end_header_id|>",
            "<|eot_id|>",
        ],
        "notes": "Llama 3 / 3.1 / 3.2 / 3.3 Instruct (vocab ≥ 128 000)",
    },
    # ------------------------------------------------------------------
    # Llama 2 Chat
    # ------------------------------------------------------------------
    "llama2": {
        "template": (
            "[INST] {{ if .System }}<<SYS>>\n"
            "{{ .System }}\n"
            "<</SYS>>\n\n"
            "{{ end }}{{ .Prompt }} [/INST]"
        ),
        "stop": ["[INST]", "[/INST]"],
        "notes": "Llama 2 Chat (vocab = 32 000)",
    },
    # ------------------------------------------------------------------
    # Mistral / Mixtral Instruct
    # ------------------------------------------------------------------
    "mistral": {
        "template": (
            "[INST] {{ if .System }}{{ .System }}\n"
            "{{ end }}{{ .Prompt }} [/INST]"
        ),
        "stop": ["[INST]", "[/INST]"],
        "notes": "Mistral / Mixtral Instruct",
    },
    # ------------------------------------------------------------------
    # Gemma / Gemma 2 / Gemma 3 Instruct
    # ------------------------------------------------------------------
    "gemma": {
        "template": (
            "<start_of_turn>user\n"
            "{{ if .System }}{{ .System }}\n\n{{ end }}"
            "{{ .Prompt }}<end_of_turn>\n"
            "<start_of_turn>model\n"
        ),
        "stop": ["<end_of_turn>"],
        "notes": "Gemma / Gemma 2 / Gemma 3 Instruct",
    },
    # ------------------------------------------------------------------
    # Qwen2 / Qwen2.5 / Yi / generic ChatML
    # ------------------------------------------------------------------
    "qwen2": {
        "template": (
            "{{ if .System }}<|im_start|>system\n{{ .System }}<|im_end|>\n{{ end }}"
            "<|im_start|>user\n{{ .Prompt }}<|im_end|>\n"
            "<|im_start|>assistant\n"
        ),
        "stop": ["<|im_end|>", "<|im_start|>"],
        "notes": "Qwen2 / Qwen2.5 / Yi / ChatML",
    },
    # ------------------------------------------------------------------
    # Phi-3 / Phi-3.5 / Phi-3.5 MoE
    # ------------------------------------------------------------------
    "phi3": {
        "template": (
            "{{ if .System }}<|system|>\n{{ .System }}<|end|>\n{{ end }}"
            "<|user|>\n{{ .Prompt }}<|end|>\n"
            "<|assistant|>\n"
        ),
        "stop": ["<|end|>"],
        "notes": "Phi-3 / Phi-3.5 / Phi-3.5 MoE Instruct",
    },
    # ------------------------------------------------------------------
    # DeepSeek V3 / R1
    # NOTE: These models use full-width Unicode vertical bars (U+FF5C) ｜
    # in their special tokens, NOT ASCII pipes |.  The characters below
    # are the correct Unicode codepoints.
    # ------------------------------------------------------------------
    "deepseek": {
        "template": (
            "{{ if .System }}{{ .System }}\n\n{{ end }}"
            "{{ .Prompt }}<\uff5cAssistant\uff5c>"
        ),
        "stop": ["<\uff5cUser\uff5c>", "<\uff5cAssistant\uff5c>"],
        "notes": "DeepSeek V3 / R1 (full-width Unicode vertical bars)",
    },
    # ------------------------------------------------------------------
    # Cohere Command-R / Command-R+
    # BOS token is intentionally absent (handled by llama.cpp from GGUF).
    # ------------------------------------------------------------------
    "command": {
        "template": (
            "{{ if .System }}"
            "<|START_OF_TURN_TOKEN|><|SYSTEM_PROMPT_TOKEN|>"
            "{{ .System }}<|END_OF_TURN_TOKEN|>"
            "{{ end }}"
            "<|START_OF_TURN_TOKEN|><|USER_TOKEN|>"
            "{{ .Prompt }}<|END_OF_TURN_TOKEN|>"
            "<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>"
        ),
        "stop": ["<|END_OF_TURN_TOKEN|>"],
        "notes": "Cohere Command-R / Command-R+",
    },
    # ------------------------------------------------------------------
    # Falcon Instruct
    # ------------------------------------------------------------------
    "falcon": {
        "template": (
            "{{ if .System }}System: {{ .System }}\n{{ end }}"
            "User: {{ .Prompt }}\nAssistant:"
        ),
        "stop": ["User:", "Assistant:"],
        "notes": "Falcon Instruct",
    },
    # ------------------------------------------------------------------
    # Generic ChatML fallback
    # ------------------------------------------------------------------
    "chatml": {
        "template": (
            "{{ if .System }}<|im_start|>system\n{{ .System }}<|im_end|>\n{{ end }}"
            "<|im_start|>user\n{{ .Prompt }}<|im_end|>\n"
            "<|im_start|>assistant\n"
        ),
        "stop": ["<|im_end|>", "<|im_start|>"],
        "notes": "Generic ChatML (fallback for fine-tuned models)",
    },
}

# ---------------------------------------------------------------------------
# Architecture → template key mapping
#
# Matched case-insensitively against the architectures list in config.json.
# "llama" is intentionally absent: Llama 2 vs Llama 3 is resolved by
# vocabulary size in TemplateDetector._resolve_llama_version(), not here.
# ---------------------------------------------------------------------------
ARCH_TO_TEMPLATE: Dict[str, str] = {
    "mistral":          "mistral",
    "mixtral":          "mistral",
    "gemma":            "gemma",   # catches Gemma, Gemma2, Gemma3
    "qwen2":            "qwen2",
    "qwen":             "qwen2",
    "phi3":             "phi3",
    "phimoe":           "phi3",    # Phi-3.5 MoE
    "deepseek":         "deepseek",
    "cohere":           "command",
    "command":          "command",
    "falcon":           "falcon",
    "yi":               "qwen2",
    "internlm":         "chatml",
    "baichuan":         "chatml",
    "chatglm":          "chatml",
    "starcoder":        "chatml",
    "codellama":        "llama2",  # Code Llama uses Llama 2 template
}

# Repository name substrings → template key (checked after arch mapping).
REPO_TO_TEMPLATE: Dict[str, str] = {
    "llama-3":          "llama3",
    "llama3":           "llama3",
    "llama-2":          "llama2",
    "llama2":           "llama2",
    "codellama":        "llama2",
    "mistral":          "mistral",
    "mixtral":          "mistral",
    "gemma":            "gemma",
    "qwen2":            "qwen2",
    "qwen2.5":          "qwen2",
    "phi-3":            "phi3",
    "phi3":             "phi3",
    "deepseek":         "deepseek",
    "command-r":        "command",
    "falcon":           "falcon",
    "yi-":              "qwen2",
    "internlm":         "chatml",
    "baichuan":         "chatml",
}

# Default context window sizes by template family (tokens).
FAMILY_CTX: Dict[str, int] = {
    "llama3":   8192,
    "llama2":   4096,
    "mistral":  8192,
    "gemma":    8192,
    "qwen2":    8192,
    "phi3":     4096,
    "deepseek": 8192,
    "command":  8192,
    "falcon":   2048,
    "chatml":   4096,
}

# Default generation parameters written to every Modelfile.
MODELFILE_DEFAULTS: Dict[str, object] = {
    "temperature":    0.7,
    "top_p":          0.9,
    "repeat_penalty": 1.1,
}


# ===========================================================================
# 1. Environment Detection
# ===========================================================================

class Environment:
    """Detects OS, CPU architecture, and available GPU hardware."""

    def __init__(self) -> None:
        self.os_name: str = platform.system()        # Darwin / Linux / Windows
        self.machine: str = platform.machine()        # x86_64 / arm64 / AMD64
        self.is_macos: bool = self.os_name == "Darwin"
        self.is_linux: bool = self.os_name == "Linux"
        self.is_windows: bool = self.os_name == "Windows"
        self.is_apple_silicon: bool = (
            self.is_macos and self.machine in ("arm64", "aarch64")
        )
        self.has_nvidia: bool = self._detect_nvidia()
        self.has_amd: bool = self._detect_amd()
        self.python_exe: str = sys.executable
        self.cpu_count: int = os.cpu_count() or 4

    # ------------------------------------------------------------------
    def _detect_nvidia(self) -> bool:
        try:
            result = subprocess.run(
                ["nvidia-smi"],
                capture_output=True,
                timeout=10,
            )
            return result.returncode == 0
        except (FileNotFoundError, subprocess.TimeoutExpired):
            return False

    def _detect_amd(self) -> bool:
        try:
            result = subprocess.run(
                ["rocm-smi"],
                capture_output=True,
                timeout=10,
            )
            return result.returncode == 0
        except (FileNotFoundError, subprocess.TimeoutExpired):
            return False

    # ------------------------------------------------------------------
    def cmake_extra_flags(self) -> List[str]:
        """Return CMake flags appropriate for this platform's GPU."""
        if self.has_nvidia:
            return ["-DGGML_CUDA=ON"]
        if self.has_amd:
            return ["-DGGML_HIPBLAS=ON"]
        # Metal is the default on macOS; no extra flag needed.
        return []

    def llama_bin_dir(self, llama_root: Path) -> Path:
        if self.is_windows:
            return llama_root / "build" / "bin" / "Release"
        return llama_root / "build" / "bin"

    def bin_ext(self) -> str:
        return ".exe" if self.is_windows else ""

    def describe(self) -> str:
        gpu_info: List[str] = []
        if self.is_apple_silicon:
            gpu_info.append("Apple Silicon (Metal)")
        if self.has_nvidia:
            gpu_info.append("NVIDIA (CUDA)")
        if self.has_amd:
            gpu_info.append("AMD (ROCm)")
        if not gpu_info:
            gpu_info.append("CPU only")
        return (
            f"OS={self.os_name} arch={self.machine} "
            f"GPU={', '.join(gpu_info)} CPUs={self.cpu_count}"
        )


# ===========================================================================
# 2. Dependency Manager
# ===========================================================================

class DependencyManager:
    """Checks for required external tools and Python packages."""

    def __init__(self, env: Environment) -> None:
        self.env = env

    # ------------------------------------------------------------------
    def check_git(self) -> None:
        if not shutil.which("git"):
            raise RuntimeError(
                "git is not installed or not on PATH. "
                "Install git from https://git-scm.com and retry."
            )
        log.info("git: OK")

    def check_cmake(self) -> None:
        if not shutil.which("cmake"):
            raise RuntimeError(
                "cmake is not installed or not on PATH. "
                "Install cmake from https://cmake.org and retry."
            )
        log.info("cmake: OK")

    def check_ollama(self) -> bool:
        """Return True if ollama is installed and the server is reachable."""
        if not shutil.which("ollama"):
            log.warning(
                "ollama not found on PATH.  Ollama steps will be skipped. "
                "Install from https://ollama.com"
            )
            return False
        try:
            import urllib.request
            urllib.request.urlopen(
                "http://localhost:11434/api/tags", timeout=5
            )
            log.info("ollama: OK (server reachable at localhost:11434)")
            return True
        except Exception:
            log.warning(
                "ollama is installed but the server is not reachable. "
                "Start it with: ollama serve"
            )
            return False

    def check_python_packages(self, packages: List[str]) -> None:
        """Verify that required Python packages are importable."""
        import importlib
        missing: List[str] = []
        for pkg in packages:
            try:
                importlib.import_module(pkg)
            except ImportError:
                missing.append(pkg)
        if missing:
            raise RuntimeError(
                f"Missing Python packages: {', '.join(missing)}. "
                f"Run: pip install {' '.join(missing)}"
            )
        log.info(f"Python packages {packages}: OK")

    def check_mlx(self) -> bool:
        """Return True if mlx-lm is available (Apple Silicon only)."""
        if not self.env.is_apple_silicon:
            return False
        import importlib
        try:
            importlib.import_module("mlx_lm")
            log.info("mlx-lm: OK")
            return True
        except ImportError:
            log.warning(
                "mlx-lm not found.  Install with: pip install mlx-lm  "
                "(Apple Silicon only).  MLX steps will be skipped."
            )
            return False

    def run_all(self) -> Dict[str, bool]:
        """Run all dependency checks. Returns feature-availability dict."""
        self.check_git()
        self.check_cmake()
        self.check_python_packages([
            "huggingface_hub",
            "transformers",
            "sentencepiece",
            "numpy",
            "requests",
        ])
        ollama_ok = self.check_ollama()
        mlx_ok = self.check_mlx()
        return {"ollama": ollama_ok, "mlx": mlx_ok}


# ===========================================================================
# 3. llama.cpp Manager
# ===========================================================================

class LlamaCppManager:
    """Clones (or updates) and compiles llama.cpp."""

    def __init__(self, env: Environment, llama_dir: Path) -> None:
        self.env = env
        self.llama_dir = llama_dir

    # ------------------------------------------------------------------
    def ensure_repo(self) -> None:
        """
        Clone llama.cpp if not present, or fetch + hard-reset to origin/HEAD.
        Using fetch + reset --hard is more robust than `git pull --ff-only`
        because it succeeds even when the local branch has diverged or has
        uncommitted modifications.
        """
        if (self.llama_dir / ".git").exists():
            log.info(
                f"llama.cpp repo found at {self.llama_dir}; "
                "fetching latest changes..."
            )
            try:
                self._run(
                    ["git", "fetch", "origin"],
                    cwd=self.llama_dir,
                )
                self._run(
                    ["git", "reset", "--hard", "origin/HEAD"],
                    cwd=self.llama_dir,
                )
            except subprocess.CalledProcessError as e:
                log.warning(
                    f"git update failed ({e}); continuing with existing clone. "
                    "If conversion fails, delete the llama.cpp directory and retry."
                )
        else:
            log.info(f"Cloning llama.cpp into {self.llama_dir} (shallow)...")
            self._run([
                "git", "clone", "--depth", "1",
                LLAMA_CPP_REPO, str(self.llama_dir),
            ])

    def install_python_deps(self) -> None:
        req_file = self.llama_dir / "requirements.txt"
        if req_file.exists():
            log.info("Installing llama.cpp Python requirements...")
            self._run([
                self.env.python_exe, "-m", "pip", "install",
                "-r", str(req_file), "--quiet",
            ])
        else:
            log.warning(
                "llama.cpp requirements.txt not found; "
                "skipping Python dep install for llama.cpp."
            )

    def build(self) -> None:
        """Configure and compile llama.cpp."""
        quantize_bin = (
            self.env.llama_bin_dir(self.llama_dir)
            / f"llama-quantize{self.env.bin_ext()}"
        )
        if quantize_bin.exists():
            log.info("llama.cpp already compiled; skipping build.")
            return

        build_dir = self.llama_dir / "build"
        log.info("Configuring llama.cpp with CMake...")
        cmake_flags = self.env.cmake_extra_flags()
        self._run(
            ["cmake", "-B", str(build_dir)] + cmake_flags,
            cwd=self.llama_dir,
        )

        log.info(
            f"Compiling llama.cpp with {self.env.cpu_count} threads "
            "(this may take several minutes)..."
        )
        self._run(
            [
                "cmake", "--build", str(build_dir),
                "--config", "Release",
                "-j", str(self.env.cpu_count),
            ],
            cwd=self.llama_dir,
        )
        log.info("llama.cpp compilation complete.")

    # ------------------------------------------------------------------
    def convert_script(self) -> Path:
        script = self.llama_dir / "convert_hf_to_gguf.py"
        if not script.exists():
            raise FileNotFoundError(
                f"convert_hf_to_gguf.py not found at {script}. "
                "Make sure llama.cpp cloned successfully."
            )
        return script

    def quantize_bin(self) -> Path:
        p = (
            self.env.llama_bin_dir(self.llama_dir)
            / f"llama-quantize{self.env.bin_ext()}"
        )
        if not p.exists():
            raise FileNotFoundError(
                f"llama-quantize not found at {p}. "
                "Make sure llama.cpp compiled successfully."
            )
        return p

    def cli_bin(self) -> Optional[Path]:
        p = (
            self.env.llama_bin_dir(self.llama_dir)
            / f"llama-cli{self.env.bin_ext()}"
        )
        return p if p.exists() else None

    # ------------------------------------------------------------------
    def _run(self, cmd: List[str], cwd: Optional[Path] = None) -> None:
        log.debug(f"Running: {' '.join(str(c) for c in cmd)}")
        subprocess.run(
            cmd,
            cwd=str(cwd) if cwd else None,
            check=True,
        )


# ===========================================================================
# 4. Model Downloader
# ===========================================================================

class ModelDownloader:
    """Downloads a model from HuggingFace Hub."""

    # Patterns to download (everything needed for PyTorch/safetensors conversion).
    INCLUDE_PATTERNS: List[str] = [
        "*.safetensors",
        "*.bin",
        "*.json",
        "*.txt",
        "*.model",
        "*.tiktoken",
    ]

    # Patterns to skip (non-PyTorch framework weights).
    EXCLUDE_PATTERNS: List[str] = [
        "*.msgpack",
        "*.h5",
        "flax_model*",
        "tf_model*",
        "rust_model*",
        "*.ot",
    ]

    def __init__(
        self,
        repo_id: str,
        local_dir: Path,
        hf_token: Optional[str],
    ) -> None:
        self.repo_id = repo_id
        self.local_dir = local_dir
        self.hf_token = hf_token

    # ------------------------------------------------------------------
    def _has_weight_files(self) -> bool:
        """Return True if at least one weight file exists in local_dir."""
        for pattern in ("*.safetensors", "*.bin"):
            if list(self.local_dir.glob(pattern)):
                return True
        return False

    def download(self) -> Path:
        """
        Download the model.  Skips if config.json AND at least one weight
        file already exist (supports interrupted-download resumption).
        """
        config_path = self.local_dir / "config.json"
        if config_path.exists() and self._has_weight_files():
            log.info(
                f"Model directory {self.local_dir} already contains model files; "
                "skipping download."
            )
            return self.local_dir

        log.info(f"Downloading {self.repo_id} to {self.local_dir}...")
        self.local_dir.mkdir(parents=True, exist_ok=True)

        try:
            from huggingface_hub import snapshot_download
        except ImportError:
            raise RuntimeError(
                "huggingface_hub not installed.  Run: pip install huggingface_hub"
            )

        snapshot_download(
            repo_id=self.repo_id,
            local_dir=str(self.local_dir),
            local_dir_use_symlinks=False,
            allow_patterns=self.INCLUDE_PATTERNS,
            ignore_patterns=self.EXCLUDE_PATTERNS,
            token=self.hf_token,
        )
        log.info(f"Download complete: {self.local_dir}")
        return self.local_dir

    # ------------------------------------------------------------------
    def load_config(self) -> Dict:
        config_path = self.local_dir / "config.json"
        if not config_path.exists():
            raise FileNotFoundError(
                f"config.json not found in {self.local_dir}. "
                "Download may have failed or been incomplete."
            )
        with open(config_path, encoding="utf-8") as f:
            return json.load(f)

    def load_tokenizer_config(self) -> Dict:
        tc_path = self.local_dir / "tokenizer_config.json"
        if not tc_path.exists():
            log.warning(
                "tokenizer_config.json not found; "
                "template detection will rely on architecture and repo name only."
            )
            return {}
        with open(tc_path, encoding="utf-8") as f:
            return json.load(f)

    # ------------------------------------------------------------------
    def estimate_param_count(self, config: Dict) -> Optional[int]:
        """
        Rough parameter-count estimate from config.json.
        Returns None if required fields are absent.
        GQA models (Llama 3, Mistral) are slightly overestimated.
        """
        try:
            h = config.get("hidden_size", 0)
            n_layers = config.get("num_hidden_layers", 0)
            vocab = config.get("vocab_size", 0)
            intermediate = config.get("intermediate_size", h * 4)
            if not (h and n_layers and vocab):
                return None
            attn = 4 * h * h           # Q, K, V, O projections (full MHA)
            ffn = 3 * h * intermediate  # gate, up, down (SwiGLU)
            return vocab * h + n_layers * (attn + ffn)
        except Exception:
            return None

    def validate_architecture(self, config: Dict) -> None:
        """
        Warn if the model does not look like a causal language model.
        This is a heuristic check; convert_hf_to_gguf.py will give a
        definitive error if the architecture is truly unsupported.
        """
        archs = config.get("architectures", [])
        if not archs:
            log.warning(
                "config.json has no 'architectures' field.  "
                "Conversion may fail if this is not a causal LM."
            )
            return
        arch_lower = archs[0].lower()
        # Rough heuristic: causal LMs end in ForCausalLM or ForConditionalGeneration
        if "forcausallm" not in arch_lower and "forconditionalgeneration" not in arch_lower:
            log.warning(
                f"Architecture '{archs[0]}' does not look like a causal LM.  "
                "Conversion may fail.  If this is an encoder-only or "
                "vision model, it is not supported by llama.cpp."
            )


# ===========================================================================
# 5. GGUF Converter
# ===========================================================================

class GGUFConverter:
    """Converts a HuggingFace model directory to a float16 GGUF base file."""

    def __init__(
        self,
        llama_mgr: LlamaCppManager,
        model_dir: Path,
        output_dir: Path,
        model_name: str,
        split_max_size: Optional[str] = None,
    ) -> None:
        self.llama_mgr = llama_mgr
        self.model_dir = model_dir
        self.output_dir = output_dir
        self.model_name = model_name
        self.split_max_size = split_max_size  # e.g. "4G" or None

    # ------------------------------------------------------------------
    def convert_f16(self) -> Path:
        """
        Convert the HuggingFace model to a float16 GGUF base file.
        Returns the path to the produced GGUF file.
        Skips conversion if the output file already exists.
        """
        out_path = self.output_dir / f"{self.model_name}-F16.gguf"
        if out_path.exists():
            log.info(f"F16 GGUF already exists at {out_path}; skipping conversion.")
            return out_path

        log.info(f"Converting {self.model_dir} to GGUF (f16)...")
        script = self.llama_mgr.convert_script()
        cmd = [
            sys.executable,
            str(script),
            str(self.model_dir),
            "--outfile", str(out_path),
            "--outtype", "f16",
        ]
        if self.split_max_size:
            cmd += ["--split-max-size", self.split_max_size]

        subprocess.run(cmd, check=True)

        if not out_path.exists():
            raise RuntimeError(
                f"Conversion appeared to succeed but {out_path} was not created. "
                "Check the output above for errors."
            )
        size_gb = out_path.stat().st_size / 1e9
        log.info(f"F16 GGUF created: {out_path} ({size_gb:.2f} GB)")
        return out_path


# ===========================================================================
# 6. Quantizer
# ===========================================================================

class Quantizer:
    """Produces quantized GGUF files from an F16 GGUF base."""

    def __init__(
        self,
        llama_mgr: LlamaCppManager,
        f16_gguf: Path,
        output_dir: Path,
        model_name: str,
    ) -> None:
        self.llama_mgr = llama_mgr
        self.f16_gguf = f16_gguf
        self.output_dir = output_dir
        self.model_name = model_name

    # ------------------------------------------------------------------
    def quantize(self, quant_type: str) -> Optional[Path]:
        """
        Quantize the F16 GGUF to the given quantization type.
        Returns the output path on success, None on failure.
        Skips if the output file already exists.
        """
        if quant_type not in ALL_QUANT_LEVELS:
            log.error(
                f"Unknown quantization type: {quant_type}. "
                f"Valid options: {ALL_QUANT_LEVELS}"
            )
            return None

        out_path = self.output_dir / f"{self.model_name}-{quant_type}.gguf"
        if out_path.exists():
            log.info(f"{quant_type} GGUF already exists at {out_path}; skipping.")
            return out_path

        log.info(f"Quantizing to {quant_type}...")
        try:
            quantize_bin = self.llama_mgr.quantize_bin()
            cmd = [
                str(quantize_bin),
                str(self.f16_gguf),
                str(out_path),
                quant_type,
            ]
            subprocess.run(cmd, check=True)
        except (FileNotFoundError, subprocess.CalledProcessError) as e:
            log.error(f"Quantization to {quant_type} failed: {e}")
            return None

        if not out_path.exists():
            log.error(
                f"Quantization to {quant_type} appeared to succeed "
                f"but {out_path} was not created."
            )
            return None

        size_gb = out_path.stat().st_size / 1e9
        log.info(f"{quant_type} GGUF created: {out_path} ({size_gb:.2f} GB)")
        return out_path

    def quantize_all(self, quant_types: List[str]) -> Dict[str, Path]:
        """
        Quantize to all requested types.  Continues on per-level failures.
        Returns {quant_type: path} for successful quantizations only.
        """
        results: Dict[str, Path] = {}
        for qt in quant_types:
            path = self.quantize(qt)
            if path is not None:
                results[qt] = path
            else:
                log.warning(f"Skipping {qt} due to quantization failure.")
        return results


# ===========================================================================
# 7. Template Detector
# ===========================================================================

class TemplateDetector:
    """
    Detects the correct Ollama chat template for a model by examining:
      1. Architecture class name (with Llama 2 / Llama 3 disambiguation)
      2. Repository name substrings
      3. Heuristic inspection of the chat_template Jinja2 string
    Falls back to generic ChatML if no match is found.
    """

    def __init__(
        self,
        repo_id: str,
        config: Dict,
        tokenizer_config: Dict,
    ) -> None:
        self.repo_id = repo_id.lower()
        self.config = config
        self.tokenizer_config = tokenizer_config
        self.architectures: List[str] = [
            a.lower() for a in config.get("architectures", [])
        ]
        self.vocab_size: int = config.get("vocab_size", 0)

    # ------------------------------------------------------------------
    def _is_llama_arch(self) -> bool:
        return any("llama" in a for a in self.architectures)

    def _resolve_llama_version(self) -> str:
        """
        Distinguish Llama 2 from Llama 3 by vocabulary size.
          Llama 2:  vocab_size == 32 000
          Llama 3:  vocab_size >= 128 000
        """
        if self.vocab_size >= LLAMA3_VOCAB_SIZE_MIN:
            log.info(
                f"LlamaForCausalLM vocab_size={self.vocab_size} "
                f"(≥ {LLAMA3_VOCAB_SIZE_MIN}): identified as Llama 3"
            )
            return "llama3"
        if self.vocab_size == LLAMA2_VOCAB_SIZE:
            log.info(
                f"LlamaForCausalLM vocab_size={self.vocab_size} "
                f"(== {LLAMA2_VOCAB_SIZE}): identified as Llama 2"
            )
            return "llama2"
        # Unknown Llama variant — default to Llama 3 with a warning.
        log.warning(
            f"LlamaForCausalLM with unexpected vocab_size={self.vocab_size}. "
            "Defaulting to Llama 3 template.  "
            "Review the generated Modelfile and adjust TEMPLATE if needed."
        )
        return "llama3"

    # ------------------------------------------------------------------
    def detect(self) -> Tuple[str, Dict]:
        """Return (template_key, template_dict). Falls back to 'chatml'."""

        # 1. Llama architecture — must be resolved by vocab size.
        if self._is_llama_arch():
            key = self._resolve_llama_version()
            log.info(f"Template detected via architecture + vocab size: {key}")
            return key, OLLAMA_TEMPLATES[key]

        # 2. Other architecture class names.
        for arch in self.architectures:
            for substr, tmpl_key in ARCH_TO_TEMPLATE.items():
                if substr in arch:
                    log.info(
                        f"Template detected via architecture '{arch}': {tmpl_key}"
                    )
                    return tmpl_key, OLLAMA_TEMPLATES[tmpl_key]

        # 3. Repository name substrings.
        for substr, tmpl_key in REPO_TO_TEMPLATE.items():
            if substr in self.repo_id:
                log.info(
                    f"Template detected via repo name '{self.repo_id}': {tmpl_key}"
                )
                return tmpl_key, OLLAMA_TEMPLATES[tmpl_key]

        # 4. Heuristic inspection of the Jinja2 chat_template string.
        chat_tmpl = self.tokenizer_config.get("chat_template", "")
        if isinstance(chat_tmpl, str) and chat_tmpl:
            if "im_start" in chat_tmpl:
                log.info("Template detected via chat_template content: qwen2 (ChatML)")
                return "qwen2", OLLAMA_TEMPLATES["qwen2"]
            if "eot_id" in chat_tmpl:
                log.info("Template detected via chat_template content: llama3")
                return "llama3", OLLAMA_TEMPLATES["llama3"]
            if "start_of_turn" in chat_tmpl:
                log.info("Template detected via chat_template content: gemma")
                return "gemma", OLLAMA_TEMPLATES["gemma"]
            if "[INST]" in chat_tmpl and "<<SYS>>" in chat_tmpl:
                log.info("Template detected via chat_template content: llama2")
                return "llama2", OLLAMA_TEMPLATES["llama2"]
            if "[INST]" in chat_tmpl:
                log.info("Template detected via chat_template content: mistral")
                return "mistral", OLLAMA_TEMPLATES["mistral"]
            if "<|end|>" in chat_tmpl:
                log.info("Template detected via chat_template content: phi3")
                return "phi3", OLLAMA_TEMPLATES["phi3"]
            # Full-width vertical bar check for DeepSeek R1/V3.
            if "\uff5c" in chat_tmpl:
                log.info(
                    "Template detected via chat_template content: deepseek "
                    "(full-width vertical bars)"
                )
                return "deepseek", OLLAMA_TEMPLATES["deepseek"]

        log.warning(
            "Could not detect chat template automatically.  "
            "Falling back to generic ChatML.  "
            "Review the generated Modelfile and adjust TEMPLATE if needed."
        )
        return "chatml", OLLAMA_TEMPLATES["chatml"]

    # ------------------------------------------------------------------
    def context_size(
        self, template_key: str, override: Optional[int] = None
    ) -> int:
        """
        Return the context window size to use.
        Priority: CLI override > model's max_position_embeddings > family default.
        Capped at MAX_AUTO_CTX to prevent absurd defaults.
        """
        if override is not None and override > 0:
            return override
        model_max = self.config.get("max_position_embeddings", 0)
        family_default = FAMILY_CTX.get(template_key, 4096)
        if model_max and model_max > 0:
            return min(model_max, MAX_AUTO_CTX)
        return family_default


# ===========================================================================
# 8. Modelfile Generator
# ===========================================================================

class ModelfileGenerator:
    """Generates an Ollama Modelfile for a converted GGUF model."""

    def __init__(
        self,
        gguf_path: Path,
        template_key: str,
        template_dict: Dict,
        context_size: int,
        system_prompt: Optional[str] = None,
    ) -> None:
        self.gguf_path = gguf_path
        self.template_key = template_key
        self.template_dict = template_dict
        self.context_size = context_size
        self.system_prompt = system_prompt

    # ------------------------------------------------------------------
    def generate(self, output_path: Path) -> Path:
        """
        Write the Modelfile to output_path and return it.

        The template string from OLLAMA_TEMPLATES uses Python \\n escape
        sequences, which are ACTUAL newline characters in the Python string.
        When written to the file they become literal newlines in the Modelfile
        TEMPLATE block — exactly what Ollama requires.
        """
        lines: List[str] = []
        lines.append(f"FROM {self.gguf_path.resolve()}")
        lines.append("")
        lines.append(f'TEMPLATE """{self.template_dict["template"]}"""')
        lines.append("")
        for stop_token in self.template_dict["stop"]:
            lines.append(f'PARAMETER stop "{stop_token}"')
        lines.append(f"PARAMETER num_ctx {self.context_size}")
        for param, value in MODELFILE_DEFAULTS.items():
            lines.append(f"PARAMETER {param} {value}")
        if self.system_prompt:
            lines.append("")
            lines.append(f'SYSTEM """{self.system_prompt}"""')
        lines.append("")
        lines.append(f"# Template family : {self.template_dict['notes']}")
        lines.append("# Generated by    : convert_model.py")

        content = "\n".join(lines)
        output_path.write_text(content, encoding="utf-8")
        log.info(f"Modelfile written to {output_path}")
        return output_path


# ===========================================================================
# 9. Ollama Manager
# ===========================================================================

class OllamaManager:
    """Registers and tests models with the Ollama server."""

    def __init__(self, env: Environment) -> None:
        self.env = env

    # ------------------------------------------------------------------
    def create_model(self, model_name: str, modelfile_path: Path) -> bool:
        """Register a model with Ollama. Returns True on success."""
        log.info(f"Registering model '{model_name}' with Ollama...")
        cmd = [
            "ollama", "create", model_name,
            "-f", str(modelfile_path.resolve()),
        ]
        try:
            subprocess.run(cmd, check=True)
            log.info(f"Model '{model_name}' registered successfully.")
            return True
        except subprocess.CalledProcessError as e:
            log.error(f"ollama create failed: {e}")
            return False

    def test_model(self, model_name: str, retries: int = 2) -> Optional[str]:
        """
        Run a quick inference test via the Ollama REST API.
        Retries up to `retries` times to allow for model load time.
        Returns the response text, or None on failure.
        """
        import urllib.request
        import urllib.error

        payload = json.dumps({
            "model": model_name,
            "messages": [
                {
                    "role": "user",
                    "content": (
                        "Reply with exactly one sentence confirming "
                        "you are working correctly."
                    ),
                }
            ],
            "stream": False,
        }).encode("utf-8")

        for attempt in range(retries + 1):
            req = urllib.request.Request(
                "http://localhost:11434/api/chat",
                data=payload,
                headers={"Content-Type": "application/json"},
                method="POST",
            )
            try:
                with urllib.request.urlopen(req, timeout=120) as resp:
                    data = json.loads(resp.read().decode("utf-8"))
                    text = data.get("message", {}).get("content", "")
                    log.info(f"Ollama test response: {text[:200]}")
                    return text
            except urllib.error.URLError as e:
                if attempt < retries:
                    log.warning(
                        f"Ollama API test attempt {attempt + 1} failed: {e}. "
                        "Retrying in 5 s..."
                    )
                    time.sleep(5)
                else:
                    log.error(
                        f"Ollama API test failed after {retries + 1} attempts: {e}"
                    )
        return None


# ===========================================================================
# 10. MLX Converter
# ===========================================================================

class MLXConverter:
    """Converts a HuggingFace model to MLX format (Apple Silicon only)."""

    def __init__(
        self,
        model_dir: Path,
        output_dir: Path,
        model_name: str,
    ) -> None:
        self.model_dir = model_dir
        self.output_dir = output_dir
        self.model_name = model_name

    # ------------------------------------------------------------------
    def convert(self, q_bits: int = 4) -> Path:
        """
        Convert the model to MLX format.
        q_bits=0  → float16 (no quantization)
        q_bits=4  → 4-bit quantization
        q_bits=8  → 8-bit quantization
        Returns the output directory path.
        Skips if the output directory already contains config.json.
        """
        suffix = f"-MLX-{q_bits}bit" if q_bits > 0 else "-MLX-f16"
        out_path = self.output_dir / f"{self.model_name}{suffix}"

        if (out_path / "config.json").exists():
            log.info(f"MLX model already exists at {out_path}; skipping.")
            return out_path

        log.info(
            f"Converting to MLX format "
            f"({'float16' if q_bits == 0 else f'{q_bits}-bit quantized'})..."
        )
        cmd = [
            sys.executable, "-m", "mlx_lm.convert",
            "--hf-path", str(self.model_dir),
            "--mlx-path", str(out_path),
        ]
        # --q-bits N alone implies quantization; -q is a redundant shorthand.
        if q_bits > 0:
            cmd += ["--q-bits", str(q_bits)]

        subprocess.run(cmd, check=True)

        if not (out_path / "config.json").exists():
            raise RuntimeError(
                f"MLX conversion appeared to succeed but "
                f"{out_path}/config.json was not created."
            )
        log.info(f"MLX model created at {out_path}")
        return out_path

    def test(self, mlx_model_path: Path) -> Optional[str]:
        """Run a quick inference test with the MLX model."""
        log.info("Running MLX inference test...")
        cmd = [
            sys.executable, "-m", "mlx_lm.generate",
            "--model", str(mlx_model_path),
            "--prompt", "Hello, please confirm you are working with one sentence.",
            "--max-tokens", "60",
            "--temp", "0.0",
        ]
        try:
            result = subprocess.run(
                cmd,
                capture_output=True,
                text=True,
                timeout=120,
                check=True,
            )
            output = result.stdout.strip()
            log.info(f"MLX test output: {output[:300]}")
            return output
        except subprocess.CalledProcessError as e:
            log.error(f"MLX inference test failed: {e.stderr[:500]}")
            return None
        except subprocess.TimeoutExpired:
            log.error("MLX inference test timed out after 120 s.")
            return None


# ===========================================================================
# 11. LM Studio Setup
# ===========================================================================

class LMStudioSetup:
    """Copies or symlinks GGUF files into the LM Studio model directory."""

    def __init__(self, env: Environment, repo_id: str) -> None:
        self.env = env
        parts = repo_id.split("/", 1)
        self.publisher = parts[0] if len(parts) == 2 else "local"
        self.model_slug = parts[1] if len(parts) == 2 else repo_id

    def lmstudio_models_dir(self) -> Path:
        return Path.home() / ".cache" / "lm-studio" / "models"

    def target_dir(self) -> Path:
        return (
            self.lmstudio_models_dir()
            / self.publisher
            / f"{self.model_slug}-GGUF"
        )

    # ------------------------------------------------------------------
    def install(self, gguf_paths: List[Path]) -> List[Path]:
        """
        Install GGUF files to the LM Studio directory.
        On Linux/macOS: creates symlinks (avoids duplicating large files).
          Falls back to copying if symlinks fail (e.g. cross-device).
        On Windows: always copies (symlinks require elevated privileges).
        Returns list of destination paths.
        """
        target = self.target_dir()
        target.mkdir(parents=True, exist_ok=True)
        installed: List[Path] = []

        for src in gguf_paths:
            dst = target / src.name
            if dst.exists() or dst.is_symlink():
                log.info(f"LM Studio: {dst.name} already present; skipping.")
                installed.append(dst)
                continue

            if self.env.is_windows:
                log.info(
                    f"LM Studio: copying {src.name} → {dst} "
                    f"({src.stat().st_size / 1e9:.2f} GB)"
                )
                shutil.copy2(src, dst)
            else:
                try:
                    os.symlink(src.resolve(), dst)
                    log.info(f"LM Studio: symlinked {src.name} → {dst}")
                except OSError:
                    log.info(
                        f"LM Studio: symlink failed; copying {src.name} → {dst} "
                        f"({src.stat().st_size / 1e9:.2f} GB)"
                    )
                    shutil.copy2(src, dst)
            installed.append(dst)

        log.info(f"LM Studio models directory: {target}")
        return installed


# ===========================================================================
# 12. Validator
# ===========================================================================

class Validator:
    """Runs validation inference tests on converted models."""

    def __init__(self, env: Environment, llama_mgr: LlamaCppManager) -> None:
        self.env = env
        self.llama_mgr = llama_mgr

    # ------------------------------------------------------------------
    def test_gguf_with_llamacpp(
        self,
        gguf_path: Path,
        prompt: str = "Hello, please respond with one sentence to confirm you work.",
        max_tokens: int = 80,
    ) -> Optional[str]:
        """
        Run llama-cli on the GGUF file and return the generated text.
        Returns None if llama-cli is not available or inference fails.
        """
        cli = self.llama_mgr.cli_bin()
        if cli is None:
            log.warning("llama-cli not found; skipping llama.cpp validation test.")
            return None

        log.info(f"Validating GGUF with llama-cli: {gguf_path.name}")
        cmd = [
            str(cli),
            "-m", str(gguf_path),
            "-p", prompt,
            "-n", str(max_tokens),
            "--log-disable",
            "-c", "512",
        ]
        try:
            result = subprocess.run(
                cmd,
                capture_output=True,
                text=True,
                timeout=180,
                check=True,
            )
            output = result.stdout.strip()
            log.info(f"llama-cli output: {output[:300]}")
            return output
        except subprocess.CalledProcessError as e:
            log.error(f"llama-cli validation failed: {e.stderr[:500]}")
            return None
        except subprocess.TimeoutExpired:
            log.error("llama-cli validation timed out after 180 s.")
            return None


# ===========================================================================
# 13. Report Generator
# ===========================================================================

class ReportGenerator:
    """Produces a plain-text summary report of the conversion run."""

    def __init__(self, repo_id: str, output_dir: Path) -> None:
        self.repo_id = repo_id
        self.output_dir = output_dir
        self.entries: List[str] = []

    def add(self, line: str) -> None:
        self.entries.append(line)

    def write(self) -> Path:
        report_path = self.output_dir / "conversion_report.txt"
        header = [
            "=" * 72,
            "  MODEL CONVERSION REPORT",
            f"  Model     : {self.repo_id}",
            f"  Generated : {time.strftime('%Y-%m-%d %H:%M:%S')}",
            "=" * 72,
            "",
        ]
        content = "\n".join(header + self.entries + [""])
        report_path.write_text(content, encoding="utf-8")
        log.info(f"Report written to {report_path}")
        return report_path

    def print_summary(self) -> None:
        print()
        print("=" * 72)
        print("  CONVERSION SUMMARY")
        print("=" * 72)
        for line in self.entries:
            print(line)
        print("=" * 72)


# ===========================================================================
# 14. Argument Parser
# ===========================================================================

def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description=(
            "Convert a HuggingFace model for local inference with "
            "llama.cpp, Ollama, LM Studio, and/or Apple MLX."
        ),
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=textwrap.dedent("""
        Examples
        --------
          # Basic — Q4_K_M, all steps:
          python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3

          # Multiple quant levels:
          python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3 \\
              --quant Q4_K_M Q8_0

          # Gated model, custom Ollama name, custom context size:
          python convert_model.py \\
              --model meta-llama/Meta-Llama-3.1-8B-Instruct \\
              --hf-token hf_YourTokenHere \\
              --ollama-name llama31-8b \\
              --quant Q4_K_M Q5_K_M Q8_0 \\
              --context-size 32768

          # Apple Silicon — also produce MLX model:
          python convert_model.py --model google/gemma-2-9b-it \\
              --quant Q4_K_M --mlx --mlx-bits 4

          # GGUF only — skip Ollama and LM Studio:
          python convert_model.py --model Qwen/Qwen2-7B-Instruct \\
              --skip-ollama --skip-lmstudio

          # Large model — split GGUF into 4 GB shards:
          python convert_model.py --model meta-llama/Llama-2-70b-chat-hf \\
              --quant Q4_K_M --split-max-size 4G

          # Keep the F16 base GGUF after quantization:
          python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3 \\
              --keep-f16
        """),
    )

    parser.add_argument(
        "--model", required=True,
        help="HuggingFace repo ID, e.g. mistralai/Mistral-7B-Instruct-v0.3",
    )
    parser.add_argument(
        "--quant", nargs="+", default=DEFAULT_QUANT_LEVELS,
        choices=ALL_QUANT_LEVELS,
        metavar="QUANT",
        help=(
            f"Quantization level(s) to produce. "
            f"Choices: {ALL_QUANT_LEVELS}. "
            f"Default: {DEFAULT_QUANT_LEVELS}"
        ),
    )
    parser.add_argument(
        "--output-dir", default=None,
        help="Directory for all output files. Default: ./<model-name>-converted/",
    )
    parser.add_argument(
        "--llama-dir", default="./llama.cpp",
        help="Where to clone/find llama.cpp. Default: ./llama.cpp",
    )
    parser.add_argument(
        "--hf-token", default=None,
        help="HuggingFace access token for gated models.",
    )
    parser.add_argument(
        "--ollama-name", default=None,
        help=(
            "Name to register in Ollama. "
            "Default: derived from repo ID (lowercased, slashes → hyphens)."
        ),
    )
    parser.add_argument(
        "--ollama-quant", default=None,
        choices=ALL_QUANT_LEVELS,
        help=(
            "Which quantization level to register with Ollama. "
            "Default: first --quant value."
        ),
    )
    parser.add_argument(
        "--context-size", type=int, default=None,
        help=(
            "Override the auto-detected context window size (tokens). "
            "Default: read from model config, capped at 131072."
        ),
    )
    parser.add_argument(
        "--split-max-size", default=None,
        metavar="SIZE",
        help=(
            "Split the F16 GGUF into shards of at most SIZE "
            "(e.g. 4G or 4096M).  Useful for very large models (70B+). "
            "Default: no splitting."
        ),
    )
    parser.add_argument(
        "--skip-ollama", action="store_true",
        help="Skip Ollama model registration.",
    )
    parser.add_argument(
        "--skip-lmstudio", action="store_true",
        help="Skip LM Studio directory setup.",
    )
    parser.add_argument(
        "--skip-validate", action="store_true",
        help="Skip validation inference tests.",
    )
    parser.add_argument(
        "--keep-f16", action="store_true",
        help=(
            "Keep the F16 base GGUF after quantization. "
            "By default it is deleted to save disk space."
        ),
    )
    parser.add_argument(
        "--mlx", action="store_true",
        help=(
            "Also convert to MLX format (Apple Silicon only; "
            "ignored on other platforms)."
        ),
    )
    parser.add_argument(
        "--mlx-bits", type=int, default=4, choices=[0, 4, 8],
        help=(
            "MLX quantization bits: 0=float16 (no quantization), "
            "4=4-bit, 8=8-bit. Default: 4"
        ),
    )
    parser.add_argument(
        "--system-prompt", default=None,
        help="Optional system prompt to embed in the Ollama Modelfile.",
    )
    parser.add_argument(
        "--no-build", action="store_true",
        help="Skip llama.cpp compilation (use if already compiled).",
    )
    parser.add_argument(
        "--verbose", action="store_true",
        help="Enable verbose/debug logging.",
    )

    return parser.parse_args()


def safe_model_name(repo_id: str) -> str:
    """Convert a HuggingFace repo ID to a filesystem-safe name."""
    return re.sub(r"[^a-zA-Z0-9._-]", "-", repo_id)


# ===========================================================================
# 15. Main Orchestrator
# ===========================================================================

def main() -> int:  # noqa: C901  (complexity is acceptable for an orchestrator)
    args = parse_args()

    if args.verbose:
        log.setLevel(logging.DEBUG)

    # ------------------------------------------------------------------
    # Step 0: Setup
    # ------------------------------------------------------------------
    log.info("=" * 60)
    log.info("  HuggingFace → Local Inference Converter")
    log.info("=" * 60)

    env = Environment()
    log.info(f"Environment: {env.describe()}")

    model_name = safe_model_name(args.model)
    output_dir = (
        Path(args.output_dir) if args.output_dir
        else Path(f"./{model_name}-converted")
    )
    output_dir.mkdir(parents=True, exist_ok=True)
    log.info(f"Output directory: {output_dir.resolve()}")

    llama_dir = Path(args.llama_dir)
    model_dir = output_dir / "hf_model"

    report = ReportGenerator(args.model, output_dir)
    report.add(f"Model          : {args.model}")
    report.add(f"Environment    : {env.describe()}")
    report.add(f"Output dir     : {output_dir.resolve()}")
    report.add(f"Quant levels   : {args.quant}")
    report.add("")

    # ------------------------------------------------------------------
    # Step 1: Dependency checks
    # ------------------------------------------------------------------
    log.info("--- Step 1: Checking dependencies ---")
    dep_mgr = DependencyManager(env)
    try:
        features = dep_mgr.run_all()
    except RuntimeError as e:
        log.error(f"Dependency check failed: {e}")
        return 1

    ollama_available = features["ollama"] and not args.skip_ollama
    mlx_available = features["mlx"] and args.mlx
    report.add(f"Ollama available : {ollama_available}")
    report.add(f"MLX available    : {mlx_available}")
    report.add("")

    # ------------------------------------------------------------------
    # Step 2: Clone / update and build llama.cpp
    # ------------------------------------------------------------------
    log.info("--- Step 2: Preparing llama.cpp ---")
    llama_mgr = LlamaCppManager(env, llama_dir)
    try:
        llama_mgr.ensure_repo()
        llama_mgr.install_python_deps()
        if not args.no_build:
            llama_mgr.build()
    except (RuntimeError, subprocess.CalledProcessError, FileNotFoundError) as e:
        log.error(f"llama.cpp setup failed: {e}")
        return 1

    # ------------------------------------------------------------------
    # Step 3: Download model
    # ------------------------------------------------------------------
    log.info("--- Step 3: Downloading model ---")
    downloader = ModelDownloader(args.model, model_dir, args.hf_token)
    try:
        downloader.download()
        config = downloader.load_config()
        tokenizer_config = downloader.load_tokenizer_config()
    except Exception as e:
        log.error(f"Model download or config loading failed: {e}")
        return 1

    architectures = config.get("architectures", ["unknown"])
    downloader.validate_architecture(config)
    param_count = downloader.estimate_param_count(config)
    param_str = f"~{param_count / 1e9:.1f}B" if param_count else "unknown"
    log.info(f"Architecture : {architectures}")
    log.info(f"Parameters   : {param_str}")
    report.add(f"Architecture   : {architectures}")
    report.add(f"Parameters     : {param_str}")
    report.add("")

    # ------------------------------------------------------------------
    # Step 4: Detect chat template
    # ------------------------------------------------------------------
    log.info("--- Step 4: Detecting chat template ---")
    detector = TemplateDetector(args.model, config, tokenizer_config)
    template_key, template_dict = detector.detect()
    ctx_size = detector.context_size(template_key, override=args.context_size)
    log.info(f"Template family : {template_key} ({template_dict['notes']})")
    log.info(f"Context size    : {ctx_size} tokens")
    report.add(f"Template family : {template_key}")
    report.add(f"Template notes  : {template_dict['notes']}")
    report.add(f"Context size    : {ctx_size} tokens")
    report.add("")

    # ------------------------------------------------------------------
    # Step 5: Convert to F16 GGUF
    # ------------------------------------------------------------------
    log.info("--- Step 5: Converting to GGUF (F16) ---")
    converter = GGUFConverter(
        llama_mgr, model_dir, output_dir, model_name,
        split_max_size=args.split_max_size,
    )
    try:
        f16_gguf = converter.convert_f16()
    except (RuntimeError, subprocess.CalledProcessError) as e:
        log.error(f"GGUF conversion failed: {e}")
        return 1
    report.add(
        f"F16 GGUF       : {f16_gguf.name} "
        f"({f16_gguf.stat().st_size / 1e9:.2f} GB)"
    )

    # ------------------------------------------------------------------
    # Step 6: Quantize
    # ------------------------------------------------------------------
    log.info("--- Step 6: Quantizing ---")
    quantizer = Quantizer(llama_mgr, f16_gguf, output_dir, model_name)
    quant_paths: Dict[str, Path] = quantizer.quantize_all(args.quant)

    if not quant_paths:
        log.error("All quantization levels failed.  Cannot continue.")
        return 1

    report.add("Quantized GGUFs :")
    for qt, p in quant_paths.items():
        report.add(f"  {qt}: {p.name} ({p.stat().st_size / 1e9:.2f} GB)")
    report.add("")

    # Optionally remove the large F16 base to save disk space.
    if not args.keep_f16 and f16_gguf.exists():
        log.info(f"Removing F16 base GGUF to save space: {f16_gguf.name}")
        f16_gguf.unlink()
        report.add("F16 base GGUF deleted (use --keep-f16 to retain it).")
        report.add("")

    # ------------------------------------------------------------------
    # Step 7: Validate GGUF with llama-cli
    # ------------------------------------------------------------------
    if not args.skip_validate:
        log.info("--- Step 7: Validating GGUF with llama-cli ---")
        validator = Validator(env, llama_mgr)
        # Use the first successfully quantized level for validation.
        first_qt, first_gguf = next(iter(quant_paths.items()))
        if first_gguf.exists():
            val_result = validator.test_gguf_with_llamacpp(first_gguf)
            if val_result:
                report.add(f"llama-cli validation ({first_qt}): PASSED")
                report.add(f"  Response: {val_result[:200]}")
            else:
                report.add(f"llama-cli validation ({first_qt}): FAILED or skipped")
            report.add("")

    # ------------------------------------------------------------------
    # Step 8: Generate Modelfile and register with Ollama
    # ------------------------------------------------------------------
    ollama_name = (
        args.ollama_name
        or re.sub(r"[^a-z0-9-]", "-", args.model.lower())
    )

    # Determine which quant level to use for Ollama.
    # Prefer the explicitly requested level; fall back to first available.
    ollama_quant = args.ollama_quant or args.quant[0]
    if ollama_quant not in quant_paths:
        log.warning(
            f"Requested Ollama quant '{ollama_quant}' was not produced.  "
            f"Falling back to first available: {next(iter(quant_paths))}"
        )
        ollama_quant = next(iter(quant_paths))
    ollama_gguf = quant_paths[ollama_quant]

    log.info("--- Step 8: Generating Modelfile ---")
    mf_gen = ModelfileGenerator(
        gguf_path=ollama_gguf,
        template_key=template_key,
        template_dict=template_dict,
        context_size=ctx_size,
        system_prompt=args.system_prompt,
    )
    modelfile_path = output_dir / "Modelfile"
    mf_gen.generate(modelfile_path)
    report.add(f"Modelfile      : {modelfile_path}")

    if ollama_available:
        log.info("Registering model with Ollama...")
        ollama_mgr = OllamaManager(env)
        success = ollama_mgr.create_model(ollama_name, modelfile_path)
        if success:
            report.add(f"Ollama model   : {ollama_name} (registered)")
            report.add(f"Ollama GGUF    : {ollama_gguf.name} ({ollama_quant})")
            report.add("")

            if not args.skip_validate:
                log.info("Running Ollama API inference test...")
                test_response = ollama_mgr.test_model(ollama_name)
                if test_response:
                    report.add("Ollama API test: PASSED")
                    report.add(f"  Response: {test_response[:200]}")
                else:
                    report.add("Ollama API test: FAILED")
                report.add("")
        else:
            report.add("Ollama registration: FAILED (see logs above)")
            report.add("")
    else:
        report.add(
            "Ollama not available.  To register manually once "
            "'ollama serve' is running:"
        )
        report.add(f"  ollama create {ollama_name} -f {modelfile_path}")
        report.add("")

    # ------------------------------------------------------------------
    # Step 9: LM Studio setup
    # ------------------------------------------------------------------
    lmstudio_target: Optional[Path] = None
    lmstudio_installed: List[Path] = []

    if not args.skip_lmstudio:
        log.info("--- Step 9: Setting up LM Studio directory ---")
        lms = LMStudioSetup(env, args.model)
        lmstudio_target = lms.target_dir()
        try:
            lmstudio_installed = lms.install(list(quant_paths.values()))
            report.add("LM Studio :")
            report.add(f"  Directory : {lmstudio_target}")
            for p in lmstudio_installed:
                report.add(f"  File      : {p.name}")
            report.add("")
        except Exception as e:
            log.warning(f"LM Studio setup failed (non-fatal): {e}")
            report.add(f"LM Studio setup: FAILED ({e})")
            report.add("")

    # ------------------------------------------------------------------
    # Step 10: MLX conversion (Apple Silicon only)
    # ------------------------------------------------------------------
    mlx_model_path: Optional[Path] = None

    if mlx_available:
        log.info("--- Step 10: Converting to MLX format ---")
        mlx_conv = MLXConverter(model_dir, output_dir, model_name)
        try:
            mlx_model_path = mlx_conv.convert(q_bits=args.mlx_bits)
            mlx_quant_desc = (
                "float16" if args.mlx_bits == 0 else f"{args.mlx_bits}-bit"
            )
            report.add(f"MLX model      : {mlx_model_path}")
            report.add(f"  Quantization : {mlx_quant_desc}")
            report.add("")

            if not args.skip_validate:
                log.info("Running MLX inference test...")
                mlx_result = mlx_conv.test(mlx_model_path)
                if mlx_result:
                    report.add("MLX inference test: PASSED")
                    report.add(f"  Output: {mlx_result[:200]}")
                else:
                    report.add("MLX inference test: FAILED")
                report.add("")
        except (RuntimeError, subprocess.CalledProcessError) as e:
            log.error(f"MLX conversion failed: {e}")
            report.add(f"MLX conversion: FAILED ({e})")
            report.add("")

    # ------------------------------------------------------------------
    # Step 11: Usage instructions
    # ------------------------------------------------------------------
    report.add("=" * 68)
    report.add("USAGE INSTRUCTIONS")
    report.add("=" * 68)
    report.add("")

    cli_bin_path = llama_mgr.cli_bin()
    cli_path_str = (
        str(cli_bin_path) if cli_bin_path
        else "llama.cpp/build/bin/llama-cli"
    )
    report.add("llama.cpp direct inference:")
    for qt, p in quant_paths.items():
        report.add(
            f'  {cli_path_str} -m "{p}" '
            f'-p "Your prompt here" -n 200 --log-disable'
        )
    report.add("")

    if not args.skip_ollama:
        report.add("Ollama (interactive):")
        report.add(f"  ollama run {ollama_name}")
        report.add("")
        report.add("Ollama native API:")
        report.add(
            f'  curl http://localhost:11434/api/chat \\\n'
            f'    -H "Content-Type: application/json" \\\n'
            f'    -d \'{{"model":"{ollama_name}",'
            f'"messages":[{{"role":"user","content":"Hello"}}],'
            f'"stream":false}}\''
        )
        report.add("")
        report.add("Ollama OpenAI-compatible API:")
        report.add(
            f'  curl http://localhost:11434/v1/chat/completions \\\n'
            f'    -H "Content-Type: application/json" \\\n'
            f'    -H "Authorization: Bearer ollama" \\\n'
            f'    -d \'{{"model":"{ollama_name}",'
            f'"messages":[{{"role":"user","content":"Hello"}}]}}\''
        )
        report.add("")

    if not args.skip_lmstudio and lmstudio_target:
        report.add("LM Studio:")
        report.add(f"  Open LM Studio → My Models → {lmstudio_target}")
        report.add("")

    if mlx_model_path is not None:
        report.add("MLX (Apple Silicon):")
        report.add(
            f'  python -m mlx_lm.generate --model "{mlx_model_path}" '
            f'--prompt "Your prompt" --max-tokens 200'
        )
        report.add("")

    # ------------------------------------------------------------------
    # Final: Write report and print summary
    # ------------------------------------------------------------------
    report_path = report.write()
    report.print_summary()
    log.info(f"Full report saved to: {report_path}")
    log.info("Conversion pipeline complete.")
    return 0


# ---------------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    try:
        sys.exit(main())
    except KeyboardInterrupt:
        print("\n\nInterrupted by user.  Exiting.", file=sys.stderr)
        sys.exit(130)

FILE 3 OF 3 — README_AUTOMATION.txt

========================================================================
  HUGGINGFACE → LOCAL INFERENCE CONVERTER
  Installation, Deployment, and Usage Guide
========================================================================

Files in this package
---------------------
  convert_model.py         Main automation script
  requirements_convert.txt Python package dependencies
  README_AUTOMATION.txt    This file


------------------------------------------------------------------------
WHAT THE SCRIPT DOES
------------------------------------------------------------------------

convert_model.py automates the complete pipeline for converting any
HuggingFace causal language model to run locally on:

  - llama.cpp  (direct C++ inference engine)
  - Ollama     (Docker-like model server with native and OpenAI APIs)
  - LM Studio  (graphical desktop application)
  - Apple MLX  (Apple Silicon native inference, optional)

Pipeline steps performed automatically:
  1.  Detect OS, CPU architecture, and GPU hardware
  2.  Verify git, cmake, and required Python packages
  3.  Clone llama.cpp (shallow) or update an existing clone
  4.  Compile llama.cpp with appropriate GPU flags
  5.  Download the model from HuggingFace Hub
  6.  Validate the model architecture
  7.  Detect the correct Ollama chat template
  8.  Convert to float16 GGUF base file
  9.  Quantize to one or more levels (Q2_K through Q8_0)
  10. Generate an Ollama Modelfile with the correct template
  11. Register the model with Ollama and run an API test
  12. Convert to MLX format (Apple Silicon, when requested)
  13. Set up the LM Studio model directory
  14. Run a llama-cli validation inference test
  15. Write a plain-text summary report


------------------------------------------------------------------------
SYSTEM REQUIREMENTS
------------------------------------------------------------------------

Supported operating systems:
  Linux   (x86_64; tested on Ubuntu 22.04 and Debian 12)
  macOS   (Intel and Apple Silicon; macOS 13 Ventura or later)
  Windows (10 and 11, x86_64)

Minimum hardware:
  8 GB RAM   for 7B models at Q4_K_M quantization
  16 GB RAM  recommended for 13B models
  30+ GB RAM for 30B+ models

Disk space during conversion:
  You need at least 2× the model's float16 size free simultaneously.
  Example for a 7B model:
    ~14 GB  downloaded safetensors
    ~14 GB  F16 GGUF (deleted after quantization by default)
    ~ 4 GB  Q4_K_M GGUF (kept)
  Peak usage: ~28 GB.  After cleanup: ~18 GB.

Required software (must be installed before running the script):
  Python 3.9 or later   https://python.org
  git                   https://git-scm.com
  cmake 3.12 or later   https://cmake.org

Optional software (the script checks for these and skips gracefully):
  Ollama                https://ollama.com
  LM Studio             https://lmstudio.ai
  NVIDIA CUDA toolkit   for GPU acceleration on NVIDIA hardware
  AMD ROCm              for GPU acceleration on AMD hardware


------------------------------------------------------------------------
INSTALLATION
------------------------------------------------------------------------

Step 1: Install Python (if not already installed).

  Linux (Debian/Ubuntu):
    sudo apt update && sudo apt install python3 python3-pip python3-venv

  macOS (using Homebrew):
    brew install python

  Windows:
    Download and run the installer from https://python.org/downloads/
    During installation, check "Add Python to PATH".

Step 2: Install git and cmake.

  Linux (Debian/Ubuntu):
    sudo apt install git cmake build-essential

  macOS:
    xcode-select --install
    brew install cmake

  Windows:
    Install git from https://git-scm.com/download/win
    Install cmake from https://cmake.org/download/
    Install Visual Studio Build Tools from:
      https://visualstudio.microsoft.com/visual-cpp-build-tools/
    Select the "Desktop development with C++" workload.

Step 3: Create and activate a Python virtual environment.

  Linux / macOS:
    python3 -m venv llm-convert
    source llm-convert/bin/activate

  Windows (Command Prompt):
    python -m venv llm-convert
    llm-convert\Scripts\activate.bat

  Windows (PowerShell):
    python -m venv llm-convert
    llm-convert\Scripts\Activate.ps1

  If PowerShell blocks the activation script:
    Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Step 4: Install Python dependencies.

  pip install -r requirements_convert.txt

  Then install PyTorch separately (choose the right command):

    CPU only (works everywhere; sufficient for conversion):
      pip install torch --index-url https://download.pytorch.org/whl/cpu

    NVIDIA GPU (example for CUDA 12.1; visit pytorch.org for your version):
      pip install torch --index-url https://download.pytorch.org/whl/cu121

    Apple Silicon (standard install; includes MPS support):
      pip install torch

  For Apple Silicon MLX support (optional; Apple Silicon ONLY):
    pip install mlx-lm
    Note: this package will fail to install on non-Apple hardware.

Step 5: Install Ollama (optional but recommended).

  Linux / macOS:
    curl -fsSL https://ollama.com/install.sh | sh

  Windows:
    Download and run the installer from https://ollama.com

  Start the Ollama server in a separate terminal (leave it running):
    ollama serve

  The script checks whether the server is reachable at startup and
  skips Ollama steps gracefully if it is not.

Step 6: Install LM Studio (optional; graphical interface).

  Download from https://lmstudio.ai and run the installer.
  No additional configuration is needed; the script copies or symlinks
  GGUF files into the correct directory automatically.


------------------------------------------------------------------------
HUGGINGFACE AUTHENTICATION (for gated models)
------------------------------------------------------------------------

Some models (Llama 3, Gemma, Phi-3, etc.) require you to accept a
license agreement on HuggingFace before downloading.

  1. Create a free account at https://huggingface.co
  2. Visit the model's page and click "Agree and access repository"
  3. Generate an access token at https://huggingface.co/settings/tokens
     (select "Read" permission)
  4. Pass the token to the script: --hf-token hf_YourTokenHere


------------------------------------------------------------------------
USAGE
------------------------------------------------------------------------

Basic (Q4_K_M only, all steps):

  python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3

Multiple quantization levels:

  python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3 \
      --quant Q4_K_M Q8_0

Gated model with token, custom Ollama name, custom context size:

  python convert_model.py \
      --model meta-llama/Meta-Llama-3.1-8B-Instruct \
      --hf-token hf_YourTokenHere \
      --ollama-name llama31-8b \
      --quant Q4_K_M Q5_K_M Q8_0 \
      --context-size 32768

Apple Silicon — also produce MLX model:

  python convert_model.py --model google/gemma-2-9b-it \
      --quant Q4_K_M --mlx --mlx-bits 4

GGUF only — skip Ollama and LM Studio:

  python convert_model.py --model Qwen/Qwen2-7B-Instruct \
      --skip-ollama --skip-lmstudio

Large model — split GGUF into 4 GB shards:

  python convert_model.py --model meta-llama/Llama-2-70b-chat-hf \
      --quant Q4_K_M --split-max-size 4G

Keep the F16 base GGUF (deleted by default to save disk space):

  python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3 \
      --keep-f16

Use a pre-existing llama.cpp directory:

  python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3 \
      --llama-dir /path/to/existing/llama.cpp

Skip llama.cpp compilation (if already compiled):

  python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3 \
      --no-build

Add a system prompt to the Ollama Modelfile:

  python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3 \
      --system-prompt "You are a helpful coding assistant."

Enable verbose debug logging:

  python convert_model.py --model mistralai/Mistral-7B-Instruct-v0.3 \
      --verbose


------------------------------------------------------------------------
COMMAND-LINE REFERENCE
------------------------------------------------------------------------

--model REPO_ID          (required) HuggingFace repo ID
--quant LEVEL [LEVEL]    Quantization levels to produce.
                         Choices: Q2_K Q3_K_M Q4_K_M Q5_K_M Q6_K Q8_0
                         Default: Q4_K_M
--output-dir DIR         Output directory.
                         Default: ./<model-name>-converted/
--llama-dir DIR          llama.cpp directory. Default: ./llama.cpp
--hf-token TOKEN         HuggingFace access token for gated models
--ollama-name NAME       Ollama model name.
                         Default: derived from repo ID
--ollama-quant LEVEL     Which quant level to register with Ollama.
                         Default: first --quant value
--context-size N         Override auto-detected context window (tokens).
                         Default: read from model config, cap 131072
--split-max-size SIZE    Split F16 GGUF into shards (e.g. 4G, 4096M).
                         Useful for 70B+ models. Default: no splitting
--skip-ollama            Skip Ollama registration
--skip-lmstudio          Skip LM Studio directory setup
--skip-validate          Skip validation inference tests
--keep-f16               Keep F16 base GGUF (deleted by default)
--mlx                    Convert to MLX format (Apple Silicon only)
--mlx-bits {0,4,8}       MLX quantization: 0=float16, 4=4-bit, 8=8-bit
                         Default: 4
--system-prompt TEXT     System prompt for Ollama Modelfile
--no-build               Skip llama.cpp compilation
--verbose                Enable debug logging


------------------------------------------------------------------------
OUTPUT FILES
------------------------------------------------------------------------

After a successful run, the output directory contains:

  hf_model/                    Downloaded HuggingFace model files
  <model>-F16.gguf             Float16 base GGUF (deleted unless --keep-f16)
  <model>-Q4_K_M.gguf          Quantized GGUF (one per --quant level)
  Modelfile                    Ollama Modelfile (for manual use or review)
  <model>-MLX-4bit/            MLX model directory (if --mlx was used)
  conversion_report.txt        Plain-text summary of the conversion run

GGUF files are also linked or copied to:
  Linux / macOS:
    ~/.cache/lm-studio/models/<publisher>/<model>-GGUF/
    (symlinks are used to avoid duplicating large files)
  Windows:
    C:\Users\<user>\.cache\lm-studio\models\<publisher>\<model>-GGUF\
    (files are copied; symlinks require elevated privileges on Windows)


------------------------------------------------------------------------
AFTER CONVERSION: RUNNING YOUR MODEL
------------------------------------------------------------------------

With llama.cpp directly (Linux / macOS):
  ./llama.cpp/build/bin/llama-cli \
      -m ./<model>-converted/<model>-Q4_K_M.gguf \
      -p "Your prompt here" -n 200 --log-disable

With llama.cpp directly (Windows):
  llama.cpp\build\bin\Release\llama-cli.exe ^
      -m .\<model>-converted\<model>-Q4_K_M.gguf ^
      -p "Your prompt here" -n 200 --log-disable

With Ollama (interactive chat):
  ollama run <your-ollama-name>

With Ollama (native REST API — Linux / macOS / PowerShell 7+):
  curl http://localhost:11434/api/chat \
    -H "Content-Type: application/json" \
    -d '{"model":"<name>","messages":[{"role":"user","content":"Hello"}],"stream":false}'

With Ollama (OpenAI-compatible API — for tools expecting OpenAI format):
  curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer ollama" \
    -d '{"model":"<name>","messages":[{"role":"user","content":"Hello"}]}'

With Ollama (Windows Command Prompt — use escaped double quotes):
  curl http://localhost:11434/api/chat ^
    -H "Content-Type: application/json" ^
    -d "{\"model\":\"<name>\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"stream\":false}"

With Ollama (Windows PowerShell 5.1 — use Invoke-RestMethod or curl.exe):
  Invoke-RestMethod -Uri "http://localhost:11434/api/chat" `
    -Method POST `
    -ContentType "application/json" `
    -Body '{"model":"<name>","messages":[{"role":"user","content":"Hello"}],"stream":false}'

  Note: In PowerShell 5.1, `curl` is an alias for Invoke-WebRequest,
  not the curl binary.  Use `curl.exe` to call the real curl, or use
  Invoke-RestMethod as shown above.  In PowerShell 7+, `curl` calls
  the real binary directly.

With LM Studio:
  Open LM Studio → click "My Models" → select your model → Load Model

With MLX (Apple Silicon only):
  python -m mlx_lm.generate \
      --model ./<model>-converted/<model>-MLX-4bit \
      --prompt "Your prompt here" --max-tokens 200


------------------------------------------------------------------------
TROUBLESHOOTING
------------------------------------------------------------------------

Problem: "unknown architecture" or "Model architecture not supported"
         during GGUF conversion.
Solution: Update llama.cpp to the latest version:
    cd llama.cpp
    git fetch origin
    git reset --hard origin/HEAD
    pip install -r requirements.txt
  Then re-run the script with --no-build if already compiled.
  Note: the script was previously named convert.py in older llama.cpp
  versions; updating will rename it correctly.

Problem: The model generates garbage or repetitive output.
Solution: The chat template is likely wrong.  Review the generated
  Modelfile in the output directory.  The TEMPLATE block must use
  actual newline characters, not the two-character sequence \n.
  Compare the template against the model's tokenizer_config.json
  chat_template field on HuggingFace.
  For Llama-family models: Llama 2 (vocab_size=32000) uses the <<SYS>>
  format; Llama 3 (vocab_size=128256) uses start_header_id/eot_id.
  For DeepSeek R1/V3: the special tokens use full-width Unicode vertical
  bars (｜, U+FF5C), not ASCII pipes (|).  Copy them from the model's
  tokenizer_config.json rather than typing them manually.
  Adding PARAMETER repeat_penalty 1.1 can also help with looping.
  Edit the Modelfile and re-register:
    ollama create <name> -f ./Modelfile

Problem: Out of memory during conversion.
Solution: The conversion script loads all weights at once.  For large
  models (13B+), use a machine with more RAM, or download a pre-converted
  GGUF from https://huggingface.co/bartowski where many popular models
  are already available.  For 70B+ models, use --split-max-size 4G to
  split the output GGUF into manageable shards.

Problem: llama.cpp compilation fails on Windows.
Solution: Make sure Visual Studio Build Tools are installed with the
  "Desktop development with C++" workload.  Run the script from a
  "Developer Command Prompt for Visual Studio".  CMake's -j flag for
  parallel compilation works on Windows with CMake 3.12 or later.

Problem: Ollama server not reachable.
Solution: Start the server in a separate terminal: ollama serve
  The server must be running before the script reaches Step 8.
  The script still generates the Modelfile and prints the manual
  registration command even if Ollama is not available.

Problem: mlx-lm not found on Apple Silicon.
Solution: Install it: pip install mlx-lm
  Make sure you are in the activated virtual environment.

Problem: HuggingFace download fails with 401 Unauthorized.
Solution: The model is gated.  Accept the license on HuggingFace,
  generate a token at https://huggingface.co/settings/tokens,
  and pass it with --hf-token hf_YourToken.

Problem: Download was interrupted; files are incomplete.
Solution: Re-run the same command.  The HuggingFace Hub CLI resumes
  interrupted downloads automatically.  The script detects complete
  downloads by checking for both config.json and at least one weight
  file; if either is missing it re-runs the download.

Problem: CUDA not detected even though an NVIDIA GPU is present.
Solution: Make sure nvidia-smi is on your PATH and the CUDA toolkit
  is installed.  If not on PATH, add it or pass --no-build and compile
  llama.cpp manually with -DGGML_CUDA=ON.

Problem: LM Studio does not show the model.
Solution: Make sure LM Studio is not running when the script installs
  files, then restart LM Studio.  Click "My Models" in the left sidebar.

Problem: The script says "Architecture does not look like a causal LM".
Solution: The model may be an encoder-only model (e.g. BERT, ViT) or a
  vision-language model.  llama.cpp supports decoder-only causal LMs.
  Check the model card on HuggingFace to confirm it is a text-generation
  model.  If it is, the warning is a false positive and conversion may
  still succeed.

Problem: KeyboardInterrupt / Ctrl+C during a long step.
Solution: The script exits cleanly with code 130.  Partially created
  output files may remain.  Re-running the script will skip already-
  completed steps (download, F16 conversion, and per-level quantization
  are all idempotent).


------------------------------------------------------------------------
SUPPORTED MODEL FAMILIES AND TEMPLATE DETECTION
------------------------------------------------------------------------

Detection uses three tiers:
  1. Architecture class name from config.json
  2. Repository name substrings
  3. Heuristic inspection of the chat_template Jinja2 string

Supported families:

  Llama 3 / 3.1 / 3.2 / 3.3 (Meta)
    Detected by: LlamaForCausalLM + vocab_size >= 128000
    Template   : start_header_id / eot_id tokens

  Llama 2 Chat (Meta)
    Detected by: LlamaForCausalLM + vocab_size == 32000
    Template   : [INST] / [/INST] with <<SYS>> system block

  Code Llama (Meta)
    Detected by: CodeLlamaForCausalLM or "codellama" in repo name
    Template   : Llama 2 format (same template family)

  Mistral / Mixtral (Mistral AI)
    Detected by: MistralForCausalLM / MixtralForCausalLM
    Template   : [INST] / [/INST] (no <<SYS>> block)

  Gemma / Gemma 2 / Gemma 3 (Google)
    Detected by: GemmaForCausalLM / Gemma2ForCausalLM / Gemma3ForCausalLM
    Template   : start_of_turn / end_of_turn markers

  Qwen2 / Qwen2.5 (Alibaba) and Yi (01.AI)
    Detected by: Qwen2ForCausalLM / YiForCausalLM
    Template   : ChatML (im_start / im_end tokens)

  Phi-3 / Phi-3.5 / Phi-3.5 MoE (Microsoft)
    Detected by: Phi3ForCausalLM / PhiMoEForCausalLM
    Template   : <|user|> / <|assistant|> / <|end|> tokens

  DeepSeek V3 / R1 (DeepSeek AI)
    Detected by: DeepseekV2ForCausalLM / DeepseekV3ForCausalLM
                 or full-width vertical bars in chat_template
    Template   : <｜User｜> / <｜Assistant｜> (Unicode U+FF5C bars)
    WARNING    : Do NOT use ASCII | pipes for these tokens.

  Command-R / Command-R+ (Cohere)
    Detected by: CohereForCausalLM
    Template   : START_OF_TURN_TOKEN / END_OF_TURN_TOKEN

  Falcon (Technology Innovation Institute)
    Detected by: FalconForCausalLM
    Template   : User: / Assistant: plain text markers

  InternLM / Baichuan / ChatGLM / StarCoder (various)
    Detected by: architecture class name substring
    Template   : Generic ChatML (im_start / im_end)

  Generic ChatML (fallback)
    Used for fine-tuned models not matching the above families.
    Review the generated Modelfile and adjust TEMPLATE if needed.


------------------------------------------------------------------------
QUANTIZATION SELECTION GUIDE
------------------------------------------------------------------------

Q2_K   ~2.6 bits/weight  Smallest files; noticeable quality loss.
                         Use only when memory is very tight (< 4 GB).

Q3_K_M ~3.3 bits/weight  Good for 4–6 GB memory budgets.
                         Acceptable quality for conversational tasks.

Q4_K_M ~4.8 bits/weight  RECOMMENDED default.  Best quality/size balance.
                         Minimal quality loss vs float16 for most tasks.

Q5_K_M ~5.7 bits/weight  Near-original quality.  Recommended for complex
                         reasoning, coding, and factual recall tasks.

Q6_K   ~6.6 bits/weight  Very close to lossless.  Good for precision tasks
                         where memory allows.

Q8_0   ~8.5 bits/weight  Essentially lossless.  Use when quality is
                         paramount and memory is not a constraint.


------------------------------------------------------------------------
NOTES ON OLLAMA API ENDPOINTS
------------------------------------------------------------------------

Ollama exposes two API styles:

  Native Ollama API:
    POST http://localhost:11434/api/chat
    POST http://localhost:11434/api/generate
    GET  http://localhost:11434/api/tags

  OpenAI-compatible API (use this with OpenAI client libraries):
    POST http://localhost:11434/v1/chat/completions
    GET  http://localhost:11434/v1/models

When integrating with tools or libraries that expect the OpenAI API
format, point them at the /v1/ endpoints, not /api/chat.  The base URL
to configure is http://localhost:11434 and the API key can be any
non-empty string (e.g. "ollama").

Thursday, May 28, 2026

RUNNING HUGGINGFACE MODELS EVERYWHERE: A COMPLETE GUIDE TO ADAPTING MODELS FOR OLLAMA, MLX, LM STUDIO, AND LLAMA.CPP