Hitchhiker's Guide to AI, Software Architecture, and Everything Else: INTRODUCTION TO OLLAMA AND USING HUGGINGFACE MODELS

MOTIVATION

Ollama is a local tool that allows you to run large language models (LLMs) on your own machine with minimal setup and full control. It supports models that have been quantized into GGUF format, which is used by the llama.cpp runtime. Ollama runs well on macOS (including Apple Silicon), Linux with CUDA, and Windows via WSL. It gives you a clean, scriptable interface similar to Docker, but for models.

You can chat with models from the terminal, connect to them over HTTP, and even bring your own fine-tuned models from HuggingFace—once they are converted into GGUF format.

RUNNING OLLAMA

Once installed, you can start running a prebuilt model by name. For example:

ollama run mistral

To download the model first, use:

ollama pull mistral

You can view all installed models:

ollama list

To delete one:

ollama delete mistral

To build and register a custom model:

ollama create mymodel -f Modelfile

You can start the REST server manually with:

ollama serve

USING HUGGINGFACE MODELS IN OLLAMA

Ollama does not directly accept HuggingFace models in .bin or .safetensors formats. You must first convert the model to GGUF format using tools provided by the llama.cpp community or third-party converters.

Once you have a GGUF file, you create a Modelfile to describe it. Here’s a minimal Modelfile:

FROM ./model.Q4_K_M.gguf

NAME my-hf-model

PARAM temperature 0.7

PARAM stop “User:”

PARAM ctx_size 4096

PARAM num_predict 300

You then register the model using:

ollama create my-hf-model -f Modelfile

And run it with:

ollama run my-hf-model

COMMAND LINE OPTIONS

Ollama supports a set of CLI arguments that control generation behavior. These can be used with any run command:

–temperature

Controls randomness. Lower is more deterministic. Use values like 0.2 to 1.0.

–top-p

Nucleus sampling. Restricts tokens to those that form a cumulative probability of p. Typical values are 0.8 to 0.95.

–top-k

Top-k sampling. Only the k most likely tokens are considered. Example: –top-k 40

–num-predict

Specifies how many tokens to generate. This is the response length.

–ctx-size

Sets the context window size in tokens. Default is 2048 or 4096, but models may support 8192, 16384 or more.

–repeat-penalty

Discourages repeated tokens. Use 1.1 to 1.3 to reduce repetition.

–stop

Stop token or phrase. Output ends when this string is generated. You can use multiple –stop entries.

–seed

Fixes the generation seed for reproducibility.

–verbose

Prints loading diagnostics, token generation time, and memory stats.

–system

Injects a system prompt at the start of the session.

–instruct

Toggles instruction-following mode even for base models.

–template

Applies a named prompt formatting template defined in ~/.ollama/templates.

–format json

Wraps the output in machine-readable JSON, ideal for scripting or pipelines.

EXAMPLE USING ADVANCED FLAGS

ollama run llama2

–system “You are a wise assistant who gives clear and precise answers.”

–temperature 0.6

–top-p 0.9

–repeat-penalty 1.2

–ctx-size 8192

–num-predict 400

–stop “<|user|>”

–verbose

SETTINGS INSIDE MODELFILE

You can include default parameters inside your Modelfile so they don’t need to be repeated in every CLI call.

FROM ./model.Q4_K_M.gguf

NAME llama2-custom

PARAM temperature 0.4

PARAM top_p 0.92

PARAM stop “###”

PARAM repeat_penalty 1.15

PARAM num_predict 256

PARAM ctx_size 8192

These defaults will apply when you run the model unless overridden at runtime.

RESOURCE CONTROL

Ollama uses memory-mapping and quantized execution layers. You can control some runtime behavior using environment variables.

OLLAMA_NUM_GPU_LAYERS

Specifies how many layers to load on the GPU. Higher values need more VRAM.

Example: export OLLAMA_NUM_GPU_LAYERS=40

OLLAMA_MAX_CONTEXT

Overrides the default maximum context size if your model supports more.

Example: export OLLAMA_MAX_CONTEXT=8192

OLLAMA_MODELS_PATH

Custom path for where models are stored. Useful for SSDs or external drives.

Example: export OLLAMA_MODELS_PATH=/mnt/fastdisk/ollama

OMP_NUM_THREADS

Control the number of CPU threads for generation.

Example: export OMP_NUM_THREADS=8

REST API

Ollama provides a local server at http://localhost:11434. You can send a POST request to generate output.

Example using curl:

curl http://localhost:11434/api/generate

-d ‘{“model”: “llama2”, “prompt”: “Explain entropy in physics.”, “temperature”: 0.5}’

You can also include top_p, top_k, num_predict, stop sequences, and format options in the JSON payload.

TEMPLATES

You can define custom prompt templates in the configuration directory ~/.ollama/templates.

A template can define system instructions, user/assistant prefixes, and format markers. You can invoke it with –template template_name

VIEWING MODEL DETAILS

To inspect a model:

ollama show llama2

This prints JSON metadata including tokenizer, quantization, parameters, and layer count.

WHERE TO FIND GGUF MODELS

You can download quantized GGUF models that are ready for Ollama from:

https://ollama.com/library

https://huggingface.co/TheBloke

https://ggml.ai/models

Make sure the context size and quantization format (Q4_K_M, Q8_0, etc.) are compatible with your hardware and usage.

SUMMARY

Ollama is a versatile tool for running quantized LLMs locally. It wraps the power of llama.cpp in a clean, consistent CLI and REST API interface. You can run high-performance instruction-tuned models like Mistral or LLaMA 2, convert your own HuggingFace models to GGUF, control generation with fine-grained parameters, and even automate via API.

With full support for stop sequences, randomness tuning, context windows up to 16K or higher, and GPU-aware execution, Ollama is one of the most developer-friendly local LLM runtimes available today.

ADDENDUM - CONVERTING HUGGINGFACE MODELS INTO GGUF

Let’s now walk through the complete step-by-step process to convert a HuggingFace model into GGUF format so it can be used with Ollama. This includes every necessary step, from downloading the model to final GGUF creation, and preparing the Modelfile to use it in Ollama.

This procedure assumes you’re working on a Unix-like system (macOS or Linux), but also works in Windows via WSL.

STEP 1: INSTALL DEPENDENCIES

You’ll need the following tools installed:

1. Python 3.10 or later

2. The HuggingFace transformers and sentencepiece libraries

3. git

4. cmake and g++

5. llama.cpp (cloned from GitHub)

To install dependencies:

pip install transformers sentencepiece huggingface_hub

To clone and build llama.cpp:

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

make

This builds the necessary conversion and quantization tools.

STEP 2: DOWNLOAD THE MODEL FROM HUGGINGFACE

Choose a HuggingFace model that is compatible with llama.cpp. That typically means LLaMA 2, Mistral, or Falcon-like models.

You can use the transformers CLI or Python:

Example: Download NousResearch/Llama-2-7b-chat-hf

In Python:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "NousResearch/Llama-2-7b-chat-hf"

model = AutoModelForCausalLM.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model.save_pretrained("./llama2")

tokenizer.save_pretrained("./llama2")

This saves the .bin weights and tokenizer config to ./llama2.

STEP 3: CONVERT TO GGUF

Now return to your llama.cpp directory and run the convert.py script.

Make sure the tokenizer and model directories are available. Then run:

python3 convert.py ../llama2 --outfile ./llama2-f16.gguf

This converts the HuggingFace model to 16-bit GGUF format (float16). If you’re on low RAM, consider using 32-bit instead:

python3 convert.py ../llama2 --outfile ./llama2-f32.gguf --outtype f32

Check that the resulting .gguf file is created. This is your base model file.

STEP 4: QUANTIZE THE GGUF MODEL

You can now reduce the size of the model to something usable on a laptop using quantize.

Inside llama.cpp:

./quantize ./llama2-f16.gguf ./llama2.Q4_K_M.gguf Q4_K_M

This creates a quantized file that Ollama can use.

Available quantization types include:

• Q2_K (ultra small, lowest accuracy)

• Q4_0, Q4_K, Q4_K_M (balanced for quality and size)

• Q5_K, Q6_K (larger, higher accuracy)

• Q8_0 (largest, best quality, needs >12GB VRAM)

The .gguf file is now ready for Ollama.

STEP 5: CREATE A MODEFILE FOR OLLAMA

Create a file called Modelfile in the same directory as your .gguf file:

FROM ./llama2.Q4_K_M.gguf

NAME llama2-local

PARAM temperature 0.7

PARAM top_p 0.9

PARAM num_predict 300

PARAM ctx_size 4096

PARAM stop "User:"

You can adjust these parameters based on your use case and system performance.

STEP 6: REGISTER AND RUN THE MODEL IN OLLAMA

In the same directory as the Modelfile:

ollama create llama2-local -f Modelfile

This registers the model in Ollama’s internal registry.

You can now run the model interactively:

ollama run llama2-local

Or invoke it with custom parameters:

ollama run llama2-local --temperature 0.4 --num-predict 500

You can list and manage models:

ollama list

ollama delete llama2-local

TROUBLESHOOTING AND TIPS

1. If the tokenizer is not recognized during conversion, make sure your model directory contains tokenizer_config.json, tokenizer.model, and special_tokens_map.json.

2. If convert.py fails, check that the model is a decoder-only (causal LM) and not an encoder-decoder model like T5.

3. The best quantization type depends on your system:

• Use Q4_K_M for general laptops (8GB RAM or M1/M2 Mac).

• Use Q5_K or Q8_0 if you have >12GB VRAM or a powerful GPU.

4. If inference is too slow, try reducing ctx_size, or limit the num_predict value.

5. You can put multiple models into different directories and register them separately with different names.

SUMMARY

To convert a HuggingFace model to GGUF for Ollama:

• Download and save the model from HuggingFace

• Convert it to GGUF using llama.cpp/convert.py

• Quantize it using quantize

• Write a Modelfile with generation parameters

• Register the model using ollama create

• Run it via ollama run

MAKEFILE

Here is a complete and functional plain ASCII Makefile and shell script combination that automates the full process of:

1. Downloading a HuggingFace model

2. Converting it to GGUF format using llama.cpp

3. Quantizing it (e.g. to Q4_K_M)

4. Creating a Modelfile

5. Registering the model in Ollama

Everything runs from the command line, assuming you have installed python3, git, llama.cpp, ollama, and all dependencies.

MODEL_NAME := NousResearch/Llama-2-7b-chat-hf

OUTPUT_DIR := output

QUANT_TYPE := Q4_K_M

CONVERT_SCRIPT := llama.cpp/convert.py

GGUF_NAME := $(OUTPUT_DIR)/model-f16.gguf

QUANTIZED_NAME := $(OUTPUT_DIR)/model.$(QUANT_TYPE).gguf

MODELFILENAME := $(OUTPUT_DIR)/Modelfile

OLLAMA_NAME := llama2-local

.PHONY: all clean run create

all: run

$(OUTPUT_DIR):

mkdir -p $(OUTPUT_DIR)

download: $(OUTPUT_DIR)

python3 -c "\

import transformers; \

model = transformers.AutoModelForCausalLM.from_pretrained('$(MODEL_NAME)', trust_remote_code=True); \

tokenizer = transformers.AutoTokenizer.from_pretrained('$(MODEL_NAME)', trust_remote_code=True); \

model.save_pretrained('$(OUTPUT_DIR)'); \

tokenizer.save_pretrained('$(OUTPUT_DIR)')"

convert: download

python3 $(CONVERT_SCRIPT) $(OUTPUT_DIR) --outfile $(GGUF_NAME)

quantize: convert

cd llama.cpp && ./quantize ../$(GGUF_NAME) ../$(QUANTIZED_NAME) $(QUANT_TYPE)

modelfile: quantize

echo "FROM ./model.$(QUANT_TYPE).gguf" > $(MODELFILENAME)

echo "NAME $(OLLAMA_NAME)" >> $(MODELFILENAME)

echo "PARAM temperature 0.7" >> $(MODELFILENAME)

echo "PARAM top_p 0.9" >> $(MODELFILENAME)

echo "PARAM repeat_penalty 1.1" >> $(MODELFILENAME)

echo "PARAM num_predict 300" >> $(MODELFILENAME)

echo "PARAM ctx_size 4096" >> $(MODELFILENAME)

echo "PARAM stop \"User:\"" >> $(MODELFILENAME)

create: modelfile

cd $(OUTPUT_DIR) && ollama create $(OLLAMA_NAME) -f Modelfile

run: create

ollama run $(OLLAMA_NAME)

clean:

rm -rf $(OUTPUT_DIR)

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Thursday, May 29, 2025

INTRODUCTION TO OLLAMA AND USING HUGGINGFACE MODELS

MOTIVATION

RUNNING OLLAMA

USING HUGGINGFACE MODELS IN OLLAMA

COMMAND LINE OPTIONS

EXAMPLE USING ADVANCED FLAGS

SETTINGS INSIDE MODELFILE

RESOURCE CONTROL

REST API

TEMPLATES

WHERE TO FIND GGUF MODELS

SUMMARY

ADDENDUM - CONVERTING HUGGINGFACE MODELS INTO GGUF

STEP 1: INSTALL DEPENDENCIES

STEP 2: DOWNLOAD THE MODEL FROM HUGGINGFACE

STEP 3: CONVERT TO GGUF

STEP 4: QUANTIZE THE GGUF MODEL

STEP 5: CREATE A MODEFILE FOR OLLAMA

STEP 6: REGISTER AND RUN THE MODEL IN OLLAMA

TROUBLESHOOTING AND TIPS

MAKEFILE

No comments:

About Me