Thursday, May 29, 2025

INTRODUCTION TO OLLAMA AND USING HUGGINGFACE MODELS

MOTIVATION


Ollama is a local tool that allows you to run large language models (LLMs) on your own machine with minimal setup and full control. It supports models that have been quantized into GGUF format, which is used by the llama.cpp runtime. Ollama runs well on macOS (including Apple Silicon), Linux with CUDA, and Windows via WSL. It gives you a clean, scriptable interface similar to Docker, but for models.


You can chat with models from the terminal, connect to them over HTTP, and even bring your own fine-tuned models from HuggingFace—once they are converted into GGUF format.


RUNNING OLLAMA


Once installed, you can start running a prebuilt model by name. For example:


ollama run mistral


To download the model first, use:


ollama pull mistral


You can view all installed models:


ollama list


To delete one:


ollama delete mistral


To build and register a custom model:


ollama create mymodel -f Modelfile


You can start the REST server manually with:


ollama serve


USING HUGGINGFACE MODELS IN OLLAMA


Ollama does not directly accept HuggingFace models in .bin or .safetensors formats. You must first convert the model to GGUF format using tools provided by the llama.cpp community or third-party converters.


Once you have a GGUF file, you create a Modelfile to describe it. Here’s a minimal Modelfile:


FROM ./model.Q4_K_M.gguf

NAME my-hf-model

PARAM temperature 0.7

PARAM stop “User:”

PARAM ctx_size 4096

PARAM num_predict 300


You then register the model using:


ollama create my-hf-model -f Modelfile


And run it with:


ollama run my-hf-model



COMMAND LINE OPTIONS


Ollama supports a set of CLI arguments that control generation behavior. These can be used with any run command:


–temperature 

Controls randomness. Lower is more deterministic. Use values like 0.2 to 1.0.


–top-p 

Nucleus sampling. Restricts tokens to those that form a cumulative probability of p. Typical values are 0.8 to 0.95.


–top-k 

Top-k sampling. Only the k most likely tokens are considered. Example: –top-k 40


–num-predict 

Specifies how many tokens to generate. This is the response length.


–ctx-size 

Sets the context window size in tokens. Default is 2048 or 4096, but models may support 8192, 16384 or more.


–repeat-penalty 

Discourages repeated tokens. Use 1.1 to 1.3 to reduce repetition.


–stop 

Stop token or phrase. Output ends when this string is generated. You can use multiple –stop entries.


–seed 

Fixes the generation seed for reproducibility.


–verbose

Prints loading diagnostics, token generation time, and memory stats.


–system 

Injects a system prompt at the start of the session.


–instruct

Toggles instruction-following mode even for base models.


–template 

Applies a named prompt formatting template defined in ~/.ollama/templates.


–format json

Wraps the output in machine-readable JSON, ideal for scripting or pipelines.


EXAMPLE USING ADVANCED FLAGS


ollama run llama2

–system “You are a wise assistant who gives clear and precise answers.”

–temperature 0.6

–top-p 0.9

–repeat-penalty 1.2

–ctx-size 8192

–num-predict 400

–stop “<|user|>”

–verbose


SETTINGS INSIDE MODELFILE


You can include default parameters inside your Modelfile so they don’t need to be repeated in every CLI call.


FROM ./model.Q4_K_M.gguf

NAME llama2-custom

PARAM temperature 0.4

PARAM top_p 0.92

PARAM stop “###”

PARAM repeat_penalty 1.15

PARAM num_predict 256

PARAM ctx_size 8192


These defaults will apply when you run the model unless overridden at runtime.


RESOURCE CONTROL


Ollama uses memory-mapping and quantized execution layers. You can control some runtime behavior using environment variables.


OLLAMA_NUM_GPU_LAYERS

Specifies how many layers to load on the GPU. Higher values need more VRAM.

Example: export OLLAMA_NUM_GPU_LAYERS=40


OLLAMA_MAX_CONTEXT

Overrides the default maximum context size if your model supports more.

Example: export OLLAMA_MAX_CONTEXT=8192


OLLAMA_MODELS_PATH

Custom path for where models are stored. Useful for SSDs or external drives.

Example: export OLLAMA_MODELS_PATH=/mnt/fastdisk/ollama


OMP_NUM_THREADS

Control the number of CPU threads for generation.

Example: export OMP_NUM_THREADS=8


REST API


Ollama provides a local server at http://localhost:11434. You can send a POST request to generate output.


Example using curl:


curl http://localhost:11434/api/generate

-d ‘{“model”: “llama2”, “prompt”: “Explain entropy in physics.”, “temperature”: 0.5}’


You can also include top_p, top_k, num_predict, stop sequences, and format options in the JSON payload.



TEMPLATES


You can define custom prompt templates in the configuration directory ~/.ollama/templates.


A template can define system instructions, user/assistant prefixes, and format markers. You can invoke it with –template template_name



VIEWING MODEL DETAILS


To inspect a model:


ollama show llama2


This prints JSON metadata including tokenizer, quantization, parameters, and layer count.


WHERE TO FIND GGUF MODELS


You can download quantized GGUF models that are ready for Ollama from:


https://ollama.com/library

https://huggingface.co/TheBloke

https://ggml.ai/models


Make sure the context size and quantization format (Q4_K_M, Q8_0, etc.) are compatible with your hardware and usage.


SUMMARY


Ollama is a versatile tool for running quantized LLMs locally. It wraps the power of llama.cpp in a clean, consistent CLI and REST API interface. You can run high-performance instruction-tuned models like Mistral or LLaMA 2, convert your own HuggingFace models to GGUF, control generation with fine-grained parameters, and even automate via API.


With full support for stop sequences, randomness tuning, context windows up to 16K or higher, and GPU-aware execution, Ollama is one of the most developer-friendly local LLM runtimes available today.




ADDENDUM - CONVERTING HUGGINGFACE MODELS INTO GGUF


Let’s now walk through the complete step-by-step process to convert a HuggingFace model into GGUF format so it can be used with Ollama. This includes every necessary step, from downloading the model to final GGUF creation, and preparing the Modelfile to use it in Ollama.


This procedure assumes you’re working on a Unix-like system (macOS or Linux), but also works in Windows via WSL.



STEP 1: INSTALL DEPENDENCIES


You’ll need the following tools installed:

1. Python 3.10 or later

2. The HuggingFace transformers and sentencepiece libraries

3. git

4. cmake and g++

5. llama.cpp (cloned from GitHub)


To install dependencies:


pip install transformers sentencepiece huggingface_hub


To clone and build llama.cpp:


git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

make


This builds the necessary conversion and quantization tools.



STEP 2: DOWNLOAD THE MODEL FROM HUGGINGFACE


Choose a HuggingFace model that is compatible with llama.cpp. That typically means LLaMA 2, Mistral, or Falcon-like models.


You can use the transformers CLI or Python:


Example: Download NousResearch/Llama-2-7b-chat-hf


In Python:


from transformers import AutoTokenizer, AutoModelForCausalLM


model_id = "NousResearch/Llama-2-7b-chat-hf"

model = AutoModelForCausalLM.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model.save_pretrained("./llama2")

tokenizer.save_pretrained("./llama2")



This saves the .bin weights and tokenizer config to ./llama2.



STEP 3: CONVERT TO GGUF


Now return to your llama.cpp directory and run the convert.py script.


Make sure the tokenizer and model directories are available. Then run:


python3 convert.py ../llama2 --outfile ./llama2-f16.gguf


This converts the HuggingFace model to 16-bit GGUF format (float16). If you’re on low RAM, consider using 32-bit instead:


python3 convert.py ../llama2 --outfile ./llama2-f32.gguf --outtype f32


Check that the resulting .gguf file is created. This is your base model file.



STEP 4: QUANTIZE THE GGUF MODEL


You can now reduce the size of the model to something usable on a laptop using quantize.


Inside llama.cpp:


./quantize ./llama2-f16.gguf ./llama2.Q4_K_M.gguf Q4_K_M


This creates a quantized file that Ollama can use.


Available quantization types include:

Q2_K   (ultra small, lowest accuracy)

Q4_0, Q4_K, Q4_K_M (balanced for quality and size)

Q5_K, Q6_K (larger, higher accuracy)

Q8_0   (largest, best quality, needs >12GB VRAM)


The .gguf file is now ready for Ollama.



STEP 5: CREATE A MODEFILE FOR OLLAMA


Create a file called Modelfile in the same directory as your .gguf file:


FROM ./llama2.Q4_K_M.gguf

NAME llama2-local

PARAM temperature 0.7

PARAM top_p 0.9

PARAM num_predict 300

PARAM ctx_size 4096

PARAM stop "User:"


You can adjust these parameters based on your use case and system performance.



STEP 6: REGISTER AND RUN THE MODEL IN OLLAMA


In the same directory as the Modelfile:


ollama create llama2-local -f Modelfile


This registers the model in Ollama’s internal registry.


You can now run the model interactively:


ollama run llama2-local


Or invoke it with custom parameters:


ollama run llama2-local --temperature 0.4 --num-predict 500


You can list and manage models:


ollama list

ollama delete llama2-local


TROUBLESHOOTING AND TIPS

1. If the tokenizer is not recognized during conversion, make sure your model directory contains tokenizer_config.json, tokenizer.model, and special_tokens_map.json.

2. If convert.py fails, check that the model is a decoder-only (causal LM) and not an encoder-decoder model like T5.

3. The best quantization type depends on your system:

Use Q4_K_M for general laptops (8GB RAM or M1/M2 Mac).

Use Q5_K or Q8_0 if you have >12GB VRAM or a powerful GPU.

4. If inference is too slow, try reducing ctx_size, or limit the num_predict value.

5. You can put multiple models into different directories and register them separately with different names.



SUMMARY


To convert a HuggingFace model to GGUF for Ollama:

Download and save the model from HuggingFace

Convert it to GGUF using llama.cpp/convert.py

Quantize it using quantize

Write a Modelfile with generation parameters

Register the model using ollama create

Run it via ollama run




MAKEFILE

Here is a complete and functional plain ASCII Makefile and shell script combination that automates the full process of:

1. Downloading a HuggingFace model

2. Converting it to GGUF format using llama.cpp

3. Quantizing it (e.g. to Q4_K_M)

4. Creating a Modelfile

5. Registering the model in Ollama


Everything runs from the command line, assuming you have installed python3, git, llama.cpp, ollama, and all dependencies.


MODEL_NAME := NousResearch/Llama-2-7b-chat-hf

OUTPUT_DIR := output

QUANT_TYPE := Q4_K_M

CONVERT_SCRIPT := llama.cpp/convert.py

GGUF_NAME := $(OUTPUT_DIR)/model-f16.gguf

QUANTIZED_NAME := $(OUTPUT_DIR)/model.$(QUANT_TYPE).gguf

MODELFILENAME := $(OUTPUT_DIR)/Modelfile

OLLAMA_NAME := llama2-local


.PHONY: all clean run create


all: run


$(OUTPUT_DIR):

mkdir -p $(OUTPUT_DIR)


download: $(OUTPUT_DIR)

python3 -c "\

import transformers; \

model = transformers.AutoModelForCausalLM.from_pretrained('$(MODEL_NAME)', trust_remote_code=True); \

tokenizer = transformers.AutoTokenizer.from_pretrained('$(MODEL_NAME)', trust_remote_code=True); \

model.save_pretrained('$(OUTPUT_DIR)'); \

tokenizer.save_pretrained('$(OUTPUT_DIR)')"


convert: download

python3 $(CONVERT_SCRIPT) $(OUTPUT_DIR) --outfile $(GGUF_NAME)


quantize: convert

cd llama.cpp && ./quantize ../$(GGUF_NAME) ../$(QUANTIZED_NAME) $(QUANT_TYPE)


modelfile: quantize

echo "FROM ./model.$(QUANT_TYPE).gguf" > $(MODELFILENAME)

echo "NAME $(OLLAMA_NAME)" >> $(MODELFILENAME)

echo "PARAM temperature 0.7" >> $(MODELFILENAME)

echo "PARAM top_p 0.9" >> $(MODELFILENAME)

echo "PARAM repeat_penalty 1.1" >> $(MODELFILENAME)

echo "PARAM num_predict 300" >> $(MODELFILENAME)

echo "PARAM ctx_size 4096" >> $(MODELFILENAME)

echo "PARAM stop \"User:\"" >> $(MODELFILENAME)


create: modelfile

cd $(OUTPUT_DIR) && ollama create $(OLLAMA_NAME) -f Modelfile


run: create

ollama run $(OLLAMA_NAME)


clean:

rm -rf $(OUTPUT_DIR)

No comments: