MOTIVATION
Ollama is a local tool that allows you to run large language models (LLMs) on your own machine with minimal setup and full control. It supports models that have been quantized into GGUF format, which is used by the llama.cpp runtime. Ollama runs well on macOS (including Apple Silicon), Linux with CUDA, and Windows via WSL. It gives you a clean, scriptable interface similar to Docker, but for models.
You can chat with models from the terminal, connect to them over HTTP, and even bring your own fine-tuned models from HuggingFace—once they are converted into GGUF format.
RUNNING OLLAMA
Once installed, you can start running a prebuilt model by name. For example:
ollama run mistral
To download the model first, use:
ollama pull mistral
You can view all installed models:
ollama list
To delete one:
ollama delete mistral
To build and register a custom model:
ollama create mymodel -f Modelfile
You can start the REST server manually with:
ollama serve
USING HUGGINGFACE MODELS IN OLLAMA
Ollama does not directly accept HuggingFace models in .bin or .safetensors formats. You must first convert the model to GGUF format using tools provided by the llama.cpp community or third-party converters.
Once you have a GGUF file, you create a Modelfile to describe it. Here’s a minimal Modelfile:
FROM ./model.Q4_K_M.gguf
NAME my-hf-model
PARAM temperature 0.7
PARAM stop “User:”
PARAM ctx_size 4096
PARAM num_predict 300
You then register the model using:
ollama create my-hf-model -f Modelfile
And run it with:
ollama run my-hf-model
COMMAND LINE OPTIONS
Ollama supports a set of CLI arguments that control generation behavior. These can be used with any run command:
–temperature
Controls randomness. Lower is more deterministic. Use values like 0.2 to 1.0.
–top-p
Nucleus sampling. Restricts tokens to those that form a cumulative probability of p. Typical values are 0.8 to 0.95.
–top-k
Top-k sampling. Only the k most likely tokens are considered. Example: –top-k 40
–num-predict
Specifies how many tokens to generate. This is the response length.
–ctx-size
Sets the context window size in tokens. Default is 2048 or 4096, but models may support 8192, 16384 or more.
–repeat-penalty
Discourages repeated tokens. Use 1.1 to 1.3 to reduce repetition.
–stop
Stop token or phrase. Output ends when this string is generated. You can use multiple –stop entries.
–seed
Fixes the generation seed for reproducibility.
–verbose
Prints loading diagnostics, token generation time, and memory stats.
–system
Injects a system prompt at the start of the session.
–instruct
Toggles instruction-following mode even for base models.
–template
Applies a named prompt formatting template defined in ~/.ollama/templates.
–format json
Wraps the output in machine-readable JSON, ideal for scripting or pipelines.
EXAMPLE USING ADVANCED FLAGS
ollama run llama2
–system “You are a wise assistant who gives clear and precise answers.”
–temperature 0.6
–top-p 0.9
–repeat-penalty 1.2
–ctx-size 8192
–num-predict 400
–stop “<|user|>”
–verbose
SETTINGS INSIDE MODELFILE
You can include default parameters inside your Modelfile so they don’t need to be repeated in every CLI call.
FROM ./model.Q4_K_M.gguf
NAME llama2-custom
PARAM temperature 0.4
PARAM top_p 0.92
PARAM stop “###”
PARAM repeat_penalty 1.15
PARAM num_predict 256
PARAM ctx_size 8192
These defaults will apply when you run the model unless overridden at runtime.
RESOURCE CONTROL
Ollama uses memory-mapping and quantized execution layers. You can control some runtime behavior using environment variables.
OLLAMA_NUM_GPU_LAYERS
Specifies how many layers to load on the GPU. Higher values need more VRAM.
Example: export OLLAMA_NUM_GPU_LAYERS=40
OLLAMA_MAX_CONTEXT
Overrides the default maximum context size if your model supports more.
Example: export OLLAMA_MAX_CONTEXT=8192
OLLAMA_MODELS_PATH
Custom path for where models are stored. Useful for SSDs or external drives.
Example: export OLLAMA_MODELS_PATH=/mnt/fastdisk/ollama
OMP_NUM_THREADS
Control the number of CPU threads for generation.
Example: export OMP_NUM_THREADS=8
REST API
Ollama provides a local server at http://localhost:11434. You can send a POST request to generate output.
Example using curl:
curl http://localhost:11434/api/generate
-d ‘{“model”: “llama2”, “prompt”: “Explain entropy in physics.”, “temperature”: 0.5}’
You can also include top_p, top_k, num_predict, stop sequences, and format options in the JSON payload.
TEMPLATES
You can define custom prompt templates in the configuration directory ~/.ollama/templates.
A template can define system instructions, user/assistant prefixes, and format markers. You can invoke it with –template template_name
VIEWING MODEL DETAILS
To inspect a model:
ollama show llama2
This prints JSON metadata including tokenizer, quantization, parameters, and layer count.
WHERE TO FIND GGUF MODELS
You can download quantized GGUF models that are ready for Ollama from:
https://ollama.com/library
https://huggingface.co/TheBloke
https://ggml.ai/models
Make sure the context size and quantization format (Q4_K_M, Q8_0, etc.) are compatible with your hardware and usage.
SUMMARY
Ollama is a versatile tool for running quantized LLMs locally. It wraps the power of llama.cpp in a clean, consistent CLI and REST API interface. You can run high-performance instruction-tuned models like Mistral or LLaMA 2, convert your own HuggingFace models to GGUF, control generation with fine-grained parameters, and even automate via API.
With full support for stop sequences, randomness tuning, context windows up to 16K or higher, and GPU-aware execution, Ollama is one of the most developer-friendly local LLM runtimes available today.
ADDENDUM - CONVERTING HUGGINGFACE MODELS INTO GGUF
Let’s now walk through the complete step-by-step process to convert a HuggingFace model into GGUF format so it can be used with Ollama. This includes every necessary step, from downloading the model to final GGUF creation, and preparing the Modelfile to use it in Ollama.
This procedure assumes you’re working on a Unix-like system (macOS or Linux), but also works in Windows via WSL.
STEP 1: INSTALL DEPENDENCIES
You’ll need the following tools installed:
1. Python 3.10 or later
2. The HuggingFace transformers and sentencepiece libraries
3. git
4. cmake and g++
5. llama.cpp (cloned from GitHub)
To install dependencies:
pip install transformers sentencepiece huggingface_hub
To clone and build llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
This builds the necessary conversion and quantization tools.
STEP 2: DOWNLOAD THE MODEL FROM HUGGINGFACE
Choose a HuggingFace model that is compatible with llama.cpp. That typically means LLaMA 2, Mistral, or Falcon-like models.
You can use the transformers CLI or Python:
Example: Download NousResearch/Llama-2-7b-chat-hf
In Python:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "NousResearch/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model.save_pretrained("./llama2")
tokenizer.save_pretrained("./llama2")
This saves the .bin weights and tokenizer config to ./llama2.
STEP 3: CONVERT TO GGUF
Now return to your llama.cpp directory and run the convert.py script.
Make sure the tokenizer and model directories are available. Then run:
python3 convert.py ../llama2 --outfile ./llama2-f16.gguf
This converts the HuggingFace model to 16-bit GGUF format (float16). If you’re on low RAM, consider using 32-bit instead:
python3 convert.py ../llama2 --outfile ./llama2-f32.gguf --outtype f32
Check that the resulting .gguf file is created. This is your base model file.
STEP 4: QUANTIZE THE GGUF MODEL
You can now reduce the size of the model to something usable on a laptop using quantize.
Inside llama.cpp:
./quantize ./llama2-f16.gguf ./llama2.Q4_K_M.gguf Q4_K_M
This creates a quantized file that Ollama can use.
Available quantization types include:
• Q2_K (ultra small, lowest accuracy)
• Q4_0, Q4_K, Q4_K_M (balanced for quality and size)
• Q5_K, Q6_K (larger, higher accuracy)
• Q8_0 (largest, best quality, needs >12GB VRAM)
The .gguf file is now ready for Ollama.
STEP 5: CREATE A MODEFILE FOR OLLAMA
Create a file called Modelfile in the same directory as your .gguf file:
FROM ./llama2.Q4_K_M.gguf
NAME llama2-local
PARAM temperature 0.7
PARAM top_p 0.9
PARAM num_predict 300
PARAM ctx_size 4096
PARAM stop "User:"
You can adjust these parameters based on your use case and system performance.
STEP 6: REGISTER AND RUN THE MODEL IN OLLAMA
In the same directory as the Modelfile:
ollama create llama2-local -f Modelfile
This registers the model in Ollama’s internal registry.
You can now run the model interactively:
ollama run llama2-local
Or invoke it with custom parameters:
ollama run llama2-local --temperature 0.4 --num-predict 500
You can list and manage models:
ollama list
ollama delete llama2-local
TROUBLESHOOTING AND TIPS
1. If the tokenizer is not recognized during conversion, make sure your model directory contains tokenizer_config.json, tokenizer.model, and special_tokens_map.json.
2. If convert.py fails, check that the model is a decoder-only (causal LM) and not an encoder-decoder model like T5.
3. The best quantization type depends on your system:
• Use Q4_K_M for general laptops (8GB RAM or M1/M2 Mac).
• Use Q5_K or Q8_0 if you have >12GB VRAM or a powerful GPU.
4. If inference is too slow, try reducing ctx_size, or limit the num_predict value.
5. You can put multiple models into different directories and register them separately with different names.
SUMMARY
To convert a HuggingFace model to GGUF for Ollama:
• Download and save the model from HuggingFace
• Convert it to GGUF using llama.cpp/convert.py
• Quantize it using quantize
• Write a Modelfile with generation parameters
• Register the model using ollama create
• Run it via ollama run
MAKEFILE
Here is a complete and functional plain ASCII Makefile and shell script combination that automates the full process of:
1. Downloading a HuggingFace model
2. Converting it to GGUF format using llama.cpp
3. Quantizing it (e.g. to Q4_K_M)
4. Creating a Modelfile
5. Registering the model in Ollama
Everything runs from the command line, assuming you have installed python3, git, llama.cpp, ollama, and all dependencies.
MODEL_NAME := NousResearch/Llama-2-7b-chat-hf
OUTPUT_DIR := output
QUANT_TYPE := Q4_K_M
CONVERT_SCRIPT := llama.cpp/convert.py
GGUF_NAME := $(OUTPUT_DIR)/model-f16.gguf
QUANTIZED_NAME := $(OUTPUT_DIR)/model.$(QUANT_TYPE).gguf
MODELFILENAME := $(OUTPUT_DIR)/Modelfile
OLLAMA_NAME := llama2-local
.PHONY: all clean run create
all: run
$(OUTPUT_DIR):
mkdir -p $(OUTPUT_DIR)
download: $(OUTPUT_DIR)
python3 -c "\
import transformers; \
model = transformers.AutoModelForCausalLM.from_pretrained('$(MODEL_NAME)', trust_remote_code=True); \
tokenizer = transformers.AutoTokenizer.from_pretrained('$(MODEL_NAME)', trust_remote_code=True); \
model.save_pretrained('$(OUTPUT_DIR)'); \
tokenizer.save_pretrained('$(OUTPUT_DIR)')"
convert: download
python3 $(CONVERT_SCRIPT) $(OUTPUT_DIR) --outfile $(GGUF_NAME)
quantize: convert
cd llama.cpp && ./quantize ../$(GGUF_NAME) ../$(QUANTIZED_NAME) $(QUANT_TYPE)
modelfile: quantize
echo "FROM ./model.$(QUANT_TYPE).gguf" > $(MODELFILENAME)
echo "NAME $(OLLAMA_NAME)" >> $(MODELFILENAME)
echo "PARAM temperature 0.7" >> $(MODELFILENAME)
echo "PARAM top_p 0.9" >> $(MODELFILENAME)
echo "PARAM repeat_penalty 1.1" >> $(MODELFILENAME)
echo "PARAM num_predict 300" >> $(MODELFILENAME)
echo "PARAM ctx_size 4096" >> $(MODELFILENAME)
echo "PARAM stop \"User:\"" >> $(MODELFILENAME)
create: modelfile
cd $(OUTPUT_DIR) && ollama create $(OLLAMA_NAME) -f Modelfile
run: create
ollama run $(OLLAMA_NAME)
clean:
rm -rf $(OUTPUT_DIR)