INTRODUCTION TO FINE-TUNING LOCAL LLMS
The ability to run and customize large language models on your own hardware has become increasingly accessible. Fine-tuning allows you to adapt pre-trained models to your specific domain, writing style, or task requirements without the massive computational resources needed for training from scratch.
This tutorial will guide you through the complete process of fine-tuning local LLMs using popular tools like Ollama and Apple MLX, ensuring you understand each step deeply enough to apply these techniques to your own projects.
Fine-tuning is fundamentally different from training a model from scratch.
When you fine-tune, you start with a model that already understands language and has general knowledge. Your goal is to teach it specialized knowledge or behaviors by continuing the training process on a carefully curated dataset. This approach requires significantly less data and computational power than initial training, making it practical for individual developers and small teams.
The landscape of local LLM fine-tuning has evolved rapidly. Tools like Ollama have made it remarkably simple to run models locally, while frameworks like Apple MLX provide hardware-optimized training capabilities for Mac users. Understanding when to use each tool and how they complement each other is essential for efficient fine-tuning workflows.
UNDERSTANDING THE FINE-TUNING PROCESS
Before diving into specific tools, you need to understand what happens during fine-tuning at a conceptual level. The pre-trained model has learned patterns from billions of tokens of text. Fine-tuning adjusts the model's weights based on your specific dataset, essentially teaching it to prioritize certain patterns or knowledge domains over others.
There are several approaches to fine-tuning. Full fine-tuning updates all parameters in the model, which provides maximum flexibility but requires substantial memory and computational resources. Parameter-efficient fine-tuning methods like LoRA (Low-Rank Adaptation) update only a small subset of parameters, dramatically reducing resource requirements while maintaining effectiveness for most tasks. For local fine-tuning, parameter-efficient methods are typically the most practical choice.
The quality of your fine-tuning results depends heavily on your training data. You need examples that represent the behavior you want the model to learn. For instruction-following tasks, this means pairs of instructions and desired responses. For domain adaptation, you need representative text from your target domain. The data must be formatted correctly and be of sufficient quality and quantity to produce meaningful improvements.
PREREQUISITES AND ENVIRONMENT SETUP
Before beginning the fine-tuning process, you need to prepare your development environment with the necessary tools and dependencies. The specific requirements vary depending on which approach you choose, but some fundamentals apply across all methods.
For Ollama-based fine-tuning, you need a system with adequate RAM and ideally a GPU, though CPU-only operation is possible for smaller models. Ollama itself is straightforward to install on Linux, macOS, and Windows. You will also need Python for data preparation and potentially for running fine-tuning scripts.
If you plan to use Apple MLX, you need a Mac with Apple Silicon (M1, M2, M3, or later). MLX is specifically optimized for these chips and takes advantage of the unified memory architecture. The installation process for MLX is simple through pip, but you should ensure you have a recent version of macOS for optimal compatibility.
Setting up your Python environment properly is crucial. Creating a virtual environment isolates your project dependencies and prevents conflicts. Here is how you would set up a basic environment for fine-tuning work. First, create the virtual environment:
python3 -m venv llm_finetuning_env
Next, activate the virtual environment:
source llm_finetuning_env/bin/activate
Now upgrade pip to the latest version:
pip install --upgrade pip
Install the core PyTorch and Hugging Face libraries:
pip install torch transformers datasets
Install Ollama Python bindings if you plan to use Ollama:
pip install ollama
Install MLX and MLX-LM if you are on Apple Silicon:
pip install mlx mlx-lm
This creates an isolated environment and installs the core libraries you will need. The transformers library from Hugging Face provides essential utilities for working with language models. The datasets library simplifies loading and processing training data. Torch is the underlying deep learning framework used by many fine-tuning tools.
PREPARING YOUR TRAINING DATA
Data preparation is perhaps the most critical step in the fine-tuning process. The quality and format of your training data directly determines the quality of your fine-tuned model. You need to understand the expected format for your chosen fine-tuning method and transform your raw data accordingly.
Most fine-tuning approaches expect data in a conversational or instruction format. The standard format includes a system message that sets the context, a user message containing the input or question, and an assistant message with the desired response. This format teaches the model to respond appropriately to different types of inputs.
Let me show you how to structure training data properly. Suppose you want to fine-tune a model to answer questions about a specific product domain. Your data should be in JSONL format, where each line is a valid JSON object representing one training example:
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant specialized in industrial automation systems."
},
{
"role": "user",
"content": "What is a PLC and how does it differ from a DCS?"
},
{
"role": "assistant",
"content": "A PLC (Programmable Logic Controller) is a ruggedized computer used for industrial automation. It differs from a DCS (Distributed Control System) primarily in architecture and application scope. PLCs are typically used for discrete control tasks and sequential operations, while DCS systems are designed for continuous process control across large facilities."
}
]
}
Each training example should demonstrate the exact behavior you want the model to learn. If you want the model to be concise, your examples should be concise. If you want detailed technical explanations, your examples should demonstrate that style. The model learns by imitation, so consistency in your training data is essential.
Creating high-quality training data often requires significant effort. You might start with existing documentation, customer support transcripts, or expert-written content. However, you typically need to clean and reformat this data. Here is a Python script that demonstrates how to convert raw question-answer pairs into the proper format. First, import the necessary module:
import json
Now define a function to create a single training example:
def create_training_example(system_prompt, question, answer):
"""
Creates a properly formatted training example for fine-tuning.
Args:
system_prompt: The system message that sets context
question: The user's question or input
answer: The desired assistant response
Returns:
A dictionary formatted for fine-tuning
"""
return {
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": question},
{"role": "assistant", "content": answer}
]
}
This function encapsulates the logic for creating a single training example. It takes three parameters and returns a properly structured dictionary. The function includes comprehensive documentation explaining its purpose and parameters, following clean code principles.
Now let me show you how to use this function to convert multiple question-answer pairs into a complete training dataset:
def convert_qa_pairs_to_training_data(qa_pairs, system_prompt, output_file):
"""
Converts a list of question-answer pairs into JSONL training data.
Args:
qa_pairs: List of tuples containing (question, answer)
system_prompt: The system message to use for all examples
output_file: Path to the output JSONL file
"""
with open(output_file, 'w', encoding='utf-8') as f:
for question, answer in qa_pairs:
example = create_training_example(system_prompt, question, answer)
f.write(json.dumps(example, ensure_ascii=False) + '\n')
print(f"Created {len(qa_pairs)} training examples in {output_file}")
This function handles the batch conversion process. It opens the output file with UTF-8 encoding to properly handle international characters. For each question-answer pair, it creates a training example and writes it as a JSON line. The ensure_ascii parameter is set to False to preserve non-ASCII characters in their original form.
Here is how you would use these functions in practice. First, define your question-answer data:
qa_data = [
(
"How do I reset the controller?",
"To reset the controller, press and hold the reset button for 3 seconds until the LED blinks twice."
),
(
"What is the maximum operating temperature?",
"The maximum operating temperature is 85 degrees Celsius in ambient conditions."
),
(
"How often should I perform maintenance?",
"Regular maintenance should be performed every 6 months or after 2000 operating hours, whichever comes first."
)
]
Define the system message:
system_message = "You are a technical support assistant for industrial equipment."
Call the conversion function:
convert_qa_pairs_to_training_data(qa_data, system_message, "training_data.jsonl")
This example demonstrates creating a small training dataset from question-answer pairs. In a real scenario, you would have many more examples, but the process remains the same. The script provides a clean, reusable way to format your data consistently.
The amount of training data you need depends on your task complexity and how different it is from the base model's capabilities. For simple style adaptation, you might need only a few dozen high-quality examples. For teaching new domain knowledge, you typically need hundreds or thousands of examples. Quality always trumps quantity, so focus on creating excellent examples rather than gathering massive amounts of mediocre data.
FINE-TUNING WITH OLLAMA
Ollama has become popular for running local LLMs because of its simplicity and Docker-like interface. While Ollama primarily focuses on inference, you can fine-tune models by creating custom Modelfiles and using external training tools that produce Ollama-compatible outputs.
The Ollama ecosystem works with GGUF format models, which are quantized versions optimized for CPU and consumer GPU inference. To fine-tune for Ollama, you typically train using standard tools and then convert the result to GGUF format. However, recent developments have made it possible to fine-tune directly in formats compatible with Ollama.
One practical approach is using the Unsloth library, which provides efficient fine-tuning capabilities and can export to formats that Ollama understands. Unsloth optimizes memory usage and training speed, making it suitable for local fine-tuning on consumer hardware. Let me walk you through a complete fine-tuning workflow using this approach.
First, you need to install Unsloth and its dependencies. Install Unsloth from the GitHub repository:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
Install additional required dependencies:
pip install --no-deps xformers trl peft accelerate bitsandbytes
Now you can write a fine-tuning script. This script loads a base model, prepares it for efficient fine-tuning using LoRA, trains on your data, and saves the result. Start by importing the necessary libraries:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
Define configuration parameters:
max_seq_length = 2048
dtype = None
load_in_4bit = True
Load the base model with optimizations:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/mistral-7b-v0.3",
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
Configure LoRA for parameter-efficient fine-tuning:
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=False,
loftq_config=None,
)
Load and prepare the training dataset:
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
Define a function to format the prompts:
def format_prompts(examples):
"""
Formats the dataset examples into the prompt structure expected by the model.
This function converts the messages format into a single text string.
"""
texts = []
for messages in examples["messages"]:
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False
)
texts.append(text)
return {"text": texts}
Apply the formatting function to the dataset:
dataset = dataset.map(format_prompts, batched=True)
Configure training parameters:
training_args = TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
)
Initialize the trainer:
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
packing=False,
args=training_args,
)
Execute the training process:
print("Starting fine-tuning process...")
trainer_stats = trainer.train()
Save the fine-tuned model:
model.save_pretrained("fine_tuned_model")
tokenizer.save_pretrained("fine_tuned_model")
print("Fine-tuning complete. Model saved to fine_tuned_model directory.")
This script demonstrates a complete fine-tuning workflow with several important considerations. The load_in_4bit parameter enables quantization during training, which dramatically reduces memory requirements. This allows you to fine-tune 7B parameter models on consumer GPUs with 8-16GB of VRAM.
The LoRA configuration is crucial for efficient fine-tuning. The rank parameter controls the expressiveness of the adaptation. A rank of 16 is a good starting point, balancing between model capacity and memory usage. The target_modules specify which parts of the model to adapt. For most transformer models, adapting the attention and feed-forward projections provides good results.
The training arguments deserve careful attention. The batch size and gradient accumulation steps together determine your effective batch size. With a per-device batch size of 2 and gradient accumulation of 4, your effective batch size is 8. This matters because larger effective batch sizes generally lead to more stable training, but you are limited by available memory.
The learning rate is another critical hyperparameter. A learning rate of 2e-4 works well for many fine-tuning tasks, but you might need to adjust it based on your specific situation. If training loss decreases too slowly, try increasing the learning rate. If loss oscillates or increases, reduce it.
After training completes, you need to convert the model to a format Ollama can use. The saved model is in Hugging Face format, so you need to convert it to GGUF. You can use the llama.cpp conversion tools for this:
python convert-hf-to-gguf.py fine_tuned_model --outtype q8_0 --outfile fine_tuned_model.gguf
The outtype parameter specifies the quantization level. The q8_0 format uses 8-bit quantization, providing a good balance between model size and quality. Once you have the GGUF file, you can create an Ollama Modelfile to make it accessible through Ollama:
FROM ./fine_tuned_model.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
SYSTEM You are a helpful assistant specialized in industrial automation systems.
Save this as Modelfile, then create the Ollama model:
ollama create my-finetuned-model -f Modelfile
Now you can use your fine-tuned model through Ollama just like any other model:
ollama run my-finetuned-model "What is a PLC?"
This workflow demonstrates the complete process from training to deployment. The key advantage of this approach is that you end up with a model that runs efficiently on local hardware through Ollama's optimized inference engine.
FINE-TUNING WITH APPLE MLX
For developers working on Apple Silicon Macs, MLX provides an excellent alternative that is specifically optimized for these systems. MLX is a machine learning framework developed by Apple that takes full advantage of the unified memory architecture in Apple Silicon, allowing you to work with larger models than would be possible with traditional frameworks.
The MLX ecosystem includes mlx-lm, a package specifically designed for working with language models. It provides utilities for fine-tuning, inference, and model conversion. The performance on Apple Silicon can be remarkable, with M1 Max and higher chips capable of fine-tuning 7B parameter models at reasonable speeds.
Setting up for MLX fine-tuning requires installing the MLX packages and preparing your data in the expected format. MLX-lm expects training data in a similar conversational format to what we discussed earlier. Here is how you would fine-tune a model using MLX. Start by importing the necessary libraries:
import mlx.core as mx
from mlx_lm import load, generate
from mlx_lm.tuner import train, evaluate
import json
Define the configuration dictionary:
config = {
"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
"train_data": "training_data.jsonl",
"valid_data": "validation_data.jsonl",
"adapter_file": "adapters.npz",
"iters": 100,
"steps_per_eval": 10,
"val_batches": 5,
"learning_rate": 1e-5,
"batch_size": 2,
"lora_layers": 16,
}
The configuration dictionary contains all the parameters needed for fine-tuning. The model parameter specifies which base model to use. MLX has a growing collection of pre-converted models optimized for Apple Silicon. The training and validation data files should be in JSONL format with the same structure we created earlier.
The lora_layers parameter determines how many transformer layers will have LoRA adapters applied. Setting this to 16 means the adapters will be applied to all layers in a typical 7B model. The learning rate for MLX fine-tuning is often lower than for other frameworks because of the way MLX handles optimization.
Before starting fine-tuning, you should verify your data is properly formatted. Here is a function to validate your training data:
def validate_training_data(file_path):
"""
Validates that training data is properly formatted for MLX fine-tuning.
Args:
file_path: Path to the JSONL training data file
Returns:
True if valid, raises exception otherwise
"""
with open(file_path, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
try:
data = json.loads(line)
if "messages" not in data:
raise ValueError(f"Line {line_num}: Missing 'messages' key")
messages = data["messages"]
if not isinstance(messages, list):
raise ValueError(f"Line {line_num}: 'messages' must be a list")
for msg in messages:
if "role" not in msg or "content" not in msg:
raise ValueError(f"Line {line_num}: Invalid message format")
if msg["role"] not in ["system", "user", "assistant"]:
raise ValueError(f"Line {line_num}: Invalid role")
except json.JSONDecodeError as e:
raise ValueError(f"Line {line_num}: Invalid JSON - {str(e)}")
print(f"Validation successful: {file_path}")
return True
Call the validation function:
validate_training_data("training_data.jsonl")
This validation function checks each line of your training data to ensure it meets the required format. It verifies that the JSON is valid, that the messages key exists, and that each message has the correct structure. Running this before fine-tuning can save you from discovering formatting issues after training has already started.
Now you can execute the fine-tuning process with MLX. Load the model:
model, tokenizer = load(config["model"])
Start the training process:
train(
model=model,
tokenizer=tokenizer,
args=config
)
Print completion message:
print("Fine-tuning complete. Adapters saved to adapters.npz")
The MLX fine-tuning process is remarkably efficient on Apple Silicon. The unified memory architecture means the entire model can be kept in memory without transfers between CPU and GPU memory, which significantly speeds up training. During training, you will see periodic evaluation metrics that help you monitor progress.
After fine-tuning completes, the LoRA adapters are saved to a file. These adapters are much smaller than the full model, typically only a few hundred megabytes even for large models. To use your fine-tuned model, you load the base model and apply the adapters. Import the necessary functions:
from mlx_lm import load, generate
Load the model with adapters:
model, tokenizer = load(
"mlx-community/Mistral-7B-Instruct-v0.3-4bit",
adapter_path="adapters.npz"
)
Generate a response:
prompt = "What is a PLC and how does it work?"
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=200,
temp=0.7
)
print(response)
This code loads the base model and applies your fine-tuned adapters, then generates a response to a prompt. The generation parameters like max_tokens and temperature control the output characteristics. You can adjust these based on your needs.
One significant advantage of the MLX approach is the ability to easily merge the adapters back into the base model for faster inference. Import the fusion utility:
from mlx_lm.utils import fuse_lora_layers
Fuse the adapters into the model:
fused_model = fuse_lora_layers(model)
Save the fused model:
fused_model.save_weights("fused_model_weights.npz")
The fused model eliminates the overhead of applying adapters during inference, resulting in faster generation speeds. This is particularly useful if you plan to use the model extensively in production.
UNDERSTANDING QUANTIZATION IN FINE-TUNING
Quantization is the process of reducing the precision of model weights, typically from 32-bit or 16-bit floating point to 8-bit or even 4-bit integers. This dramatically reduces model size and memory requirements, making it possible to run larger models on consumer hardware. However, the relationship between quantization and fine-tuning requires careful consideration.
There are two main scenarios where quantization intersects with fine-tuning. The first is quantization-aware training, where you fine-tune a model that is already quantized. This approach, used by tools like Unsloth and MLX, allows you to fine-tune with reduced memory requirements. The second is post-training quantization, where you fine-tune in full precision and then quantize the result for deployment.
For local fine-tuning, quantization-aware training is typically the better choice because it allows you to work with larger models given your hardware constraints. Modern quantization techniques like QLoRA (Quantized Low-Rank Adaptation) maintain model quality even when training with 4-bit quantization.
Here is how quantization affects your fine-tuning workflow. When you load a model with 4-bit quantization, the base model weights are stored in 4-bit format, but the LoRA adapters are trained in higher precision. This hybrid approach provides the memory benefits of quantization while maintaining the training quality of higher precision.
The choice of quantization level depends on your priorities. 8-bit quantization (q8_0 in GGUF format) provides excellent quality with moderate size reduction. 4-bit quantization (q4_0 or q4_K_M) offers more aggressive compression with some quality tradeoff. For most fine-tuning tasks, the quality difference is minimal, making 4-bit quantization an excellent choice.
After fine-tuning, you might want to experiment with different quantization levels for deployment. Here is how you would convert a model to various quantization formats. Convert to 8-bit quantization:
python convert-hf-to-gguf.py fine_tuned_model --outtype q8_0 --outfile model_q8.gguf
Convert to 4-bit mixed quantization:
python convert-hf-to-gguf.py fine_tuned_model --outtype q4_K_M --outfile model_q4.gguf
Convert to 5-bit mixed quantization:
python convert-hf-to-gguf.py fine_tuned_model --outtype q5_K_M --outfile model_q5.gguf
You can then test each quantized version to find the best balance between size and quality for your specific use case. The K_M variants use mixed quantization, applying different quantization levels to different parts of the model for optimal quality-to-size ratio.
EVALUATING YOUR FINE-TUNED MODEL
After fine-tuning completes, thorough evaluation is essential to determine whether the model has learned the desired behaviors. Evaluation should test both the specific capabilities you trained for and ensure the model has not lost general capabilities from the base model.
The most straightforward evaluation approach is qualitative testing with representative prompts. Create a test set of questions or tasks that cover the range of behaviors you want the model to exhibit. Compare the fine-tuned model's responses to the base model's responses to see the improvement.
Here is a script for systematic qualitative evaluation. Import the necessary modules:
import json
from typing import List, Dict
Define a function to load test prompts:
def load_test_prompts(file_path: str) -> List[Dict]:
"""
Loads test prompts from a JSONL file.
Args:
file_path: Path to the test prompts file
Returns:
List of test prompt dictionaries
"""
prompts = []
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
prompts.append(json.loads(line))
return prompts
Define a function to evaluate model responses:
def evaluate_model_responses(model, tokenizer, test_prompts: List[Dict]):
"""
Evaluates model responses on a set of test prompts.
Args:
model: The fine-tuned model
tokenizer: The model's tokenizer
test_prompts: List of test prompt dictionaries
"""
results = []
for prompt_data in test_prompts:
prompt = prompt_data["prompt"]
expected_behavior = prompt_data.get("expected_behavior", "")
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=300,
temp=0.7
)
result = {
"prompt": prompt,
"response": response,
"expected_behavior": expected_behavior
}
results.append(result)
print(f"\nPrompt: {prompt}")
print(f"Response: {response}")
print(f"Expected: {expected_behavior}")
print("-" * 80)
return results
This evaluation script loads test prompts and generates responses, displaying them for manual review. For more rigorous evaluation, you might implement automated metrics or use another LLM to judge response quality.
Quantitative evaluation is also important, especially for tasks with clear right and wrong answers. If you are fine-tuning for a specific task like classification or information extraction, you can compute standard metrics like accuracy, precision, and recall:
def calculate_accuracy(predictions: List[str], ground_truth: List[str]) -> float:
"""
Calculates accuracy for classification tasks.
Args:
predictions: List of predicted labels
ground_truth: List of correct labels
Returns:
Accuracy as a float between 0 and 1
"""
if len(predictions) != len(ground_truth):
raise ValueError("Predictions and ground truth must have same length")
correct = sum(1 for pred, truth in zip(predictions, ground_truth)
if pred.strip().lower() == truth.strip().lower())
return correct / len(predictions)
Example usage:
predictions = ["PLC", "DCS", "SCADA", "PLC"]
ground_truth = ["PLC", "DCS", "SCADA", "HMI"]
accuracy = calculate_accuracy(predictions, ground_truth)
print(f"Accuracy: {accuracy:.2%}")
For generation tasks, you might use metrics like BLEU or ROUGE to compare generated text to reference text, though these metrics have limitations and should be combined with human evaluation.
Another important aspect of evaluation is checking for regression. Your fine-tuned model should maintain the base model's general capabilities while adding the new specialized knowledge. Test the model on general questions unrelated to your fine-tuning domain to ensure it still performs well:
general_test_prompts = [
"Explain the concept of recursion in programming.",
"What are the main causes of climate change?",
"How does photosynthesis work?"
]
If the model's performance on general questions has degraded significantly, you might need to adjust your fine-tuning approach. This could mean reducing the learning rate, using fewer training steps, or including more diverse examples in your training data.
ADVANCED FINE-TUNING TECHNIQUES
Once you have mastered basic fine-tuning, several advanced techniques can further improve your results. These techniques address common challenges like catastrophic forgetting, data scarcity, and training instability.
Catastrophic forgetting occurs when fine-tuning causes the model to lose capabilities it had before training. One mitigation strategy is mixing your specialized training data with general examples from the base model's training distribution. This helps the model maintain broad capabilities while learning new specialized knowledge:
def create_mixed_dataset(specialized_data: List[Dict],
general_data: List[Dict],
mix_ratio: float = 0.2) -> List[Dict]:
"""
Creates a mixed dataset combining specialized and general examples.
Args:
specialized_data: Your domain-specific training examples
general_data: General examples from diverse domains
mix_ratio: Proportion of general examples to include
Returns:
Combined dataset with mixed examples
"""
import random
num_general = int(len(specialized_data) * mix_ratio)
sampled_general = random.sample(general_data,
min(num_general, len(general_data)))
mixed_data = specialized_data + sampled_general
random.shuffle(mixed_data)
return mixed_data
This function takes your specialized training data and mixes in a proportion of general examples. A mix ratio of 0.2 means that 20 percent of your training data will be general examples. Experiment with different ratios to find the best balance for your use case.
Another advanced technique is curriculum learning, where you organize training examples from simple to complex. This can improve learning efficiency and final model quality:
def sort_by_complexity(examples: List[Dict]) -> List[Dict]:
"""
Sorts training examples by complexity for curriculum learning.
Args:
examples: List of training examples
Returns:
Examples sorted by increasing complexity
"""
def estimate_complexity(example: Dict) -> int:
"""Estimates complexity based on response length and vocabulary."""
response = example["messages"][-1]["content"]
words = response.split()
unique_words = len(set(words))
return len(words) + unique_words
return sorted(examples, key=estimate_complexity)
This function provides a simple complexity estimate based on response length and vocabulary diversity. More sophisticated approaches might consider syntactic complexity or domain-specific difficulty metrics.
For scenarios with limited training data, data augmentation can help. You can create variations of your existing examples through paraphrasing or by using another LLM to generate similar examples:
def augment_training_example(example: Dict, num_variations: int = 2) -> List[Dict]:
"""
Creates augmented variations of a training example.
Args:
example: Original training example
num_variations: Number of variations to create
Returns:
List containing original and augmented examples
"""
augmented = [example]
original_question = example["messages"][1]["content"]
original_answer = example["messages"][2]["content"]
paraphrase_prompts = [
f"Rephrase this question while keeping the same meaning: {original_question}",
f"Ask the same question in a different way: {original_question}"
]
# Note: You would use an LLM to generate actual paraphrases
# This is a simplified example showing the structure
return augmented
Data augmentation should be used carefully to avoid introducing noise or inconsistencies into your training data. Always review augmented examples before including them in your training set.
DEPLOYMENT CONSIDERATIONS
After successfully fine-tuning your model, you need to consider how to deploy it for actual use. The deployment approach depends on your requirements for latency, throughput, privacy, and resource availability.
For local deployment with Ollama, you have already seen how to create a Modelfile and register your model. This approach is excellent for personal use or small-scale applications. Ollama provides a simple REST API that you can use to integrate the model into applications. Import the necessary modules:
import requests
import json
Define a function to query the Ollama model:
def query_ollama_model(model_name: str, prompt: str,
base_url: str = "http://localhost:11434") -> str:
"""
Queries an Ollama model via its REST API.
Args:
model_name: Name of the Ollama model
prompt: The prompt to send to the model
base_url: Base URL of the Ollama server
Returns:
The model's response as a string
"""
url = f"{base_url}/api/generate"
payload = {
"model": model_name,
"prompt": prompt,
"stream": False
}
response = requests.post(url, json=payload)
response.raise_for_status()
result = response.json()
return result["response"]
Example usage:
response = query_ollama_model("my-finetuned-model", "What is a PLC?")
print(response)
This function provides a clean interface for querying your Ollama model from Python applications. Setting stream to False returns the complete response at once, while setting it to True enables streaming for real-time output.
For MLX models on Apple Silicon, you can create a simple inference server. Import Flask and MLX libraries:
from flask import Flask, request, jsonify
from mlx_lm import load, generate
Initialize the Flask app:
app = Flask(__name__)
Load the model at startup:
model, tokenizer = load(
"mlx-community/Mistral-7B-Instruct-v0.3-4bit",
adapter_path="adapters.npz"
)
Define the API endpoint:
@app.route('/generate', methods=['POST'])
def generate_response():
"""
API endpoint for generating model responses.
Expects JSON payload with 'prompt' field.
Returns JSON with 'response' field.
"""
data = request.get_json()
if 'prompt' not in data:
return jsonify({'error': 'Missing prompt field'}), 400
prompt = data['prompt']
max_tokens = data.get('max_tokens', 200)
temperature = data.get('temperature', 0.7)
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=max_tokens,
temp=temperature
)
return jsonify({'response': response})
Run the server:
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
This creates a simple Flask server that loads your fine-tuned model once at startup and then serves inference requests. The API accepts JSON payloads with the prompt and optional generation parameters.
For production deployments, you should consider additional factors like request queuing, batching, caching, and monitoring. Here is a more robust inference wrapper that includes basic caching. Import additional modules:
from functools import lru_cache
import hashlib
Define the cached inference class:
class CachedModelInference:
"""
Wrapper for model inference with response caching.
"""
def __init__(self, model, tokenizer, cache_size=128):
"""
Initializes the cached inference wrapper.
Args:
model: The language model
tokenizer: The model's tokenizer
cache_size: Maximum number of cached responses
"""
self.model = model
self.tokenizer = tokenizer
self.cache = {}
self.cache_size = cache_size
def _hash_prompt(self, prompt: str, max_tokens: int, temp: float) -> str:
"""Creates a hash key for caching."""
key_string = f"{prompt}_{max_tokens}_{temp}"
return hashlib.md5(key_string.encode()).hexdigest()
def generate(self, prompt: str, max_tokens: int = 200,
temp: float = 0.7) -> str:
"""
Generates a response with caching.
Args:
prompt: The input prompt
max_tokens: Maximum tokens to generate
temp: Temperature for sampling
Returns:
Generated response string
"""
cache_key = self._hash_prompt(prompt, max_tokens, temp)
if cache_key in self.cache:
return self.cache[cache_key]
response = generate(
self.model,
self.tokenizer,
prompt=prompt,
max_tokens=max_tokens,
temp=temp
)
if len(self.cache) >= self.cache_size:
# Remove oldest entry
oldest_key = next(iter(self.cache))
del self.cache[oldest_key]
self.cache[cache_key] = response
return response
This cached inference class stores responses for repeated prompts, which can significantly reduce latency for common queries. The cache size is configurable, and the implementation uses a simple FIFO eviction policy.
MONITORING AND ITERATION
Fine-tuning is rarely a one-time process. You should establish monitoring to track how your model performs in real-world use and iterate based on feedback. Collect examples where the model fails or produces suboptimal responses, then use these to create additional training data for future fine-tuning iterations.
Here is a simple logging system for tracking model performance. Import necessary modules:
import datetime
import csv
Define the performance logger class:
class ModelPerformanceLogger:
"""
Logs model interactions for performance monitoring.
"""
def __init__(self, log_file: str = "model_interactions.csv"):
"""
Initializes the performance logger.
Args:
log_file: Path to the CSV log file
"""
self.log_file = log_file
self._initialize_log_file()
def _initialize_log_file(self):
"""Creates the log file with headers if it doesn't exist."""
try:
with open(self.log_file, 'x', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow([
'timestamp', 'prompt', 'response',
'user_rating', 'notes'
])
except FileExistsError:
pass
def log_interaction(self, prompt: str, response: str,
user_rating: int = None, notes: str = ""):
"""
Logs a model interaction.
Args:
prompt: The input prompt
response: The model's response
user_rating: Optional rating from 1-5
notes: Optional notes about the interaction
"""
timestamp = datetime.datetime.now().isoformat()
with open(self.log_file, 'a', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow([timestamp, prompt, response, user_rating, notes])
def get_low_rated_interactions(self, threshold: int = 3):
"""
Retrieves interactions with low ratings for review.
Args:
threshold: Maximum rating to include
Returns:
List of low-rated interactions
"""
low_rated = []
with open(self.log_file, 'r', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
if row['user_rating'] and int(row['user_rating']) <= threshold:
low_rated.append(row)
return low_rated
This logger tracks all interactions with your model, including optional user ratings. You can periodically review low-rated interactions to identify areas where the model needs improvement.
Based on logged interactions, you can create new training examples to address identified weaknesses:
def create_training_from_corrections(interaction_log: str,
corrections_file: str,
output_file: str):
"""
Creates new training data from corrected model responses.
Args:
interaction_log: Path to the interaction log CSV
corrections_file: Path to file with corrected responses
output_file: Path for output training data JSONL
"""
import csv
import json
corrections = {}
with open(corrections_file, 'r', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
corrections[row['timestamp']] = row['corrected_response']
training_examples = []
with open(interaction_log, 'r', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
if row['timestamp'] in corrections:
example = {
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": row['prompt']
},
{
"role": "assistant",
"content": corrections[row['timestamp']]
}
]
}
training_examples.append(example)
with open(output_file, 'w', encoding='utf-8') as f:
for example in training_examples:
f.write(json.dumps(example, ensure_ascii=False) + '\n')
print(f"Created {len(training_examples)} training examples from corrections")
This function takes logged interactions and a file of corrections, then generates new training data. This creates a feedback loop where real-world usage directly improves the model through subsequent fine-tuning iterations.
CONCLUSION
Fine-tuning local LLMs has become accessible to individual developers and small teams through tools like Ollama and Apple MLX. The key to successful fine-tuning lies in understanding the fundamentals: preparing high-quality training data, choosing appropriate hyperparameters, and thoroughly evaluating results.
Start with a small, high-quality dataset and iterate based on results. Use parameter-efficient methods like LoRA to make fine-tuning practical on consumer hardware. Leverage quantization to work with larger models within your memory constraints. Monitor your deployed model and use real-world feedback to continuously improve through additional fine-tuning iterations.
The techniques covered in this tutorial provide a solid foundation for fine-tuning LLMs for your specific needs. Whether you are adapting a model to a specialized domain, teaching it a particular writing style, or improving its performance on specific tasks, these approaches will help you achieve your goals efficiently on local hardware.
Remember that fine-tuning is both a science and an art. While the technical steps are straightforward, achieving optimal results requires experimentation, careful evaluation, and iteration. Use the code examples and techniques presented here as starting points, and adapt them to your specific requirements and constraints.ΓΌ