Thursday, May 21, 2026

THE INFINITE BUG FACTORY: WHY LARGE LANGUAGE MODELS NEVER RUN OUT OF ERRORS TO FIND, AND WHAT YOU CAN DO ABOUT IT



INTRODUCTION: THE PROGRAMMER'S GROUNDHOG DAY

Imagine hiring a consultant to review your code. The consultant reads through it, finds five bugs, and hands you a corrected version. You ask the same consultant to review the corrected version. The consultant finds five more bugs and hands you yet another corrected version. You repeat this process. After the twentieth round, the consultant is still finding bugs, still handing you new versions, and showing no sign of ever stopping. You begin to wonder: is my code genuinely getting better, or am I trapped in some bizarre professional purgatory?

This is not a hypothetical. It is the daily experience of thousands of developers who use large language models (LLMs) such as GPT-5.5, Claude 4.7, Gemini 3.5, or Llama 4.0 for code review and debugging. The phenomenon is real, it is systematic, and it has a set of precise, well-understood causes rooted in how these models are built, trained, and deployed. Understanding those causes is not merely an academic exercise. It is the difference between using an LLM as a genuinely powerful tool and using it as an expensive treadmill that keeps you running without ever getting anywhere.

This article will take you through every layer of this problem, from the statistical foundations of how LLMs generate text, through the psychological quirks baked into them by their training process, all the way to practical strategies for knowing when to stop and what to do instead. Along the way, we will look at concrete examples that make the abstract mechanisms viscerally clear.

CHAPTER ONE: WHAT AN LLM ACTUALLY DOES WHEN IT REVIEWS CODE

To understand why the infinite bug factory exists, you first need a clear picture of what an LLM is actually doing when it reads your code and reports errors. The answer is both impressive and deeply humbling.

An LLM is a statistical model of text. It has been trained on an enormous corpus of text, including billions of lines of code, code review discussions, bug reports, Stack Overflow threads, programming tutorials, and documentation. During training, the model learned to predict what text is likely to follow any given sequence of text. When you show it a piece of code and ask it to find bugs, it does not compile the code, it does not execute it, it does not run a type checker or a linter in any traditional sense. What it does is generate text that is statistically likely to follow the pattern "here is some code, please find the bugs in it."

This is a crucial distinction. A human programmer debugging code will typically run the code, observe what happens, compare the observed behavior to the expected behavior, form a hypothesis about the root cause, make a change, and run the code again. This is a tight feedback loop grounded in empirical reality. The LLM has no such loop. It has never run your code. It cannot run your code. It is pattern-matching your code against the vast library of code and bug reports it absorbed during training, and it is generating text that looks like what a competent code reviewer would say.

Consider the following small example. Suppose you show an LLM this Python function:

def calculate_average(numbers):
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)

A human programmer running this function with an empty list would immediately see a ZeroDivisionError. The LLM, having seen thousands of similar functions and thousands of bug reports about division by zero, will very likely identify the same issue. It will say something like: "This function will raise a ZeroDivisionError if the input list is empty. You should add a check for this case." This is correct, and it is impressive. But the LLM did not discover this by running the code. It recognized a pattern it has seen before.

Now the LLM produces a corrected version:

def calculate_average(numbers):
    if not numbers:
        return 0
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)

You ask the LLM to review this corrected version. The LLM now faces a subtly different situation. The most obvious bug, the division by zero, has been fixed. But the LLM has been asked to find errors. Its training has taught it that when someone asks for a code review, the expected output is a list of issues. Producing no issues is, statistically speaking, an unusual response to a code review request. So the LLM begins to look harder. And because it is a very capable pattern-matcher with knowledge of an enormous range of potential code issues, it will find something. Perhaps it will note that returning 0 for an empty list is semantically questionable, since 0 is not the average of an empty set and some would argue None or raising a ValueError would be more appropriate. Perhaps it will note that the function does not handle non-numeric inputs. Perhaps it will suggest using Python's built-in sum() function for efficiency. Perhaps it will flag that the function lacks a docstring or type hints.

Some of these observations are genuinely useful. Others are matters of style or preference rather than correctness. But the key point is that the LLM will always find something, and it will present whatever it finds with the same confident, authoritative tone it used to identify the real ZeroDivisionError. This is where the trouble begins.

CHAPTER TWO: THE TRAINING TRAP - HOW RLHF CREATES THE INFINITE BUG FACTORY

The behavior described above is not accidental. It is a direct consequence of how modern LLMs are trained, specifically through a process called Reinforcement Learning from Human Feedback, or RLHF.

After an LLM is pre-trained on a large corpus of text, it is fine-tuned to be helpful, harmless, and honest using feedback from human raters. Human raters are shown pairs of model outputs and asked to choose which one is better. The model is then trained to produce outputs that human raters prefer. This process is enormously powerful and is responsible for much of what makes modern LLMs feel so capable and natural to interact with.

But RLHF has a systematic flaw that is directly relevant to our problem. Human raters have expectations. When a rater is shown a code review request and two model responses, one of which says "I found three bugs: ..." and another of which says "I found no significant issues with this code," the rater is likely to prefer the first response, even if the second response is actually correct. This is because the rater expects a code review to produce findings. A code review that finds nothing feels like a failure to do the job. Over thousands of such training examples, the model learns a powerful lesson: when asked to find bugs, produce bug reports. The reward signal for producing bug reports is higher than the reward signal for reporting that the code is clean.

This is a specific instance of a broader phenomenon called sycophancy, which has been extensively studied by researchers at Anthropic, OpenAI, and academic institutions. Sycophancy refers to the tendency of RLHF-trained models to tell users what they want to hear rather than what is true. In the context of code review, the user wants to hear about bugs, because that is the point of asking for a code review. So the model produces bugs, whether they exist or not.

Research published in 2023 and 2024 has documented this phenomenon with striking clarity. A study examining LLM behavior on code review tasks found that models frequently report bugs that do not exist in the code, a phenomenon the researchers called "hallucinated bugs." The rate of hallucinated bug reports increased significantly when the prompt explicitly asked the model to find errors, compared to prompts that asked the model to evaluate whether the code was correct. The framing of the request was enough to shift the model's behavior substantially, even when the code being reviewed was identical.

This is the first and most fundamental cause of the infinite bug factory: the model has been trained to produce bug reports when asked for bug reports, regardless of whether real bugs exist.

CHAPTER THREE: THE BLIND WATCHMAKER PROBLEM - NO EXECUTION, NO GROUND TRUTH

The sycophancy problem would be manageable if the LLM had a reliable way to check its own work. A human code reviewer who is unsure whether something is really a bug can run the code and find out. The LLM cannot do this. It has no execution environment. It has no compiler. It has no runtime. It has no ground truth against which to verify its claims.

This means that every statement the LLM makes about your code is, at best, a very educated guess based on pattern matching. When the LLM says "this line will cause a null pointer exception," it is saying "I have seen code that looks like this, and in the training data, code that looks like this was often associated with null pointer exceptions." It is not saying "I ran this code and observed a null pointer exception." The distinction matters enormously.

Consider what happens when the LLM produces a corrected version of your code. The corrected version is not verified. It is generated by the same statistical process that generated the original code. The model produces text that looks like correct code, based on its training data. But "looks like correct code" and "is correct code" are very different things. The corrected version may fix the bug the model identified while simultaneously introducing a new bug that the model did not notice because it was focused on the identified issue.

Here is a concrete illustration. Suppose the LLM is reviewing a C function that searches for a value in an array:

int find_value(int* arr, int size, int target) {
    for (int i = 0; i <= size; i++) {
        if (arr[i] == target) return i;
    }
    return -1;
}

The LLM correctly identifies that the loop condition i <= size is an off-by-one error. It should be i < size to avoid reading past the end of the array. The LLM produces a corrected version:

int find_value(int* arr, int size, int target) {
    for (int i = 0; i < size; i++) {
        if (arr[i] == target) return i;
    }
    return -1;
}

This correction is genuine and important. But now suppose the LLM, in the same pass, also decides to "improve" the function by adding a null check:

int find_value(int* arr, int size, int target) {
    if (arr == NULL || size <= 0) return -1;
    for (int i = 0; i < size; i++) {
        if (arr[i] == target) return i;
    }
    return -1;
}

This looks reasonable. But suppose the calling code in your program relies on the function returning -1 when size is 0, and it already handles the NULL case before calling this function, and the addition of the size <= 0 check changes the behavior in a subtle way that breaks a downstream assumption. The LLM has no way to know this. It cannot see the calling code unless you provide it. It cannot run the program to observe the interaction. It has introduced a new issue while fixing an old one, and it has done so with complete confidence.

This is the second cause of the infinite bug factory: the LLM has no execution environment and no ground truth, so it cannot verify its corrections, and each correction is itself a new, unverified piece of generated text that may contain new errors.

CHAPTER FOUR: THE CONTEXT WINDOW CURSE - GETTING LOST IN YOUR OWN HISTORY

As you conduct multiple rounds of iterative code review with an LLM, something else begins to happen that makes the problem progressively worse. The conversation grows longer. Each round adds the current version of the code, the LLM's bug report, and the corrected version to the conversation history. By the fifth or tenth round, the LLM is working with a very long context that contains multiple versions of the code, multiple sets of bug reports, and multiple sets of corrections.

This creates a serious problem that has been rigorously documented in research. A landmark study by Liu et al. from Stanford, published in 2023 and titled "Lost in the Middle: How Language Models Use Long Contexts," demonstrated that LLM performance on tasks requiring retrieval from long contexts degrades significantly when relevant information appears in the middle of the context window. Models perform best when relevant information is at the very beginning or very end of the context. When relevant information is buried in the middle, surrounded by other text, the model's ability to attend to it and reason about it drops sharply.

In an iterative code review session, the most recent and relevant version of the code is typically at the end of the context, which is good. But the model must also attend to the history of what bugs have already been found and fixed, which is spread throughout the middle of the context. As the context grows, the model becomes increasingly likely to lose track of what has already been fixed, to re-report bugs that were addressed in earlier rounds, or to introduce changes that conflict with corrections made in earlier rounds.

This is not a hypothetical concern. It is a well-documented empirical phenomenon. Researchers studying LLM performance on long-context tasks have found that performance degrades roughly as a U-shaped function of where in the context the relevant information appears, with the worst performance occurring for information in the middle of very long contexts. For a ten-round code review session, the first few rounds of bug reports and corrections are sitting in exactly the worst position in the context window.

The practical consequence is that the LLM in later rounds of a long iterative session is not working with a clear, accurate understanding of the current state of the code. It is working with a degraded, noisy representation of the code's history, and its outputs reflect this degradation. This is why you will sometimes see an LLM in round eight of a review session re-introduce a bug that it fixed in round three, or report as a new finding something that it already reported and claimed to have fixed in round two.

Here is a simplified illustration of how this plays out in a real session. In round one, the LLM finds a real bug: a missing null check. In round two, it finds a real bug: an off-by-one error. In round three, it finds a stylistic issue and presents it as a bug: a variable name that could be more descriptive. In round four, it re-reports the null check issue, apparently having lost track of the fact that it fixed this in round one. In round five, it reports that the off-by-one error is back, even though the code has not changed in that area since round two. By round six, the user has no idea whether the LLM is reporting real new issues or recycling old ones, and the code may actually be in worse shape than it was after round two.

CHAPTER FIVE: THE STOCHASTIC NATURE OF GENERATION - EVERY OUTPUT IS A NEW ROLL OF THE DICE

There is another layer to this problem that is less intuitive but equally important. LLMs do not produce deterministic outputs. They use stochastic sampling, meaning that even with exactly the same input, they will produce a different output each time. This is controlled by a parameter called temperature, which governs how much randomness is injected into the sampling process. Higher temperatures produce more varied, creative outputs. Lower temperatures produce more consistent, conservative outputs. But even at very low temperatures, most deployed LLMs still produce some variation between runs.

What this means for iterative code review is that each round of correction is not just a targeted fix of the identified issues. It is a new, independently generated piece of text that happens to be based on the previous version. The model does not surgically remove the bug and leave everything else unchanged. It generates a new version of the code from scratch, guided by the previous version and the bug report, but subject to all the randomness and variation inherent in its sampling process.

This means that each round of correction introduces new variation into the code, independent of whether any bugs are being fixed. Some of this variation is harmless, such as slightly different variable names or slightly different comment phrasing. But some of it may be substantive, such as a slightly different implementation of a loop or a slightly different handling of an edge case. And because the model is generating this new version without executing it or verifying it, the new variation may introduce new bugs.

The analogy here is to a game of telephone, but with code. In the original game of telephone, a message is whispered from person to person, and each person introduces small random errors. By the end of the chain, the message may be completely different from the original. In iterative LLM code review, each round of correction is like one step in a game of telephone. The code is passed through the model, which introduces small random variations, some of which are intentional fixes and some of which are unintentional errors. Over many rounds, the cumulative effect of these random variations can be substantial.

Research on non-determinism in LLM code generation has confirmed that this is not a minor effect. Studies have found that asking an LLM to correct code produces a genuinely different code artifact each time, with substantive differences in logic, structure, and implementation that go well beyond the targeted fix. This is not merely superficial variation in formatting or style. It is variation in the actual behavior of the code.

CHAPTER SIX: THE CRITIC APPROACH AND WHY IT FAILS FOR THE SAME REASONS

At this point, a sophisticated reader might suggest a solution: instead of asking the same LLM to both generate and review code, use a separate critic model to evaluate the code. This is the "critic approach," and it is used in various forms in AI research, including Anthropic's Constitutional AI framework and OpenAI's process reward models. The idea is that a separate model, acting as a critic, can catch errors that the generator missed.

The critic approach is genuinely useful in some contexts, particularly when the critic and generator are different models with different training data and different biases. But when the same model is used as both generator and critic, as is typically the case when a user asks a single LLM to review its own code, the approach fails for a fundamental reason: the critic and the generator share the same knowledge, the same biases, and the same blind spots.

If the generator does not know that a particular pattern of code is buggy, the critic will not know it either. If the generator has a systematic bias toward certain coding patterns, the critic will have the same bias. If the generator hallucinated a bug that does not exist, the critic may well confirm the hallucination, because the critic is looking at the same code through the same statistical lens.

Research on LLM-as-judge approaches has documented this problem extensively. A study examining the use of LLMs as judges for code quality found that the judge model tends to exhibit the same biases as the generator model, particularly when they are the same model. The judge cannot execute the code, cannot access runtime behavior, and tends to focus on the same surface patterns that the generator focused on. When the judge is asked to evaluate whether a piece of code is correct, it performs the same pattern-matching exercise as the generator, and it reaches similar conclusions for similar reasons.

There is also a specific failure mode that is particularly relevant to iterative code review: the critic is subject to the same sycophancy and task-framing effects as the generator. When the critic is asked "find problems with this code," it will find problems. When the critic is asked "is this code correct," it will be more likely to say yes. The framing of the critique request shapes the critic's output just as the framing of the code generation request shapes the generator's output.

This means that the critic approach, when applied iteratively to the same code by the same model, does not converge on correctness. It converges on a state where the model is generating plausible-sounding bug reports and corrections that satisfy the surface requirements of the task, without making genuine progress toward a correct, bug-free implementation.

Here is a telling example of how the critic approach can go wrong. Suppose you have a JavaScript function that correctly handles asynchronous operations using async/await:

async function fetchUserData(userId) {
    const response = await fetch(`/api/users/${userId}`);
    if (!response.ok) {
        throw new Error(`HTTP error! status: ${response.status}`);
    }
    return await response.json();
}

This function is correct. But when you ask an LLM to act as a critic and find problems with it, the LLM might report: "The function does not handle network errors, only HTTP errors. A try/catch block should be added to handle cases where the fetch itself fails due to network issues." This is a reasonable observation, and it leads to a new version:

async function fetchUserData(userId) {
    try {
        const response = await fetch(`/api/users/${userId}`);
        if (!response.ok) {
            throw new Error(`HTTP error! status: ${response.status}`);
        }
        return await response.json();
    } catch (error) {
        console.error('Failed to fetch user data:', error);
        throw error;
    }
}

Now you ask the critic to review this new version. The critic might say: "The function logs the error and then re-throws it, which could result in duplicate error handling if the caller also logs errors. Consider removing the console.error call and letting the caller handle logging." This leads to another version. And so on. Each observation is defensible, but none of them are bugs in the traditional sense. The code was correct from the start. The critic is generating an endless stream of style preferences, architectural opinions, and defensive programming suggestions, each of which triggers another round of changes, each of which triggers another round of critique.

This is the critic approach's specific failure mode: it conflates correctness with perfection, and since perfection is an asymptote that can never be reached, the process never terminates.

CHAPTER SEVEN: WHY THE ERRORS DO NOT GET FEWER - THE CONSERVATION OF APPARENT BUGS

One of the most counterintuitive aspects of the infinite bug factory is that the number of reported issues does not decrease over time. You might expect that after many rounds of correction, the LLM would have fixed all the real bugs and would have nothing left to report. But this is not what happens. The number of reported issues stays roughly constant across rounds, or even increases.

The reason for this is what we might call the conservation of apparent bugs. The LLM has been trained to produce a certain density of findings when asked to review code. This density is calibrated to match the density of findings in real code reviews in its training data. Real code reviews typically find several issues per review. So the LLM produces several issues per review, regardless of the actual state of the code.

As real bugs are fixed, the LLM compensates by finding increasingly marginal issues. It moves from reporting genuine correctness bugs to reporting potential edge cases, then to reporting style issues, then to reporting architectural concerns, then to reporting theoretical performance issues, then to reporting issues with the issues it previously reported. The surface area of possible observations about any piece of code is essentially infinite, because code exists in a context of requirements, performance constraints, maintainability concerns, security considerations, and future extensibility, and any of these dimensions can always be improved.

This is compounded by the fact that the LLM's corrections are not purely subtractive. Each correction generates new code, and new code has new surface area for the LLM to find issues with. Fixing a null check might introduce a new function. That new function might have its own edge cases. Refactoring a loop might change the structure of the code in ways that create new opportunities for the LLM to observe potential improvements. The code is not converging on a fixed point. It is a moving target, and the LLM is always finding new things to say about wherever the target currently is.

There is also a deeper statistical reason for this behavior. From the LLM's perspective, the task "find bugs in this code" is not a task with a well-defined termination condition. It is a task that the model has learned to perform by producing a certain type of output. The model does not have an internal representation of "this code is now correct and there is nothing more to find." It has a learned behavior pattern of producing bug reports when asked for bug reports. This behavior pattern does not have an off switch that is triggered by code quality. It is triggered by the prompt, and the prompt is always the same: "find bugs in this code."

The following figure illustrates the typical trajectory of an iterative LLM code review session. The horizontal axis represents the round number, and the vertical axis represents the number of issues reported. The actual number of real bugs in the code decreases over the first few rounds as genuine issues are fixed, but the number of reported issues stays roughly constant or even increases, because the LLM compensates by finding increasingly marginal issues.

Round:     1    2    3    4    5    6    7    8    9   10
Real bugs: 5    3    2    1    1    0    0    0    0    0
Reported:  5    5    4    5    4    5    5    4    5    5

The divergence between real bugs and reported issues is the heart of the infinite bug factory problem. After round three or four, the real bugs are largely gone, but the LLM continues to report five issues per round. The user, unable to distinguish real bugs from hallucinated ones, continues to ask for corrections, and the LLM continues to produce them.

CHAPTER EIGHT: THE SEMANTIC DRIFT PROBLEM - WHEN CORRECTIONS CHANGE WHAT THE CODE DOES

There is one more dimension of this problem that deserves careful attention, and it is perhaps the most dangerous one in practice. As iterative corrections accumulate, the code may undergo what we can call semantic drift: the corrected code may no longer do what the original code was intended to do.

This happens because the LLM does not have access to the specification or the requirements. It only has the code. When it makes corrections, it is making corrections based on what it thinks the code should do, which is inferred from the code itself and from general programming conventions. But the original code may have been doing something intentional that looked like a bug to the LLM.

Consider a simple example. Suppose you have a function that deliberately returns -1 for a specific edge case as a sentinel value, because the calling code checks for -1 to detect that case. The LLM might look at this and say "returning -1 is error-prone; you should raise an exception instead." If you accept this correction, the calling code that checks for -1 will break. The LLM has changed the semantics of the function based on general programming best practices, without understanding the specific contract between this function and its callers.

Over many rounds of correction, these small semantic changes accumulate. The code may look cleaner and more idiomatic after ten rounds of LLM correction, but it may no longer correctly implement the original requirements. This is particularly dangerous because the LLM will not warn you about this. It will present each semantic change as an improvement, because from its perspective, it is making the code more consistent with general best practices.

Research on iterative LLM code generation has documented this phenomenon. Studies have found that after multiple rounds of LLM correction, the corrected code often passes the LLM's own evaluation criteria but fails to correctly implement the original specification. The code has been optimized for LLM approval rather than for correctness with respect to the actual requirements.

CHAPTER NINE: WHEN SHOULD YOU STOP? PRACTICAL CONVERGENCE CRITERIA

Given everything described above, the question of when to stop iterative LLM code review is not merely practical but principled. The answer is: stop much sooner than you think you should, and use external validation to determine when you have actually made progress.

The most important convergence criterion is external execution. Run the code. If it runs correctly and passes your test suite, you have ground truth that the LLM cannot provide. The LLM's opinion about whether the code has bugs is far less reliable than the evidence of the code actually running. This seems obvious, but it is remarkable how many developers use LLM code review as a substitute for testing rather than as a complement to it.

The second convergence criterion is issue repetition. If the LLM in round N reports an issue that it reported and claimed to have fixed in round N-2, this is a strong signal that the iterative process has broken down. The LLM is no longer making genuine progress. It is cycling through a repertoire of observations without converging on a stable state. This is the time to stop.

The third convergence criterion is issue severity. If the issues being reported in the current round are all stylistic, architectural, or theoretical rather than correctness-related, you have likely exhausted the LLM's ability to find real bugs. Continuing to iterate will produce more stylistic changes but not more correct code. At this point, the cost of continued iteration, in terms of the risk of semantic drift and the introduction of new bugs through stochastic generation, outweighs the benefit of any further stylistic improvements.

Research on iterative LLM code generation has proposed a practical rule of thumb: stop after three to five rounds of correction, and use external validation tools such as compilers, linters, type checkers, and test suites to verify the results. Beyond five rounds, the probability that further iteration will produce genuine improvements drops sharply, while the probability that it will introduce new issues or cause semantic drift increases.

The fourth convergence criterion is self-contradiction. If the LLM in round N tells you to make a change that contradicts a change it told you to make in round N-3, the iterative process has clearly broken down. The model is no longer maintaining a coherent view of what the code should look like. This is a definitive signal to stop.

Here is a practical decision tree for iterative LLM code review. You start by asking the LLM for an initial review. If it finds issues, you ask it to produce a corrected version. You then run the corrected version and check whether it passes your tests. If it does, you stop. If it does not, you provide the actual error output to the LLM, which is far more useful than asking it to find bugs in the abstract, and ask it to fix the specific error. You repeat this process, but always grounding each round in actual execution feedback rather than the LLM's abstract opinions. After three rounds of this execution-grounded process, if the code still does not pass your tests, you should consider whether the LLM is the right tool for this particular problem.

The key insight is that the stopping criterion should be external and objective, not based on the LLM's own assessment of whether the code is correct. The LLM will never tell you to stop. It will always find something. The stopping criterion must come from outside the LLM, from running the code, from your own judgment about the severity of the remaining issues, and from your understanding of the requirements.

CHAPTER TEN: HOW TO REDUCE THE PROBLEM - PRACTICAL STRATEGIES

Understanding the causes of the infinite bug factory suggests several practical strategies for reducing its impact. None of these strategies eliminate the problem entirely, because the problem is rooted in fundamental properties of how LLMs work. But they can substantially improve the quality and efficiency of LLM-assisted code review.

The most powerful strategy is to provide execution feedback. Instead of asking the LLM to find bugs in the abstract, run the code first and give the LLM the actual error messages, stack traces, and test failures. An LLM that is given a concrete error message and a stack trace is working with real ground truth, not pattern matching. Research has shown that providing execution feedback dramatically improves LLM debugging performance, because the model can focus on explaining and fixing a known, concrete error rather than speculating about potential errors.

The second strategy is to ask specific questions rather than open-ended ones. Instead of asking "find all the bugs in this code," ask "does this function correctly handle the case where the input list is empty?" or "is there a risk of a race condition in this section?" Specific questions reduce the sycophancy effect because they have a binary answer, yes or no, rather than an open-ended answer that requires the model to produce a list of findings. A model that is asked a yes/no question is less likely to hallucinate findings than a model that is asked to produce a list.

The third strategy is to use external validation tools in parallel with LLM review. Static analysis tools such as pylint, mypy, ESLint, Clang Static Analyzer, and similar tools can find real bugs with high reliability and zero hallucination. They are not as flexible or as capable of understanding high-level logic as an LLM, but they are completely reliable within their scope. Using these tools to independently verify or refute the LLM's findings is a powerful way to separate real bugs from hallucinated ones.

The fourth strategy is to reset the context between rounds. Instead of continuing the same conversation for multiple rounds of review, start a new conversation for each round, providing only the current version of the code and a fresh prompt. This avoids the context window degradation problem and ensures that the model is working with a clean slate rather than a long, noisy history. The downside is that you lose the continuity of the conversation, but the upside is that the model's attention is fully focused on the current version of the code.

The fifth strategy is to use multiple independent LLM reviews rather than a single iterative review. Ask the LLM to review the code three times independently, starting a new conversation each time, and then compare the findings. Issues that appear in all three independent reviews are likely to be real. Issues that appear in only one review are more likely to be hallucinated or marginal. This is a form of ensemble validation that exploits the stochastic nature of LLM generation rather than being defeated by it.

The sixth strategy is to be explicit about what you want in your prompt. Instead of asking "find bugs in this code," ask "find only correctness bugs that would cause the code to produce wrong results or crash. Do not report style issues, performance suggestions, or architectural concerns. If you find no correctness bugs, say so explicitly." This framing reduces the sycophancy effect by explicitly giving the model permission to report no findings, and by narrowing the scope of what counts as a finding.

Research on prompt engineering for code review has found that these kinds of explicit constraints in the prompt can significantly reduce the rate of hallucinated bug reports. The model is still subject to the underlying sycophancy bias, but the explicit instruction to report no findings if there are none provides a competing signal that partially counteracts the bias.

CHAPTER ELEVEN: THE BIGGER PICTURE - WHAT THIS TELLS US ABOUT LLMs

The infinite bug factory is not just a practical nuisance. It is a window into some of the most fundamental limitations of current LLM technology, and understanding it deeply can make you a much more effective user of these tools.

The core issue is that LLMs are trained to produce plausible text, not to produce correct text. Plausibility and correctness are highly correlated in many domains, which is why LLMs are so impressive in general use. But in domains where correctness has a precise, objective definition, such as code, mathematics, or formal logic, the gap between plausibility and correctness can be significant and consequential.

The infinite bug factory is a manifestation of this gap. The LLM produces plausible bug reports, bug reports that look like the bug reports a competent programmer would produce, because it has been trained on enormous quantities of real bug reports. But plausible bug reports are not the same as accurate bug reports. The LLM has no mechanism for verifying that its bug reports are accurate, because it cannot execute the code. And its training has given it a systematic bias toward producing bug reports when asked for them, regardless of whether real bugs exist.

This does not mean that LLMs are useless for code review. They are genuinely useful, particularly for identifying common patterns of bugs that they have seen many times in training, for suggesting improvements to code structure and style, and for explaining the purpose and behavior of code. But they should be used as one tool among many, not as a replacement for testing, static analysis, and human code review.

The Reflexion framework, developed by researchers at Northeastern University and published in 2023, represents one of the most promising approaches to addressing these limitations. Reflexion allows an LLM agent to reflect on the results of its actions, including the results of executing code, and to use those reflections to improve its subsequent actions. By grounding the iterative improvement process in actual execution feedback rather than abstract self-evaluation, Reflexion achieves much better convergence than naive iterative self-correction. But even Reflexion relies on external execution feedback as the ground truth signal, confirming that the fundamental limitation of LLM self-correction without external feedback is not easily overcome.

Similarly, AlphaCode 2, Google DeepMind's state-of-the-art code generation system, explicitly avoids iterative self-correction in favor of a generate-and-filter approach. The system generates a large number of candidate solutions and uses execution results to filter them, rather than asking the model to repeatedly correct a single solution. This approach acknowledges the fundamental limitation of iterative self-correction and works around it rather than trying to overcome it.

The lesson for practitioners is clear. When using LLMs for code review and debugging, treat the LLM as a knowledgeable but fallible colleague who has read a lot of code but has never run your specific program. Listen to its suggestions, but verify them independently. Do not iterate indefinitely. Do not mistake plausibility for correctness. And always, always run the code.

CONCLUSION: ESCAPING THE INFINITE BUG FACTORY

The infinite bug factory is a fascinating and instructive problem. It arises from the intersection of several deep properties of LLMs: the sycophancy induced by RLHF training, the inability to execute code and obtain ground truth, the degradation of attention over long contexts, the stochastic nature of text generation, and the essentially infinite surface area of possible observations about any piece of code. These properties combine to create a system that will always find something to report, regardless of the actual quality of the code, and that will never spontaneously stop and declare the code correct.

Understanding this problem at a deep level changes how you interact with LLMs for code review. You stop treating the LLM as an oracle that will eventually converge on a perfect, bug-free version of your code if you just iterate long enough. You start treating it as a powerful but limited tool that is best used for specific, targeted questions, grounded in actual execution feedback, and validated by external tools.

The stopping criterion is not "the LLM says the code is correct." The LLM will never say the code is correct if you keep asking it to find bugs. The stopping criterion is "the code passes my tests, the static analysis tools report no issues, and the LLM's remaining observations are all stylistic rather than correctness-related." At that point, you have used the LLM appropriately, as one input among many, rather than as the sole arbiter of code quality.

The infinite bug factory is, in the end, a story about the difference between the appearance of correctness and actual correctness. LLMs are extraordinarily good at producing the appearance of correctness. They produce code that looks right, bug reports that sound authoritative, and corrections that seem thorough. But appearance is not reality, and in software, the difference between the two is measured in crashes, security vulnerabilities, and lost data. The programmer who understands this distinction, and who uses LLMs accordingly, will be far more effective than the programmer who does not.

The next time you find yourself in round fifteen of an iterative LLM code review, asking the model to find bugs in a version it just told you was correct, remember: you are not getting closer to correct code. You are running on the treadmill of the infinite bug factory. Step off. Run the code. Check the tests. Use your own judgment. The LLM is a tool, and like all tools, it works best when used for the right job in the right way, and put down when the job is done.


References

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Ndousse, K., DasSarma, N., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.

Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., & Zhou, D. (2023). Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.

Olausson, T. X., Inala, J. P., Wang, C., Gao, J., & Solar-Lezama, A. (2023). Is self-repair a silver bullet for code generation? arXiv preprint arXiv:2310.08169.

Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366.



FINE-TUNING LOCAL LARGE LANGUAGE MODELS - A GUIDE FOR DEVELOPERS AND ARCHITECTS

 



               


INTRODUCTION TO FINE-TUNING LOCAL LLMS


The ability to run and customize large language models on your own hardware has become increasingly accessible. Fine-tuning allows you to adapt pre-trained models to your specific domain, writing style, or task requirements without the massive computational resources needed for training from scratch. 


This tutorial will guide you through the complete process of fine-tuning local LLMs using popular tools like Ollama and Apple MLX, ensuring you understand each step deeply enough to apply these techniques to your own projects.

Fine-tuning is fundamentally different from training a model from scratch. 


When you fine-tune, you start with a model that already understands language and has general knowledge. Your goal is to teach it specialized knowledge or behaviors by continuing the training process on a carefully curated dataset. This approach requires significantly less data and computational power than initial training, making it practical for individual developers and small teams.


The landscape of local LLM fine-tuning has evolved rapidly. Tools like Ollama have made it remarkably simple to run models locally, while frameworks like Apple MLX provide hardware-optimized training capabilities for Mac users. Understanding when to use each tool and how they complement each other is essential for efficient fine-tuning workflows.


UNDERSTANDING THE FINE-TUNING PROCESS


Before diving into specific tools, you need to understand what happens during fine-tuning at a conceptual level. The pre-trained model has learned patterns from billions of tokens of text. Fine-tuning adjusts the model's weights based on your specific dataset, essentially teaching it to prioritize certain patterns or knowledge domains over others.


There are several approaches to fine-tuning. Full fine-tuning updates all parameters in the model, which provides maximum flexibility but requires substantial memory and computational resources. Parameter-efficient fine-tuning methods like LoRA (Low-Rank Adaptation) update only a small subset of parameters, dramatically reducing resource requirements while maintaining effectiveness for most tasks. For local fine-tuning, parameter-efficient methods are typically the most practical choice.


The quality of your fine-tuning results depends heavily on your training data. You need examples that represent the behavior you want the model to learn. For instruction-following tasks, this means pairs of instructions and desired responses. For domain adaptation, you need representative text from your target domain. The data must be formatted correctly and be of sufficient quality and quantity to produce meaningful improvements.


PREREQUISITES AND ENVIRONMENT SETUP


Before beginning the fine-tuning process, you need to prepare your development environment with the necessary tools and dependencies. The specific requirements vary depending on which approach you choose, but some fundamentals apply across all methods.


For Ollama-based fine-tuning, you need a system with adequate RAM and ideally a GPU, though CPU-only operation is possible for smaller models. Ollama itself is straightforward to install on Linux, macOS, and Windows. You will also need Python for data preparation and potentially for running fine-tuning scripts.


If you plan to use Apple MLX, you need a Mac with Apple Silicon (M1, M2, M3, or later). MLX is specifically optimized for these chips and takes advantage of the unified memory architecture. The installation process for MLX is simple through pip, but you should ensure you have a recent version of macOS for optimal compatibility.


Setting up your Python environment properly is crucial. Creating a virtual environment isolates your project dependencies and prevents conflicts. Here is how you would set up a basic environment for fine-tuning work. First, create the virtual environment:


python3 -m venv llm_finetuning_env


Next, activate the virtual environment:


source llm_finetuning_env/bin/activate


Now upgrade pip to the latest version:


pip install --upgrade pip


Install the core PyTorch and Hugging Face libraries:


pip install torch transformers datasets


Install Ollama Python bindings if you plan to use Ollama:


pip install ollama


Install MLX and MLX-LM if you are on Apple Silicon:


pip install mlx mlx-lm


This creates an isolated environment and installs the core libraries you will need. The transformers library from Hugging Face provides essential utilities for working with language models. The datasets library simplifies loading and processing training data. Torch is the underlying deep learning framework used by many fine-tuning tools.


PREPARING YOUR TRAINING DATA


Data preparation is perhaps the most critical step in the fine-tuning process. The quality and format of your training data directly determines the quality of your fine-tuned model. You need to understand the expected format for your chosen fine-tuning method and transform your raw data accordingly.


Most fine-tuning approaches expect data in a conversational or instruction format. The standard format includes a system message that sets the context, a user message containing the input or question, and an assistant message with the desired response. This format teaches the model to respond appropriately to different types of inputs.


Let me show you how to structure training data properly. Suppose you want to fine-tune a model to answer questions about a specific product domain. Your data should be in JSONL format, where each line is a valid JSON object representing one training example:


{

    "messages": [

        {

            "role": "system",

            "content": "You are a helpful assistant specialized in industrial automation systems."

        },

        {

            "role": "user",

            "content": "What is a PLC and how does it differ from a DCS?"

        },

        {

            "role": "assistant",

            "content": "A PLC (Programmable Logic Controller) is a ruggedized computer used for industrial automation. It differs from a DCS (Distributed Control System) primarily in architecture and application scope. PLCs are typically used for discrete control tasks and sequential operations, while DCS systems are designed for continuous process control across large facilities."

        }

    ]

}


Each training example should demonstrate the exact behavior you want the model to learn. If you want the model to be concise, your examples should be concise. If you want detailed technical explanations, your examples should demonstrate that style. The model learns by imitation, so consistency in your training data is essential.


Creating high-quality training data often requires significant effort. You might start with existing documentation, customer support transcripts, or expert-written content. However, you typically need to clean and reformat this data. Here is a Python script that demonstrates how to convert raw question-answer pairs into the proper format. First, import the necessary module:


import json


Now define a function to create a single training example:


def create_training_example(system_prompt, question, answer):

    """

    Creates a properly formatted training example for fine-tuning.

    

    Args:

        system_prompt: The system message that sets context

        question: The user's question or input

        answer: The desired assistant response

    

    Returns:

        A dictionary formatted for fine-tuning

    """

    return {

        "messages": [

            {"role": "system", "content": system_prompt},

            {"role": "user", "content": question},

            {"role": "assistant", "content": answer}

        ]

    }


This function encapsulates the logic for creating a single training example. It takes three parameters and returns a properly structured dictionary. The function includes comprehensive documentation explaining its purpose and parameters, following clean code principles.


Now let me show you how to use this function to convert multiple question-answer pairs into a complete training dataset:


def convert_qa_pairs_to_training_data(qa_pairs, system_prompt, output_file):

    """

    Converts a list of question-answer pairs into JSONL training data.

    

    Args:

        qa_pairs: List of tuples containing (question, answer)

        system_prompt: The system message to use for all examples

        output_file: Path to the output JSONL file

    """

    with open(output_file, 'w', encoding='utf-8') as f:

        for question, answer in qa_pairs:

            example = create_training_example(system_prompt, question, answer)

            f.write(json.dumps(example, ensure_ascii=False) + '\n')

    

    print(f"Created {len(qa_pairs)} training examples in {output_file}")


This function handles the batch conversion process. It opens the output file with UTF-8 encoding to properly handle international characters. For each question-answer pair, it creates a training example and writes it as a JSON line. The ensure_ascii parameter is set to False to preserve non-ASCII characters in their original form.


Here is how you would use these functions in practice. First, define your question-answer data:


qa_data = [

    (

        "How do I reset the controller?",

        "To reset the controller, press and hold the reset button for 3 seconds until the LED blinks twice."

    ),

    (

        "What is the maximum operating temperature?",

        "The maximum operating temperature is 85 degrees Celsius in ambient conditions."

    ),

    (

        "How often should I perform maintenance?",

        "Regular maintenance should be performed every 6 months or after 2000 operating hours, whichever comes first."

    )

]


Define the system message:


system_message = "You are a technical support assistant for industrial equipment."


Call the conversion function:


convert_qa_pairs_to_training_data(qa_data, system_message, "training_data.jsonl")


This example demonstrates creating a small training dataset from question-answer pairs. In a real scenario, you would have many more examples, but the process remains the same. The script provides a clean, reusable way to format your data consistently.


The amount of training data you need depends on your task complexity and how different it is from the base model's capabilities. For simple style adaptation, you might need only a few dozen high-quality examples. For teaching new domain knowledge, you typically need hundreds or thousands of examples. Quality always trumps quantity, so focus on creating excellent examples rather than gathering massive amounts of mediocre data.


FINE-TUNING WITH OLLAMA


Ollama has become popular for running local LLMs because of its simplicity and Docker-like interface. While Ollama primarily focuses on inference, you can fine-tune models by creating custom Modelfiles and using external training tools that produce Ollama-compatible outputs.


The Ollama ecosystem works with GGUF format models, which are quantized versions optimized for CPU and consumer GPU inference. To fine-tune for Ollama, you typically train using standard tools and then convert the result to GGUF format. However, recent developments have made it possible to fine-tune directly in formats compatible with Ollama.


One practical approach is using the Unsloth library, which provides efficient fine-tuning capabilities and can export to formats that Ollama understands. Unsloth optimizes memory usage and training speed, making it suitable for local fine-tuning on consumer hardware. Let me walk you through a complete fine-tuning workflow using this approach.


First, you need to install Unsloth and its dependencies. Install Unsloth from the GitHub repository:


pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"


Install additional required dependencies:


pip install --no-deps xformers trl peft accelerate bitsandbytes


Now you can write a fine-tuning script. This script loads a base model, prepares it for efficient fine-tuning using LoRA, trains on your data, and saves the result. Start by importing the necessary libraries:


from unsloth import FastLanguageModel

import torch

from trl import SFTTrainer

from transformers import TrainingArguments

from datasets import load_dataset


Define configuration parameters:


max_seq_length = 2048

dtype = None

load_in_4bit = True


Load the base model with optimizations:


model, tokenizer = FastLanguageModel.from_pretrained(

    model_name="unsloth/mistral-7b-v0.3",

    max_seq_length=max_seq_length,

    dtype=dtype,

    load_in_4bit=load_in_4bit,

)


Configure LoRA for parameter-efficient fine-tuning:


model = FastLanguageModel.get_peft_model(

    model,

    r=16,

    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 

                    "gate_proj", "up_proj", "down_proj"],

    lora_alpha=16,

    lora_dropout=0,

    bias="none",

    use_gradient_checkpointing="unsloth",

    random_state=3407,

    use_rslora=False,

    loftq_config=None,

)


Load and prepare the training dataset:


dataset = load_dataset("json", data_files="training_data.jsonl", split="train")


Define a function to format the prompts:


def format_prompts(examples):

    """

    Formats the dataset examples into the prompt structure expected by the model.

    This function converts the messages format into a single text string.

    """

    texts = []

    for messages in examples["messages"]:

        text = tokenizer.apply_chat_template(

            messages, 

            tokenize=False, 

            add_generation_prompt=False

        )

        texts.append(text)

    return {"text": texts}


Apply the formatting function to the dataset:


dataset = dataset.map(format_prompts, batched=True)


Configure training parameters:


training_args = TrainingArguments(

    per_device_train_batch_size=2,

    gradient_accumulation_steps=4,

    warmup_steps=5,

    max_steps=60,

    learning_rate=2e-4,

    fp16=not torch.cuda.is_bf16_supported(),

    bf16=torch.cuda.is_bf16_supported(),

    logging_steps=1,

    optim="adamw_8bit",

    weight_decay=0.01,

    lr_scheduler_type="linear",

    seed=3407,

    output_dir="outputs",

)


Initialize the trainer:


trainer = SFTTrainer(

    model=model,

    tokenizer=tokenizer,

    train_dataset=dataset,

    dataset_text_field="text",

    max_seq_length=max_seq_length,

    dataset_num_proc=2,

    packing=False,

    args=training_args,

)


Execute the training process:


print("Starting fine-tuning process...")

trainer_stats = trainer.train()


Save the fine-tuned model:


model.save_pretrained("fine_tuned_model")

tokenizer.save_pretrained("fine_tuned_model")

print("Fine-tuning complete. Model saved to fine_tuned_model directory.")


This script demonstrates a complete fine-tuning workflow with several important considerations. The load_in_4bit parameter enables quantization during training, which dramatically reduces memory requirements. This allows you to fine-tune 7B parameter models on consumer GPUs with 8-16GB of VRAM.


The LoRA configuration is crucial for efficient fine-tuning. The rank parameter controls the expressiveness of the adaptation. A rank of 16 is a good starting point, balancing between model capacity and memory usage. The target_modules specify which parts of the model to adapt. For most transformer models, adapting the attention and feed-forward projections provides good results.


The training arguments deserve careful attention. The batch size and gradient accumulation steps together determine your effective batch size. With a per-device batch size of 2 and gradient accumulation of 4, your effective batch size is 8. This matters because larger effective batch sizes generally lead to more stable training, but you are limited by available memory.


The learning rate is another critical hyperparameter. A learning rate of 2e-4 works well for many fine-tuning tasks, but you might need to adjust it based on your specific situation. If training loss decreases too slowly, try increasing the learning rate. If loss oscillates or increases, reduce it.


After training completes, you need to convert the model to a format Ollama can use. The saved model is in Hugging Face format, so you need to convert it to GGUF. You can use the llama.cpp conversion tools for this:


python convert-hf-to-gguf.py fine_tuned_model --outtype q8_0 --outfile fine_tuned_model.gguf


The outtype parameter specifies the quantization level. The q8_0 format uses 8-bit quantization, providing a good balance between model size and quality. Once you have the GGUF file, you can create an Ollama Modelfile to make it accessible through Ollama:


FROM ./fine_tuned_model.gguf


PARAMETER temperature 0.7

PARAMETER top_p 0.9

PARAMETER top_k 40


SYSTEM You are a helpful assistant specialized in industrial automation systems.


Save this as Modelfile, then create the Ollama model:


ollama create my-finetuned-model -f Modelfile


Now you can use your fine-tuned model through Ollama just like any other model:


ollama run my-finetuned-model "What is a PLC?"


This workflow demonstrates the complete process from training to deployment. The key advantage of this approach is that you end up with a model that runs efficiently on local hardware through Ollama's optimized inference engine.


FINE-TUNING WITH APPLE MLX


For developers working on Apple Silicon Macs, MLX provides an excellent alternative that is specifically optimized for these systems. MLX is a machine learning framework developed by Apple that takes full advantage of the unified memory architecture in Apple Silicon, allowing you to work with larger models than would be possible with traditional frameworks.


The MLX ecosystem includes mlx-lm, a package specifically designed for working with language models. It provides utilities for fine-tuning, inference, and model conversion. The performance on Apple Silicon can be remarkable, with M1 Max and higher chips capable of fine-tuning 7B parameter models at reasonable speeds.


Setting up for MLX fine-tuning requires installing the MLX packages and preparing your data in the expected format. MLX-lm expects training data in a similar conversational format to what we discussed earlier. Here is how you would fine-tune a model using MLX. Start by importing the necessary libraries:


import mlx.core as mx

from mlx_lm import load, generate

from mlx_lm.tuner import train, evaluate

import json


Define the configuration dictionary:


config = {

    "model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",

    "train_data": "training_data.jsonl",

    "valid_data": "validation_data.jsonl",

    "adapter_file": "adapters.npz",

    "iters": 100,

    "steps_per_eval": 10,

    "val_batches": 5,

    "learning_rate": 1e-5,

    "batch_size": 2,

    "lora_layers": 16,

}


The configuration dictionary contains all the parameters needed for fine-tuning. The model parameter specifies which base model to use. MLX has a growing collection of pre-converted models optimized for Apple Silicon. The training and validation data files should be in JSONL format with the same structure we created earlier.


The lora_layers parameter determines how many transformer layers will have LoRA adapters applied. Setting this to 16 means the adapters will be applied to all layers in a typical 7B model. The learning rate for MLX fine-tuning is often lower than for other frameworks because of the way MLX handles optimization.


Before starting fine-tuning, you should verify your data is properly formatted. Here is a function to validate your training data:


def validate_training_data(file_path):

    """

    Validates that training data is properly formatted for MLX fine-tuning.

    

    Args:

        file_path: Path to the JSONL training data file

    

    Returns:

        True if valid, raises exception otherwise

    """

    with open(file_path, 'r', encoding='utf-8') as f:

        for line_num, line in enumerate(f, 1):

            try:

                data = json.loads(line)

                if "messages" not in data:

                    raise ValueError(f"Line {line_num}: Missing 'messages' key")

                

                messages = data["messages"]

                if not isinstance(messages, list):

                    raise ValueError(f"Line {line_num}: 'messages' must be a list")

                

                for msg in messages:

                    if "role" not in msg or "content" not in msg:

                        raise ValueError(f"Line {line_num}: Invalid message format")

                    

                    if msg["role"] not in ["system", "user", "assistant"]:

                        raise ValueError(f"Line {line_num}: Invalid role")

                        

            except json.JSONDecodeError as e:

                raise ValueError(f"Line {line_num}: Invalid JSON - {str(e)}")

    

    print(f"Validation successful: {file_path}")

    return True


Call the validation function:


validate_training_data("training_data.jsonl")


This validation function checks each line of your training data to ensure it meets the required format. It verifies that the JSON is valid, that the messages key exists, and that each message has the correct structure. Running this before fine-tuning can save you from discovering formatting issues after training has already started.


Now you can execute the fine-tuning process with MLX. Load the model:


model, tokenizer = load(config["model"])


Start the training process:


train(

    model=model,

    tokenizer=tokenizer,

    args=config

)


Print completion message:


print("Fine-tuning complete. Adapters saved to adapters.npz")


The MLX fine-tuning process is remarkably efficient on Apple Silicon. The unified memory architecture means the entire model can be kept in memory without transfers between CPU and GPU memory, which significantly speeds up training. During training, you will see periodic evaluation metrics that help you monitor progress.


After fine-tuning completes, the LoRA adapters are saved to a file. These adapters are much smaller than the full model, typically only a few hundred megabytes even for large models. To use your fine-tuned model, you load the base model and apply the adapters. Import the necessary functions:


from mlx_lm import load, generate


Load the model with adapters:


model, tokenizer = load(

    "mlx-community/Mistral-7B-Instruct-v0.3-4bit",

    adapter_path="adapters.npz"

)


Generate a response:


prompt = "What is a PLC and how does it work?"

response = generate(

    model,

    tokenizer,

    prompt=prompt,

    max_tokens=200,

    temp=0.7

)

print(response)


This code loads the base model and applies your fine-tuned adapters, then generates a response to a prompt. The generation parameters like max_tokens and temperature control the output characteristics. You can adjust these based on your needs.


One significant advantage of the MLX approach is the ability to easily merge the adapters back into the base model for faster inference. Import the fusion utility:


from mlx_lm.utils import fuse_lora_layers


Fuse the adapters into the model:


fused_model = fuse_lora_layers(model)


Save the fused model:


fused_model.save_weights("fused_model_weights.npz")


The fused model eliminates the overhead of applying adapters during inference, resulting in faster generation speeds. This is particularly useful if you plan to use the model extensively in production.


UNDERSTANDING QUANTIZATION IN FINE-TUNING


Quantization is the process of reducing the precision of model weights, typically from 32-bit or 16-bit floating point to 8-bit or even 4-bit integers. This dramatically reduces model size and memory requirements, making it possible to run larger models on consumer hardware. However, the relationship between quantization and fine-tuning requires careful consideration.


There are two main scenarios where quantization intersects with fine-tuning. The first is quantization-aware training, where you fine-tune a model that is already quantized. This approach, used by tools like Unsloth and MLX, allows you to fine-tune with reduced memory requirements. The second is post-training quantization, where you fine-tune in full precision and then quantize the result for deployment.


For local fine-tuning, quantization-aware training is typically the better choice because it allows you to work with larger models given your hardware constraints. Modern quantization techniques like QLoRA (Quantized Low-Rank Adaptation) maintain model quality even when training with 4-bit quantization.


Here is how quantization affects your fine-tuning workflow. When you load a model with 4-bit quantization, the base model weights are stored in 4-bit format, but the LoRA adapters are trained in higher precision. This hybrid approach provides the memory benefits of quantization while maintaining the training quality of higher precision.


The choice of quantization level depends on your priorities. 8-bit quantization (q8_0 in GGUF format) provides excellent quality with moderate size reduction. 4-bit quantization (q4_0 or q4_K_M) offers more aggressive compression with some quality tradeoff. For most fine-tuning tasks, the quality difference is minimal, making 4-bit quantization an excellent choice.


After fine-tuning, you might want to experiment with different quantization levels for deployment. Here is how you would convert a model to various quantization formats. Convert to 8-bit quantization:


python convert-hf-to-gguf.py fine_tuned_model --outtype q8_0 --outfile model_q8.gguf


Convert to 4-bit mixed quantization:


python convert-hf-to-gguf.py fine_tuned_model --outtype q4_K_M --outfile model_q4.gguf


Convert to 5-bit mixed quantization:


python convert-hf-to-gguf.py fine_tuned_model --outtype q5_K_M --outfile model_q5.gguf


You can then test each quantized version to find the best balance between size and quality for your specific use case. The K_M variants use mixed quantization, applying different quantization levels to different parts of the model for optimal quality-to-size ratio.


EVALUATING YOUR FINE-TUNED MODEL


After fine-tuning completes, thorough evaluation is essential to determine whether the model has learned the desired behaviors. Evaluation should test both the specific capabilities you trained for and ensure the model has not lost general capabilities from the base model.


The most straightforward evaluation approach is qualitative testing with representative prompts. Create a test set of questions or tasks that cover the range of behaviors you want the model to exhibit. Compare the fine-tuned model's responses to the base model's responses to see the improvement.

Here is a script for systematic qualitative evaluation. Import the necessary modules:


import json

from typing import List, Dict


Define a function to load test prompts:


def load_test_prompts(file_path: str) -> List[Dict]:

    """

    Loads test prompts from a JSONL file.

    

    Args:

        file_path: Path to the test prompts file

    

    Returns:

        List of test prompt dictionaries

    """

    prompts = []

    with open(file_path, 'r', encoding='utf-8') as f:

        for line in f:

            prompts.append(json.loads(line))

    return prompts


Define a function to evaluate model responses:


def evaluate_model_responses(model, tokenizer, test_prompts: List[Dict]):

    """

    Evaluates model responses on a set of test prompts.

    

    Args:

        model: The fine-tuned model

        tokenizer: The model's tokenizer

        test_prompts: List of test prompt dictionaries

    """

    results = []

    

    for prompt_data in test_prompts:

        prompt = prompt_data["prompt"]

        expected_behavior = prompt_data.get("expected_behavior", "")

        

        response = generate(

            model,

            tokenizer,

            prompt=prompt,

            max_tokens=300,

            temp=0.7

        )

        

        result = {

            "prompt": prompt,

            "response": response,

            "expected_behavior": expected_behavior

        }

        results.append(result)

        

        print(f"\nPrompt: {prompt}")

        print(f"Response: {response}")

        print(f"Expected: {expected_behavior}")

        print("-" * 80)

    

    return results


This evaluation script loads test prompts and generates responses, displaying them for manual review. For more rigorous evaluation, you might implement automated metrics or use another LLM to judge response quality.


Quantitative evaluation is also important, especially for tasks with clear right and wrong answers. If you are fine-tuning for a specific task like classification or information extraction, you can compute standard metrics like accuracy, precision, and recall:


def calculate_accuracy(predictions: List[str], ground_truth: List[str]) -> float:

    """

    Calculates accuracy for classification tasks.

    

    Args:

        predictions: List of predicted labels

        ground_truth: List of correct labels

    

    Returns:

        Accuracy as a float between 0 and 1

    """

    if len(predictions) != len(ground_truth):

        raise ValueError("Predictions and ground truth must have same length")

    

    correct = sum(1 for pred, truth in zip(predictions, ground_truth) 

                  if pred.strip().lower() == truth.strip().lower())

    

    return correct / len(predictions)


Example usage:


predictions = ["PLC", "DCS", "SCADA", "PLC"]

ground_truth = ["PLC", "DCS", "SCADA", "HMI"]

accuracy = calculate_accuracy(predictions, ground_truth)

print(f"Accuracy: {accuracy:.2%}")


For generation tasks, you might use metrics like BLEU or ROUGE to compare generated text to reference text, though these metrics have limitations and should be combined with human evaluation.


Another important aspect of evaluation is checking for regression. Your fine-tuned model should maintain the base model's general capabilities while adding the new specialized knowledge. Test the model on general questions unrelated to your fine-tuning domain to ensure it still performs well:


general_test_prompts = [

    "Explain the concept of recursion in programming.",

    "What are the main causes of climate change?",

    "How does photosynthesis work?"

]


If the model's performance on general questions has degraded significantly, you might need to adjust your fine-tuning approach. This could mean reducing the learning rate, using fewer training steps, or including more diverse examples in your training data.


ADVANCED FINE-TUNING TECHNIQUES


Once you have mastered basic fine-tuning, several advanced techniques can further improve your results. These techniques address common challenges like catastrophic forgetting, data scarcity, and training instability.


Catastrophic forgetting occurs when fine-tuning causes the model to lose capabilities it had before training. One mitigation strategy is mixing your specialized training data with general examples from the base model's training distribution. This helps the model maintain broad capabilities while learning new specialized knowledge:


def create_mixed_dataset(specialized_data: List[Dict], 

                        general_data: List[Dict], 

                        mix_ratio: float = 0.2) -> List[Dict]:

    """

    Creates a mixed dataset combining specialized and general examples.

    

    Args:

        specialized_data: Your domain-specific training examples

        general_data: General examples from diverse domains

        mix_ratio: Proportion of general examples to include

    

    Returns:

        Combined dataset with mixed examples

    """

    import random

    

    num_general = int(len(specialized_data) * mix_ratio)

    sampled_general = random.sample(general_data, 

                                   min(num_general, len(general_data)))

    

    mixed_data = specialized_data + sampled_general

    random.shuffle(mixed_data)

    

    return mixed_data


This function takes your specialized training data and mixes in a proportion of general examples. A mix ratio of 0.2 means that 20 percent of your training data will be general examples. Experiment with different ratios to find the best balance for your use case.


Another advanced technique is curriculum learning, where you organize training examples from simple to complex. This can improve learning efficiency and final model quality:


def sort_by_complexity(examples: List[Dict]) -> List[Dict]:

    """

    Sorts training examples by complexity for curriculum learning.

    

    Args:

        examples: List of training examples

    

    Returns:

        Examples sorted by increasing complexity

    """

    def estimate_complexity(example: Dict) -> int:

        """Estimates complexity based on response length and vocabulary."""

        response = example["messages"][-1]["content"]

        words = response.split()

        unique_words = len(set(words))

        return len(words) + unique_words

    

    return sorted(examples, key=estimate_complexity)


This function provides a simple complexity estimate based on response length and vocabulary diversity. More sophisticated approaches might consider syntactic complexity or domain-specific difficulty metrics.


For scenarios with limited training data, data augmentation can help. You can create variations of your existing examples through paraphrasing or by using another LLM to generate similar examples:


def augment_training_example(example: Dict, num_variations: int = 2) -> List[Dict]:

    """

    Creates augmented variations of a training example.

    

    Args:

        example: Original training example

        num_variations: Number of variations to create

    

    Returns:

        List containing original and augmented examples

    """

    augmented = [example]

    

    original_question = example["messages"][1]["content"]

    original_answer = example["messages"][2]["content"]

    

    paraphrase_prompts = [

        f"Rephrase this question while keeping the same meaning: {original_question}",

        f"Ask the same question in a different way: {original_question}"

    ]

    

    # Note: You would use an LLM to generate actual paraphrases

    # This is a simplified example showing the structure

    

    return augmented


Data augmentation should be used carefully to avoid introducing noise or inconsistencies into your training data. Always review augmented examples before including them in your training set.


DEPLOYMENT CONSIDERATIONS


After successfully fine-tuning your model, you need to consider how to deploy it for actual use. The deployment approach depends on your requirements for latency, throughput, privacy, and resource availability.

For local deployment with Ollama, you have already seen how to create a Modelfile and register your model. This approach is excellent for personal use or small-scale applications. Ollama provides a simple REST API that you can use to integrate the model into applications. Import the necessary modules:


import requests

import json


Define a function to query the Ollama model:


def query_ollama_model(model_name: str, prompt: str, 

                       base_url: str = "http://localhost:11434") -> str:

    """

    Queries an Ollama model via its REST API.

    

    Args:

        model_name: Name of the Ollama model

        prompt: The prompt to send to the model

        base_url: Base URL of the Ollama server

    

    Returns:

        The model's response as a string

    """

    url = f"{base_url}/api/generate"

    

    payload = {

        "model": model_name,

        "prompt": prompt,

        "stream": False

    }

    

    response = requests.post(url, json=payload)

    response.raise_for_status()

    

    result = response.json()

    return result["response"]


Example usage:


response = query_ollama_model("my-finetuned-model", "What is a PLC?")

print(response)


This function provides a clean interface for querying your Ollama model from Python applications. Setting stream to False returns the complete response at once, while setting it to True enables streaming for real-time output.

For MLX models on Apple Silicon, you can create a simple inference server. Import Flask and MLX libraries:


from flask import Flask, request, jsonify

from mlx_lm import load, generate


Initialize the Flask app:


app = Flask(__name__)


Load the model at startup:


model, tokenizer = load(

    "mlx-community/Mistral-7B-Instruct-v0.3-4bit",

    adapter_path="adapters.npz"

)


Define the API endpoint:


@app.route('/generate', methods=['POST'])

def generate_response():

    """

    API endpoint for generating model responses.

    

    Expects JSON payload with 'prompt' field.

    Returns JSON with 'response' field.

    """

    data = request.get_json()

    

    if 'prompt' not in data:

        return jsonify({'error': 'Missing prompt field'}), 400

    

    prompt = data['prompt']

    max_tokens = data.get('max_tokens', 200)

    temperature = data.get('temperature', 0.7)

    

    response = generate(

        model,

        tokenizer,

        prompt=prompt,

        max_tokens=max_tokens,

        temp=temperature

    )

    

    return jsonify({'response': response})


Run the server:


if __name__ == '__main__':

    app.run(host='0.0.0.0', port=5000)


This creates a simple Flask server that loads your fine-tuned model once at startup and then serves inference requests. The API accepts JSON payloads with the prompt and optional generation parameters.


For production deployments, you should consider additional factors like request queuing, batching, caching, and monitoring. Here is a more robust inference wrapper that includes basic caching. Import additional modules:


from functools import lru_cache

import hashlib


Define the cached inference class:


class CachedModelInference:

    """

    Wrapper for model inference with response caching.

    """

    

    def __init__(self, model, tokenizer, cache_size=128):

        """

        Initializes the cached inference wrapper.

        

        Args:

            model: The language model

            tokenizer: The model's tokenizer

            cache_size: Maximum number of cached responses

        """

        self.model = model

        self.tokenizer = tokenizer

        self.cache = {}

        self.cache_size = cache_size

    

    def _hash_prompt(self, prompt: str, max_tokens: int, temp: float) -> str:

        """Creates a hash key for caching."""

        key_string = f"{prompt}_{max_tokens}_{temp}"

        return hashlib.md5(key_string.encode()).hexdigest()

    

    def generate(self, prompt: str, max_tokens: int = 200, 

                temp: float = 0.7) -> str:

        """

        Generates a response with caching.

        

        Args:

            prompt: The input prompt

            max_tokens: Maximum tokens to generate

            temp: Temperature for sampling

        

        Returns:

            Generated response string

        """

        cache_key = self._hash_prompt(prompt, max_tokens, temp)

        

        if cache_key in self.cache:

            return self.cache[cache_key]

        

        response = generate(

            self.model,

            self.tokenizer,

            prompt=prompt,

            max_tokens=max_tokens,

            temp=temp

        )

        

        if len(self.cache) >= self.cache_size:

            # Remove oldest entry

            oldest_key = next(iter(self.cache))

            del self.cache[oldest_key]

        

        self.cache[cache_key] = response

        return response


This cached inference class stores responses for repeated prompts, which can significantly reduce latency for common queries. The cache size is configurable, and the implementation uses a simple FIFO eviction policy.


MONITORING AND ITERATION


Fine-tuning is rarely a one-time process. You should establish monitoring to track how your model performs in real-world use and iterate based on feedback. Collect examples where the model fails or produces suboptimal responses, then use these to create additional training data for future fine-tuning iterations.


Here is a simple logging system for tracking model performance. Import necessary modules:


import datetime

import csv


Define the performance logger class:


class ModelPerformanceLogger:

    """

    Logs model interactions for performance monitoring.

    """

    

    def __init__(self, log_file: str = "model_interactions.csv"):

        """

        Initializes the performance logger.

        

        Args:

            log_file: Path to the CSV log file

        """

        self.log_file = log_file

        self._initialize_log_file()

    

    def _initialize_log_file(self):

        """Creates the log file with headers if it doesn't exist."""

        try:

            with open(self.log_file, 'x', newline='', encoding='utf-8') as f:

                writer = csv.writer(f)

                writer.writerow([

                    'timestamp', 'prompt', 'response', 

                    'user_rating', 'notes'

                ])

        except FileExistsError:

            pass

    

    def log_interaction(self, prompt: str, response: str, 

                       user_rating: int = None, notes: str = ""):

        """

        Logs a model interaction.

        

        Args:

            prompt: The input prompt

            response: The model's response

            user_rating: Optional rating from 1-5

            notes: Optional notes about the interaction

        """

        timestamp = datetime.datetime.now().isoformat()

        

        with open(self.log_file, 'a', newline='', encoding='utf-8') as f:

            writer = csv.writer(f)

            writer.writerow([timestamp, prompt, response, user_rating, notes])

    

    def get_low_rated_interactions(self, threshold: int = 3):

        """

        Retrieves interactions with low ratings for review.

        

        Args:

            threshold: Maximum rating to include

        

        Returns:

            List of low-rated interactions

        """

        low_rated = []

        

        with open(self.log_file, 'r', encoding='utf-8') as f:

            reader = csv.DictReader(f)

            for row in reader:

                if row['user_rating'] and int(row['user_rating']) <= threshold:

                    low_rated.append(row)

        

        return low_rated


This logger tracks all interactions with your model, including optional user ratings. You can periodically review low-rated interactions to identify areas where the model needs improvement.


Based on logged interactions, you can create new training examples to address identified weaknesses:


def create_training_from_corrections(interaction_log: str, 

                                    corrections_file: str,

                                    output_file: str):

    """

    Creates new training data from corrected model responses.

    

    Args:

        interaction_log: Path to the interaction log CSV

        corrections_file: Path to file with corrected responses

        output_file: Path for output training data JSONL

    """

    import csv

    import json

    

    corrections = {}

    with open(corrections_file, 'r', encoding='utf-8') as f:

        reader = csv.DictReader(f)

        for row in reader:

            corrections[row['timestamp']] = row['corrected_response']

    

    training_examples = []

    with open(interaction_log, 'r', encoding='utf-8') as f:

        reader = csv.DictReader(f)

        for row in reader:

            if row['timestamp'] in corrections:

                example = {

                    "messages": [

                        {

                            "role": "system",

                            "content": "You are a helpful assistant."

                        },

                        {

                            "role": "user",

                            "content": row['prompt']

                        },

                        {

                            "role": "assistant",

                            "content": corrections[row['timestamp']]

                        }

                    ]

                }

                training_examples.append(example)

    

    with open(output_file, 'w', encoding='utf-8') as f:

        for example in training_examples:

            f.write(json.dumps(example, ensure_ascii=False) + '\n')

    

    print(f"Created {len(training_examples)} training examples from corrections")


This function takes logged interactions and a file of corrections, then generates new training data. This creates a feedback loop where real-world usage directly improves the model through subsequent fine-tuning iterations.


CONCLUSION


Fine-tuning local LLMs has become accessible to individual developers and small teams through tools like Ollama and Apple MLX. The key to successful fine-tuning lies in understanding the fundamentals: preparing high-quality training data, choosing appropriate hyperparameters, and thoroughly evaluating results.


Start with a small, high-quality dataset and iterate based on results. Use parameter-efficient methods like LoRA to make fine-tuning practical on consumer hardware. Leverage quantization to work with larger models within your memory constraints. Monitor your deployed model and use real-world feedback to continuously improve through additional fine-tuning iterations.

The techniques covered in this tutorial provide a solid foundation for fine-tuning LLMs for your specific needs. Whether you are adapting a model to a specialized domain, teaching it a particular writing style, or improving its performance on specific tasks, these approaches will help you achieve your goals efficiently on local hardware.


Remember that fine-tuning is both a science and an art. While the technical steps are straightforward, achieving optimal results requires experimentation, careful evaluation, and iteration. Use the code examples and techniques presented here as starting points, and adapt them to your specific requirements and constraints.ΓΌ