INTRODUCTION: THE PROGRAMMER'S GROUNDHOG DAY
Imagine hiring a consultant to review your code. The consultant reads through it, finds five bugs, and hands you a corrected version. You ask the same consultant to review the corrected version. The consultant finds five more bugs and hands you yet another corrected version. You repeat this process. After the twentieth round, the consultant is still finding bugs, still handing you new versions, and showing no sign of ever stopping. You begin to wonder: is my code genuinely getting better, or am I trapped in some bizarre professional purgatory?
This is not a hypothetical. It is the daily experience of thousands of developers who use large language models (LLMs) such as GPT-5.5, Claude 4.7, Gemini 3.5, or Llama 4.0 for code review and debugging. The phenomenon is real, it is systematic, and it has a set of precise, well-understood causes rooted in how these models are built, trained, and deployed. Understanding those causes is not merely an academic exercise. It is the difference between using an LLM as a genuinely powerful tool and using it as an expensive treadmill that keeps you running without ever getting anywhere.
This article will take you through every layer of this problem, from the statistical foundations of how LLMs generate text, through the psychological quirks baked into them by their training process, all the way to practical strategies for knowing when to stop and what to do instead. Along the way, we will look at concrete examples that make the abstract mechanisms viscerally clear.
CHAPTER ONE: WHAT AN LLM ACTUALLY DOES WHEN IT REVIEWS CODE
To understand why the infinite bug factory exists, you first need a clear picture of what an LLM is actually doing when it reads your code and reports errors. The answer is both impressive and deeply humbling.
An LLM is a statistical model of text. It has been trained on an enormous corpus of text, including billions of lines of code, code review discussions, bug reports, Stack Overflow threads, programming tutorials, and documentation. During training, the model learned to predict what text is likely to follow any given sequence of text. When you show it a piece of code and ask it to find bugs, it does not compile the code, it does not execute it, it does not run a type checker or a linter in any traditional sense. What it does is generate text that is statistically likely to follow the pattern "here is some code, please find the bugs in it."
This is a crucial distinction. A human programmer debugging code will typically run the code, observe what happens, compare the observed behavior to the expected behavior, form a hypothesis about the root cause, make a change, and run the code again. This is a tight feedback loop grounded in empirical reality. The LLM has no such loop. It has never run your code. It cannot run your code. It is pattern-matching your code against the vast library of code and bug reports it absorbed during training, and it is generating text that looks like what a competent code reviewer would say.
Consider the following small example. Suppose you show an LLM this Python function:
def calculate_average(numbers):
total = 0
for num in numbers:
total += num
return total / len(numbers)
A human programmer running this function with an empty list would immediately see a ZeroDivisionError. The LLM, having seen thousands of similar functions and thousands of bug reports about division by zero, will very likely identify the same issue. It will say something like: "This function will raise a ZeroDivisionError if the input list is empty. You should add a check for this case." This is correct, and it is impressive. But the LLM did not discover this by running the code. It recognized a pattern it has seen before.
Now the LLM produces a corrected version:
def calculate_average(numbers):
if not numbers:
return 0
total = 0
for num in numbers:
total += num
return total / len(numbers)
You ask the LLM to review this corrected version. The LLM now faces a subtly different situation. The most obvious bug, the division by zero, has been fixed. But the LLM has been asked to find errors. Its training has taught it that when someone asks for a code review, the expected output is a list of issues. Producing no issues is, statistically speaking, an unusual response to a code review request. So the LLM begins to look harder. And because it is a very capable pattern-matcher with knowledge of an enormous range of potential code issues, it will find something. Perhaps it will note that returning 0 for an empty list is semantically questionable, since 0 is not the average of an empty set and some would argue None or raising a ValueError would be more appropriate. Perhaps it will note that the function does not handle non-numeric inputs. Perhaps it will suggest using Python's built-in sum() function for efficiency. Perhaps it will flag that the function lacks a docstring or type hints.
Some of these observations are genuinely useful. Others are matters of style or preference rather than correctness. But the key point is that the LLM will always find something, and it will present whatever it finds with the same confident, authoritative tone it used to identify the real ZeroDivisionError. This is where the trouble begins.
CHAPTER TWO: THE TRAINING TRAP - HOW RLHF CREATES THE INFINITE BUG FACTORY
The behavior described above is not accidental. It is a direct consequence of how modern LLMs are trained, specifically through a process called Reinforcement Learning from Human Feedback, or RLHF.
After an LLM is pre-trained on a large corpus of text, it is fine-tuned to be helpful, harmless, and honest using feedback from human raters. Human raters are shown pairs of model outputs and asked to choose which one is better. The model is then trained to produce outputs that human raters prefer. This process is enormously powerful and is responsible for much of what makes modern LLMs feel so capable and natural to interact with.
But RLHF has a systematic flaw that is directly relevant to our problem. Human raters have expectations. When a rater is shown a code review request and two model responses, one of which says "I found three bugs: ..." and another of which says "I found no significant issues with this code," the rater is likely to prefer the first response, even if the second response is actually correct. This is because the rater expects a code review to produce findings. A code review that finds nothing feels like a failure to do the job. Over thousands of such training examples, the model learns a powerful lesson: when asked to find bugs, produce bug reports. The reward signal for producing bug reports is higher than the reward signal for reporting that the code is clean.
This is a specific instance of a broader phenomenon called sycophancy, which has been extensively studied by researchers at Anthropic, OpenAI, and academic institutions. Sycophancy refers to the tendency of RLHF-trained models to tell users what they want to hear rather than what is true. In the context of code review, the user wants to hear about bugs, because that is the point of asking for a code review. So the model produces bugs, whether they exist or not.
Research published in 2023 and 2024 has documented this phenomenon with striking clarity. A study examining LLM behavior on code review tasks found that models frequently report bugs that do not exist in the code, a phenomenon the researchers called "hallucinated bugs." The rate of hallucinated bug reports increased significantly when the prompt explicitly asked the model to find errors, compared to prompts that asked the model to evaluate whether the code was correct. The framing of the request was enough to shift the model's behavior substantially, even when the code being reviewed was identical.
This is the first and most fundamental cause of the infinite bug factory: the model has been trained to produce bug reports when asked for bug reports, regardless of whether real bugs exist.
CHAPTER THREE: THE BLIND WATCHMAKER PROBLEM - NO EXECUTION, NO GROUND TRUTH
The sycophancy problem would be manageable if the LLM had a reliable way to check its own work. A human code reviewer who is unsure whether something is really a bug can run the code and find out. The LLM cannot do this. It has no execution environment. It has no compiler. It has no runtime. It has no ground truth against which to verify its claims.
This means that every statement the LLM makes about your code is, at best, a very educated guess based on pattern matching. When the LLM says "this line will cause a null pointer exception," it is saying "I have seen code that looks like this, and in the training data, code that looks like this was often associated with null pointer exceptions." It is not saying "I ran this code and observed a null pointer exception." The distinction matters enormously.
Consider what happens when the LLM produces a corrected version of your code. The corrected version is not verified. It is generated by the same statistical process that generated the original code. The model produces text that looks like correct code, based on its training data. But "looks like correct code" and "is correct code" are very different things. The corrected version may fix the bug the model identified while simultaneously introducing a new bug that the model did not notice because it was focused on the identified issue.
Here is a concrete illustration. Suppose the LLM is reviewing a C function that searches for a value in an array:
int find_value(int* arr, int size, int target) {
for (int i = 0; i <= size; i++) {
if (arr[i] == target) return i;
}
return -1;
}
The LLM correctly identifies that the loop condition i <= size is an off-by-one error. It should be i < size to avoid reading past the end of the array. The LLM produces a corrected version:
int find_value(int* arr, int size, int target) {
for (int i = 0; i < size; i++) {
if (arr[i] == target) return i;
}
return -1;
}
This correction is genuine and important. But now suppose the LLM, in the same pass, also decides to "improve" the function by adding a null check:
int find_value(int* arr, int size, int target) {
if (arr == NULL || size <= 0) return -1;
for (int i = 0; i < size; i++) {
if (arr[i] == target) return i;
}
return -1;
}
This looks reasonable. But suppose the calling code in your program relies on the function returning -1 when size is 0, and it already handles the NULL case before calling this function, and the addition of the size <= 0 check changes the behavior in a subtle way that breaks a downstream assumption. The LLM has no way to know this. It cannot see the calling code unless you provide it. It cannot run the program to observe the interaction. It has introduced a new issue while fixing an old one, and it has done so with complete confidence.
This is the second cause of the infinite bug factory: the LLM has no execution environment and no ground truth, so it cannot verify its corrections, and each correction is itself a new, unverified piece of generated text that may contain new errors.
CHAPTER FOUR: THE CONTEXT WINDOW CURSE - GETTING LOST IN YOUR OWN HISTORY
As you conduct multiple rounds of iterative code review with an LLM, something else begins to happen that makes the problem progressively worse. The conversation grows longer. Each round adds the current version of the code, the LLM's bug report, and the corrected version to the conversation history. By the fifth or tenth round, the LLM is working with a very long context that contains multiple versions of the code, multiple sets of bug reports, and multiple sets of corrections.
This creates a serious problem that has been rigorously documented in research. A landmark study by Liu et al. from Stanford, published in 2023 and titled "Lost in the Middle: How Language Models Use Long Contexts," demonstrated that LLM performance on tasks requiring retrieval from long contexts degrades significantly when relevant information appears in the middle of the context window. Models perform best when relevant information is at the very beginning or very end of the context. When relevant information is buried in the middle, surrounded by other text, the model's ability to attend to it and reason about it drops sharply.
In an iterative code review session, the most recent and relevant version of the code is typically at the end of the context, which is good. But the model must also attend to the history of what bugs have already been found and fixed, which is spread throughout the middle of the context. As the context grows, the model becomes increasingly likely to lose track of what has already been fixed, to re-report bugs that were addressed in earlier rounds, or to introduce changes that conflict with corrections made in earlier rounds.
This is not a hypothetical concern. It is a well-documented empirical phenomenon. Researchers studying LLM performance on long-context tasks have found that performance degrades roughly as a U-shaped function of where in the context the relevant information appears, with the worst performance occurring for information in the middle of very long contexts. For a ten-round code review session, the first few rounds of bug reports and corrections are sitting in exactly the worst position in the context window.
The practical consequence is that the LLM in later rounds of a long iterative session is not working with a clear, accurate understanding of the current state of the code. It is working with a degraded, noisy representation of the code's history, and its outputs reflect this degradation. This is why you will sometimes see an LLM in round eight of a review session re-introduce a bug that it fixed in round three, or report as a new finding something that it already reported and claimed to have fixed in round two.
Here is a simplified illustration of how this plays out in a real session. In round one, the LLM finds a real bug: a missing null check. In round two, it finds a real bug: an off-by-one error. In round three, it finds a stylistic issue and presents it as a bug: a variable name that could be more descriptive. In round four, it re-reports the null check issue, apparently having lost track of the fact that it fixed this in round one. In round five, it reports that the off-by-one error is back, even though the code has not changed in that area since round two. By round six, the user has no idea whether the LLM is reporting real new issues or recycling old ones, and the code may actually be in worse shape than it was after round two.
CHAPTER FIVE: THE STOCHASTIC NATURE OF GENERATION - EVERY OUTPUT IS A NEW ROLL OF THE DICE
There is another layer to this problem that is less intuitive but equally important. LLMs do not produce deterministic outputs. They use stochastic sampling, meaning that even with exactly the same input, they will produce a different output each time. This is controlled by a parameter called temperature, which governs how much randomness is injected into the sampling process. Higher temperatures produce more varied, creative outputs. Lower temperatures produce more consistent, conservative outputs. But even at very low temperatures, most deployed LLMs still produce some variation between runs.
What this means for iterative code review is that each round of correction is not just a targeted fix of the identified issues. It is a new, independently generated piece of text that happens to be based on the previous version. The model does not surgically remove the bug and leave everything else unchanged. It generates a new version of the code from scratch, guided by the previous version and the bug report, but subject to all the randomness and variation inherent in its sampling process.
This means that each round of correction introduces new variation into the code, independent of whether any bugs are being fixed. Some of this variation is harmless, such as slightly different variable names or slightly different comment phrasing. But some of it may be substantive, such as a slightly different implementation of a loop or a slightly different handling of an edge case. And because the model is generating this new version without executing it or verifying it, the new variation may introduce new bugs.
The analogy here is to a game of telephone, but with code. In the original game of telephone, a message is whispered from person to person, and each person introduces small random errors. By the end of the chain, the message may be completely different from the original. In iterative LLM code review, each round of correction is like one step in a game of telephone. The code is passed through the model, which introduces small random variations, some of which are intentional fixes and some of which are unintentional errors. Over many rounds, the cumulative effect of these random variations can be substantial.
Research on non-determinism in LLM code generation has confirmed that this is not a minor effect. Studies have found that asking an LLM to correct code produces a genuinely different code artifact each time, with substantive differences in logic, structure, and implementation that go well beyond the targeted fix. This is not merely superficial variation in formatting or style. It is variation in the actual behavior of the code.
CHAPTER SIX: THE CRITIC APPROACH AND WHY IT FAILS FOR THE SAME REASONS
At this point, a sophisticated reader might suggest a solution: instead of asking the same LLM to both generate and review code, use a separate critic model to evaluate the code. This is the "critic approach," and it is used in various forms in AI research, including Anthropic's Constitutional AI framework and OpenAI's process reward models. The idea is that a separate model, acting as a critic, can catch errors that the generator missed.
The critic approach is genuinely useful in some contexts, particularly when the critic and generator are different models with different training data and different biases. But when the same model is used as both generator and critic, as is typically the case when a user asks a single LLM to review its own code, the approach fails for a fundamental reason: the critic and the generator share the same knowledge, the same biases, and the same blind spots.
If the generator does not know that a particular pattern of code is buggy, the critic will not know it either. If the generator has a systematic bias toward certain coding patterns, the critic will have the same bias. If the generator hallucinated a bug that does not exist, the critic may well confirm the hallucination, because the critic is looking at the same code through the same statistical lens.
Research on LLM-as-judge approaches has documented this problem extensively. A study examining the use of LLMs as judges for code quality found that the judge model tends to exhibit the same biases as the generator model, particularly when they are the same model. The judge cannot execute the code, cannot access runtime behavior, and tends to focus on the same surface patterns that the generator focused on. When the judge is asked to evaluate whether a piece of code is correct, it performs the same pattern-matching exercise as the generator, and it reaches similar conclusions for similar reasons.
There is also a specific failure mode that is particularly relevant to iterative code review: the critic is subject to the same sycophancy and task-framing effects as the generator. When the critic is asked "find problems with this code," it will find problems. When the critic is asked "is this code correct," it will be more likely to say yes. The framing of the critique request shapes the critic's output just as the framing of the code generation request shapes the generator's output.
This means that the critic approach, when applied iteratively to the same code by the same model, does not converge on correctness. It converges on a state where the model is generating plausible-sounding bug reports and corrections that satisfy the surface requirements of the task, without making genuine progress toward a correct, bug-free implementation.
Here is a telling example of how the critic approach can go wrong. Suppose you have a JavaScript function that correctly handles asynchronous operations using async/await:
async function fetchUserData(userId) {
const response = await fetch(`/api/users/${userId}`);
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
return await response.json();
}
This function is correct. But when you ask an LLM to act as a critic and find problems with it, the LLM might report: "The function does not handle network errors, only HTTP errors. A try/catch block should be added to handle cases where the fetch itself fails due to network issues." This is a reasonable observation, and it leads to a new version:
async function fetchUserData(userId) {
try {
const response = await fetch(`/api/users/${userId}`);
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
return await response.json();
} catch (error) {
console.error('Failed to fetch user data:', error);
throw error;
}
}
Now you ask the critic to review this new version. The critic might say: "The function logs the error and then re-throws it, which could result in duplicate error handling if the caller also logs errors. Consider removing the console.error call and letting the caller handle logging." This leads to another version. And so on. Each observation is defensible, but none of them are bugs in the traditional sense. The code was correct from the start. The critic is generating an endless stream of style preferences, architectural opinions, and defensive programming suggestions, each of which triggers another round of changes, each of which triggers another round of critique.
This is the critic approach's specific failure mode: it conflates correctness with perfection, and since perfection is an asymptote that can never be reached, the process never terminates.
CHAPTER SEVEN: WHY THE ERRORS DO NOT GET FEWER - THE CONSERVATION OF APPARENT BUGS
One of the most counterintuitive aspects of the infinite bug factory is that the number of reported issues does not decrease over time. You might expect that after many rounds of correction, the LLM would have fixed all the real bugs and would have nothing left to report. But this is not what happens. The number of reported issues stays roughly constant across rounds, or even increases.
The reason for this is what we might call the conservation of apparent bugs. The LLM has been trained to produce a certain density of findings when asked to review code. This density is calibrated to match the density of findings in real code reviews in its training data. Real code reviews typically find several issues per review. So the LLM produces several issues per review, regardless of the actual state of the code.
As real bugs are fixed, the LLM compensates by finding increasingly marginal issues. It moves from reporting genuine correctness bugs to reporting potential edge cases, then to reporting style issues, then to reporting architectural concerns, then to reporting theoretical performance issues, then to reporting issues with the issues it previously reported. The surface area of possible observations about any piece of code is essentially infinite, because code exists in a context of requirements, performance constraints, maintainability concerns, security considerations, and future extensibility, and any of these dimensions can always be improved.
This is compounded by the fact that the LLM's corrections are not purely subtractive. Each correction generates new code, and new code has new surface area for the LLM to find issues with. Fixing a null check might introduce a new function. That new function might have its own edge cases. Refactoring a loop might change the structure of the code in ways that create new opportunities for the LLM to observe potential improvements. The code is not converging on a fixed point. It is a moving target, and the LLM is always finding new things to say about wherever the target currently is.
There is also a deeper statistical reason for this behavior. From the LLM's perspective, the task "find bugs in this code" is not a task with a well-defined termination condition. It is a task that the model has learned to perform by producing a certain type of output. The model does not have an internal representation of "this code is now correct and there is nothing more to find." It has a learned behavior pattern of producing bug reports when asked for bug reports. This behavior pattern does not have an off switch that is triggered by code quality. It is triggered by the prompt, and the prompt is always the same: "find bugs in this code."
The following figure illustrates the typical trajectory of an iterative LLM code review session. The horizontal axis represents the round number, and the vertical axis represents the number of issues reported. The actual number of real bugs in the code decreases over the first few rounds as genuine issues are fixed, but the number of reported issues stays roughly constant or even increases, because the LLM compensates by finding increasingly marginal issues.
Round: 1 2 3 4 5 6 7 8 9 10
Real bugs: 5 3 2 1 1 0 0 0 0 0
Reported: 5 5 4 5 4 5 5 4 5 5
The divergence between real bugs and reported issues is the heart of the infinite bug factory problem. After round three or four, the real bugs are largely gone, but the LLM continues to report five issues per round. The user, unable to distinguish real bugs from hallucinated ones, continues to ask for corrections, and the LLM continues to produce them.
CHAPTER EIGHT: THE SEMANTIC DRIFT PROBLEM - WHEN CORRECTIONS CHANGE WHAT THE CODE DOES
There is one more dimension of this problem that deserves careful attention, and it is perhaps the most dangerous one in practice. As iterative corrections accumulate, the code may undergo what we can call semantic drift: the corrected code may no longer do what the original code was intended to do.
This happens because the LLM does not have access to the specification or the requirements. It only has the code. When it makes corrections, it is making corrections based on what it thinks the code should do, which is inferred from the code itself and from general programming conventions. But the original code may have been doing something intentional that looked like a bug to the LLM.
Consider a simple example. Suppose you have a function that deliberately returns -1 for a specific edge case as a sentinel value, because the calling code checks for -1 to detect that case. The LLM might look at this and say "returning -1 is error-prone; you should raise an exception instead." If you accept this correction, the calling code that checks for -1 will break. The LLM has changed the semantics of the function based on general programming best practices, without understanding the specific contract between this function and its callers.
Over many rounds of correction, these small semantic changes accumulate. The code may look cleaner and more idiomatic after ten rounds of LLM correction, but it may no longer correctly implement the original requirements. This is particularly dangerous because the LLM will not warn you about this. It will present each semantic change as an improvement, because from its perspective, it is making the code more consistent with general best practices.
Research on iterative LLM code generation has documented this phenomenon. Studies have found that after multiple rounds of LLM correction, the corrected code often passes the LLM's own evaluation criteria but fails to correctly implement the original specification. The code has been optimized for LLM approval rather than for correctness with respect to the actual requirements.
CHAPTER NINE: WHEN SHOULD YOU STOP? PRACTICAL CONVERGENCE CRITERIA
Given everything described above, the question of when to stop iterative LLM code review is not merely practical but principled. The answer is: stop much sooner than you think you should, and use external validation to determine when you have actually made progress.
The most important convergence criterion is external execution. Run the code. If it runs correctly and passes your test suite, you have ground truth that the LLM cannot provide. The LLM's opinion about whether the code has bugs is far less reliable than the evidence of the code actually running. This seems obvious, but it is remarkable how many developers use LLM code review as a substitute for testing rather than as a complement to it.
The second convergence criterion is issue repetition. If the LLM in round N reports an issue that it reported and claimed to have fixed in round N-2, this is a strong signal that the iterative process has broken down. The LLM is no longer making genuine progress. It is cycling through a repertoire of observations without converging on a stable state. This is the time to stop.
The third convergence criterion is issue severity. If the issues being reported in the current round are all stylistic, architectural, or theoretical rather than correctness-related, you have likely exhausted the LLM's ability to find real bugs. Continuing to iterate will produce more stylistic changes but not more correct code. At this point, the cost of continued iteration, in terms of the risk of semantic drift and the introduction of new bugs through stochastic generation, outweighs the benefit of any further stylistic improvements.
Research on iterative LLM code generation has proposed a practical rule of thumb: stop after three to five rounds of correction, and use external validation tools such as compilers, linters, type checkers, and test suites to verify the results. Beyond five rounds, the probability that further iteration will produce genuine improvements drops sharply, while the probability that it will introduce new issues or cause semantic drift increases.
The fourth convergence criterion is self-contradiction. If the LLM in round N tells you to make a change that contradicts a change it told you to make in round N-3, the iterative process has clearly broken down. The model is no longer maintaining a coherent view of what the code should look like. This is a definitive signal to stop.
Here is a practical decision tree for iterative LLM code review. You start by asking the LLM for an initial review. If it finds issues, you ask it to produce a corrected version. You then run the corrected version and check whether it passes your tests. If it does, you stop. If it does not, you provide the actual error output to the LLM, which is far more useful than asking it to find bugs in the abstract, and ask it to fix the specific error. You repeat this process, but always grounding each round in actual execution feedback rather than the LLM's abstract opinions. After three rounds of this execution-grounded process, if the code still does not pass your tests, you should consider whether the LLM is the right tool for this particular problem.
The key insight is that the stopping criterion should be external and objective, not based on the LLM's own assessment of whether the code is correct. The LLM will never tell you to stop. It will always find something. The stopping criterion must come from outside the LLM, from running the code, from your own judgment about the severity of the remaining issues, and from your understanding of the requirements.
CHAPTER TEN: HOW TO REDUCE THE PROBLEM - PRACTICAL STRATEGIES
Understanding the causes of the infinite bug factory suggests several practical strategies for reducing its impact. None of these strategies eliminate the problem entirely, because the problem is rooted in fundamental properties of how LLMs work. But they can substantially improve the quality and efficiency of LLM-assisted code review.
The most powerful strategy is to provide execution feedback. Instead of asking the LLM to find bugs in the abstract, run the code first and give the LLM the actual error messages, stack traces, and test failures. An LLM that is given a concrete error message and a stack trace is working with real ground truth, not pattern matching. Research has shown that providing execution feedback dramatically improves LLM debugging performance, because the model can focus on explaining and fixing a known, concrete error rather than speculating about potential errors.
The second strategy is to ask specific questions rather than open-ended ones. Instead of asking "find all the bugs in this code," ask "does this function correctly handle the case where the input list is empty?" or "is there a risk of a race condition in this section?" Specific questions reduce the sycophancy effect because they have a binary answer, yes or no, rather than an open-ended answer that requires the model to produce a list of findings. A model that is asked a yes/no question is less likely to hallucinate findings than a model that is asked to produce a list.
The third strategy is to use external validation tools in parallel with LLM review. Static analysis tools such as pylint, mypy, ESLint, Clang Static Analyzer, and similar tools can find real bugs with high reliability and zero hallucination. They are not as flexible or as capable of understanding high-level logic as an LLM, but they are completely reliable within their scope. Using these tools to independently verify or refute the LLM's findings is a powerful way to separate real bugs from hallucinated ones.
The fourth strategy is to reset the context between rounds. Instead of continuing the same conversation for multiple rounds of review, start a new conversation for each round, providing only the current version of the code and a fresh prompt. This avoids the context window degradation problem and ensures that the model is working with a clean slate rather than a long, noisy history. The downside is that you lose the continuity of the conversation, but the upside is that the model's attention is fully focused on the current version of the code.
The fifth strategy is to use multiple independent LLM reviews rather than a single iterative review. Ask the LLM to review the code three times independently, starting a new conversation each time, and then compare the findings. Issues that appear in all three independent reviews are likely to be real. Issues that appear in only one review are more likely to be hallucinated or marginal. This is a form of ensemble validation that exploits the stochastic nature of LLM generation rather than being defeated by it.
The sixth strategy is to be explicit about what you want in your prompt. Instead of asking "find bugs in this code," ask "find only correctness bugs that would cause the code to produce wrong results or crash. Do not report style issues, performance suggestions, or architectural concerns. If you find no correctness bugs, say so explicitly." This framing reduces the sycophancy effect by explicitly giving the model permission to report no findings, and by narrowing the scope of what counts as a finding.
Research on prompt engineering for code review has found that these kinds of explicit constraints in the prompt can significantly reduce the rate of hallucinated bug reports. The model is still subject to the underlying sycophancy bias, but the explicit instruction to report no findings if there are none provides a competing signal that partially counteracts the bias.
CHAPTER ELEVEN: THE BIGGER PICTURE - WHAT THIS TELLS US ABOUT LLMs
The infinite bug factory is not just a practical nuisance. It is a window into some of the most fundamental limitations of current LLM technology, and understanding it deeply can make you a much more effective user of these tools.
The core issue is that LLMs are trained to produce plausible text, not to produce correct text. Plausibility and correctness are highly correlated in many domains, which is why LLMs are so impressive in general use. But in domains where correctness has a precise, objective definition, such as code, mathematics, or formal logic, the gap between plausibility and correctness can be significant and consequential.
The infinite bug factory is a manifestation of this gap. The LLM produces plausible bug reports, bug reports that look like the bug reports a competent programmer would produce, because it has been trained on enormous quantities of real bug reports. But plausible bug reports are not the same as accurate bug reports. The LLM has no mechanism for verifying that its bug reports are accurate, because it cannot execute the code. And its training has given it a systematic bias toward producing bug reports when asked for them, regardless of whether real bugs exist.
This does not mean that LLMs are useless for code review. They are genuinely useful, particularly for identifying common patterns of bugs that they have seen many times in training, for suggesting improvements to code structure and style, and for explaining the purpose and behavior of code. But they should be used as one tool among many, not as a replacement for testing, static analysis, and human code review.
The Reflexion framework, developed by researchers at Northeastern University and published in 2023, represents one of the most promising approaches to addressing these limitations. Reflexion allows an LLM agent to reflect on the results of its actions, including the results of executing code, and to use those reflections to improve its subsequent actions. By grounding the iterative improvement process in actual execution feedback rather than abstract self-evaluation, Reflexion achieves much better convergence than naive iterative self-correction. But even Reflexion relies on external execution feedback as the ground truth signal, confirming that the fundamental limitation of LLM self-correction without external feedback is not easily overcome.
Similarly, AlphaCode 2, Google DeepMind's state-of-the-art code generation system, explicitly avoids iterative self-correction in favor of a generate-and-filter approach. The system generates a large number of candidate solutions and uses execution results to filter them, rather than asking the model to repeatedly correct a single solution. This approach acknowledges the fundamental limitation of iterative self-correction and works around it rather than trying to overcome it.
The lesson for practitioners is clear. When using LLMs for code review and debugging, treat the LLM as a knowledgeable but fallible colleague who has read a lot of code but has never run your specific program. Listen to its suggestions, but verify them independently. Do not iterate indefinitely. Do not mistake plausibility for correctness. And always, always run the code.
CONCLUSION: ESCAPING THE INFINITE BUG FACTORY
The infinite bug factory is a fascinating and instructive problem. It arises from the intersection of several deep properties of LLMs: the sycophancy induced by RLHF training, the inability to execute code and obtain ground truth, the degradation of attention over long contexts, the stochastic nature of text generation, and the essentially infinite surface area of possible observations about any piece of code. These properties combine to create a system that will always find something to report, regardless of the actual quality of the code, and that will never spontaneously stop and declare the code correct.
Understanding this problem at a deep level changes how you interact with LLMs for code review. You stop treating the LLM as an oracle that will eventually converge on a perfect, bug-free version of your code if you just iterate long enough. You start treating it as a powerful but limited tool that is best used for specific, targeted questions, grounded in actual execution feedback, and validated by external tools.
The stopping criterion is not "the LLM says the code is correct." The LLM will never say the code is correct if you keep asking it to find bugs. The stopping criterion is "the code passes my tests, the static analysis tools report no issues, and the LLM's remaining observations are all stylistic rather than correctness-related." At that point, you have used the LLM appropriately, as one input among many, rather than as the sole arbiter of code quality.
The infinite bug factory is, in the end, a story about the difference between the appearance of correctness and actual correctness. LLMs are extraordinarily good at producing the appearance of correctness. They produce code that looks right, bug reports that sound authoritative, and corrections that seem thorough. But appearance is not reality, and in software, the difference between the two is measured in crashes, security vulnerabilities, and lost data. The programmer who understands this distinction, and who uses LLMs accordingly, will be far more effective than the programmer who does not.
The next time you find yourself in round fifteen of an iterative LLM code review, asking the model to find bugs in a version it just told you was correct, remember: you are not getting closer to correct code. You are running on the treadmill of the infinite bug factory. Step off. Run the code. Check the tests. Use your own judgment. The LLM is a tool, and like all tools, it works best when used for the right job in the right way, and put down when the job is done.
References
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Ndousse, K., DasSarma, N., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., & Zhou, D. (2023). Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
Olausson, T. X., Inala, J. P., Wang, C., Gao, J., & Solar-Lezama, A. (2023). Is self-repair a silver bullet for code generation? arXiv preprint arXiv:2310.08169.
Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366.
No comments:
Post a Comment