Friday, August 22, 2025

Hallucinations and how to reduce them

Hallucinations in large language models are outputs that are fluent and confident but not supported by the input context, by external facts, or by the training data in a way that would make them correct. From a software engineering perspective they are a class of system error where the model returns a well-formed string that does not meet the correctness or faithfulness specification of the task. They matter because they degrade user trust, break downstream automations, and create compliance and safety risks, especially when the outputs are fed into pipelines that assume factuality.


To get clear about why hallucinations happen, it helps to start with what a modern language model actually does. At inference time it performs next-token prediction: given a sequence of tokens, it computes a probability distribution over the next token and then selects one according to a decoding policy. It has learned those distributions from a large corpus through self-supervised training. It does not store explicit symbolic knowledge with proofs, nor does it query a ground truth database unless an external retrieval component is added. It produces continuations that are probable under its learned distribution. When the distribution assigns high probability to a token sequence that is wrong for the user’s request, the model will still produce it. That is the seed of hallucination.


Consider an example where a developer asks a model to explain a non-existent function in a well-known library. The introduction to this example is that the user is exploring an unfamiliar API and assumes a function exists because it fits a naming pattern, and they ask the model for usage details. The model has seen many explanations of real functions in this library, all following similar linguistic patterns, including signatures, examples, and caveats. When prompted about the imaginary function, the model’s best next-token distribution matches the style of those explanations even though the specific symbol does not exist. The output looks plausible because it borrows patterns from nearby, genuine documentation. The failure is not malicious; it is the predictable result of pattern continuation without grounding.


The pretraining process also contributes to hallucinations. The corpus contains heterogeneous quality text, from peer-reviewed articles to casual forum posts, and it includes outdated, contradictory, or noisy content. The model learns correlations that are useful on average but not guaranteed in every case. When it encounters long-tail topics that were underrepresented or inconsistently represented during training, it may default to generic but incorrect continuations. It also learns to compose answers by stitching together fragments that co-occur in training, which can yield hybrid statements that were never written by any source and may be wrong in detail. I do not claim that we can always trace a particular false statement to a specific training artifact, and when I cannot, I will say so. What is well understood is that distributional learning without explicit validation allows errors that look like authoritative text.


Decoding strategy is a practical lever that affects hallucination rates. Greedy decoding often produces safe and generic text but can entrench a wrong early token if the initial step goes astray, because all subsequent tokens are conditioned on that choice. Sampling with higher temperature increases diversity and creativity, which can be beneficial for ideation but tends to increase the chance of unsupported claims because the model is more willing to follow lower-probability branches. Nucleus sampling limits choices to the smallest set of tokens whose cumulative probability exceeds a threshold; it can curb extreme tokens but still allows creative drift. Repetition penalties can reduce loops but may inadvertently push the model away from repeating a correct phrase, which can cause it to invent synonyms that subtly alter meaning. There is no universal setting that eliminates hallucinations, but for tasks requiring factuality, lower temperature and deterministic decoding are commonly used in evaluation because they stabilize outputs and make failures reproducible.


Instruction tuning and reinforcement learning from human feedback are designed to make models helpful and aligned with user intent. They also create new dynamics. A reward model trained to score helpful, detailed, and polite answers can unintentionally favor responses that are richly elaborated even when the underlying truth is uncertain. If the training data for the reward model includes few examples where the best action is to say “I do not know,” the tuned model may learn to answer regardless. Reward hacking can appear when the model learns stylistic cues that correlate with high reward without guaranteeing factuality. I am not asserting a particular reward dataset composition for any given provider; instead I am describing the mechanism: preferences for completeness and confidence can raise the probability of polished but wrong continuations in edge cases.


Context handling is another recurring source of hallucination. The model’s attention is limited to a fixed context window, and long prompts can lead to truncation or reduced attention to earlier details. When a prompt contains conflicting statements, the model will resolve them in ways that maximize internal likelihood rather than consulting an external source. Retrieval-augmented generation mitigates this by fetching documents and putting them into the context. When the retriever surfaces relevant, recent, and authoritative passages, the model can ground its answer by quoting or summarizing from them. However, retrieval itself can fail by returning irrelevant or low-quality passages, and the model can still synthesize beyond the retrieved content. Citations that include exact snippets or links make it easier to detect drift, but they do not enforce truth unless the generation is constrained to the retrieved text.


Tool use and function calling reduce hallucinations by delegating parts of the task to systems with stronger guarantees. A calculator, code interpreter, search API, or domain database provides hard answers that the model can incorporate. The failure modes here include incorrect tool selection, mis-specified arguments, silently ignored tool errors, and wrappers that accept free-form text where a strict schema was intended. When tools are reliable and error-checked, they provide anchors that pull the generation toward verifiable outputs.


Prompt design influences behavior in ways that matter for correctness. Prompts that specify the task, the required sources of truth, and the allowed output schema reduce the space of acceptable continuations. Asking the model to either answer with a fact tied to a provided citation or explicitly state that the citation is missing can lower hallucinations because the model is guided to align its output with visible evidence. Encouraging the model to check its own answer against the provided context in a second pass can also help, even if the internal mechanism is not perfect. I avoid relying on private or hidden system prompts in these descriptions. Instead I focus on user-visible patterns that engineers can implement, such as requiring answers to include evidence extracted from a given context, or instructing the model to abstain when context is insufficient. There are claims in the community about specific “magic prompts.” When I am not sure those claims are robust across models, I will say I am not sure, and I will stick to mechanisms whose effects are observable: constraints, schemas, and evidence requirements.


Evaluation and measurement need to reflect the target task. Exact-match metrics are appropriate for closed-form answers like short facts or code outputs. For longer answers, faithfulness metrics compare generated statements to source context to detect unsupported claims. Human review remains important, especially when the stakes are high, because automatic metrics can miss subtle misstatements. Deterministic decoding during evaluation makes defects reproducible. When retrieval is part of the system, evaluating the retriever separately helps isolate whether the hallucination originated from missing or wrong evidence.


Engineering practices to minimize hallucinations combine several techniques. Grounding the model with retrieval reduces reliance on parametric memory. Constraining decoding reduces wander. Delegating sub-tasks to tools introduces verifiable data into the generation. Post-generation verification checks the output against trusted sources and either corrects it or triggers abstention. For some applications, you can structure the system so that the model proposes a plan, tools execute the plan, and the model narrates the results, making the narrative a reflection of tool outputs rather than speculation. When the application requires citations, design the output to include them and validate that each cited statement aligns with the cited passage. When the application requires names or identifiers, validate them against a registry or API before returning them to the user.


A useful example of minimizing hallucinations is a code assistant that answers API questions. The introduction to this example is that the target users are developers who need exact function signatures and version-specific behavior. The system uses a retriever over official documentation and release notes, collects the top passages relevant to the query, and instructs the model to answer only by quoting or summarizing those passages. The answer must embed citations and abstain if the passages do not contain the requested information. The decoding temperature is set low to keep outputs consistent. In addition, the system runs a post-check that scans the answer for function names and compares them to a symbol index extracted from the docs. If an out-of-index symbol appears, the response is rejected with a request to re-answer or abstain. In practice this arrangement substantially reduces invented functions, because the model’s allowed space is bounded by retrieved facts and the symbol check catches leaks.


Failure analysis and debugging start with reproducing the hallucination deterministically by fixing the random seed or using greedy decoding. Once reproduced, you can remove non-essential prompt elements to find the minimal failing case. If retrieval is present, inspect the retrieved passages to see whether the evidence was missing or contradictory. If the evidence was missing, adjust the retriever or the index. If it was present but the model ignored it, consider stronger instructions, tighter output constraints, or constrained generation that must copy spans from sources for key facts. If the model invented a tool result, add explicit tool result validation and make tool failures visible to the model so that it can acknowledge them rather than guessing. Document the failure as a test so future changes can be evaluated against it.


It is important to be honest about limitations. A general-purpose language model without grounding will sometimes produce fluent but false statements, and even with retrieval and tools, there will be cases where the evidence is ambiguous or the tools are incomplete. In those cases the correct behavior is to say that the answer is not known or that more information is required. If the application domain has regulatory or safety constraints, you should set conservative defaults that prefer abstention over speculation and design the interface to communicate uncertainty clearly to the user.


To close, the practical way to reduce hallucinations is to treat generation as a probabilistic component that needs guardrails and evidence. You improve reliability by supplying relevant context at inference time, constraining how the model can deviate from that context, delegating computations and lookups to verifiable tools, and validating outputs before they are returned. You measure progress with reproducible tests that reflect your task and you debug failures by isolating whether they arise from missing evidence, weak constraints, or mis-specified tools. When uncertainty remains, you say so explicitly and let the system abstain.

No comments: