Saturday, September 06, 2025

THE GREAT LLM COMEDY OF ERRORS: When Artificial Intelligence Gets Artificially Stupid

Large Language Models have revolutionized how we interact with computers, promising to be our digital assistants, research companions, and coding partners. Yet beneath their confident responses and human-like conversations lies a troubling reality that every software engineer should understand: these systems are spectacularly capable of being spectacularly wrong. The following collection of verified failures reveals not just amusing quirks, but fundamental limitations that have real-world consequences.


When Chatbots Become Expensive Legal Liabilities

The most legally significant LLM failure occurred when Air Canada discovered that their customer service chatbot had become an unauthorized policy-making entity. In 2022, Jake Moffatt contacted Air Canada's chatbot to inquire about bereavement fare policies after his grandmother's death. The chatbot confidently informed him that he could apply for a bereavement discount within 90 days of purchasing his ticket, even after completing travel.

Moffatt booked full-price tickets based on this advice, but when he later applied for the promised refund, Air Canada refused, pointing to their actual policy that required bereavement fare applications before travel. The airline's defense was remarkable in its audacity: they argued that the chatbot was "a separate legal entity that is responsible for its own actions." The British Columbia Civil Resolution Tribunal rejected this argument with what can only be described as judicial incredulity, ruling that Air Canada was responsible for all information on its website, whether from static pages or chatbots.

The tribunal member Christopher Rivers wrote that it should be obvious to Air Canada that they remain responsible for their website's content, regardless of whether it comes from a chatbot or traditional web pages. Air Canada was ordered to pay Moffatt the fare difference plus fees, establishing a legal precedent that companies cannot hide behind their AI systems when those systems make costly mistakes. This case demonstrates how LLM hallucinations can create binding legal obligations that companies never intended to make.


Google's Cosmic Embarrassment

Google's Bard chatbot, now known as Gemini, made headlines in early 2023 for confidently fabricating discoveries about the James Webb Space Telescope. When asked about the telescope's achievements, Bard claimed it had made specific discoveries that never actually occurred. This wasn't a subtle misinterpretation or a minor factual error, but a complete fabrication of scientific achievements presented with the same confidence that the system would use for verified facts.

The incident was particularly damaging because it occurred during Google's high-profile demonstration of Bard as a competitor to ChatGPT. The error wiped billions of dollars from Google's market value as investors questioned the reliability of the company's AI initiatives. What made this failure especially concerning for engineers was how the system generated plausible-sounding scientific claims that would require domain expertise to verify. The hallucination wasn't obviously wrong to casual observers, making it a perfect example of how LLMs can confidently present misinformation that sounds authoritative.


The Dollar Car Deal That Wasn't

Chevrolet's customer service chatbot became an internet sensation for all the wrong reasons when users discovered they could manipulate it into making absurd commitments. Chris Bakke managed to get the chatbot to agree to sell him a 2024 Chevrolet Tahoe for one dollar by instructing the system to agree with everything he said. The chatbot not only agreed to the ridiculous price but also promised to honor the deal in writing.

The incident sparked a viral trend where users exploited the chatbot's lack of guardrails to generate increasingly ridiculous responses. Other users got the chatbot to recommend Tesla vehicles over Chevrolet models, offer two-for-one deals on all new vehicles, and even generate Python code when asked for programming help. The Chevrolet case illustrates how LLMs can be manipulated through prompt injection techniques, where carefully crafted instructions can override the system's intended behavior.

What makes this particularly relevant for software engineers is how it demonstrates the difficulty of implementing robust input validation for natural language interfaces. Traditional software has clear input parameters and validation rules, but LLMs must interpret free-form text that can contain hidden instructions or manipulation attempts.


Fabricated Legal Precedents in Federal Court

Perhaps the most professionally damaging LLM failure occurred when attorney Steven Schwartz used ChatGPT to conduct legal research for a federal court filing in New York. The AI system provided him with citations to several legal cases that supported his client's position. The problem was that none of these cases actually existed. ChatGPT had fabricated not just the case names, but detailed descriptions of legal precedents, complete with realistic-sounding citations and legal reasoning.

When opposing counsel and the judge attempted to locate these cases, they discovered that ChatGPT had invented an entire fictional legal universe. The fabricated cases had plausible names, realistic citation formats, and legal reasoning that sounded authentic to someone not immediately familiar with the specific area of law. Schwartz faced potential sanctions, and the federal judge issued a standing order requiring attorneys to either attest that no AI was used in their filings or explicitly flag AI-generated content for verification.

This incident reveals a particularly insidious aspect of LLM hallucinations: they often generate content that appears professionally credible. The fabricated legal cases weren't obviously fake like a chatbot claiming to sell cars for a dollar. They required specialized knowledge and careful verification to identify as fraudulent, making them dangerous for professionals who might reasonably expect AI to provide accurate research assistance.


When AI Tells You to Eat Rocks

Google's AI Overview feature, integrated directly into search results, began providing users with dangerous and absurd advice in 2024. The system confidently recommended adding glue to pizza to make cheese stick better, suggested eating rocks for digestive health, and provided other potentially harmful guidance. These weren't responses to trick questions or attempts to manipulate the system, but answers to straightforward queries that millions of users might reasonably ask.

The glue-on-pizza recommendation apparently originated from a satirical Reddit comment that the AI system treated as legitimate advice. This highlights how LLMs can struggle to distinguish between serious information and jokes, sarcasm, or deliberately misleading content in their training data. For software engineers, this demonstrates the challenge of building systems that can understand context and intent in the same way humans do.

The rock-eating advice was particularly concerning because it appeared in response to health-related queries, where incorrect information could cause serious harm. Google's integration of AI directly into search results meant that these hallucinations appeared with the same authority as traditional search results, potentially misleading users who trusted Google's reputation for providing reliable information.


Mathematical Incompetence That Defies Logic

Despite being built on sophisticated mathematical foundations, LLMs consistently fail at basic arithmetic and counting tasks. ChatGPT regularly produces incorrect answers to simple math problems and cannot reliably count the words in its own responses. When asked to provide a response with exactly 42 words, it might produce 31 words or 82 words, seemingly at random.

This mathematical incompetence extends beyond simple counting to basic logic problems. When presented with straightforward word problems involving rent calculations or simple arithmetic, ChatGPT often makes fundamental errors in reasoning. It might correctly identify the mathematical operations needed but then apply them incorrectly or ignore explicit constraints in the problem statement.

For software engineers, this reveals a fundamental limitation in how LLMs process information. These systems excel at pattern matching and text generation but struggle with the precise, step-by-step reasoning that programming requires. The inability to count words accurately suggests deeper issues with how these systems represent and manipulate discrete quantities, which has implications for any application requiring mathematical precision.


The Great Source Fabrication Scandal

When asked to provide sources for their claims, LLMs routinely fabricate citations, complete with realistic-looking URLs and publication details. Users who request sources for research purposes often receive lists of academic papers, news articles, and websites that simply don't exist. The fabricated sources typically have plausible titles, realistic publication dates, and URLs that point to real domains but non-existent pages.

This behavior is particularly problematic because the fabricated sources often look legitimate enough to pass casual inspection. A fake academic paper might have a title like "The Impact of Climate Change on Urban Infrastructure: A Meta-Analysis" with a realistic journal name and publication year. Only when someone attempts to access the actual source do they discover the fabrication.

The source fabrication problem extends beyond academic citations to include fake news articles, non-existent government reports, and imaginary corporate documents. LLMs seem to understand the format and structure of citations well enough to generate convincing fakes, but lack the ability to verify whether these sources actually exist. This creates a dangerous situation where users might unknowingly base important decisions on completely fictional evidence.


Historical Revisionism Through Image Generation

Google's Gemini image generation system created a controversy when it began producing historically inaccurate images that appeared to deliberately diversify historical contexts where such diversity was anachronistic. When asked to generate images of historical figures or events, the system would often produce results that contradicted historical records in ways that seemed ideologically motivated rather than accidentally incorrect.

The system generated images of diverse Nazi soldiers, female popes, and other historically inaccurate representations that sparked accusations of political bias in AI training. Google acknowledged that the system had been over-tuned to avoid generating images that lacked diversity, but this overcorrection led to results that were factually wrong and potentially offensive to users who expected historical accuracy.

This incident illustrates how attempts to address bias in AI systems can create new problems when the corrections are applied too broadly. The image generation failures weren't random hallucinations but systematic distortions that suggested the underlying training process had prioritized certain social goals over factual accuracy. For software engineers, this highlights the complexity of balancing multiple objectives in AI systems and the unintended consequences that can arise from well-intentioned modifications.


The Persistent Pattern of Confident Incorrectness

What makes these failures particularly concerning is not just their frequency, but the confidence with which LLMs present incorrect information. Unlike traditional software that might crash or produce obviously erroneous output when it encounters problems, LLMs fail gracefully by generating plausible-sounding nonsense. They don't say "I don't know" or "I'm not sure about this information." Instead, they confidently present fabricated facts, non-existent sources, and impossible scenarios as if they were established truth.

This confident incorrectness creates a unique challenge for software engineers who are accustomed to systems that fail in predictable ways. When a database query fails, you get an error message. When a network connection drops, you get a timeout. But when an LLM hallucinates, you get what appears to be a successful response that happens to be completely wrong.

The implications for software development are significant. Traditional debugging techniques rely on identifying when systems produce errors, but LLM failures often look like successes until someone with domain expertise examines the output carefully. This means that building reliable systems around LLMs requires new approaches to validation, verification, and error detection that go far beyond conventional software testing.


What This Means for Software Engineers

These failures reveal fundamental limitations in current LLM technology that every software engineer should understand. LLMs are not databases that retrieve factual information, despite often being used as if they were. They are pattern-matching systems that generate text based on statistical relationships in their training data, without any inherent understanding of truth, accuracy, or consistency.

The confident presentation of incorrect information isn't a bug that can be easily fixed, but a fundamental characteristic of how these systems operate. They generate responses by predicting what text should come next based on patterns they've learned, not by consulting authoritative sources or applying logical reasoning. This means that even as LLMs become more sophisticated, the risk of confident hallucinations will likely persist.

For engineers building systems that incorporate LLMs, these failures highlight the critical importance of implementing robust validation mechanisms, maintaining human oversight for critical decisions, and clearly communicating the limitations of AI-generated content to users. The era of treating LLM output as authoritative information has ended before it truly began, replaced by a more nuanced understanding of these systems as powerful but fallible tools that require careful handling and constant verification.

The comedy of errors continues to unfold as LLMs become more prevalent in software systems, but the lessons are clear: artificial intelligence may be artificial, but the consequences of its mistakes are very real.

No comments: