INTRODUCTION
There is a peculiar kind of excitement that grips the technology world every few years, a collective fever dream in which the latest invention is declared the final stepping stone to a future that has been promised since the 1950s. In the 1980s it was expert systems. In the 1990s it was neural networks of the first generation. In the 2000s it was symbolic AI hybrids. And today, with a confidence that borders on religious conviction, a significant portion of the artificial intelligence community has declared that Large Language Models, those colossal statistical engines that power ChatGPT, Claude, Gemini, and their kin, are the true and final road to Artificial General Intelligence.
They are wrong. Fascinatingly, demonstrably, and in some ways beautifully wrong.
This article is not an attack on the extraordinary engineering achievement that LLMs represent. They are genuinely remarkable. They can write poetry that moves people to tears, debug complex software, explain quantum mechanics to a ten-year-old, and hold a conversation that feels startlingly human. But remarkable is not the same as general. Impressive is not the same as intelligent. And the gap between what LLMs do and what AGI requires is not a gap that more data, more parameters, or more compute will close. It is a structural, architectural, and philosophical chasm that goes to the very heart of what intelligence actually is.
To understand why, we need to start at the beginning, with the machine itself.
WHAT AN LLM ACTUALLY IS, AND WHY THAT MATTERS
Before we can argue about what LLMs cannot do, we need to be precise about what they actually do. This is not as obvious as it sounds, because the marketing language surrounding these systems has become so inflated that many people, including many researchers, have lost sight of the underlying mechanism.
A Large Language Model is, at its core, a function that takes a sequence of tokens as input and produces a probability distribution over the next token as output. A token is roughly a word or a word fragment. The model is trained on enormous quantities of text, sometimes hundreds of billions of words scraped from the internet, books, scientific papers, and code repositories, and during training it adjusts billions of internal numerical parameters so that it becomes progressively better at predicting what token comes next in any given sequence.
That is it. That is the whole game.
The training objective is called next-token prediction, and it is a beautifully simple idea that has proven extraordinarily powerful. By optimizing for this single objective across a vast and diverse corpus of human-written text, the model is forced to learn an enormous amount about the statistical structure of language, and, indirectly, about the world that language describes. It learns that "Paris is the capital of" is almost always followed by "France." It learns that code written in Python tends to follow certain syntactic patterns. It learns that sentences about grief tend to use certain kinds of vocabulary. It learns millions upon millions of such correlations, and it encodes them in its billions of parameters.
The result, when you interact with a well-trained LLM, is something that feels uncannily like understanding. The model responds coherently, contextually, and often with apparent insight. It can answer questions you have never asked before. It can combine concepts in novel ways. It can, in some narrow sense, generalize.
But here is the crucial question, the one that the entire debate about AGI hinges on: is this generalization the same kind of generalization that underlies human intelligence? Is the model understanding, or is it doing something else entirely that merely resembles understanding from the outside?
The answer, as we shall see, is that it is doing something else entirely. And that something else, however impressive, has fundamental limitations that cannot be engineered away.
THE STOCHASTIC PARROT IN THE ROOM
In 2021, a group of researchers including Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell published a paper that caused considerable controversy in the AI community. The paper was titled "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" and it introduced a phrase that has since become one of the most debated in the field: the stochastic parrot.
The argument is elegant in its simplicity. A parrot, as any pet owner knows, can produce utterances that sound meaningful. It can say "hello" when someone enters the room, "want a cracker" when it is hungry, and even string together phrases it has heard in ways that seem contextually appropriate. But no one seriously believes the parrot understands what it is saying. It is producing statistically likely outputs given its training environment, which is to say, the sounds it has heard most frequently in certain contexts.
The authors argued that LLMs are doing something structurally similar, just at a vastly greater scale and with vastly more sophisticated statistical machinery. They are producing statistically likely sequences of tokens given their training data. The outputs can be extraordinarily convincing, but the process generating them does not involve understanding in any meaningful sense of the word.
This argument was controversial because it seemed to dismiss the genuine capabilities of these systems. But the controversy largely missed the point. The stochastic parrot critique is not about whether LLMs are useful. They clearly are. It is about whether the mechanism underlying their outputs is the kind of mechanism that could scale to genuine general intelligence. And the answer, the paper argued, is no.
To see why, consider a concrete example. Ask an LLM what happens if you drop a glass on a concrete floor. It will tell you, correctly, that the glass will likely shatter. Now ask it why. It will produce a fluent explanation involving gravity, the brittleness of glass, and the hardness of concrete. This explanation will be accurate. But the model does not know this because it has any model of physics. It knows it because it has read thousands of texts in which people describe dropping glasses and the consequences thereof. The knowledge is encoded as a statistical pattern in its parameters, not as a causal model of the physical world.
Now ask it something slightly different. Ask it what would happen if you dropped a glass on a floor made of compressed air. The model will probably produce something plausible-sounding, but it will be doing so by interpolating between patterns it has seen, not by simulating the physics of the situation. It has no way to actually reason about a novel physical scenario from first principles, because it has no first principles. It has only patterns.
This distinction, between pattern-based retrieval and genuine causal reasoning, is not a minor technical detail. It is the heart of the matter.
THE CHINESE ROOM, UPDATED FOR THE TWENTY-FIRST CENTURY
In 1980, the philosopher John Searle published a thought experiment that has haunted AI research ever since. I mentioned this experiment twice in previous articles. He called it the Chinese Room, and it goes like this:
Imagine a person who speaks no Chinese whatsoever, locked alone in a room. Through a slot in the door, slips of paper with Chinese symbols are passed in. The person has an enormous rulebook that tells them, for any given sequence of Chinese symbols, what sequence of Chinese symbols to write in response. They follow the rules, write the appropriate symbols on a new slip of paper, and pass it back through the slot. From the outside, the room appears to understand Chinese perfectly. The responses are appropriate, contextually sensitive, and indistinguishable from those of a native speaker. But the person inside understands nothing. They are manipulating symbols according to rules, with no grasp of what those symbols mean.
Searle's point was that syntax, the formal manipulation of symbols according to rules, is not sufficient for semantics, which is the actual meaning of those symbols. A system can be syntactically perfect and semantically empty.
The critics of the Chinese Room argument have always pointed out that it is the whole system, not just the person, that understands Chinese. The room, the rulebook, and the person together constitute an understanding system. But this objection, while philosophically interesting, does not actually help the case for LLMs. Because even if we grant that the whole system understands Chinese in some sense, we are still left with the question of what kind of understanding it is, and whether that kind of understanding can scale to genuine general intelligence.
The LLM version of the Chinese Room is in some ways even more striking than Searle's original. The rulebook is not a static lookup table but a learned function with billions of parameters. The symbols being manipulated are not just Chinese characters but the full richness of human language. And the outputs are not just appropriate responses but creative, nuanced, and sometimes genuinely surprising text. And yet the fundamental situation is the same. The system is manipulating symbols according to learned statistical rules, with no direct access to the meaning of those symbols.
Consider this small demonstration of the gap between syntactic fluency and semantic understanding.
SHOWCASE 1: The Bat and Ball Problem
The following is a famous cognitive test known as the CRT (Cognitive Reflection Test):
"A bat and a ball together cost $1.10. The bat costs $1.00 more than the ball. How much does the ball cost?"
The intuitive but wrong answer is $0.10. The correct answer is $0.05.
When this exact problem is presented to a state-of-the-art LLM, it typically gets it right. But when the problem is rephrased in a way that is superficially different but logically identical, for example by changing the objects or the currency or the phrasing, the model's performance drops significantly. It is not solving the problem by applying an algebraic reasoning procedure. It is recognizing a pattern from its training data and producing the associated correct answer. Change the surface form enough, and the pattern match fails.
A human who truly understands the problem can solve any version of it, because they have internalized the underlying logical structure, not just the surface pattern.
This is not a trivial observation. It reveals something deep about the nature of LLM "knowledge." The model's apparent competence is highly sensitive to the surface form of the problem. True understanding, by contrast, is surface-form invariant. A mathematician who understands the concept of simultaneous linear equations can solve them whether they are presented in words, in symbols, in a story about bats and balls, or in a story about camels and dates. The LLM's sensitivity to surface form is a direct symptom of its reliance on pattern matching rather than genuine understanding.
THE GROUNDING PROBLEM: WORDS WITHOUT WORLDS
There is a deeper issue lurking beneath the surface-form sensitivity, and it has a name that philosophers of mind have been wrestling with for decades: the symbol grounding problem. It was first articulated clearly by the cognitive scientist Stevan Harnad in 1990, and it asks a deceptively simple question: how do symbols get their meaning?
For humans, the answer is relatively clear, at least at a high level. The word "hot" means something to us because we have felt heat. We have touched a stove, stood in the sun, held a cup of coffee. The word is grounded in a rich network of sensory experiences, bodily reactions, and emotional associations. When we hear the word "hot," we do not just retrieve a dictionary definition. We activate a whole constellation of embodied memories and anticipations.
For an LLM, the word "hot" is a token. It is associated, through training, with other tokens: "temperature," "fire," "burn," "summer," "cold" (as its opposite), "spicy," and so on. The model has learned an extraordinarily rich network of such associations, and this network captures a great deal of what we might call the meaning of "hot" in a functional sense. But it is a network of symbols connected to other symbols, not a network of symbols connected to experiences.
This matters enormously for the question of AGI, because genuine general intelligence requires the ability to reason about the world, not just about text. A truly general intelligence must be able to predict what will happen when you pour water on a fire, not because it has read about fire and water, but because it has a causal model of combustion, heat transfer, and fluid dynamics. It must be able to understand that a bridge might collapse under a certain load, not because it has read about bridge collapses, but because it has a model of structural mechanics. It must be able to navigate a room it has never been in before, not because it has read about rooms, but because it has a model of three-dimensional space and the physics of solid objects.
LLMs have none of these things. They have text about these things, which is not the same.
Yann LeCun, one of the founding figures of modern deep learning and Meta's former chief AI scientist, has been perhaps the most prominent and technically sophisticated critic of the idea that LLMs are the path to AGI. LeCun argues that the fundamental limitation of LLMs is their lack of what he calls a world model. A world model, in LeCun's framework, is an internal representation of the physical world that allows an agent to predict the consequences of its actions, plan sequences of actions to achieve goals, and reason about counterfactual scenarios. Humans and animals build world models through direct sensorimotor interaction with the physical environment. A baby learns about gravity not by reading about it but by dropping things, repeatedly, and observing what happens. It learns about object permanence by playing peekaboo. It learns about the properties of materials by touching, squeezing, tasting, and throwing them.
LLMs have no such developmental history. They have text, and only text. And while text is an extraordinarily rich source of information about the world, it is a fundamentally impoverished substitute for direct experience. The philosopher and cognitive scientist Andy Clark has argued, in a tradition known as embodied cognition, that intelligence is not just a property of the brain but of the whole body and its interactions with the environment. Cognition is not just computation happening inside a skull; it is a dynamic process that involves the body, the environment, and the ongoing loop of perception and action. LLMs are, by their very nature, disembodied. They have no body, no sensors, no actuators, no ongoing interaction with the physical world. They are, in a very literal sense, brains in vats, and not even real brains at that.
THE CAUSAL REASONING CATASTROPHE
One of the most technically precise arguments against LLMs as a path to AGI comes from the field of causal inference, and it is worth spending some time on because it is both rigorous and devastating.
The statistician and philosopher Judea Pearl, who won the Turing Award in 2011 for his work on probabilistic reasoning and causal inference, has articulated what he calls the "ladder of causation." This ladder has three rungs, and the distinction between them is crucial for understanding what LLMs can and cannot do.
The first rung is association. This is the ability to notice correlations in data: when A happens, B tends to happen. This is what LLMs do, and they do it extraordinarily well. They have learned an enormous number of associations from their training data, and they can deploy these associations with impressive fluency.
The second rung is intervention. This is the ability to reason about what would happen if you actively changed something: if I do A, what will happen to B? This requires a causal model, not just a statistical one. It requires understanding the mechanism by which A influences B, not just the correlation between them.
The third rung is counterfactual reasoning. This is the ability to reason about what would have happened if things had been different: if A had not happened, would B still have occurred? This is the most sophisticated form of causal reasoning, and it is fundamental to human intelligence. It underlies our ability to learn from mistakes, to assign responsibility, to understand narratives, and to plan for the future.
LLMs operate almost entirely at the first rung. They are extraordinarily good at association. They can tell you that smoking is correlated with lung cancer, that rain is correlated with wet streets, that studying is correlated with good grades. But they cannot reliably reason about interventions or counterfactuals, because they have no causal model of the world. They have only a statistical model of text about the world.
SHOWCASE 2: The Causal Reasoning Gap
Consider the following two questions:
Question A (Association): "Is there a correlation between ice cream sales and drowning rates?"
An LLM will correctly note that both tend to increase in summer, and will likely correctly identify that this is a spurious correlation driven by the confounding variable of warm weather.
Question B (Intervention): "If a city bans ice cream sales, will drowning rates decrease?"
An LLM will likely answer "no" correctly, because it has read texts that discuss this exact example as a classic illustration of confounding.
Question C (Novel Causal Scenario): "A factory produces widgets. When machine A runs, the temperature in room X rises. When the temperature in room X rises, machine B slows down. Machine C runs only when machine B slows down. If we install better cooling in room X, what happens to machine C?"
This is a simple causal chain, but it is presented in a form that is unlikely to match any specific pattern in the training data. LLMs frequently fail on such novel causal chains, especially when the chain has more than two or three steps, or when the problem is embedded in an unfamiliar domain.
A human engineer with a basic understanding of cause and effect can solve this trivially. The LLM must hope that it has seen enough similar patterns to produce the right answer by interpolation.
The practical consequences of this causal blindness are not trivial. An AGI would need to reason causally about the world in order to plan, to learn from experience, to understand the consequences of its actions, and to model the intentions and beliefs of other agents. All of these capabilities depend on causal reasoning. An LLM that can only do association is, in Pearl's framework, stuck on the first rung of the ladder, no matter how many parameters it has or how much data it has been trained on.
THE FROZEN KNOWLEDGE PROBLEM
There is another fundamental limitation of LLMs that is sometimes overlooked in the excitement about their capabilities, and it is one that becomes more important the more you think about what AGI would actually need to do. LLMs do not learn. Not in the way that matters.
This statement requires some clarification, because LLMs obviously do learn during training. The process of training a large language model involves adjusting billions of parameters over the course of weeks or months, and the result is a model that has "learned" an enormous amount about language and the world. But once training is complete, the model is frozen. Its parameters do not change. It cannot update its knowledge based on new experiences. It cannot learn from its mistakes in real time. It cannot accumulate new skills through practice.
This is in stark contrast to human intelligence, which is characterized by continuous, lifelong learning. A human expert does not just apply knowledge acquired during education. They continuously refine their understanding through experience, feedback, and reflection. A doctor who misdiagnoses a patient learns from that mistake and adjusts their diagnostic reasoning. A chess player who loses a game analyzes what went wrong and improves their strategy. A scientist who gets a surprising experimental result updates their theoretical model of the world.
LLMs cannot do any of this. When an LLM makes a mistake, it does not learn from it. The next time it encounters a similar situation, it will make the same mistake again, unless it has been explicitly retrained. This is not a minor engineering limitation that will be solved by better software. It is a consequence of the fundamental architecture of these systems.
Techniques like Retrieval-Augmented Generation (RAG) and fine-tuning can partially address this limitation. RAG allows an LLM to access external knowledge bases at inference time, effectively giving it access to information that was not in its training data. Fine-tuning allows an LLM to be updated with new information. But neither of these techniques provides the seamless, continuous, experience-driven learning that characterizes human intelligence. RAG is essentially a sophisticated lookup system, not genuine learning. Fine-tuning is a batch process that requires significant computational resources and careful curation of training data, not the kind of rapid, flexible adaptation that intelligence requires.
The deeper issue is that LLMs have no mechanism for integrating new experiences into their world model, because they have no world model to integrate them into. A human who learns that a particular bridge is structurally unsound updates their model of that bridge and, more importantly, updates their general model of bridge construction in ways that might affect their reasoning about other bridges. An LLM that is told a bridge is unsound can use that information within the current conversation, but it cannot generalize from it in the way a human would, and it certainly cannot retain it after the conversation ends.
THE ARC-AGI BENCHMARK: A MIRROR HELD UP TO THE EMPEROR'S NEW CLOTHES
One of the most illuminating empirical demonstrations of LLM limitations comes from a benchmark called ARC-AGI, created by Francois Chollet, the creator of the Keras deep learning library and a researcher at Google. Chollet has been one of the most thoughtful and technically rigorous critics of the idea that current AI systems are approaching AGI, and the ARC-AGI benchmark was designed specifically to test for the kind of general reasoning that LLMs are supposed to be developing.
The benchmark consists of visual pattern recognition tasks that are trivially easy for humans but extremely difficult for LLMs. Each task presents a small number of input-output examples showing a transformation of a grid of colored cells, and the system must infer the rule governing the transformation and apply it to a new input.
SHOWCASE 3: An ARC-AGI Style Task (Simplified Text Representation)
Training Examples:
Example 1: Input: [R][R][R] Output: [R][R][R] [ ][ ][ ] [R][ ][R] [ ][ ][ ] [R][R][R]
Example 2: Input: [B][B] Output: [B][B] [ ][ ] [B][B]
The rule: The shape is "filled in" to form a solid rectangle.
Test Input: [G][G][G][G] [ ][ ][ ][ ] [ ][ ][ ][ ] [ ][ ][ ][ ]
Expected Output: [G][G][G][G] [G][G][G][G] [G][G][G][G] [G][G][G][G]
A human child of five can solve this after seeing two examples. State-of-the-art LLMs, as of 2024, score far below human performance on the full ARC-AGI benchmark, which contains hundreds of such tasks. The ARC Prize 2024 offered $1 million for the first system to achieve 85% accuracy. The benchmark was specifically designed to be resistant to pattern memorization, requiring genuine abstract reasoning.
What makes the ARC-AGI benchmark so revealing is precisely what makes it hard for LLMs. Each task is novel. The rules governing each transformation are not things that can be memorized from a training set, because the benchmark is designed so that each task requires inferring a new rule from just a few examples. This is exactly what Chollet means by "efficient acquisition of new skills," and it is exactly what LLMs cannot do.
Chollet's definition of intelligence, which he has articulated in a 2019 paper titled "On the Measure of Intelligence," is worth quoting in spirit if not verbatim: intelligence is the ability to efficiently acquire new skills and solve novel problems, measured relative to the amount of prior experience and innate knowledge the system brings to bear. By this definition, LLMs are not particularly intelligent at all. They are extraordinarily good at applying skills they have already acquired during training, but they are poor at acquiring genuinely new skills from a small number of examples.
This is a profound point. The apparent generality of LLMs is largely an artifact of the breadth of their training data. Because they have been trained on text covering virtually every domain of human knowledge, they appear to be generally capable. But this apparent generality is not the same as genuine general intelligence. It is more like having a very large library. A person with access to a very large library can answer many questions, but that does not mean they understand the answers they are giving. And a person with a large library is helpless when confronted with a problem that requires knowledge not in any book.
THE CONSCIOUSNESS CONUNDRUM: DOES IT MATTER?
At this point, some readers may be thinking: so what? Perhaps AGI does not require understanding in the philosophical sense. Perhaps it just requires the ability to perform well across a wide range of tasks. If LLMs can do that, does it matter whether they "truly" understand?
This is a fair challenge, and it deserves a serious answer. The question of whether consciousness or genuine understanding is necessary for AGI is genuinely contested, and we should not dismiss it lightly. But there are strong reasons to think that the kind of performance-without-understanding that LLMs exhibit is not sufficient for AGI, even by purely functional criteria.
The first reason is reliability. A system that produces correct outputs by pattern matching will fail in unpredictable ways when it encounters situations that do not match its training patterns. A system that produces correct outputs by genuine understanding will fail gracefully and predictably, because its failures will be traceable to gaps in its knowledge or reasoning, not to arbitrary mismatches between test inputs and training patterns. LLMs are notoriously unreliable in exactly this way. They can produce confident, fluent, and completely wrong answers on topics they appear to know well, simply because the surface form of the question triggered the wrong pattern. This phenomenon, known as hallucination, is not a bug that can be fixed by better engineering. It is a direct consequence of the pattern-matching architecture.
The second reason is goal-directedness. AGI, by any reasonable definition, must be able to pursue goals. It must be able to identify what it wants to achieve, plan a sequence of actions to achieve it, monitor its progress, and adjust its plans when things go wrong. This requires not just the ability to produce appropriate outputs, but the ability to model the world, predict the consequences of actions, and reason about the relationship between means and ends. LLMs have none of this. They have no goals. They have no model of the world. They have no ability to plan. They are, in a very precise sense, reactive systems: they respond to inputs, but they do not act in the world.
The third reason is self-awareness. A genuinely general intelligence must be able to model itself, to know what it knows and what it does not know, to recognize when it is uncertain and when it is confident, and to reason about its own reasoning processes. This is known as metacognition, and it is a fundamental component of human intelligence. LLMs have a very limited and unreliable form of metacognition. They can be prompted to express uncertainty, and they can sometimes correctly identify when they do not know something. But this expressed uncertainty is itself a pattern learned from training data, not a genuine reflection of the model's epistemic state. The model does not actually know what it knows. It knows what kinds of uncertainty expressions tend to follow what kinds of questions.
SHOWCASE 4: The Hallucination Trap
Ask a state-of-the-art LLM the following question:
"What were the main findings of the 2019 paper by Dr. Elena Marchetti on the neurological correlates of creative problem-solving?"
If Dr. Elena Marchetti and this paper do not exist (and for the purposes of this example, they do not), a well-calibrated system should say it does not know. Many LLMs, however, will produce a fluent, confident, and entirely fabricated summary of findings, complete with plausible- sounding methodology and conclusions.
This is not a failure of knowledge retrieval. It is a failure of self-knowledge. The model does not know that it does not know. It cannot distinguish between "I have information about this" and "I can generate plausible-sounding text about this." From the model's perspective, these are the same operation.
A system that cannot reliably distinguish what it knows from what it is making up cannot be trusted with the kind of high-stakes reasoning that AGI would need to perform.
Noam Chomsky, writing in The New York Times in March 2023 with co-authors Ian Roberts and Jeffrey Watumull, made a related point with characteristic precision. He argued that the human mind is not a statistical engine for pattern matching but a system that seeks to create explanations. A child learning language does not just learn to predict the next word in a sequence. The child learns the grammatical rules of their language, and they do so with remarkable efficiency, from far fewer examples than any machine learning system requires. Chomsky's point is that human cognition is characterized by a drive toward explanation and understanding, not just toward prediction. LLMs are optimized for prediction, and prediction alone, however sophisticated, does not give rise to explanation or understanding.
THE OPTIMISTS AND WHY THEY ARE WRONG
It would be intellectually dishonest to present only the skeptical side of this debate without giving serious consideration to the arguments of those who believe that LLMs are, if not already AGI, at least a credible path toward it. These arguments deserve to be taken seriously, because they are made by serious people with serious credentials.
The most prominent optimist is Sam Altman, the CEO of OpenAI, who has said publicly that he believes AGI could be achieved within a few thousand days, and that the rapid progress in LLM capabilities is evidence that we are on the right track. Altman's argument rests on what might be called the scaling hypothesis: the idea that as LLMs are trained on more data with more compute and more parameters, they will develop increasingly general and powerful capabilities, and that this scaling will eventually produce something that qualifies as AGI.
The scaling hypothesis has genuine empirical support. The performance of LLMs on a wide range of benchmarks has improved dramatically as models have been scaled up, and some of this improvement has come in the form of emergent capabilities, abilities that appear suddenly at certain scales and were not present in smaller models. The ability to do multi-step arithmetic, to understand analogies, to write code, and to engage in something resembling logical reasoning all emerged as models were scaled up, and this emergence was not fully predicted in advance.
But the scaling hypothesis has a fundamental problem, and it is one that the empirical data is increasingly making clear. The improvements from scaling are following diminishing returns. The jump from GPT-2 to GPT-3 was enormous. The jump from GPT-3 to GPT-4 was significant but smaller. The jumps since then have been progressively smaller still, at least on the kinds of tasks that genuinely test for reasoning and understanding rather than fluency and knowledge retrieval. The scaling laws, first described rigorously in a 2020 paper from OpenAI, show that performance scales as a power law with compute, data, and parameters, which means that each doubling of resources produces a smaller and smaller improvement in performance.
More importantly, the emergent capabilities that have appeared with scaling are not the capabilities that AGI requires. They are capabilities that are consistent with increasingly sophisticated pattern matching. The ability to do multi-step arithmetic, for example, is something that can be learned by pattern matching on arithmetic problems in the training data. The ability to write code is something that can be learned by pattern matching on code repositories. These are impressive capabilities, but they are not evidence of the kind of general reasoning that AGI requires.
Gary Marcus, professor emeritus of psychology and neural science at New York University and one of the most persistent and technically informed critics of the LLM-to-AGI thesis, has argued that the apparent progress of LLMs is misleading because the benchmarks on which they are evaluated are themselves susceptible to pattern matching. When an LLM achieves a high score on a benchmark, it is often because the benchmark contains patterns that are similar to patterns in the training data, not because the model has developed genuine reasoning capabilities. When the benchmark is modified to remove these patterns, performance drops dramatically. This is exactly what the ARC-AGI benchmark was designed to demonstrate, and the results have been unambiguous.
Marcus has also argued, in his book "Rebooting AI" co-authored with Ernest Davis, that the field of AI has a long history of overpromising and underdelivering, and that the current excitement about LLMs is another instance of this pattern. He points to the fact that LLMs still fail at tasks that any child can perform, such as reliably counting the number of objects in a scene, understanding the physical consequences of simple actions, or maintaining a consistent model of a simple fictional world across a long conversation.
Another prominent optimist is Demis Hassabis, the CEO of Google DeepMind, who has argued that AGI is achievable within the next decade. Hassabis is a more nuanced thinker than Altman on this topic, and he has acknowledged that LLMs alone are not sufficient for AGI. He believes that the path to AGI involves combining LLMs with other AI techniques, including reinforcement learning, symbolic reasoning, and world models. This is a more defensible position than the pure scaling hypothesis, but it is also, in a sense, an admission that LLMs by themselves are not enough. The question then becomes whether the combination of LLMs with these other techniques will produce AGI, and that is a much harder question to answer.
The researchers who argue that emergent capabilities in LLMs are evidence of something like general intelligence face a fundamental methodological problem. Emergence is a seductive concept, but it is not magic. When a new capability appears in a large model that was not present in a smaller model, there are two possible explanations. The first is that the capability is genuinely new, arising from some qualitative change in the nature of the model's representations or computations. The second is that the capability was always latent in the model's architecture, but required a certain scale of training data and parameters to be reliably expressed. The evidence strongly favors the second explanation. The emergent capabilities of LLMs are not qualitatively different from their non-emergent capabilities. They are all, at bottom, sophisticated pattern matching. They just require more patterns to be reliably triggered.
SHOWCASE 5: The Emergence Illusion
Consider the following analogy. A student is learning to multiply large numbers. With small numbers (up to 10), they can do it by counting on their fingers. With medium numbers (up to 100), they need to use a learned algorithm. With large numbers (up to 1000), they need to use long multiplication.
At each stage, a new "capability" appears that was not present before. But this is not genuine emergence in any deep sense. The student is not developing a new kind of intelligence. They are applying the same basic cognitive machinery to progressively larger problems, using techniques that scale with the size of the problem.
LLM emergent capabilities follow the same pattern. The ability to do chain-of-thought reasoning, for example, appears at a certain scale. But it appears because the model has seen enough examples of chain-of- thought reasoning in its training data to reliably reproduce the pattern. It is not evidence of a new kind of intelligence. It is evidence of a more complete pattern library.
The test of genuine emergence would be the appearance of capabilities that cannot be explained by pattern matching on training data. No such capabilities have been convincingly demonstrated in LLMs.
THE MATHEMATICAL WALL
There is one more argument against LLMs as a path to AGI that deserves careful attention, because it comes not from philosophy or cognitive science but from mathematics, and it is in some ways the most fundamental of all.
LLMs are, at their core, continuous functions from input token sequences to output probability distributions. They are implemented as deep neural networks, which are compositions of linear transformations and nonlinear activation functions. The universal approximation theorem tells us that sufficiently large neural networks can approximate any continuous function to arbitrary precision. This is sometimes cited as evidence that LLMs could, in principle, learn to do anything.
But the universal approximation theorem is a theorem about function approximation, not about intelligence. It tells us that a neural network can approximate any function, given enough parameters and the right training signal. What it does not tell us is that next-token prediction is the right training signal for learning the functions that intelligence requires. And there are strong theoretical reasons to believe that it is not.
The functions that intelligence requires are not just any functions. They are functions that involve causal reasoning, counterfactual reasoning, goal-directed planning, and self-modeling. These functions are not well-captured by the statistical structure of text. A text corpus contains the outputs of intelligent processes, but it does not contain the processes themselves. Training a model to predict text is like training a model to predict the outputs of a calculator without ever showing it the calculator. The model might learn to approximate the outputs for common inputs, but it will not learn the underlying arithmetic.
This is related to a point made by the mathematician and physicist Roger Penrose, who has argued, controversially, that human consciousness involves non-algorithmic processes that cannot be captured by any computational system. Penrose's argument is based on Godel's incompleteness theorems and is highly contested. But even setting aside the consciousness question, there is a more modest and less controversial version of the same point: the functions that intelligence requires may not be learnable from text prediction alone, regardless of scale.
SHOWCASE 6: The Training Signal Problem
Imagine you want to train a system to be a master chess player.
Approach A: Show the system millions of games of chess, and train it to predict the next move in each game. This is analogous to how LLMs are trained: predict the next token given the preceding context.
Approach B: Have the system play millions of games of chess against itself and other opponents, receiving a reward signal (win/loss/draw) at the end of each game. This is how AlphaZero was trained.
Approach A will produce a system that can predict what moves human players tend to make in various positions. It will be a good predictor of human chess behavior. But it will not necessarily be a good chess player, because predicting what humans do is not the same as finding the best move.
Approach B will produce a system that actually learns to play chess well, because the training signal is directly aligned with the goal.
LLMs are trained using Approach A, applied to language. They learn to predict what humans say. This makes them good at producing human-like text. But it does not make them good at reasoning, planning, or understanding, because these are not what the training signal rewards.
The training signal problem is, in some ways, the most fundamental objection to LLMs as a path to AGI. It is not just that LLMs lack certain capabilities. It is that the way they are trained cannot, in principle, produce those capabilities. You cannot learn to reason causally by predicting text, any more than you can learn to swim by reading about swimming. The training signal must be aligned with the capability you want to develop, and next-token prediction is not aligned with the capabilities that AGI requires.
THE MULTIMODAL OBJECTION AND WHY IT DOES NOT SAVE THE DAY
At this point, a sophisticated defender of the LLM-to-AGI thesis might raise an objection: modern LLMs are not just text models. They are multimodal systems that can process images, audio, and even video. GPT-4V, Gemini Ultra, and Claude 3 can all process images as well as text. Does this not address the grounding problem? Does it not give these systems access to the kind of sensory information that grounds human understanding?
It is a fair point, and multimodal models are genuinely more capable than text-only models in many respects. But multimodality does not solve the grounding problem. It extends it.
The issue is not just that LLMs lack access to sensory information. The issue is that they lack the kind of active, exploratory, goal-directed interaction with the world that gives sensory information its meaning. A human child does not just passively observe the world. The child reaches out and touches things, picks them up, drops them, throws them, puts them in their mouth. The child's understanding of the physical world is built up through thousands of hours of active exploration, in which the child is the agent, making choices and observing the consequences.
A multimodal LLM that can process images is not doing anything like this. It is processing static representations of the world, not interacting with the world itself. It is like the difference between a person who has seen thousands of photographs of swimming pools and a person who has actually swum in one. The photographs convey a great deal of information, but they do not convey the feel of the water, the resistance of it, the way it holds you up, the way it gets in your nose if you do not hold your breath. And it is precisely this kind of embodied, action-based knowledge that grounds understanding.
Furthermore, the way multimodal LLMs process images is, at bottom, the same as the way they process text: they convert the image into a sequence of tokens and apply the same next-token prediction machinery. The image tokens are correlated with text tokens in the training data, and the model learns these correlations. This is useful, but it is not grounding in the philosophical sense. The model has not learned what a chair is by sitting in one. It has learned that images of chairs tend to co-occur with text about chairs, and it has learned to produce appropriate text when shown an image of a chair. This is a sophisticated form of pattern matching, not embodied understanding.
WHAT WOULD AGI ACTUALLY REQUIRE?
Having spent considerable time on what LLMs cannot do, it is worth pausing to ask what AGI would actually require. This is not just an academic question. It is the question that determines whether the gap between LLMs and AGI is a gap that can be bridged by incremental improvements, or whether it requires a fundamentally different approach.
Most serious researchers agree that AGI would require, at minimum, the following capabilities. It would need to be able to learn continuously from experience, updating its knowledge and skills in real time without forgetting what it already knows. It would need to have a causal model of the world that allows it to predict the consequences of actions, reason about counterfactuals, and plan sequences of actions to achieve goals. It would need to be able to generalize from a small number of examples to novel situations, in the way that humans can learn a new concept from just one or two instances. It would need to have some form of self-model, an understanding of its own capabilities, limitations, and knowledge state. And it would need to be able to pursue goals autonomously, without requiring human guidance for every step of a task.
LLMs fall short on every one of these dimensions. They do not learn continuously. They do not have causal models. They generalize poorly from small numbers of examples. Their self-models are unreliable. And they cannot pursue goals autonomously in any meaningful sense.
The path to AGI, if there is one, is likely to involve architectures that are fundamentally different from current LLMs. Yann LeCun has proposed a hierarchical architecture based on world models, in which the system learns to predict the future state of the world from current observations, and uses this predictive model to plan actions. Ben Goertzel, the creator of the OpenCog framework, has argued for a hybrid approach that combines neural networks with symbolic reasoning and probabilistic logic. Others have argued for embodied AI systems that learn through physical interaction with the world, in the tradition of developmental robotics.
None of these approaches has yet produced anything close to AGI. But they are at least aimed at the right target. LLMs, however impressive, are aimed at a different target: predicting text. And predicting text, however well you do it, is not the same as being generally intelligent.
THE DEEPER PHILOSOPHICAL QUESTION
There is one final dimension to this debate that deserves attention, even though it takes us into territory that is genuinely uncertain and contested. It is the question of whether intelligence, in the full sense that AGI implies, is even possible without something like consciousness.
This is not a question that can be answered definitively with current knowledge. We do not have a scientific theory of consciousness. We do not know what gives rise to subjective experience. We do not know whether consciousness is a necessary component of intelligence or an epiphenomenon that happens to accompany it in biological systems. These are among the hardest open questions in all of science and philosophy.
But the question is relevant to the AGI debate for the following reason. If consciousness, or something like it, is necessary for genuine understanding, goal-directedness, and self-awareness, then any system that lacks consciousness will also lack these properties. And LLMs, as far as we can tell, are not conscious. They have no subjective experience. There is nothing it is like to be an LLM. They process inputs and produce outputs, but there is no inner experience accompanying this processing.
John Searle's Chinese Room argument, discussed earlier, is relevant here. Searle argued that syntax is not sufficient for semantics, and that no amount of symbol manipulation, however sophisticated, will give rise to genuine understanding without the right kind of causal connection to the world. Whether or not you accept Searle's specific argument, the intuition behind it is powerful: there seems to be something missing from a system that can produce all the right outputs without any inner experience of what those outputs mean.
The philosopher David Chalmers has called this the "hard problem of consciousness": explaining why there is subjective experience at all, why there is something it is like to be a conscious being, rather than just information processing happening in the dark. The hard problem has no accepted solution, and it is not clear that it ever will have one. But it casts a long shadow over the AGI debate, because it raises the possibility that genuine intelligence requires something that no purely computational system can have.
This is not to say that AGI is impossible. It is to say that the path to AGI may require solving problems that are far deeper than the engineering challenges of building bigger and better LLMs. It may require a fundamental rethinking of what intelligence is, how it arises, and what kind of physical system can instantiate it.
CONCLUSION: THE MAGNIFICENT DEAD END AND THE ROAD NOT YET TAKEN
Large Language Models are one of the most remarkable technological achievements in human history. They have demonstrated capabilities that would have seemed like science fiction just a decade ago. They are genuinely useful, genuinely impressive, and genuinely transformative for many aspects of human work and life. None of this is in dispute.
But they are not on the road to AGI. They are on a road that leads somewhere else, somewhere interesting and valuable, but not to general intelligence. The reasons for this are not matters of opinion or speculation. They are grounded in the fundamental architecture of these systems, in the nature of the training signal they use, in the absence of causal reasoning and world models, in the frozen nature of their knowledge, in the unreliability of their self-models, and in the deep philosophical problems of grounding and understanding.
The scientists who believe that LLMs are the path to AGI are making a mistake that is understandable but important to correct. They are confusing impressive performance with genuine capability. They are confusing the breadth of a training corpus with the depth of understanding. They are confusing the appearance of reasoning with reasoning itself. And they are confusing the excitement of rapid progress with progress toward the right goal.
The scaling hypothesis, the idea that more data and more compute will eventually produce AGI, is not supported by the evidence. The improvements from scaling are following diminishing returns. The capabilities that are emerging with scale are consistent with increasingly sophisticated pattern matching, not with the development of genuine reasoning or understanding. And the benchmarks that LLMs are failing, like ARC-AGI, are precisely the benchmarks that test for the kind of general reasoning that AGI requires.
The road to AGI, if it exists, runs through territory that LLMs cannot reach: through embodied interaction with the physical world, through causal models that capture the structure of reality rather than the statistics of text, through continuous learning systems that update their knowledge in real time, through architectures that can generalize from small numbers of examples to novel situations, and perhaps through some form of self-awareness that goes beyond the unreliable metacognition of current LLMs.
None of this means that the work on LLMs is wasted. These systems are extraordinarily useful tools, and the techniques developed for training and deploying them will undoubtedly inform the development of future AI systems. But they are tools, not minds. They are mirrors, reflecting the intelligence of the humans who wrote the text they were trained on, not windows into a new kind of intelligence.
The magnificent dead end of LLMs is magnificent precisely because it has taught us so much about what intelligence is not. It has shown us, with unprecedented clarity, that fluency is not understanding, that correlation is not causation, that prediction is not reasoning, and that scale is not the same as depth. These are not small lessons. They are the kind of lessons that, if taken seriously, will point the way toward whatever comes next.
And whatever comes next, it will not look like a very large autocomplete.
SOURCES AND REFERENCES
Bender, E. M., Gebru, T., McMillan-Major, A., and Mitchell, M. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT '21), pages 610-623. ACM. DOI: https://doi.org/10.1145/3442188.3445922. Note: The fourth author, Margaret Mitchell, appears in the ACM Digital Library under the pseudonym "Shmargaret Shmitchell" due to a dispute with Google at the time of publication. Her real name is used here for clarity and accuracy.
Chollet, F. (2019). "On the Measure of Intelligence." arXiv preprint arXiv:1911.01547 [cs.AI]. Submitted November 5, 2019. Available at: https://arxiv.org/abs/1911.01547. This paper introduces both the formal definition of intelligence used throughout the article and the Abstraction and Reasoning Corpus (ARC) benchmark.
Chomsky, N., Roberts, I., and Watumull, J. (2023). "The False Promise of ChatGPT." The New York Times, March 8, 2023. Available at: https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-false-promise.html.
Harnad, S. (1990). "The Symbol Grounding Problem." Physica D: Nonlinear Phenomena, Volume 42, Issues 1-3, June 1990, pages 335-346. DOI: https://doi.org/10.1016/0167-2789(90)90052-J.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). "Scaling Laws for Neural Language Models." arXiv preprint arXiv:2001.08361 [cs.LG]. Submitted January 23, 2020. Available at: https://arxiv.org/abs/2001.08361. Note: The original references section listed only "Kaplan, J., et al." without the full author list, which has been corrected here.
LeCun, Y. (2022). "A Path Towards Autonomous Machine Intelligence." Version 0.9.2, June 27, 2022. Available on OpenReview at: https://openreview.net/forum?id=BZ5a1r-kVsf and on Meta AI Research at: https://ai.meta.com/research/publications/a-path-towards-autonomous-machine-intelligence/. This position paper sets out LeCun's proposed architecture for autonomous intelligence based on world models and is the primary source for his critique of LLMs as a path to AGI.
Marcus, G., and Davis, E. (2019). "Rebooting AI: Building Artificial Intelligence We Can Trust." Pantheon Books. Published September 10, 2019. ISBN: 978-1524748258.
Marcus, G. (ongoing). "Marcus on AI." Substack newsletter. Available at: https://garymarcus.substack.com/. Gary Marcus's ongoing public commentary on the limitations of LLMs and the challenges of achieving AGI.
Pearl, J., and Mackenzie, D. (2018). "The Book of Why: The New Science of Cause and Effect." Basic Books. Published May 15, 2018. ISBN: 978-0465097609. Note: The original references section incorrectly listed this book as authored by Judea Pearl alone. Dana Mackenzie is the co-author and this has been corrected here.
Pearl, J. (2011). ACM A.M. Turing Award. Awarded for fundamental contributions to artificial intelligence through the development of a calculus for probabilistic and causal reasoning. Award details available at: https://amturing.acm.org/award_winners/pearl_2658896.cfm. The year 2011 cited in the article is confirmed correct.
Searle, J. R. (1980). "Minds, Brains, and Programs." Behavioral and Brain Sciences, Volume 3, Issue 3, September 1980, pages 417-424. DOI: https://doi.org/10.1017/S0140525X00005756.
Altman, S. (2024). "The Intelligence Age." Blog post published on the OpenAI website, September 23, 2024. Available at: https://openai.com/index/the-intelligence-age/. This is the primary source for Altman's statement that superintelligence may arrive "in a few thousand days." The original references section described this only as "blog posts on the OpenAI website" without specifying the title or date, which has been corrected here.
Hassabis, D. (2023). Statements on AGI timelines made in an interview with MIT Technology Review, published July 10, 2023. Available at: https://www.technologyreview.com/2023/07/10/1075699/demis-hassabis-deepmind-agi/. Hassabis stated that AGI could be achieved within a decade and that the path involves combining LLMs with reinforcement learning and symbolic reasoning.
Goertzel, B. OpenCog: A Software Framework for Integrative Artificial General Intelligence. Goertzel is the primary architect of the OpenCog hybrid AI framework, which combines neural networks with symbolic reasoning and probabilistic logic as an alternative path to AGI. Further details available at: https://www.goertzel.org/.
ARC Prize and ARC-AGI-2 Benchmark (2024-2025). Official competition website: https://arcprize.org/. The ARC Prize 2025 offers a prize of $1 million or more for the first system to achieve 85% or higher accuracy on the ARC-AGI-2 benchmark. The benchmark was created by Francois Chollet and is described in detail in his 2019 paper cited above. The announcement of ARC-AGI-2 and the 2025 prize is available at: https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025.
No comments:
Post a Comment