From Next-Token Prediction to Next-Item Prediction: Iterative Updating as a Unifying Account of Intelligence

Abstract

Recent public discussion of large language models has revived a familiar dismissal: that such systems are “just” next token predictors. Recent capabilities shown by language models in mathematics have prompted Terence Tao to push back on this deflationary move by suggesting that next step prediction may in fact constitute a large fraction of what we call intelligence. This article develops and defends that stance by linking it to an Iterative Updating model of cognition in which the mind continuously maintains and partially preserves a limited working set of active psychological items while repeatedly selecting the next item to insert or modify via associative search. On this view, the cognitive analogue of next token prediction is next item prediction, where the units are not inherently linguistic but include concepts, goals, perceptual fragments, affective tags, and action tendencies, with language functioning as one prominent interface. The framework clarifies why next token trained systems capture a surprising portion of intelligent behavior while still falling short of human cognition, and it reframes the remaining gap as primarily architectural: the breadth of internal state variables, the objectives that constrain updating, and the coordination among multiple predictive modules. Such changes could result in a system that manipulates psychological items rather than tokens. Finally, the paper outlines behavioral, neural, and engineering implications of treating iterative predictive updating as a core substrate of intelligence, motivating research programs and agent designs that generalize the language model loop to a multimodal, goal constrained predictive workspace.

1. Tao’s provocation and the “deflation” of intelligence

Terence Tao recently made a comment about intelligence that hit a nerve precisely because it sounds, at first pass, like a deflation. The line people latched onto was his willingness to entertain the possibility that what we call human intelligence might not be as exotic as we imagine. In the context of discussing large language models, he framed “next-token prediction” as a mechanism that many critics treat as an explanation away, as if the phrase itself ends the conversation. His point was that the phrase does not end the conversation. If anything, it should start it. If a system that is trained to predict what comes next can display wide competence across language, reasoning, and problem solving, then either we have to keep moving the goalposts for what counts as intelligence, or we have to concede that iterative prediction is closer to the substrate of intelligence than our intuitions suggest.

The line that gets quoted most is Tao’s (carefully hedged) punchline:

“maybe that’s actually a lot of what humans do as well”

I agree with Tao’s stance in spirit, and I think it is more than a rhetorical flourish. It points to something that is, in retrospect, almost obvious. Intelligence is always operating under severe constraints: finite time, finite bandwidth, finite memory, partial information, and the need to act in the next moment rather than in an abstract mathematical eternity. Under those constraints, it makes sense that “intelligent behavior” is implemented as a repeated local operation that advances a state. The mystery is not that the operation is local. The mystery, if there is one, is that the repeated application of a local update rule can generate global structure that looks like planning, understanding, and insight.

This is exactly why the dismissive phrase “it is just next-word prediction” has always struck me as conceptually lazy. Saying “just” is doing all the work. The relevant question is what kinds of internal representations can be constructed, compressed, and deployed in service of prediction, and how a system’s update dynamics can chain those predictions into coherent multi-step behavior. Even in humans, much of what we call thinking is an unfolding sequence in which the current mental state constrains what becomes salient next. We experience that as meaning, intention, and comprehension, but at an implementational level it can still be a continuation dynamic.

Tao’s remark is also valuable because it forces an uncomfortable comparison. Humans like to imagine that intelligence is a special substance, and that language models are clever imitations that lack whatever that substance is. Yet language models keep demonstrating that a system can acquire a broad competence profile by optimizing a prediction objective over large corpora. That does not prove that next-token prediction is sufficient for the full range of human cognition, but it does strongly suggest that prediction is not an incidental byproduct of intelligence. It is part of its core machinery.

At aithought.com where I lay out my full model, I argue that a large fraction of intelligence may be implemented as iterative prediction of the next element in a structured stream, conditioned on a context that is itself a compressed summary of prior structure. When you set up cognition that way, the success of large language models becomes less surprising. They are not bizarre anomalies that accidentally stumbled into intelligence. They are clean implementations of a major cognitive motif.

That is the entry point for the argument I want to make in this paper. Tao’s public framing gives us permission to treat prediction as central rather than peripheral. My goal is to show that if you generalize “next-token prediction” to “next-item prediction” in a working-memory workspace, you get a model of cognition that aligns with the phenomenology of the stream of thought and that also explains why language models capture so many important aspects of intelligence.

2. Intelligence as iterative prediction: from next token to next psychological item

To make the connection precise, it helps to strip away the cultural baggage around language models and state the computational motif in abstract terms. There is a representational state that summarizes what is currently relevant. There is a rule that produces a probability distribution over what could come next, given that state. There is an update step that incorporates the selected next element into the state. Then the cycle repeats. If you do this once, you get a small continuation. If you do it thousands of times, you can get an extended coherent trajectory.

In a language model, the representational state is the context window, along with whatever internal activations are computed over that context. The “next unit” is a token. The update step is straightforward: append the token and recompute. The objective that shapes the whole system is to minimize prediction error on the next token. The elegance of the design is that the system is forced to internalize a vast amount of implicit structure because it must continually guess what comes next in a domain where what comes next depends on syntax, semantics, pragmatics, world knowledge, and social conventions.

In the model of cognition I have been developing, the same motif appears, but the unit of prediction is not a token. It is what I call a psychological item. A psychological item can correspond to a word, but it need not. It can be a perceptual fragment, a concept, a goal, a memory trace, a social inference, an affective tag, a motor intention, or an abstract constraint. The state is a limited working set of such items, coactive at any moment. The update step is not a full wipe and replacement. It is an iterative updating process in which portions of the prior state are preserved while a subset is replaced or modified. This is the mechanism that produces continuity. The stream of thought feels like a stream because the mind is not assembling each moment from scratch. It is updating.

The key move is to treat each update as a prediction. Given the current set of active items, what is the next item that should enter the set so that the overall state remains coherent, useful, and aligned with constraints? The selection of that next item can be modeled as associative search over long-term memory and latent structure. The current state biases retrieval. Retrieved candidates compete. The winner becomes active, and its activation reshapes the field, altering what becomes likely next. This is a continuation dynamic, but it is a continuation dynamic over conceptual and multimodal items rather than over word pieces.

Once you see the structural similarity, you can also see why the Tao remark is not merely a soundbite. If intelligence is built from iterative prediction of the next unit in a constrained workspace, then it makes sense that a system trained to predict the next unit in language will acquire many properties we associate with intelligence. Language is a rich proxy domain because it is an external record of the internal states humans traverse when they think, plan, explain, argue, and imagine. The “next word” in a sentence is often a surface manifestation of a deeper “next item” in cognition. When a model becomes good at predicting the surface, it often has to become partially good at tracking the underlying structure that generates it.

This is also where a common misunderstanding arises. People hear “next-token prediction” and assume it means the system is doing something shallow. In practice, predicting the next token in a human-like way requires the model to carry forward an evolving representation of what is being discussed, why it is being discussed, what is assumed, what is implied, and what would be consistent next. That is not the whole of intelligence, but it is not trivial either. It is an implementation of an iterative predictive loop in a domain where the latent variables are extremely high-dimensional.

The difference between language models and human cognition, in my view, is not that humans have a completely different kind of magic. It is that the brain runs the same general motif over a broader set of internal variables and under a broader set of objectives. The brain’s “tokens” are not just linguistic. They include bodily and motivational constraints, perceptual predictions, action policies, and social valuation. The brain also appears to have many interacting modules that contribute candidates into the workspace, not just a single predictor trained on text. If large language models feel, at times, like a single cortical module amplified to an extreme, that is because they are. They are a powerful, language-specialized predictor. The broader architecture of mind includes that function, but it is embedded in a larger system that selects, evaluates, grounds, and acts.

3. The Iterative Updating model and the continuity of mind

The claim I have been making on AIThought.com and in my earlier papers is that the stream of consciousness is not best modeled as a sequence of discrete snapshots that replace one another. Instead, it is better modeled as an evolving set of coactive psychological items that is iteratively updated. At any moment, some portion of the active set is retained, some portion is modified, and some portion is replaced. This is the simplest way I know to make temporal continuity a first-class architectural feature rather than a philosophical afterthought.

In this framework, working memory is not a container that receives fully formed “thoughts.” It is the active workspace that determines what can be retrieved, what can be inferred, and what can be acted upon next. The content of the workspace is the context. The update operation is the primitive. The mind’s apparent unity emerges because the next state is literally built from the prior state, not merely influenced by it. Continuity is not a narrative we impose after the fact. It is a mechanical consequence of partial carryover.

What selects the update? The answer is associative search constrained by the current state. In any realistic cognitive system, you have a vast reservoir of potential items: memories, categories, sensory fragments, motor schemas, social models, emotional tags, goals, and self-model elements. Only a tiny fraction can be active at once. The system must therefore repeatedly decide what to bring forward, what to suppress, what to revise, and what new item to activate. This looks like a competition among candidates where the current state biases the search field. Items that are strongly linked to the current configuration are more likely to be retrieved and activated. Once activated, they reshape the configuration and thereby reshape the next search. Thought becomes a trajectory through a structured associative landscape.

This is the place where Tao’s “next-token prediction” framing becomes genuinely useful as a bridge. If you replace “token” with “psychological item,” you get a similar update logic. The system maintains a context, predicts what is likely or useful next, updates the context, and repeats. In language models the update unit is token-like and the training signal is explicit. In brains the unit is multimodal and the training signal is implicit, distributed across survival, action success, and social coherence. The computational motif is still recognizably the same.

My earlier work argued, in different ways and at different levels of formality, that this iterative updating principle is not merely compatible with cognition but explanatory. It accounts for why thought has inertia, why it exhibits path dependence, why certain items recur obsessively under stress, why attention is both selective and sticky, and why “insight” often feels like a discrete insertion into an otherwise continuous stream. It also aligns with the phenomenology of the specious present: we do not experience a sequence of points but a temporally thick window that is continually refreshed. If the brain were fully replacing its state at each step, the subjective continuity would be harder to explain. If it is updating by partial carryover, continuity is expected.

A compact way to state the model is this. Let S_t be the working set of active items at time t. The next state S_{t+1} is a weighted mixture of retained elements from S_t plus a set of newly selected elements retrieved by associative search conditioned on S_t. The point is not the exact equation. The point is that the system’s core competence is the repeated selection of the next item to activate under constraints. That is the brain’s analogue of next-step prediction.

This is also why I have been comfortable saying, for years, that prediction is not just a component of cognition but its organizing principle. The most practical minds are not those that represent everything, but those that represent what matters next. The updating process is a mechanism for compressing a vast world into a small, actionable, predictive state.

4. Why LLMs capture so much, and why they remain incomplete

If you accept the framework above, the success of large language models becomes less mysterious. They implement the core motif cleanly: maintain a rolling context, predict a next unit, update, repeat. They are trained at scale on a domain that is saturated with human cognition, because language is the public trace of our internal updating dynamics. Text contains not only facts but intentions, explanations, social games, and plans. A model that learns to continue text well is forced to learn a statistical shadow of these deeper structures.

This helps explain a common experience: when an LLM is performing well, it feels as though it “understands” more than it possibly could, given that it is “only” doing next-token prediction. The correct response is not to deny what it is doing. It is to revise our intuitions about what next-step prediction can contain. A system that can sustain coherent continuation over long contexts must represent, at least implicitly, the latent variables that make continuation coherent. Those variables include topic, goal, conversational stance, assumed background, causal structure, and the expectations of a human reader. This is not the full stack of intelligence, but it is a meaningful portion.

This is where my Broca analogy fits, as long as it is used carefully. I am not claiming a literal anatomical mapping from transformers to Broca’s area. I am making a functional point. Language models look like a massively amplified specialization for linguistic continuation. The brain has language-specialized circuitry, but language is embedded in a broader architecture that includes perception, action, valuation, and homeostasis. A language model can be extraordinarily competent within its specialization while still lacking the full ecology of constraints that shape human cognition.

What is missing is not a mystical ingredient. It is a set of state variables and objectives that matter in real organisms and real agents. Humans do not merely need to generate plausible continuations of text. They must regulate energy, avoid harm, pursue goals, coordinate with others, and commit to actions under uncertainty. The brain’s predictive updates are constrained by embodiment, motivation, and social feedback. Those constraints shape what becomes salient next, and they give cognition its directionality. A purely text-trained model can imitate directionality by modeling linguistic traces of goals, but it does not automatically inherit the underlying goal machinery.

This is also why people’s critiques often mix two valid points and treat them as one. First, it is true that next-token prediction can generate surprising competence. Second, it is true that humans are more than a token predictor. The mistake is to infer that because humans are more than a token predictor, token prediction is therefore not central. The more coherent inference is that prediction is central and the remaining gap concerns what is being predicted, what objectives sculpt the prediction, and how multiple specialized predictors coordinate.

In the Iterative Updating framing, language models are strong because they approximate the core loop over a particular representational alphabet. They are incomplete because cognition is not only language. In the brain, the update candidates come from many subsystems. Visual systems offer predicted percepts and scene elements. Motor systems offer action affordances. Valuation systems offer salience and priority. Social inference systems offer models of other minds. Affect offers urgency and bias. The working set is therefore a negotiated product of many modules, not a single predictor optimized for text continuation.

This difference matters because it suggests the right direction for the next stage of AI. If the substrate is iterative prediction, then we should not abandon it. We should generalize it. We should build systems that maintain a structured workspace of psychological-item-like representations and repeatedly update that workspace using candidates contributed by multiple modalities and multiple objective functions. We should also treat language as one interface among several, not as the entire cognitive universe.

5. Predictions, research agenda, and architectural implications

A model is only as valuable as the constraints it imposes. The Iterative Updating framework is useful insofar as it suggests concrete predictions and engineering moves.

At the behavioral level, the model predicts that cognition should show measurable signatures of update competition. When multiple candidate items are strongly activated by the current state, selection should slow and errors should rise. This should not be limited to verbal tasks. You should see it in any domain where a limited workspace must choose among competing updates, including task switching, working memory substitution, and attention capture. When the update is forced to overwrite a strongly active item, you should observe a measurable cost. When an item is retained, you should observe inertia and persistence. These are not exotic predictions, but the point is that the framework unifies them as properties of a single update rule rather than as a miscellaneous list of effects.

At the neural level, the model predicts a mixture of continuity and punctuated change. If the state is partially carried over, then some neural ensembles should show persistence across successive moments. If a new item is inserted or a subset is replaced, then you should see discrete transition events that resemble update pulses. Importantly, the predicted signals are not only about “content.” They are about the dynamics of replacement and retention. Even when the content is stable, the system is still executing an update rule. The brain should therefore show structured temporal patterns that correspond to state maintenance, candidate activation, selection, and integration.

At the architectural level for AI, the framework suggests a simple but consequential pivot. Instead of treating text continuation as the whole of cognition, treat it as one module in a modular predictive system. Keep the predictive loop, but change the representational units and the sources of candidate updates. A next-generation agent could maintain an explicit workspace that includes goals, situational models, pending actions, social context, uncertainty estimates, and multimodal perceptual summaries. Specialist models would propose updates to this workspace, and a selection mechanism would determine what becomes active next. Language generation would then be downstream, one expression of the active workspace rather than the workspace itself.

This also implies a shift in evaluation. If you want to test whether an AI system has moved from next-token competence toward general cognition, you should test the integrity of its workspace updating. Can it keep a stable set of goals across distraction? Can it revise one element without collapsing the whole context? Can it suppress an irrelevant candidate update when a competing, goal-relevant update is available? Can it update beliefs incrementally in response to new evidence without rewriting its entire narrative? These are update-level questions. They map more naturally onto cognition than many benchmark tasks that reward polished text.

Finally, this framing clarifies what it means to say that “there may not be much more to intelligence.” That claim should not be taken as nihilism about minds. It is a design claim. Much of what we call intelligence may be accounted for by a single repeated operation: maintaining a context and selecting the next update that best satisfies constraints. The sophistication comes from the structure of the context, the richness of the candidate space, the objectives that constrain selection, and the coordination among modules, not from some separate ingredient called intelligence. In that sense, the success of large language models is not an accident. It is a proof of concept for the power of iterative prediction.

Conclusion

Tao’s remark landed because it forces a reconciliation. We can either keep treating next-token prediction as a demotion, or we can treat it as an empirical hint about the underlying substrate of cognition. I take the second option. The fact that a system trained to predict the next unit of text can display broad competence suggests that next-step prediction is not peripheral. It is central.

My claim is that the same principle can be stated in a brain-realistic way. The brain does not predict the next token. It predicts the next psychologically relevant item to insert into an iteratively updated working set. That working set is the context window of the mind. The update rule generates continuity and direction. Associative search supplies candidates. Competition and constraint satisfaction select what becomes active next. When you frame cognition this way, language models look like a powerful specialization rather than a conceptual outlier. They capture a major portion of the predictive loop in a domain where the loop is richly expressed.

What remains, in my view, is not to abandon the loop but to widen it. Build systems that perform iterative prediction over a richer internal workspace, with multiple modules contributing candidates and multiple objectives constraining updates. If intelligence is largely an iterative continuation process, then the road to broader machine cognition is not mysterious. It is architectural. It is about what the system is continuing, how it selects updates, and how those updates remain grounded in the world and in persistent goals.

Iterated Insights

Recent Posts

Reser’s Basilisk: When the AI Future Solves the Past

From ARPANET to Artificial Intelligence: Lessons from the Open Internet for the Post-Labor Economy

The Tender Window: Why Context Matters More When You Drink

Autism as a Distinct Attentional Configuration: Working Memory Selection and the Emergence of Non-Social Abstraction

Why AI Might Have Subjective Continuity Without a World