Why Transformers Approximate Continuity, Why We Keep Building Prompt Workarounds, and What an Explicit Overlap Substrate Would Change

Abstract

This article argues that “continuity of thought” is best understood as the phenomenological signature of a deeper computational requirement: stateful iteration. Any system that executes algorithms across time needs a substrate that preserves intermediate variables long enough to be updated, otherwise it can only recompute from scratch. Using this lens, I propose a simple taxonomy of information-processing substrates: external record substrates that preserve history as a trace, internal curated state substrates that maintain a compact working set updated by deltas, and hybrid substrates that combine both. I then apply this framework to transformer-based large language models, arguing that their effective continuity is dominated by an external record substrate (the token context), with strong iterative updating across depth inside a single forward pass but comparatively weak native time-iteration. I interpret popular prompting practices such as scratchpads, chain-of-thought, running summaries, and tool-based memory as compensatory attempts to manufacture an iterative substrate in text. Finally, I outline a hybrid architecture in which a transformer remains the associative engine and proposal generator while a capacity-limited, overlap-enforced workspace maintains protected referents and incremental updates across time, enabling progressive construction, improved interruption recovery, and measurable continuity dynamics.

Introduction

When people talk about “continuity of thought,” they often mean something subjective. A stream of experience that feels smooth rather than choppy. But continuity is also a computational issue, and I think it is more useful to start there. Any system that executes an algorithm across time needs a substrate that can hold intermediate variables long enough for the next operation to act on them. If nothing persists, there is no true iteration, only repeated recomputation. That distinction sounds abstract until you notice how often it shows up in engineered systems, and how often it shows up in our current attempts to make large language models behave like stable reasoners or agents.

In my earlier work I argued that mental continuity can be explained by overlap in the set of coactive representations across successive brain states, and by incremental change in that overlap over time. The important part, for the purposes of AI, is not the phenomenology. It is the substrate. Overlap is a minimal recipe for statefulness without rigidity. The system can evolve, but it evolves as an edited continuation of itself rather than as a series of internal reboots. If you take that seriously, you get a more general claim: the overlap regime is not just a correlate of continuity, it is a computational medium that makes iterative processing possible, and iterative processing is what enables the execution of learned algorithms in a progressive, multi-step way.

Once you see it that way, you can compare cognitive substrates across biology and engineering. The pattern that keeps repeating is simple. There is a state, there is an update operator, and the system advances by applying updates to a state that remains recognizable across steps. The persistent state is the work surface. The update rule is the algorithm. Many systems can be described in this language, from caches and process contexts to Kalman filters and iterative solvers. The details differ, but the principle is stable. Computation becomes more than mapping inputs to outputs. It becomes a trajectory of a state that is iteratively refined.

That lens is also a good way to understand modern transformer models. Transformers are extraordinarily capable systems, but it is not obvious that they implement stateful iteration in the same way biological cognition seems to. They can produce coherent output, they can stay on topic, they can appear to reason, and yet the continuity substrate that makes those behaviors possible is not the one most people imagine. This matters, because the entire ecosystem of prompting tricks, scratchpads, and tool scaffolding can be reinterpreted as a collective attempt to add a missing substrate.

Section 1. A taxonomy of information-processing substrates

If we want to compare biological cognition to engineered systems and to transformers, we need a vocabulary that does not smuggle in conclusions. I find it useful to divide substrates for iteration into three broad categories: external record substrates, internal curated state substrates, and hybrid substrates.

An external record substrate is the simplest conceptually. The system persists its history in a record, and continuity comes from rereading that record. The record can be a log file, a notebook, a database table, or a sequence of tokens in a context window. The state of the system can be reconstructed by consulting the record, and the system can keep behaving consistently because the record remains stable. This is a real substrate for iteration, but the iteration is mediated by recollection and recomputation. The system does not necessarily carry a compact internal working state forward. It carries a trace, and it keeps re-deriving what matters from that trace.

An internal curated state substrate is more like what computer architects and control theorists instinctively mean by “state.” The system has a compact working state that persists across steps and is updated incrementally. CPU registers and flags are the simplest example. Caches are a particularly revealing example because they are curated under a capacity constraint. They do not keep everything. They keep what the system predicts it will need soon, and they evict the rest. The intelligence is not in storage, it is in survival policy. Operating systems do something similar at a higher level when they preserve process contexts across time slices. A running program continues because its working state is saved and restored, not because the system rereads the original source code each millisecond. Control systems make the same point in mathematical form. A Kalman filter is literally a belief state that is updated by deltas as new evidence arrives. Each update depends on what was carried forward, so the system becomes coherent across time by construction.

A hybrid substrate is what you build when you want both capacity and real-time iterative control. The external record gives you breadth and persistence. The internal curated state gives you speed, invariants, and a working surface for ongoing computation. Many high-performance systems look like this because it is how you get robustness and efficiency at the same time. Databases use on-disk storage plus caches and indexes that are maintained incrementally. Compilers keep the original source but also build intermediate representations that are edited through a series of transformations. Robotics stacks keep maps, logs, and sensor streams, but they also maintain a live state estimate that updates iteratively and drives action.

This taxonomy matters because it lets us pose a clean question about cognition and AI. Where does the system’s iteration actually live. Is it living in an external record, in an internal curated working set, or in a hybrid of both. If you believe, as I do, that continuity is the phenomenological signature of an underlying iterative substrate, then the architecture of that substrate becomes a central design question for AI.

Section 2. What a transformer is actually using as its substrate

Transformers, as used in large language models, are often described as if they carry an internal “train of thought” forward through time. In practice, their continuity substrate is closer to an external record model. The main thing that persists across time during generation is the growing token sequence itself. The model generates one token, appends it to the context, and then generates the next token by attending over that context. In other words, the model’s access to the past is mediated by the record of the past. That record is the substrate. It is not that the model has no internal dynamics, but the long-horizon continuity is largely implemented by rereading, reweighting, and recomputing over a stable trace.

The KV cache that people often mention does not fundamentally change this picture. It is an optimization that makes attention over previous tokens faster by caching internal key and value tensors. It makes the rereading of the record computationally efficient. It does not, by itself, create a compact curated working set with explicit eviction and protected invariants. It is closer to a performance enhancement for the external record substrate than it is to a new stateful substrate category.

There is, however, a real iterative substrate inside a transformer, and it is important to name it correctly. It lives across depth rather than across time. Within a single forward pass, the model maintains a residual stream that is updated layer by layer. Each layer applies a relatively small transformation and adds it back to the existing representation. That is iterative updating. It is a deep sequence of edits to a representational state, and it is one reason transformers are so powerful. But this is not the same thing as a persistent time-iteration substrate. It is depth-iteration that happens within one generative moment. The model can generate a coherent token because it can refine representations through many layers. The question is what happens across successive moments of generation, where the model is effectively re-running that depth-iteration procedure again, conditioned on an expanded record.

Attention itself provides a kind of soft working set, because some parts of the context are weighted heavily and others are effectively ignored. In that sense, there is a functional foregrounding and backgrounding. But it is soft, distributed, and not explicitly governed by a persistence policy that enforces overlap and controlled turnover. The model is not forced to keep a stable subset of active internal referents alive from moment to moment. It is free to shift its effective focus drastically if the attention dynamics call for it. Sometimes that is good. Sometimes it is exactly what produces the feeling that the model is coherent but not stable, articulate but not anchored.

This is the point where the substrate lens becomes clarifying rather than critical. A transformer can still do impressive multi-step work by repeatedly re-deriving intermediate structure from the external record. It can appear continuous because the trace is continuous. But it is not obviously doing what biological cognition seems to do, which is to preserve a compact active set that carries forward as a curated scaffold, and to update that scaffold incrementally by eviction and replacement. That difference is not a moral judgment. It is a design difference, and it likely explains why so many techniques in the LLM ecosystem look like attempts to manufacture a working substrate in text.

In the next section, I will make that point explicit by treating chain-of-thought, scratchpads, plan lists, running summaries, and tool-based note taking as compensatory workarounds. They are not arbitrary prompting fashions. They are our collective attempt to graft a curated time-iteration substrate onto an architecture whose native substrate is primarily an external record.

Section 3. Why the ecosystem keeps inventing prompt workarounds

If you watch how people actually use large language models when the stakes are higher than casual chat, you start to see a pattern. They do not simply ask the model to answer. They build scaffolding. They ask it to write a plan, maintain a running summary, keep a scratchpad, record assumptions, track open questions, and periodically restate goals. They add tools, retrieval, long-term memory stores, and external note-taking systems. On the surface, this looks like a grab bag of “prompt engineering.” Under the substrate lens, it looks like something much more coherent. It looks like a distributed attempt to create an iterative working medium that the model can carry forward.

Chain-of-thought and scratchpads are the clearest example. When a human solves a multi-step problem, the intermediate variables usually live somewhere. They might live in working memory, in an internal sketch, or on paper. When we prompt an LLM to “show your work,” we are not merely asking for transparency. We are asking the model to externalize intermediate state into text so that those variables can persist from one step to the next. The model is then able to condition on its own intermediate outputs as it continues. In other words, we are manufacturing a stateful iteration substrate by turning the token record into a scratch space for computation.

Plans, checklists, and running summaries play a similar role, but they aim at stability rather than explicit calculation. A running summary is a compact set of referents that the system can keep reloading into attention. A checklist is a set of constraints that must remain invariant while details change. A “goal restatement” is an attempt to protect a small core of state variables from being washed away by novelty and distraction. Humans do this too. We write notes to ourselves so that our own cognition does not drift. With LLMs, we do it because the model’s native continuity medium is an external record that is not automatically curated into a stable active set. So we curate it manually.

Tool use and retrieval systems extend the same idea. People add vector databases, “memory” modules, and note stores so that the model can re-access prior content. But there is a trap here. Retrieval by itself is still an external record mechanism. It is a way of reading from a larger archive. It becomes a true cognitive substrate only when there is a mechanism that decides what retrieved content becomes active, what persists, and what is allowed to be evicted. In other words, retrieval is not the workspace. It is an input channel. The missing piece is a curated active set that treats some items as referents that survive across cycles.

Self-consistency and multi-sampling methods are also revealing. When people ask a model to sample multiple solutions and vote, they are doing something analogous to iterative convergence, but in a crude parallel form. Instead of an internal state that refines itself step by step, we run multiple independent trajectories and hope that aggregation yields stability. This can improve reliability, but it also highlights what is missing. We are building robustness through external redundancy because the architecture does not naturally implement a stable internal convergence process under controlled turnover.

All of this is why I do not dismiss prompt workarounds as tricks. They are diagnostic. They are telling us what the architecture is not giving us natively. They are attempts to give the model intermediate state variables, protected invariants, and a stable scaffold for progressive construction. In short, they are attempts to add a time-iteration substrate.

Section 4. What an explicit overlap substrate would add

An explicit overlap substrate changes the nature of the computation. It takes us from a regime of repeated recomputation over a record to a regime of stateful iterative updating. The key is that the system is forced to carry a compact working set forward, and to update it incrementally. Some elements persist as referents. Some elements are replaced. New content enters in relation to what persisted, not as a fresh start.

This is the real meaning of “keep, drop, add.” It is not just memory management. It is the minimal machinery required for progressive construction. A system with a curated overlap substrate can hold a plan while revising it, keep a theme while exploring variations, maintain a causal model while adding evidence, and build an internal scene or diagram while editing its parts. Each step is an edit, not a reinvention. That yields a computational trajectory that looks like thought in the way we experience it, but more importantly it looks like algorithm execution. Intermediate variables survive long enough to be transformed.

Once you make overlap explicit, you get a place to store and protect invariants. That is a concept worth emphasizing. In many domains, the important part of state is not a heap of facts. It is a small set of commitments that must remain stable while other things change. When we solve a problem, we keep track of what is fixed, what is assumed, what must be preserved, and what is allowed to vary. In a curated overlap substrate, these invariants can be assigned higher survival pressure. They can be protected by the persistence policy. That gives you a system that is harder to derail and more capable of long-horizon coherence.

You also get a natural mechanism for revision and error correction. If part of the active set persists, then new candidate content has to reconcile itself with what is already there. When there is a mismatch, that mismatch is informative. It can trigger re-evaluation rather than collapse. In a reboot regime, mismatch often produces oscillation and inconsistency because the system is constantly reconstituting its state from scratch. In an overlap regime, mismatch can be treated as a signal that something needs to be repaired. You can preserve the stable core while repairing the conflicting component. That is what robust systems do in many domains. They do not throw everything away when one component becomes suspect.

A final benefit is that continuity becomes a tunable parameter. The overlap ratio, how much of the active set is forced to persist, becomes a dial that trades stability for flexibility. High overlap yields composure and coherence. Lower overlap yields agility and exploration. This is not just a conceptual dial. It is measurable. You can quantify drift in the active set, recovery after interruption, and stability of commitments across time. If continuity is real, you should be able to measure it. The overlap substrate gives you the knob.

Section 5. Engineering tradeoffs, and why transformers did not do this by default

It is important to be honest about why transformer-based language models became the dominant paradigm. They are simple to train, extremely scalable, and they work with a universal interface, text. The external record substrate is powerful precisely because it is generic. A token sequence can represent anything, and attending over it is a flexible mechanism for conditioning. This makes the architecture broadly applicable, and it makes training and deployment straightforward.

The external record substrate also has a kind of transparency. The model’s “state” is visible as text. You can inspect the prompt, inspect the conversation history, and reason about what information the model has access to. In contrast, an internal curated working set introduces a new object that needs to be designed, supervised, and evaluated. You have to decide what the active items are, how they are represented, how they bind, how they are scored, how they persist, and how they are evicted. That adds complexity, and complexity creates new failure modes.

There is also an optimization reality. Transformer inference is already heavy. Adding a recurrent workspace, map modules, and controlled turnover introduces additional computation and additional training signals. The payoff might be large, but the path is not free. And because the existing approach works well enough for many tasks, engineering organizations tend to keep adding patches and scaffolds rather than revisiting the substrate.

But I do not think these tradeoffs are reasons to avoid an overlap substrate. They are reasons the first generation of widely deployed models did not prioritize it. The moment you start asking for robust long-horizon behavior, progressive construction, stable agency, or reliable recovery after interruption, the limitations of an external-record-first substrate become more salient. At that point, the hybrid approach becomes attractive. You keep the transformer’s strength as an associative engine over rich context, but you add a compact curated time-iteration substrate that makes the system’s trajectory genuinely stateful.

In other words, the question is not whether transformers are good. They are. The question is what they are good at, what substrate they are implicitly relying on, and what class of cognition becomes easier once we treat overlap as a first-class computational primitive rather than something we approximate with prompting rituals.

Section 6. The hybrid design, and what it would look like in practice

If I had to summarize the hybrid in one line, it would be this: let the transformer remain the associative engine and proposal generator, but add a compact curated workspace that is explicitly responsible for time-iteration. The transformer is excellent at generating candidates, retrieving relevant context from a long external record, and integrating heterogeneous information. The workspace is excellent at doing what a long record does not automatically do, which is to maintain a stable set of referents, constraints, and intermediate variables that survive across successive cycles.

In a practical system, the transformer consumes the external record, including the conversation history, tool outputs, retrieved notes, and current sensory input if we are doing multimodal. It produces a pool of candidate representations: salient entities, inferred goals, constraints, next actions, hypotheses, and proposed updates to the current plan. That candidate pool is not yet cognition. It is a flood of possible content.

The curated workspace is the selection bottleneck. It maintains a capacity-limited active set, optionally with bindings, and it updates that set using a keep, drop, add rule that enforces overlap. Some items are protected because they function as invariants: the goal of the task, the user’s preferences, hard constraints, safety boundaries, and any long-horizon commitments the system should not abandon casually. Other items are more replaceable: momentary details, local observations, or transient subgoals. New items are admitted by pooled associative pressure from what persisted, plus relevance to the task and novelty considerations. The workspace then broadcasts its active set back to the transformer and to any simulation modules, and the cycle repeats.

If you want to push this beyond language, you add map modules. These are progressive scratch spaces that build internal objects, not just descriptions. A visual latent, a spatial scene graph, a causal model, a plan graph, a code structure, a diagram. The point is that the system has an internal object that can be refined rather than regenerated. The workspace keeps a stable scaffold of constraints that guide the map’s refinement, and the map sends back candidate edits that can be admitted into the workspace. This creates a loop that is closer to how humans build things. We keep a theme, we elaborate detail, we notice inconsistencies, we revise, and we stay within an identity of the object we are constructing.

This hybrid also clarifies the role of retrieval. Retrieval remains an external record mechanism, but it becomes much more powerful when the workspace decides what retrieved items become active and remain active. The system is no longer just a model that can read. It is a model that can hold. And holding is what makes progressive multi-step algorithm execution feel like genuine iteration rather than a string of clever recomputations.

Section 7. How to test whether this is real

If the overlap substrate is doing meaningful work, it should change behavior in ways that are both measurable and intuitively recognizable. The goal is not to prove a philosophical point. The goal is to show that a different substrate produces a different cognitive regime.

The first test is interruption and recovery. Insert distractors, topic shifts, or tool calls that produce large irrelevant output, and measure whether the system returns to its prior thread without having to be reminded. A model that relies primarily on the external record can often recover if the record remains clean and the prompt is well-managed. But under real noise, it can drift. A model with a protected overlap substrate should show better composure, because the core referents and goals are explicitly protected as state variables.

The second test is delayed association and accumulation. Present relevant evidence separated by time and noise, and ask for integration. If the system’s cognition is an edited continuation rather than repeated recomputation, it should do better at accumulating related items into a coherent scaffold. This is where you see the difference between access and active maintenance. The model can always re-access a fact in the record, but the question is whether it keeps the right referents alive long enough for later evidence to bind to them.

The third test is progressive construction. Give the system tasks that require iterative refinement, not just final answers. Planning an itinerary with evolving constraints, designing a multi-part argument, building a complex specification, or drafting a diagram-like description that must remain consistent while being elaborated. Then you evaluate not only the final product, but the trajectory. Does the system actually build on what it already built, or does it repeatedly generate new versions that only superficially resemble revisions.

A fourth test is continuity measurement itself. Because the active set is explicit, you can quantify drift. You can define an overlap ratio between successive steps and compute a continuity half-life under different task conditions. You can then correlate those metrics with performance and with subjective impressions of stability. In other words, you can operationalize continuity. If it cannot be measured, it is not yet engineering.

Finally, the ablation tests are essential. Turn off overlap enforcement. Turn off bindings. Remove map modules. Sweep the overlap ratio. A real substrate should yield systematic tradeoffs. High overlap should increase stability but reduce flexibility. Low overlap should increase exploration but risk fragmentation. Removing bindings should create a distinctive failure mode where the system retains pieces but loses structure. Removing overlap should increase hard cuts and reduce recovery. These are falsifiable predictions, and they are exactly what makes the proposal more than a metaphor.

Section 8. Why this matters, and where it points

I do not think the next phase of AI progress is only about larger models and larger context windows. Those help, but they mostly strengthen the external record substrate. They make rereading more powerful. They do not necessarily create a compact, curated, time-persistent working state that is updated by controlled turnover. The current ecology of prompting, scratchpads, planning rituals, memory tools, and retrieval systems is already telling us what people want. They want models that can keep a thread, preserve commitments, build objects progressively, and recover from distraction. Those are substrate-level properties.

The deeper point is that continuity is not just what thought feels like. It is what stateful iteration looks like from the inside. A system that can execute learned algorithms across time needs intermediate variables that persist. It needs a work surface. It needs a mechanism that preserves a scaffold while allowing controlled edits. Overlap is a minimal way to get that. It creates a trajectory rather than a series of re-derivations. It turns computation into progressive construction.

Transformers are already a triumph of associative computation. They can retrieve, integrate, and generate at a level that still surprises people. The question is what happens when we stop treating the token history as the only continuity medium and start treating overlap as an explicit computational primitive. My prediction is that you get systems that are not merely coherent in output, but coherent in trajectory. You get models that do not simply answer, but build. And you get a clearer bridge between modern deep learning and the kind of iterative, stateful cognition that humans use when they plan, design, imagine, and reason over long horizons.

That is the research program as I see it. Identify the substrate that makes iteration possible, implement it explicitly, measure it, and then ask what new capabilities become natural when the system’s internal life is a stream of edited continuations rather than a repeated reconstitution from a record.

Iterated Insights

Recent Posts

From Moonshot Compute to Agent Armies: The Next Technological Soundbite

Social Group Size and the Evolutionary Calibration of Autism

Solitary Calibration: Conserved Neuromodulatory and Genetic Mechanisms Linking Mammalian Social Ecology and Autism

Reser’s Basilisk: When the AI Future Solves the Past

From ARPANET to Artificial Intelligence: Lessons from the Open Internet for the Post-Labor Economy