Every AI Model Loses Its Memory. Here Is What the Labs Are Doing About It.

Every AI model has an attention budget. Fill it up and the model starts forgetting. not dramatically, not all at once, but measurably, consistently, and in ways that matter if you are running AI at scale.

I write for board-level readers and technology executives. If you want the full mathematical treatment of transformer attention complexity, I have linked the source papers throughout. This is the practical version: what the problem is, what the major labs are doing about it, and what it means for anyone deploying AI agents in production.

The Problem Does Not Care About Your Token Limit

There is a distinction the major AI providers rarely advertise clearly. It separates two different failure modes: context overflow and context rot.

Context overflow is the familiar one. The model hits its declared maximum. 200,000 tokens, one million tokens, whatever the spec says. and the API rejects the request or truncates the input. That is a hard wall. You can see it coming.

Context rot is the one that actually costs you. It occurs well before overflow. The Chroma research team published a study in 2025 testing 18 frontier models, including Claude Opus 4, GPT-4.1, Gemini 2.5 Pro, and GPT-4o. Their finding: every single model showed measurable performance degradation as context length increased. Not some models. Every model.

A model with a 200,000 token context window can exhibit significant degradation at 50,000 tokens. The question is not whether degradation occurs. It is how steeply.

Three Ways Models Go Wrong

The Stanford paper “Lost in the Middle” (Liu et al., TACL 2024) was the first rigorous demonstration of something that practitioners had suspected for a while. Language model performance follows a U-shaped curve over context position. Models attend reliably to content near the start and end of a context. Content placed in the middle gets missed. In multi-document question-answering tasks, performance degraded by 30% or more when the relevant information sat in the middle of the input, even in models specifically designed for long contexts.

Imagine you ask a colleague to read a 200-page briefing before a board meeting. They read the executive summary at the front, they remember the recommendations at the back, and everything in the supporting evidence section in the middle becomes a blur. That is roughly analogous to what is happening, though I am massively oversimplifying the mechanics. The point is that position within the context actively affects retrieval quality.

The second mechanism is attention dilution. As context grows, the model’s attention signal for any individual piece of information weakens relative to the noise from everything else. The relevant content is still there. The model simply cannot weight it correctly against the volume of surrounding material.

The third mechanism is the one the Chroma study identified most sharply: distractor interference. Semantically similar but incorrect content actively degrades retrieval accuracy. The distractors that cause the most damage are not nonsense. they are coherent, topically relevant passages. Exactly the kind of content that appears in real-world documents. The same study found that models actually performed better on shuffled documents than on well-organised ones, because coherent narrative structure creates more plausible interference. Structural coherence, it turns out, backfires.

What the Labs Are Actually Doing

Anthropic: Engineering the Boundaries

Anthropic’s published technical focus is not on novel attention architecture but on practical context management. Two contributions are worth understanding directly.

Prompt caching (released public beta August 2024) lets developers cache portions of a prompt between API calls. When the same context prefix is reused across multiple queries. the same system prompt, the same reference document, the same project background. the model skips recomputing attention for those tokens. Reported performance improvements: up to 90% cost reduction for cached tokens, and latency down from 11.5 seconds to 2.4 seconds for a 100,000-token book context. Those numbers are for the cached portion; the benefit is proportional to how large and stable the prefix is.

Contextual Retrieval (Anthropic, 2024) addresses a specific failure in standard Retrieval-Augmented Generation. When a document is split into chunks for retrieval, each chunk loses its surrounding context. A chunk stating “revenue grew by 3%” has no meaning without knowing which company and which period. Anthropic’s approach: before embedding each chunk, prepend a 50-100 token summary explaining what the chunk is about within the broader document. Claude generates these automatically. Combined with BM25 lexical matching, retrieval failure rate drops by 49% compared to standard RAG. With reranking added, the failure rate reduction reaches 67%.

Claude’s extended thinking mode (released February 2025 with Claude 3.7 Sonnet) functions as a context management strategy for complex multi-step tasks. Rather than holding the full reasoning trace in working memory alongside the document context, the model externalises reasoning steps into an explicit thinking budget before producing an answer. One important caveat: Anthropic’s own research paper “Reasoning Models Don’t Always Say What They Think” (2025) found that chain-of-thought steps frequently omit or obscure the actual reasoning. Extended thinking improves measurable task performance. It cannot be treated as a transparent window into model reasoning.

Google DeepMind: The Architecture Play

The most significant published research contribution to long-context LLMs in the 2024-2025 period came from Google DeepMind’s Gemini 1.5 technical report (arXiv:2403.05530, February 2024).

Gemini 1.5 Pro demonstrated greater than 99.7% recall at one million tokens on standard Needle-in-a-Haystack retrieval tasks. a result that represented a genuine step change from what any prior model had achieved at that scale. The architecture uses a sparse Mixture-of-Experts Transformer, described in the paper as building on a much longer history of MoE research at Google. Specific expert counts and routing configurations are not disclosed.

The practical upside of MoE is not solely quality. it is economics. By activating only a subset of parameters per token, MoE models reduce the compute cost per forward pass, making longer context economically tractable at scale. Gemini 2.5 Pro inherits this architecture. The Chroma context rot study tested Gemini 2.5 Pro and still found context-length-dependent performance degradation, consistent with every other model in the study. Near-perfect NIAH recall at one million tokens and measurable degradation in real-world tasks are not contradictory findings. NIAH is a clean, binary retrieval test. Production tasks are not.

OpenAI: The Million-Token Claim

GPT-4.1 (April 2025) expanded OpenAI’s context window from GPT-4o’s 128,000 tokens to one million. OpenAI claims 100% Needle-in-a-Haystack accuracy at all positions up to one million tokens. They do not publish the architectural specifications behind this improvement.

That claim is worth holding at arm’s length. The RULER benchmark (Hsieh, Sun et al., NVIDIA, arXiv:2404.06654, COLM 2024) tested 17 long-context models and found that despite near-perfect performance on vanilla NIAH, almost all models showed large performance drops as context length and task complexity increased. Only approximately half maintained satisfactory performance at 32,000 tokens, despite all claiming context windows of that size or larger. Declared context size and effective context size are not the same number.

The Responses API, which is replacing the deprecated Assistants API (full shutdown August 2026), gives developers more explicit control over context management rather than relying on the managed thread abstraction that the Assistants API provided. Whether that is progress depends on whether you wanted the abstraction.

Meta: The Long Game on Position Encoding

Meta’s contribution to this problem is architectural and has influenced the wider research community.

The Llama family uses Rotary Position Embeddings (RoPE) for positional encoding. Standard RoPE is trained on a fixed context length and generalises poorly beyond it. Meta and the research community have published several extension techniques: Position Interpolation (arXiv:2306.15595), YaRN (ICLR 2024, arXiv:2309.00071), and Microsoft’s LongRoPE (ICML 2024), which extends context to over two million tokens with near-lossless quality at 128,000 tokens for approximately 10-billion-parameter models.

Llama 4 Scout (April 2025) claims a 10 million token context window using iRoPE. an architecture that interleaves standard RoPE layers with NoPE (No Positional Encoding) layers in roughly a 3:1 ratio, allowing local relative position capture and global dependency capture in the same model. Pre-training was conducted at 256,000 tokens; the 10 million token capability is claimed as a length generalisation result.

Important caveat: independent evaluations of the 10 million token claim are limited as of March 2026. The Chroma context rot study did not include Llama 4. The architectural novelty of iRoPE is genuine. The performance claim at 10 million tokens should be treated as pending independent verification.

What the Benchmarks Say About All of These Claims

Three benchmarks matter for understanding the gap between provider claims and real-world performance.

Needle-in-a-Haystack (NIAH). the original test, open-sourced by Greg Kamradt in 2023 (github.com/gkamradt/LLMTest_NeedleInAHaystack). It inserts a specific factual statement into a large block of unrelated text and asks the model to retrieve it. Models can appear to pass NIAH while still failing badly on realistic retrieval tasks.

RULER (NVIDIA, 2024) extends NIAH with 13 task categories including multi-needle retrieval, multi-hop tracing, and question answering requiring multi-document synthesis. Key finding: despite near-perfect NIAH performance, almost all 17 models tested showed large performance drops as context grew. Half failed to maintain satisfactory performance at 32,000 tokens.

Sequential-NIAH (2025) tested six frontier models on extracting multiple ordered items from contexts of 8,000 to 128,000 tokens. The best-performing model achieved 63.5% accuracy. Sequential information retrieval at scale remains an unsolved problem.

What This Means If You Are Running AI Agents at Scale

I run a large multi-agentic AI system, Crown Intelligence AI, that handles research, content production, quality assurance, SEO, design review, and financial tracking for a portfolio of web properties. Context management is not an abstract concern. It is an operational constraint we hit regularly.

The practical conclusion from the research is this: declared context windows are marketing figures. The models become unreliable at roughly 60-70% of their declared maximum. A 200,000-token model starts degrading before you reach 130,000 tokens. We design workflows to stay within that range, or we trigger compression before hitting it.

Five engineering approaches that work today, regardless of which model you are using:

1. Structured external memory. The most reliable context extension available is not a longer context window. it is a well-maintained memory file that survives context compaction. In Crown Intelligence AI, every agent maintains a MEMORY.md file containing durable, task-relevant facts. These persist between sessions and get injected at the start of each invocation. A well-written memory file is worth far more than an extra 50,000 tokens of context.

2. Task decomposition at agent boundaries. Instead of one agent holding 200,000 tokens for a complex research task, ten agents each working with 20,000 tokens on focused sub-problems produce better results at lower cost. Each agent’s context stays well within the reliable performance range. The coordination overhead is real but manageable.

3. RAG over brute-force context injection. Retrieval-Augmented Generation. chunking documents, embedding them, and retrieving only the relevant portions at query time. consistently outperforms loading entire documents into context. Anthropic’s contextual retrieval approach (contextual summaries prepended to each chunk before embedding) reduces failure rates by 49% compared to standard RAG. For any knowledge base larger than roughly 50,000 tokens, this is the appropriate architecture.

4. Active task tracking at the end of context. The Manus engineering team observed that maintaining a running task file updated throughout a long session, and placing it at the end of context rather than the beginning, combats the lost-in-the-middle effect by keeping current objectives in the attention-favoured end position. Simple. Effective. Zero engineering overhead.

5. Prompt caching for stable prefixes. Any workflow where the same system prompt, agent specification, or reference document is used across many invocations should enable prompt caching. The 90% cost reduction and 85% latency reduction figures from Anthropic are for the cached portion. For high-volume deployments, this is the cheapest optimisation available.

What Requires Waiting for the Labs

The lost-in-the-middle effect, distractor interference, and attention dilution at very long contexts are training-time problems. They cannot be fully engineered around at the application layer. Genuine improvements to retrieval quality at 500,000-token contexts require changes to how models are trained, not just how they are called.

Infini-Attention (Google, arXiv:2404.07143, April 2024) attempted a compressive memory mechanism within the standard attention block. A one-billion-parameter fine-tuned model successfully retrieved a passkey from a one-million-token document. Hugging Face published a post-mortem (“A failed experiment: Infini-Attention, and why we should keep trying?”) documenting that real-world performance gains were more limited than the initial paper suggested. The direction is right. The implementation is not yet ready for production.

The appropriate response to model-level limitations: test at the context lengths you will actually use, not at the advertised maximum. Pick models based on empirical performance at your operating range. Monitor the research. Build your engineering to be model-agnostic so you can switch when something meaningfully better ships.

The Board-Level Summary

If you are a director or executive responsible for AI strategy, here is what this means for your organisation.

Every AI tool your teams use is subject to context degradation. The tools do not advertise this prominently. A model rated highly in benchmark results may still struggle when given a long meeting transcript, a complex document archive, or a multi-day conversation history. The more complex and longer-running the task, the more likely the model is operating in a degraded state.

AI vendor claims about context windows should be verified with empirical tests at the context lengths your use cases actually produce. A million-token context window is a maximum, not a guaranteed operating capacity.

And if you are building AI agent systems. or governing teams that are. the research consensus is clear: external memory, task decomposition, and retrieval architectures outperform raw context scaling as reliability strategies. The engineering solutions exist today. They require design discipline to implement. The labs will keep improving the underlying models. In the meantime, the gap between declared capability and operational reality is wider than most deployment teams account for.

The Board AI Governance Framework covers how boards should evaluate AI capability claims, including the gap between benchmark performance and production reliability. For quantum security context on AI infrastructure risk, visit Quantum Security Defence.

Sources: Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” TACL 2024. Hsieh et al., “RULER: What’s the Real Context Size of Your Long-Context Language Models?”, arXiv:2404.06654, COLM 2024. Munkhdalai et al., “Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention,” arXiv:2404.07143. Sarthi et al., “RAPTOR,” ICLR 2024, arXiv:2401.18059. Google DeepMind, Gemini 1.5 technical report, arXiv:2403.05530. Chroma, “Context Rot: How Increasing Input Tokens Impacts LLM Performance,” research.trychroma.com, 2025. Greg Kamradt, Needle-in-a-Haystack, github.com/gkamradt/LLMTest_NeedleInAHaystack.

Steve Vaile

Board technology advisor and QSECDEF co-founder. Writes on AI governance, quantum security, and commercial strategy for boards and deep tech founders. Follow him on LinkedIn.