Why LLMs Have No Memory — A Cross-Validated Research Report with 67 Primary Sources

TL;DR
#

“LLMs have no memory” isn’t an oversight — it’s the equilibrium of four compounding constraints: O(n²) attention + KV cache VRAM + catastrophic forgetting + GDPR compliance. Every “Memory” feature from ChatGPT / Claude / Cursor works the same way: inject structured text back into the system prompt. Weights never change. Prompt Caching is performance optimization, not memory. The mainstream for the next 1–3 years is “stateless LLM core + stateful Agent memory layer”.

Complexity	100M ctx Cost	Cache Price	Common TTL
O(n²)	638×H100	0.1×	5min–24h

1. Why LLMs Are Stateless
#

Four independent constraints — individually manageable, together they leave “stateless” as the only viable engineering solution. This conclusion is cross-validated across 67 primary sources.

Architecture: O(n²) Attention
#

Self-attention scales at O(n²). A single 4096-token sequence needs ~~2 GB VRAM for KV cache; 32 concurrent sessions hit 64 GB — more than the model weights themselves. Llama 3.1 at 100M context requires 638 H100 GPUs (~~$5,400/hour) for KV cache alone.

→ Liu et al. “Lost in the Middle” (TACL 2024): long contexts aren’t just slower — middle-section recall follows a U-shaped curve, worse than closed-book.

Training: Catastrophic Forgetting
#

LLM knowledge is entangled across billions of weights. No isolated “French module” or “user preference register” exists. Every fine-tune reshapes the entire parameter landscape. Even LoRA suffers from catastrophic forgetting in continual learning scenarios (arXiv 2404.16789).

→ Industry standard: offline retraining at weekly/daily cadence. No one does per-request weight updates.

Compliance: Right to Be Forgotten
#

GDPR Article 17 and PDPA require data controllers to delete personal data “without undue delay.” Once baked into billions of weights, the right to be forgotten becomes nearly impossible to execute — you can’t “subtract” a user from the model. Both Anthropic and OpenAI explicitly state Memory data lives externally, not in weights. This is a legal constraint, not a technical preference.

→ RAG / Memory Layer beats fine-tuning because of compliance, not technical superiority.

Security: Persistent Memory = Persistent Attack Surface
#

ChatGPT Memory has been breached via prompt injection through Google Docs, images, and web pages — attackers invoke to=bio to write malicious persistent instructions affecting all future conversations (Embrace The Red, 2024). This is precisely why Cursor 1.0→1.2 added mandatory user approval, and why Anthropic tested sycophancy/harmful conversation before releasing Memory.

Karpathy’s canonical analogy: Weights = ROM (static, burned in at training); context window = RAM (directly addressable during inference); KV cache = working memory (formed at test-time); external vector / KG store = disk (persistent, requires retrieval). “Knowledge in the weights is a hazy recollection of training-time internet documents; content in the context window is directly accessible” — Andrej Karpathy, Dwarkesh Patel Interview (2025-10).

2. Product Landscape: Cache vs Memory vs True Memory
#

14 products, zero weight modifications. This section also disentangles three commonly conflated concepts:

Cache (KV/Prompt Caching): Caches K,V projection tensors; prefix byte-level match → skip prefill. 5min–24h lifetime. Compute optimization, not “remembering.”
Memory (Product Layer): Text in external databases/vector stores/markdown, injected into system prompt on each call. User-controlled.
True Model Memory (In-Weights): Changing weights themselves. Hit by catastrophic forgetting + GDPR + interpretability.

Comparison Table
#

Product	Strategy	Type	Weight Δ?
ChatGPT Memory	4-layer: metadata + bio + ~40 summaries + window	Memory	No
OpenAI Prompt Caching	≥1024 tokens auto KV cache, 5min–24h TTL	Cache	No
Anthropic Prompt Caching	Explicit `cache_control` ≤4 breakpoints, byte-level match	Cache	No
Gemini Context Caching	Implicit 90% discount + Explicit 60min TTL	Cache	No
Claude.ai Projects	Instructions + files + history, full prompt injection	Memory	No
Claude Memory (2025-10)	Project-isolated, 24h synthesis, editable	Memory	No
Claude Code	CLAUDE.md + model-written MEMORY.md (200 lines)	Memory	No
Cursor Rules / AGENTS.md	Static markdown, 4 trigger modes, Team > Project > User	Memory	No
Cursor Memories (1.0+)	AI generates candidates → user approves → writes	Memory	No
Cursor Codebase Index	Merkle tree + encryption + Turbopuffer vector DB	RAG	No
Windsurf Cascade	global + workspace rules + auto Memories + RAG	Memory	No
Devin Knowledge	Human-written + AI suggestions + DeepWiki + VM Snapshots	Memory+RAG	No
Replit Checkpoints	VM snapshot = files + DB + chat + Agent memory	Snapshot	No

Italic = Cache/RAG/Snapshot; Bold = Memory. No product modifies weights.

Key reverse-engineering evidence: Manthan Gupta confirmed through three experiments: ask ChatGPT about a specific topic discussed a year ago, and it has absolutely no idea. ChatGPT Memory does not use RAG. It stores only: session metadata + dozens of bio entries + user message summaries of the last ~40 chats (not ChatGPT’s own replies) + the current sliding window. Cursor’s official docs put it even more bluntly: “Large language models don’t retain memory between completions. Rules provide persistent, reusable context at the prompt level.”

3. The Four-Layer Future Stack
#

Bottom-up: base layer forever stateless. The three above are different abstractions for “giving it memory.” L4 is the short-term mainstream; L2 is the highest-value research leap.

L4 · Agent Memory Layer
#

Most Mature

Treats the LLM as a stateless CPU; memory lives in external databases + Agent runtime. Representatives: Letta (MemGPT) · Mem0 · Zep + Graphiti · LangGraph Store · AutoGen Memory.

✅ Auditable · Deletable · Model-agnostic
⚠️ Retrieval quality ceiling · Write contamination accumulates
Mem0 scores 26% above OpenAI Memory on LoCoMo; 91% lower p95 latency; 90% fewer tokens

L3 · Ultra-Long Context
#

Commercialized

Stuffs memory into ultra-long context windows. Representatives: Gemini 2M (>99% needle recall) · Magic LTM-2-Mini 100M tokens.

✅ Best in-session carrier
⚠️ Lost-in-the-middle unsolved · 100M ctx single user = 638×H100

L3 and L4 are complementary, not competitive: ultra-long context handles within-session associations; Agent memory layer handles cross-session / cross-year persistence. Combining both is the current engineering optimum.

L2 · In-Architecture Memory
#

Highest Research Value

Embeds “persistent memory” as a differentiable module in the network — potentially the real paradigm shift. Representatives: Google Titans · Infini-attention · Mamba-2 · RWKV-7 Goose.

✅ Constant VRAM · Linear time
⚠️ Not yet validated at scale (needs ≥70B params / ≥10T tokens)

L1 · Bare LLM (frozen weights)
#

Forever Stateless

GPT / Claude / Gemini / Llama core. Each inference is a fresh process. Continual learning won’t become a per-user memory path short-term. LoRA is for domain/role specialization, not per-user.

4. Memory Economics: Why Cache TTL Is a Hidden Pricing Dial
#

This is the most underappreciated thread in the entire landscape.

In 2026-03, Anthropic silently dropped cache TTL from 1h to 5min, causing Claude Code users to pay 17–26% more. No announcement. No SLA commitment. This exposed a brutal truth: cache TTL directly impacts per-user cost but appears on zero SLAs.

Metric	Value
Cost increase after Anthropic TTL change	17–26%
Cache cost transparency	0% (fully hidden)
100M ctx hardware cost (single user)	~$5.4k/hr
SLA commitments on cache TTL	0

Extrapolate this logic and the future “memory economics” increasingly resemble cloud storage — tiered (5min/1h/24h/permanent), pricable (micro-adjusting TTL is reverse-pricing by traffic), and lock-in (migration cost once agent workflows depend on specific cache strategies).

5. Three-Year Paradigm Roadmap
#

Based on Anthropic, Letta, Karpathy, LeCun sources. 2026 has high confidence; 2027–2028 are inferential with explicit uncertainty.

Year	Mainstream	Potential Dark Horse
2026	Bare LLM + Agent Memory (Mem0/Zep/Letta) + long-context caching	Titans-style architectures begin small-scale commercial use; Sleep-time Compute becomes agent standard
2027	Reflection / Sleep-time / TTT enter mainstream Agent framework primitives	A 7B SSM/Hybrid surpasses Transformer on long-context benchmarks
2028	Top models may integrate in-arch memory (high-risk prediction); otherwise Memory Layer remains standard	LeCun H-JEPA + LLM hybrid prototype (early signal for 5–10 year bet)

2028 caveat: In-architecture memory requires ≥70B params and ≥10T token training for validation — currently arXiv-only. The more likely 2028 scenario is coexistence, not replacement.

6. Nine Practical Takeaways
#

Never conflate Cache and Memory: Cache skips prefill; Memory decides what goes into the prompt. Orthogonal.
Writing memory = writing system prompt: Any convention expressible in markdown (Cursor Rules / CLAUDE.md / AGENTS.md) beats “letting the AI remember” — diffable, version-controlled, deterministic.
Prefix order: static → dynamic: Tool definitions, system prompt, project rules first; user input last. Top-level advice from OpenAI, Anthropic, and Google docs.
Compaction must be cache-safe: Don’t open a new system prompt for summarization — forces full uncached recomputation. Claude Code calls this “cache-safe forking.”
TTL is a product decision: The Anthropic 1h→5min incident proves it. Expose TTL as user-configurable, or users will find your hidden pricing in their bills.
AI writes, human approves = steadiest auto-Memory: Cursor 1.2’s user approval + Devin’s suggestion-only flow are the post-prompt-injection consensus.
Visible, editable, exportable = trust: Anthropic’s natural language synthesis vs ChatGPT’s opaque synthesis — two sides of the same coin.
Privacy mode conflicts with Cache: OpenAI Extended cache loses ZDR; Cursor privacy mode stores no plaintext. Offer “performance vs. privacy” as two modes.
The real moat is “context engineering,” not “memory models”: Deterministic, version-controlled, human-readable state. Curation cost is one-time; benefit compounds.

7. Key References
#

All primary sources from 2024–2026. 30+ curated entries covering vendor docs, arXiv papers, and researcher essays.

A. Vendor Sources
#

OpenAI: Prompt Caching guide · Caching 201 cookbook · Manthan Gupta · Reverse Engineered ChatGPT Memory · Embrace The Red · Hacking Memories

Anthropic: Prompt Caching docs · Lessons from Claude Code · Claude Code Memory · How Claude’s memory works

Google: Gemini Context Caching · Vertex AI caching overview

Cursor / Windsurf / Devin / Replit: Cursor Rules · Codebase Indexing · Cursor 1.0 + 1.2 changelogs · Windsurf Memories · Devin Knowledge · Replit Checkpoints

B. Key Papers
#

Architecture: Lost in the Middle · Gemini 1.5 · Magic LTM-2-Mini · Titans · Infini-attention · Mamba-2 · RWKV-7 · KV-Direct

Memory Layer: MemGPT · Mem0 · Zep + Graphiti · A-Mem · Generative Agents · Sleep-time Compute

Continual Learning: CL Survey · TTT (ICML 2025) · Memory Taxonomy

C. Researchers (Karpathy / LeCun / Raschka)
#

D. Frameworks
#

Research method: Three parallel sub-agents (technical principles + product API design + future paradigms), cross-validated across four sources (Exa, Tavily, Context7, WebSearch). 67 primary URLs, 2024-Q1 to 2026-Q2.

TL;DR#

1. Why LLMs Are Stateless#

Architecture: O(n²) Attention#

Training: Catastrophic Forgetting#

Compliance: Right to Be Forgotten#

Security: Persistent Memory = Persistent Attack Surface#

2. Product Landscape: Cache vs Memory vs True Memory#

Comparison Table#

3. The Four-Layer Future Stack#

L4 · Agent Memory Layer#

L3 · Ultra-Long Context#

L2 · In-Architecture Memory#

L1 · Bare LLM (frozen weights)#

4. Memory Economics: Why Cache TTL Is a Hidden Pricing Dial#

5. Three-Year Paradigm Roadmap#

6. Nine Practical Takeaways#

7. Key References#

A. Vendor Sources#

B. Key Papers#

C. Researchers (Karpathy / LeCun / Raschka)#

D. Frameworks#

相关文章