跳过正文
  1. Ens/
  2. Projects/

Why LLMs Have No Memory — A Cross-Validated Research Report with 67 Primary Sources

·1711 字·9 分钟
Liu ZhuoQi
作者
Liu ZhuoQi
Personal blog of AI Agent developer Liu ZhuoQi. Sharing practical notes on AI Agent development, tool engineering, and creative programming.

TL;DR
#

“LLMs have no memory” isn’t an oversight — it’s the equilibrium of four compounding constraints: O(n²) attention + KV cache VRAM + catastrophic forgetting + GDPR compliance. Every “Memory” feature from ChatGPT / Claude / Cursor works the same way: inject structured text back into the system prompt. Weights never change. Prompt Caching is performance optimization, not memory. The mainstream for the next 1–3 years is “stateless LLM core + stateful Agent memory layer”.

Complexity100M ctx CostCache PriceCommon TTL
O(n²)638×H1000.1×5min–24h

1. Why LLMs Are Stateless
#

Four independent constraints — individually manageable, together they leave “stateless” as the only viable engineering solution. This conclusion is cross-validated across 67 primary sources.

Architecture: O(n²) Attention
#

Self-attention scales at O(n²). A single 4096-token sequence needs 2 GB VRAM for KV cache; 32 concurrent sessions hit 64 GB — more than the model weights themselves. Llama 3.1 at 100M context requires 638 H100 GPUs ($5,400/hour) for KV cache alone.

→ Liu et al. “Lost in the Middle” (TACL 2024): long contexts aren’t just slower — middle-section recall follows a U-shaped curve, worse than closed-book.

Training: Catastrophic Forgetting
#

LLM knowledge is entangled across billions of weights. No isolated “French module” or “user preference register” exists. Every fine-tune reshapes the entire parameter landscape. Even LoRA suffers from catastrophic forgetting in continual learning scenarios (arXiv 2404.16789).

→ Industry standard: offline retraining at weekly/daily cadence. No one does per-request weight updates.

Compliance: Right to Be Forgotten
#

GDPR Article 17 and PDPA require data controllers to delete personal data “without undue delay.” Once baked into billions of weights, the right to be forgotten becomes nearly impossible to execute — you can’t “subtract” a user from the model. Both Anthropic and OpenAI explicitly state Memory data lives externally, not in weights. This is a legal constraint, not a technical preference.

→ RAG / Memory Layer beats fine-tuning because of compliance, not technical superiority.

Security: Persistent Memory = Persistent Attack Surface
#

ChatGPT Memory has been breached via prompt injection through Google Docs, images, and web pages — attackers invoke to=bio to write malicious persistent instructions affecting all future conversations (Embrace The Red, 2024). This is precisely why Cursor 1.0→1.2 added mandatory user approval, and why Anthropic tested sycophancy/harmful conversation before releasing Memory.

Karpathy’s canonical analogy: Weights = ROM (static, burned in at training); context window = RAM (directly addressable during inference); KV cache = working memory (formed at test-time); external vector / KG store = disk (persistent, requires retrieval). “Knowledge in the weights is a hazy recollection of training-time internet documents; content in the context window is directly accessible” — Andrej Karpathy, Dwarkesh Patel Interview (2025-10).

2. Product Landscape: Cache vs Memory vs True Memory
#

14 products, zero weight modifications. This section also disentangles three commonly conflated concepts:

  • Cache (KV/Prompt Caching): Caches K,V projection tensors; prefix byte-level match → skip prefill. 5min–24h lifetime. Compute optimization, not “remembering.”
  • Memory (Product Layer): Text in external databases/vector stores/markdown, injected into system prompt on each call. User-controlled.
  • True Model Memory (In-Weights): Changing weights themselves. Hit by catastrophic forgetting + GDPR + interpretability.

Comparison Table
#

ProductStrategyTypeWeight Δ?
ChatGPT Memory4-layer: metadata + bio + ~40 summaries + windowMemoryNo
OpenAI Prompt Caching≥1024 tokens auto KV cache, 5min–24h TTLCacheNo
Anthropic Prompt CachingExplicit cache_control ≤4 breakpoints, byte-level matchCacheNo
Gemini Context CachingImplicit 90% discount + Explicit 60min TTLCacheNo
Claude.ai ProjectsInstructions + files + history, full prompt injectionMemoryNo
Claude Memory (2025-10)Project-isolated, 24h synthesis, editableMemoryNo
Claude CodeCLAUDE.md + model-written MEMORY.md (200 lines)MemoryNo
Cursor Rules / AGENTS.mdStatic markdown, 4 trigger modes, Team > Project > UserMemoryNo
Cursor Memories (1.0+)AI generates candidates → user approves → writesMemoryNo
Cursor Codebase IndexMerkle tree + encryption + Turbopuffer vector DBRAGNo
Windsurf Cascadeglobal + workspace rules + auto Memories + RAGMemoryNo
Devin KnowledgeHuman-written + AI suggestions + DeepWiki + VM SnapshotsMemory+RAGNo
Replit CheckpointsVM snapshot = files + DB + chat + Agent memorySnapshotNo

Italic = Cache/RAG/Snapshot; Bold = Memory. No product modifies weights.

Key reverse-engineering evidence: Manthan Gupta confirmed through three experiments: ask ChatGPT about a specific topic discussed a year ago, and it has absolutely no idea. ChatGPT Memory does not use RAG. It stores only: session metadata + dozens of bio entries + user message summaries of the last ~40 chats (not ChatGPT’s own replies) + the current sliding window. Cursor’s official docs put it even more bluntly: “Large language models don’t retain memory between completions. Rules provide persistent, reusable context at the prompt level.”

3. The Four-Layer Future Stack
#

Bottom-up: base layer forever stateless. The three above are different abstractions for “giving it memory.” L4 is the short-term mainstream; L2 is the highest-value research leap.

L4 · Agent Memory Layer
#

Most Mature

Treats the LLM as a stateless CPU; memory lives in external databases + Agent runtime. Representatives: Letta (MemGPT) · Mem0 · Zep + Graphiti · LangGraph Store · AutoGen Memory.

  • ✅ Auditable · Deletable · Model-agnostic
  • ⚠️ Retrieval quality ceiling · Write contamination accumulates
  • Mem0 scores 26% above OpenAI Memory on LoCoMo; 91% lower p95 latency; 90% fewer tokens

L3 · Ultra-Long Context
#

Commercialized

Stuffs memory into ultra-long context windows. Representatives: Gemini 2M (>99% needle recall) · Magic LTM-2-Mini 100M tokens.

  • ✅ Best in-session carrier
  • ⚠️ Lost-in-the-middle unsolved · 100M ctx single user = 638×H100

L3 and L4 are complementary, not competitive: ultra-long context handles within-session associations; Agent memory layer handles cross-session / cross-year persistence. Combining both is the current engineering optimum.

L2 · In-Architecture Memory
#

Highest Research Value

Embeds “persistent memory” as a differentiable module in the network — potentially the real paradigm shift. Representatives: Google Titans · Infini-attention · Mamba-2 · RWKV-7 Goose.

  • ✅ Constant VRAM · Linear time
  • ⚠️ Not yet validated at scale (needs ≥70B params / ≥10T tokens)

L1 · Bare LLM (frozen weights)
#

Forever Stateless

GPT / Claude / Gemini / Llama core. Each inference is a fresh process. Continual learning won’t become a per-user memory path short-term. LoRA is for domain/role specialization, not per-user.


4. Memory Economics: Why Cache TTL Is a Hidden Pricing Dial
#

This is the most underappreciated thread in the entire landscape.

In 2026-03, Anthropic silently dropped cache TTL from 1h to 5min, causing Claude Code users to pay 17–26% more. No announcement. No SLA commitment. This exposed a brutal truth: cache TTL directly impacts per-user cost but appears on zero SLAs.

MetricValue
Cost increase after Anthropic TTL change17–26%
Cache cost transparency0% (fully hidden)
100M ctx hardware cost (single user)~$5.4k/hr
SLA commitments on cache TTL0

Extrapolate this logic and the future “memory economics” increasingly resemble cloud storage — tiered (5min/1h/24h/permanent), pricable (micro-adjusting TTL is reverse-pricing by traffic), and lock-in (migration cost once agent workflows depend on specific cache strategies).


5. Three-Year Paradigm Roadmap
#

Based on Anthropic, Letta, Karpathy, LeCun sources. 2026 has high confidence; 2027–2028 are inferential with explicit uncertainty.

YearMainstreamPotential Dark Horse
2026Bare LLM + Agent Memory (Mem0/Zep/Letta) + long-context cachingTitans-style architectures begin small-scale commercial use; Sleep-time Compute becomes agent standard
2027Reflection / Sleep-time / TTT enter mainstream Agent framework primitivesA 7B SSM/Hybrid surpasses Transformer on long-context benchmarks
2028Top models may integrate in-arch memory (high-risk prediction); otherwise Memory Layer remains standardLeCun H-JEPA + LLM hybrid prototype (early signal for 5–10 year bet)
2028 caveat: In-architecture memory requires ≥70B params and ≥10T token training for validation — currently arXiv-only. The more likely 2028 scenario is coexistence, not replacement.

6. Nine Practical Takeaways
#

  1. Never conflate Cache and Memory: Cache skips prefill; Memory decides what goes into the prompt. Orthogonal.

  2. Writing memory = writing system prompt: Any convention expressible in markdown (Cursor Rules / CLAUDE.md / AGENTS.md) beats “letting the AI remember” — diffable, version-controlled, deterministic.

  3. Prefix order: static → dynamic: Tool definitions, system prompt, project rules first; user input last. Top-level advice from OpenAI, Anthropic, and Google docs.

  4. Compaction must be cache-safe: Don’t open a new system prompt for summarization — forces full uncached recomputation. Claude Code calls this “cache-safe forking.”

  5. TTL is a product decision: The Anthropic 1h→5min incident proves it. Expose TTL as user-configurable, or users will find your hidden pricing in their bills.

  6. AI writes, human approves = steadiest auto-Memory: Cursor 1.2’s user approval + Devin’s suggestion-only flow are the post-prompt-injection consensus.

  7. Visible, editable, exportable = trust: Anthropic’s natural language synthesis vs ChatGPT’s opaque synthesis — two sides of the same coin.

  8. Privacy mode conflicts with Cache: OpenAI Extended cache loses ZDR; Cursor privacy mode stores no plaintext. Offer “performance vs. privacy” as two modes.

  9. The real moat is “context engineering,” not “memory models”: Deterministic, version-controlled, human-readable state. Curation cost is one-time; benefit compounds.


7. Key References
#

All primary sources from 2024–2026. 30+ curated entries covering vendor docs, arXiv papers, and researcher essays.

A. Vendor Sources
#

OpenAI: Prompt Caching guide · Caching 201 cookbook · Manthan Gupta · Reverse Engineered ChatGPT Memory · Embrace The Red · Hacking Memories

Anthropic: Prompt Caching docs · Lessons from Claude Code · Claude Code Memory · How Claude’s memory works

Google: Gemini Context Caching · Vertex AI caching overview

Cursor / Windsurf / Devin / Replit: Cursor Rules · Codebase Indexing · Cursor 1.0 + 1.2 changelogs · Windsurf Memories · Devin Knowledge · Replit Checkpoints

B. Key Papers
#

Architecture: Lost in the Middle · Gemini 1.5 · Magic LTM-2-Mini · Titans · Infini-attention · Mamba-2 · RWKV-7 · KV-Direct

Memory Layer: MemGPT · Mem0 · Zep + Graphiti · A-Mem · Generative Agents · Sleep-time Compute

Continual Learning: CL Survey · TTT (ICML 2025) · Memory Taxonomy

C. Researchers (Karpathy / LeCun / Raschka)
#

D. Frameworks
#


Research method: Three parallel sub-agents (technical principles + product API design + future paradigms), cross-validated across four sources (Exa, Tavily, Context7, WebSearch). 67 primary URLs, 2024-Q1 to 2026-Q2.

相关文章

大模型为什么没有记忆——67 条一手资料的交叉验证调研

·1172 字·6 分钟
一句话结论 # 所谓「大模型没有记忆」不是疏忽,而是 Transformer O(n²) 注意力 + KV cache 显存 + 权重纠缠(灾难性遗忘)+ GDPR 合规 四重约束的均衡解。ChatGPT / Claude / Cursor 的 “Memory” 本质都是把结构化文本塞回 system prompt,模型权重永远不动。Prompt Caching 只是性能优化,不是记忆。未来 1–3 年的主流是 「无状态 LLM 内核 + 有状态 Agent 记忆层」 混合架构。 计算复杂度 100M ctx 成本 Cache 价格 主流 TTL O(n²) 638×H100 0.1× 5min–24h 1. 为什么 LLM 被设计成无状态 # 四个独立约束叠加,每一个单独都不致命,叠在一起就只剩"无状态"这一种工程解——这个结论来自对 67 条一手资料的交叉验证。