How agents remember things that don't fit in the context window. The context-window problem and why long-context models don't solve it. MemGPT's OS-inspired virtual memory, memory blocks, sleep-time compute for offline consolidation, and eviction/promotion policies.
An agent's only memory is its prompt. Once the conversation, tool results, and reflections grow past the context window, something has to give — and what gives determines whether the agent feels coherent or feels like a goldfish.
Long-context models (200k, 1M, 2M tokens) help but don't solve the problem. Three reasons: (1) cost scales linearly or worse with context length, (2) attention degrades — models perform worse on facts in the middle of long contexts (the 'lost in the middle' phenomenon), (3) some agent state genuinely outlives any single context window (a personal assistant that you've used for two years).
MemGPT (Packer et al., 2023, now productionized as Letta) reframed this: borrow the operating-systems trick of virtual memory. Treat the LLM context window as 'main memory' (small, fast) and an external store as 'disk' (large, slow). Page items in and out as needed. The model itself decides what to evict and what to recall, using tools.
This chapter covers:
Click any topic to jump in
Long context buys headroom but doesn't replace memory. Cost, lost-in-the-middle, and persistence all argue for external storage.
How items move (paging) and how they're shaped (blocks)
OS-inspired virtual memory for LLMs — main context + archival, with the model itself paging items in and out.
Typed segments inside core context: system, persona, human, working — each with its own size and update policy.
Offline consolidation when the agent is idle. Memory summarization, reflection generation, embedding updates.
What stays in core, what falls back to archival. The heart of every memory architecture.
Every LLM has a finite context window. As an agent runs, the context fills with system prompt + history + tool results + reflections. When it hits the limit, the agent's options are: (a) fail, (b) drop old content (and lose memory), or (c) pay for a 'memory' subsystem to manage what stays in.
The naive answer to memory is 'just use a 1M-token model.' Three reasons that fails:
Long context buys you headroom. It doesn't replace memory architecture.
Three distinct flavors of agent memory, often conflated:
MemGPT's architecture maps neatly: working memory is core context; semantic memory lives in a user block always loaded in-context; episodic memory lives in external storage and is retrieved by tool calls.
A user tells the agent: 'My daughter Marie is allergic to peanuts.' Three months later, the user asks: 'What snacks should I send to Marie's school?' How does each memory type contribute?