Week 5-6

Chapter 5: Memory I: Virtual Context & MemGPT

How agents remember things that don't fit in the context window. The context-window problem and why long-context models don't solve it. MemGPT's OS-inspired virtual memory, memory blocks, sleep-time compute for offline consolidation, and eviction/promotion policies.

Chapter Overview

An agent's only memory is its prompt. Once the conversation, tool results, and reflections grow past the context window, something has to give — and what gives determines whether the agent feels coherent or feels like a goldfish.

Long-context models (200k, 1M, 2M tokens) help but don't solve the problem. Three reasons: (1) cost scales linearly or worse with context length, (2) attention degrades — models perform worse on facts in the middle of long contexts (the 'lost in the middle' phenomenon), (3) some agent state genuinely outlives any single context window (a personal assistant that you've used for two years).

MemGPT (Packer et al., 2023, now productionized as Letta) reframed this: borrow the operating-systems trick of virtual memory. Treat the LLM context window as 'main memory' (small, fast) and an external store as 'disk' (large, slow). Page items in and out as needed. The model itself decides what to evict and what to recall, using tools.

This chapter covers:

The context-window problem — why long context isn't a complete fix
MemGPT: virtual context — paging items between context and external store
Memory blocks architecture — typed segments (system, user, working, episodic) with explicit policies
Sleep-time compute — offline consolidation when the agent is idle
Evicting & promoting memories — what stays in core, what falls back to disk

Chapter Roadmap

Click any topic to jump in

Context Window Problem

Long context buys headroom but doesn't replace memory. Cost, lost-in-the-middle, and persistence all argue for external storage.

Why Long Context Isn't EnoughWhat 'Memory' Actually Means for Agents

Two halves of the in-context architecture

How items move (paging) and how they're shaped (blocks)

MemGPT: Virtual Context

OS-inspired virtual memory for LLMs — main context + archival, with the model itself paging items in and out.

The Two-Tier ArchitectureThe 'Function Call Heartbeat' Loop

Memory Blocks

Typed segments inside core context: system, persona, human, working — each with its own size and update policy.

The Standard Block TypesBlock Size Tuning

What happens between user turns

Sleep-Time Compute

Offline consolidation when the agent is idle. Memory summarization, reflection generation, embedding updates.

What to Do During SleepThe Cost-Quality Frontier of Sleep-Time

The policies that make it all work

Eviction & Promotion

What stays in core, what falls back to archival. The heart of every memory architecture.

Eviction PoliciesPromotion Policies

Every LLM has a finite context window. As an agent runs, the context fills with system prompt + history + tool results + reflections. When it hits the limit, the agent's options are: (a) fail, (b) drop old content (and lose memory), or (c) pay for a 'memory' subsystem to manage what stays in.

In this topic

1Why Long Context Isn't Enough

2What 'Memory' Actually Means for Agents

1 of 2

Why Long Context Isn't Enough

The naive answer to memory is 'just use a 1M-token model.' Three reasons that fails:

Cost. Each prompt's cost scales roughly linearly with context length. A 1M-token prompt at $5/M output tokens costs$ 5 per turn. At 100 turns, $500 per session — untenable for any consumer-facing product.
Quality degradation. The 'lost in the middle' paper (Liu et al., 2023) showed that even strong long-context models score worst on facts buried in the middle of the prompt. Recall is bimodal: high at the start and end, much lower in the middle.
Persistence. A personal assistant that remembers facts about you across years cannot fit two years of conversation in any context window. The fundamental need for external storage is unavoidable.

Long context buys you headroom. It doesn't replace memory architecture.

2 of 2

What 'Memory' Actually Means for Agents

Three distinct flavors of agent memory, often conflated:

Working memory. The active context window — what the agent is reasoning about now. ~10K to ~200K tokens depending on the model.
Episodic memory. Records of past sessions/conversations — 'last week we discussed X.' Often months or years deep. Not in working memory by default; pulled in on demand.
Semantic memory. Facts the agent knows about the user/world — 'this user prefers French; their daughter's name is Marie.' Smaller than episodic, included in working memory if relevant.

MemGPT's architecture maps neatly: working memory is core context; semantic memory lives in a user block always loaded in-context; episodic memory lives in external storage and is retrieved by tool calls.

Example:

A user tells the agent: 'My daughter Marie is allergic to peanuts.' Three months later, the user asks: 'What snacks should I send to Marie's school?' How does each memory type contribute?

Papers

Lost in the Middle: How Language Models Use Long Contexts MemGPT: Towards LLMs as Operating Systems

Blogs

The Long-Context RAG Question — LlamaIndex