Context Compaction: Managing Long Conversations Without Losing the Thread
Why conversation history grows until it breaks. Sliding window, summarisation-based, and hierarchical compaction strategies for production agents.
Every LLM conversation is a race against the context window. As a conversation grows — messages, tool results, retrieved documents, agent steps — it consumes tokens. Eventually, either the window fills and older content is dropped, or costs explode because every request re-sends an ever-growing history. Context compaction is the set of techniques for managing this.
Why context management matters more than you think
At 200K tokens, a 100-turn conversation with tool use can comfortably fit. But costs are proportional to input tokens on every request — a 100K-token context means $1+ per request on frontier models. And once you go beyond the context window, the model starts dropping content. What it drops first is usually the middle of the conversation — the resolution to earlier confusions, key decisions, agreed constraints.
Context limits are not symmetric failures. When your context fills and the model starts dropping content, you may not notice immediately. The model continues to respond coherently — it just slowly forgets earlier constraints, corrections, and context that shaped the conversation.
Technique 1: Rolling window
Keep only the last N turns in context, dropping the oldest messages when the window fills. Simple, fast, zero cost. Failure mode: the model loses information from early in the conversation that's still relevant — user preferences, established constraints, critical facts stated early.
def get_context_window(messages, max_tokens=100_000, model="claude-opus-4-6"):
# Always keep system message
system = [m for m in messages if m["role"] == "system"]
conversation = [m for m in messages if m["role"] != "system"]
# Count tokens from the end backwards
kept = []
running_total = count_tokens(system)
for message in reversed(conversation):
msg_tokens = count_tokens([message])
if running_total + msg_tokens > max_tokens:
break
kept.insert(0, message)
running_total += msg_tokens
return system + kept
Technique 2: Conversation summarisation
When the conversation exceeds a threshold, summarise the oldest N turns into a compact summary, replace those turns with the summary, and continue. The summary preserves key facts, decisions, and context in far fewer tokens than the raw conversation.
COMPACTION_PROMPT = """Summarise the following conversation segment in 200-300 words.
Preserve: key decisions made, facts established, user preferences, unresolved questions.
Discard: small talk, repetitive exchanges, clarifications that were resolved.
Conversation:
{messages}"""
def compact_context(messages, compaction_threshold=50_000):
current_tokens = count_tokens(messages)
if current_tokens < compaction_threshold:
return messages
# Summarise the oldest third
split = len(messages) // 3
to_summarise = messages[:split]
to_keep = messages[split:]
summary_text = llm(COMPACTION_PROMPT.format(
messages=format_messages(to_summarise)
))
summary_message = {
"role": "system",
"content": f"[Earlier conversation summary]: {summary_text}"
}
return [summary_message] + to_keep
Technique 3: Memory extraction
At regular intervals, extract persistent facts from the conversation into a structured memory store — user preferences, established facts, key decisions. These facts are retrieved and re-injected into future context as needed, rather than keeping the full conversation history.
MEMORY_EXTRACTION_PROMPT = """Review this conversation and extract:
1. User preferences (how they like things done)
2. Key facts established (names, IDs, decisions made)
3. Active constraints (things I must or must not do)
Return JSON: {"preferences": [], "facts": [], "constraints": []}
Conversation: {messages}"""
def extract_and_store_memories(messages, memory_store):
extracted = json.loads(llm(MEMORY_EXTRACTION_PROMPT.format(
messages=format_messages(messages[-20:]) # Last 20 turns
)))
for fact in extracted["facts"]:
memory_store.upsert(fact, category="fact")
for pref in extracted["preferences"]:
memory_store.upsert(pref, category="preference")
Technique 4: Anthropic's built-in compaction
Claude's API supports automatic context compaction — when the context window approaches capacity, Claude automatically summarises the oldest portions of the conversation to free up space. This is opt-in and configurable. For most production applications, native compaction is the easiest solution and works well for conversational use cases.
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
# Enable automatic compaction
betas=["extended-cache-ttl-2025-04-11"],
system="You are a helpful assistant.",
messages=conversation_history,
# Optionally set compaction behaviour
thinking={"type": "enabled", "budget_tokens": 10000}
)
Choosing the right approach
| Use case | Recommended approach |
|---|---|
| Short task-focused sessions (<20 turns) | No compaction needed |
| Long conversations, stateless tasks | Rolling window — simple, cheap |
| Long conversations, stateful (user has prefs, facts) | Summarisation + memory extraction |
| Agents with many tool calls | Checkpoint + re-summarise every 10 steps |
| Consumer product, many users, long sessions | Native API compaction — lowest ops overhead |
Context management tools →: Configure and test context compaction strategies in the Systems module.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →