AI Engineering 8 min read

Context Compaction: Managing Long Conversations Without Losing the Thread

Why conversation history grows until it breaks. Sliding window, summarisation-based, and hierarchical compaction strategies for production agents.

Every LLM conversation is a race against the context window. As a conversation grows — messages, tool results, retrieved documents, agent steps — it consumes tokens. Eventually, either the window fills and older content is dropped, or costs explode because every request re-sends an ever-growing history. Context compaction is the set of techniques for managing this.

Why context management matters more than you think

At 200K tokens, a 100-turn conversation with tool use can comfortably fit. But costs are proportional to input tokens on every request — a 100K-token context means $1+ per request on frontier models. And once you go beyond the context window, the model starts dropping content. What it drops first is usually the middle of the conversation — the resolution to earlier confusions, key decisions, agreed constraints.

Context limits are not symmetric failures. When your context fills and the model starts dropping content, you may not notice immediately. The model continues to respond coherently — it just slowly forgets earlier constraints, corrections, and context that shaped the conversation.

Technique 1: Rolling window

Keep only the last N turns in context, dropping the oldest messages when the window fills. Simple, fast, zero cost. Failure mode: the model loses information from early in the conversation that's still relevant — user preferences, established constraints, critical facts stated early.

def get_context_window(messages, max_tokens=100_000, model="claude-opus-4-6"):
    # Always keep system message
    system = [m for m in messages if m["role"] == "system"]
    conversation = [m for m in messages if m["role"] != "system"]

    # Count tokens from the end backwards
    kept = []
    running_total = count_tokens(system)

    for message in reversed(conversation):
        msg_tokens = count_tokens([message])
        if running_total + msg_tokens > max_tokens:
            break
        kept.insert(0, message)
        running_total += msg_tokens

    return system + kept

Technique 2: Conversation summarisation

When the conversation exceeds a threshold, summarise the oldest N turns into a compact summary, replace those turns with the summary, and continue. The summary preserves key facts, decisions, and context in far fewer tokens than the raw conversation.

COMPACTION_PROMPT = """Summarise the following conversation segment in 200-300 words.
Preserve: key decisions made, facts established, user preferences, unresolved questions.
Discard: small talk, repetitive exchanges, clarifications that were resolved.

Conversation:
{messages}"""

def compact_context(messages, compaction_threshold=50_000):
    current_tokens = count_tokens(messages)
    if current_tokens < compaction_threshold:
        return messages

    # Summarise the oldest third
    split = len(messages) // 3
    to_summarise = messages[:split]
    to_keep = messages[split:]

    summary_text = llm(COMPACTION_PROMPT.format(
        messages=format_messages(to_summarise)
    ))

    summary_message = {
        "role": "system",
        "content": f"[Earlier conversation summary]: {summary_text}"
    }

    return [summary_message] + to_keep

Technique 3: Memory extraction

At regular intervals, extract persistent facts from the conversation into a structured memory store — user preferences, established facts, key decisions. These facts are retrieved and re-injected into future context as needed, rather than keeping the full conversation history.

MEMORY_EXTRACTION_PROMPT = """Review this conversation and extract:
1. User preferences (how they like things done)
2. Key facts established (names, IDs, decisions made)
3. Active constraints (things I must or must not do)

Return JSON: {"preferences": [], "facts": [], "constraints": []}

Conversation: {messages}"""

def extract_and_store_memories(messages, memory_store):
    extracted = json.loads(llm(MEMORY_EXTRACTION_PROMPT.format(
        messages=format_messages(messages[-20:])  # Last 20 turns
    )))

    for fact in extracted["facts"]:
        memory_store.upsert(fact, category="fact")
    for pref in extracted["preferences"]:
        memory_store.upsert(pref, category="preference")

Technique 4: Anthropic's built-in compaction

Claude's API supports automatic context compaction — when the context window approaches capacity, Claude automatically summarises the oldest portions of the conversation to free up space. This is opt-in and configurable. For most production applications, native compaction is the easiest solution and works well for conversational use cases.

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=4096,
    # Enable automatic compaction
    betas=["extended-cache-ttl-2025-04-11"],
    system="You are a helpful assistant.",
    messages=conversation_history,
    # Optionally set compaction behaviour
    thinking={"type": "enabled", "budget_tokens": 10000}
)

Choosing the right approach

Use case	Recommended approach
Short task-focused sessions (<20 turns)	No compaction needed
Long conversations, stateless tasks	Rolling window — simple, cheap
Long conversations, stateful (user has prefs, facts)	Summarisation + memory extraction
Agents with many tool calls	Checkpoint + re-summarise every 10 steps
Consumer product, many users, long sessions	Native API compaction — lowest ops overhead

Context management tools →: Configure and test context compaction strategies in the Systems module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →