The Context Window: What Goes In, What Gets Dropped, and Why It Matters
How LLMs process long inputs, what context overflow looks like in production, and strategies to stay within the window without losing what matters.
**No prerequisites — standalone read.** After this post you'll understand what the context window is, why it fills up, what gets silently dropped and in what order, and how production systems manage the tradeoff between context length and cost.
The context window is the LLM's working memory. Everything the model knows about your conversation, your documents, your instructions — all of it is in there. When it runs out, something gets dropped. The model doesn't warn you.
Understanding the context window is essential for building reliable RAG systems, agents, and chatbots. It's not just a limit to stay under — it's a resource to manage actively.
What is the context window?
A transformer processes all tokens in a sequence simultaneously via self-attention. The context window is the maximum number of tokens it can process in a single forward pass — the ceiling on sequence length.
| Model | Context window | Rough equivalent |
|---|---|---|
| GPT-3.5 | 16K tokens | ~12,000 words |
| GPT-4o | 128K tokens | ~96,000 words |
| Claude 3.5 Sonnet | 200K tokens | ~150,000 words |
| Gemini 1.5 Pro | 1M tokens | ~750,000 words |
| Llama 3 70B | 128K tokens | ~96,000 words |
Larger context window ≠ better performance. Models trained with shorter contexts often perform worse in the middle of long inputs. A 200K window doesn't mean the model uses all 200K equally well.
What competes for context space
In a typical production system, the context window is shared among multiple components. Each one competes for the same finite space:
- System prompt — instructions, persona, rules, few-shot examples. Can easily be 1,000–5,000 tokens.
- Conversation history — grows with every turn. Uncapped, it eventually fills the window.
- Retrieved chunks (RAG) — each chunk is typically 200–500 tokens. Top-5 retrieval = 1,000–2,500 tokens.
- Tool results (agents) — API responses, function outputs. Can be large and unpredictable.
- The user's current query — usually small, but multi-turn queries accumulate.
- The model's output — consumes tokens from your max_tokens budget, not input context, but still relevant to total cost.
In a RAG system with a 5K system prompt, 10-turn conversation history (3K), and top-5 chunks (2K), you've already spent 10K tokens before generating a single word of response.
The lost-in-the-middle problem
Research (Liu et al., 2023) demonstrated that LLMs attend disproportionately to the beginning and end of their context. Information buried in the middle of a long context is frequently missed — even when it's explicitly relevant to the query.
Practical implications:
- Put critical instructions at the start of the system prompt, not in the middle
- If using RAG, consider placing the most relevant chunk first and last — not sandwiched between lower-relevance ones
- Don't rely on the model "seeing" something just because it's in context — especially in long contexts
What happens when you overflow
When your total input exceeds the context limit, the API will either reject the request (with a context length error) or silently truncate. The truncation behaviour depends on the system — most drop from the oldest conversation turns.
In RAG systems, overflow means fewer chunks get included — reducing recall without any visible signal. In agents, it means tool results or earlier reasoning steps disappear — the agent can lose track of what it was doing.
Strategies for context management
- Sliding window — drop oldest messages when total tokens exceed a threshold. Simple, fast. Loses early context.
- Summarisation — periodically compress conversation history into a summary, replacing the raw turns. Preserves semantic content at lower token cost.
- Selective retrieval — instead of injecting all history, retrieve relevant past turns using embedding similarity. More complex but highly effective.
- Context compaction (Claude) — automatic compression of long conversations while preserving critical information. Happens transparently during agentic runs.
- Hierarchical memory — separate short-term (full history) from long-term (compressed summaries) storage. Used in multi-session agents.
A common mistake: treating the context window as "memory". It's not. The model has no memory between API calls. What feels like memory is the conversation history being re-injected into the context on every turn. This means costs scale with conversation length — plan accordingly.
Cost and latency implications
Every token in your input context is processed at inference time. Larger context = higher cost + higher time-to-first-token (TTFT). For Claude and GPT-4, input tokens are priced — a 100K token context costs roughly 3–10× more than a 10K context, depending on the model.
Prompt caching can dramatically reduce costs on repeated large system prompts: if the first 90K tokens are identical across requests, only the final 10K needs to be freshly computed. Cache hit rates above 80% are achievable in well-designed systems.
Test Context Window & Cost →: See how context fills up across a conversation, and what gets dropped when you overflow.
- Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023)
- LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
- Context Compaction — Anthropic Research
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →