GenAI Systems Lab Open interactive version →
Foundations & Architecture 7 min read

The Context Window: What Goes In, What Gets Dropped, and Why It Matters

How LLMs process long inputs, what context overflow looks like in production, and strategies to stay within the window without losing what matters.

**No prerequisites — standalone read.** After this post you'll understand what the context window is, why it fills up, what gets silently dropped and in what order, and how production systems manage the tradeoff between context length and cost.

The context window is the LLM's working memory. Everything the model knows about your conversation, your documents, your instructions — all of it is in there. When it runs out, something gets dropped. The model doesn't warn you.

Understanding the context window is essential for building reliable RAG systems, agents, and chatbots. It's not just a limit to stay under — it's a resource to manage actively.

What is the context window?

A transformer processes all tokens in a sequence simultaneously via self-attention. The context window is the maximum number of tokens it can process in a single forward pass — the ceiling on sequence length.

ModelContext windowRough equivalent
GPT-3.516K tokens~12,000 words
GPT-4o128K tokens~96,000 words
Claude 3.5 Sonnet200K tokens~150,000 words
Gemini 1.5 Pro1M tokens~750,000 words
Llama 3 70B128K tokens~96,000 words

Larger context window ≠ better performance. Models trained with shorter contexts often perform worse in the middle of long inputs. A 200K window doesn't mean the model uses all 200K equally well.

What competes for context space

In a typical production system, the context window is shared among multiple components. Each one competes for the same finite space:

In a RAG system with a 5K system prompt, 10-turn conversation history (3K), and top-5 chunks (2K), you've already spent 10K tokens before generating a single word of response.

The lost-in-the-middle problem

Research (Liu et al., 2023) demonstrated that LLMs attend disproportionately to the beginning and end of their context. Information buried in the middle of a long context is frequently missed — even when it's explicitly relevant to the query.

Practical implications:

What happens when you overflow

When your total input exceeds the context limit, the API will either reject the request (with a context length error) or silently truncate. The truncation behaviour depends on the system — most drop from the oldest conversation turns.

In RAG systems, overflow means fewer chunks get included — reducing recall without any visible signal. In agents, it means tool results or earlier reasoning steps disappear — the agent can lose track of what it was doing.

Strategies for context management

A common mistake: treating the context window as "memory". It's not. The model has no memory between API calls. What feels like memory is the conversation history being re-injected into the context on every turn. This means costs scale with conversation length — plan accordingly.

Cost and latency implications

Every token in your input context is processed at inference time. Larger context = higher cost + higher time-to-first-token (TTFT). For Claude and GPT-4, input tokens are priced — a 100K token context costs roughly 3–10× more than a 10K context, depending on the model.

Prompt caching can dramatically reduce costs on repeated large system prompts: if the first 90K tokens are identical across requests, only the final 10K needs to be freshly computed. Cache hit rates above 80% are achievable in well-designed systems.

Test Context Window & Cost →: See how context fills up across a conversation, and what gets dropped when you overflow.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →