GenAI Systems Lab Open interactive version →
Production & LLMOps 11 min read

How Cursor Built AI Code Completion: Context, Retrieval, and Sub-Second Latency

Cursor's architecture for codebase understanding — tree-sitter AST parsing, embedding-based file retrieval, fill-in-the-middle for completions, and how they handle the dual latency problem.

Code completion has a unique set of constraints that make it one of the hardest production AI problems. The latency budget is 200ms for inline suggestions — slower than that, and the completion appears after the developer has already moved on. The context problem is severe — a production codebase has hundreds of thousands of files, but the LLM context window holds only a fraction.

Cursor solves both. Their editor integrates AI at two levels: sub-200ms ghost text completion (powered by smaller, faster models) and codebase-aware chat (powered by frontier models with intelligent retrieval). The architecture behind each is fundamentally different.

Fill-in-the-middle (FIM): how completions work

Standard LLMs predict what comes next given a prefix. Code completion is different: the developer has a cursor in the middle of a file. There's context before the cursor (prefix) and context after (suffix). The model needs to fill in the gap.

Fill-in-the-middle (FIM) is the training objective that enables this. During training, a segment is removed from the middle of code, and the model learns to predict the removed segment given both prefix and suffix. OpenAI's Codex, StarCoder, and DeepSeek-Coder all support FIM.

<PRE> def calculate_total(items):
    prices = [item.price for item in items]
    <SUF>
    return total
<MID>

The model fills in what goes between prefix and suffix. Cursor formats context precisely to maximize cache hits on the prefix tokens — a prompt caching optimization that reduces latency significantly on repeated completions in the same file.

Context assembly: what goes into the completion prompt

The completion model sees more than just the current file. Cursor assembles context from multiple sources, ranked by relevance:

The context budget is limited. Cursor allocates it greedily — most-relevant content goes in first, least-relevant gets dropped. The relevance ranking is what makes completions feel context-aware rather than generic.

Context selection is the main lever on completion quality. More context isn't always better — irrelevant context is noise. Cursor's retrieval step is the system's real intelligence, not the completion model itself.

Codebase indexing for chat

Cursor's 'Chat' and 'Composer' features (multi-file editing) require understanding the whole codebase. This is a different problem from completion.

Cursor indexes the repository using a combination of:

This is meaningfully better than embedding raw file contents. A function has a clear semantic boundary. Embedding file chunks with arbitrary boundaries loses this structure.

The dual latency model

Cursor runs two parallel inference paths at different latency budgets:

The routing between paths is based on the type of interaction. Typing → fast path. Explicit chat message → slow path. This is model routing at the UI layer, not the backend.

Measuring quality: accepted completions

Cursor's primary quality metric is acceptance rate — what percentage of shown completions the developer actually accepts (keeps typing rather than dismissing). A high-quality completion system has >30% acceptance on good context.

But acceptance rate is noisy. A developer might accept a completion that's technically correct but not what they wanted. Persistence rate — whether the completion is still in the file 30 days later — is a better signal and closer to what GitHub Copilot uses.

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →