Production & LLMOps 11 min read

How Cursor Built AI Code Completion: Context, Retrieval, and Sub-Second Latency

Cursor's architecture for codebase understanding — tree-sitter AST parsing, embedding-based file retrieval, fill-in-the-middle for completions, and how they handle the dual latency problem.

Code completion has a unique set of constraints that make it one of the hardest production AI problems. The latency budget is 200ms for inline suggestions — slower than that, and the completion appears after the developer has already moved on. The context problem is severe — a production codebase has hundreds of thousands of files, but the LLM context window holds only a fraction.

Cursor solves both. Their editor integrates AI at two levels: sub-200ms ghost text completion (powered by smaller, faster models) and codebase-aware chat (powered by frontier models with intelligent retrieval). The architecture behind each is fundamentally different.

Fill-in-the-middle (FIM): how completions work

Standard LLMs predict what comes next given a prefix. Code completion is different: the developer has a cursor in the middle of a file. There's context before the cursor (prefix) and context after (suffix). The model needs to fill in the gap.

Fill-in-the-middle (FIM) is the training objective that enables this. During training, a segment is removed from the middle of code, and the model learns to predict the removed segment given both prefix and suffix. OpenAI's Codex, StarCoder, and DeepSeek-Coder all support FIM.

<PRE> def calculate_total(items):
    prices = [item.price for item in items]
    <SUF>
    return total
<MID>

The model fills in what goes between prefix and suffix. Cursor formats context precisely to maximize cache hits on the prefix tokens — a prompt caching optimization that reduces latency significantly on repeated completions in the same file.

Context assembly: what goes into the completion prompt

The completion model sees more than just the current file. Cursor assembles context from multiple sources, ranked by relevance:

Current file: the entire file being edited, truncated to fit the context window
Recently edited files: files modified in the last 30 minutes, included with recency weighting
Imported files: files imported by the current file — often directly relevant
Related files: BM25 + semantic search over the repository for files similar to the current editing context
Git diff: what changed in the current branch — helps the model understand what problem is being solved

The context budget is limited. Cursor allocates it greedily — most-relevant content goes in first, least-relevant gets dropped. The relevance ranking is what makes completions feel context-aware rather than generic.

Context selection is the main lever on completion quality. More context isn't always better — irrelevant context is noise. Cursor's retrieval step is the system's real intelligence, not the completion model itself.

Codebase indexing for chat

Cursor's 'Chat' and 'Composer' features (multi-file editing) require understanding the whole codebase. This is a different problem from completion.

Cursor indexes the repository using a combination of:

Tree-sitter AST parsing: each file is parsed into an abstract syntax tree. Functions, classes, methods, and their docstrings are extracted as distinct chunks — not arbitrary text chunks, but semantically meaningful code units.
Embedding-based search: code chunks are embedded using a code-specialized embedding model. When you ask a question, your question is embedded and matched against the chunk index.
Symbol graph: import relationships between files are tracked. If you ask about a class, the system can retrieve not just the class definition but its callers, its dependencies, and its tests.

This is meaningfully better than embedding raw file contents. A function has a clear semantic boundary. Embedding file chunks with arbitrary boundaries loses this structure.

The dual latency model

Cursor runs two parallel inference paths at different latency budgets:

Fast path (<200ms): small, specialized models (likely DeepSeek-Coder or similar) for ghost text completion. No round-trip to a frontier model. Cached prefix tokens reduce latency further.
Slow path (1-10s): frontier models (Claude, GPT-4o) for codebase chat, multi-file edits, and complex refactors. Users accept latency here because they've issued a deliberate command.

The routing between paths is based on the type of interaction. Typing → fast path. Explicit chat message → slow path. This is model routing at the UI layer, not the backend.

Measuring quality: accepted completions

Cursor's primary quality metric is acceptance rate — what percentage of shown completions the developer actually accepts (keeps typing rather than dismissing). A high-quality completion system has >30% acceptance on good context.

But acceptance rate is noisy. A developer might accept a completion that's technically correct but not what they wanted. Persistence rate — whether the completion is still in the file 30 days later — is a better signal and closer to what GitHub Copilot uses.

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →