How Cursor Built AI Code Completion: Context, Retrieval, and Sub-Second Latency
Cursor's architecture for codebase understanding — tree-sitter AST parsing, embedding-based file retrieval, fill-in-the-middle for completions, and how they handle the dual latency problem.
Code completion has a unique set of constraints that make it one of the hardest production AI problems. The latency budget is 200ms for inline suggestions — slower than that, and the completion appears after the developer has already moved on. The context problem is severe — a production codebase has hundreds of thousands of files, but the LLM context window holds only a fraction.
Cursor solves both. Their editor integrates AI at two levels: sub-200ms ghost text completion (powered by smaller, faster models) and codebase-aware chat (powered by frontier models with intelligent retrieval). The architecture behind each is fundamentally different.
Fill-in-the-middle (FIM): how completions work
Standard LLMs predict what comes next given a prefix. Code completion is different: the developer has a cursor in the middle of a file. There's context before the cursor (prefix) and context after (suffix). The model needs to fill in the gap.
Fill-in-the-middle (FIM) is the training objective that enables this. During training, a segment is removed from the middle of code, and the model learns to predict the removed segment given both prefix and suffix. OpenAI's Codex, StarCoder, and DeepSeek-Coder all support FIM.
<PRE> def calculate_total(items):
prices = [item.price for item in items]
<SUF>
return total
<MID>
The model fills in what goes between prefix and suffix. Cursor formats context precisely to maximize cache hits on the prefix tokens — a prompt caching optimization that reduces latency significantly on repeated completions in the same file.
Context assembly: what goes into the completion prompt
The completion model sees more than just the current file. Cursor assembles context from multiple sources, ranked by relevance:
- Current file: the entire file being edited, truncated to fit the context window
- Recently edited files: files modified in the last 30 minutes, included with recency weighting
- Imported files: files imported by the current file — often directly relevant
- Related files: BM25 + semantic search over the repository for files similar to the current editing context
- Git diff: what changed in the current branch — helps the model understand what problem is being solved
The context budget is limited. Cursor allocates it greedily — most-relevant content goes in first, least-relevant gets dropped. The relevance ranking is what makes completions feel context-aware rather than generic.
Context selection is the main lever on completion quality. More context isn't always better — irrelevant context is noise. Cursor's retrieval step is the system's real intelligence, not the completion model itself.
Codebase indexing for chat
Cursor's 'Chat' and 'Composer' features (multi-file editing) require understanding the whole codebase. This is a different problem from completion.
Cursor indexes the repository using a combination of:
- Tree-sitter AST parsing: each file is parsed into an abstract syntax tree. Functions, classes, methods, and their docstrings are extracted as distinct chunks — not arbitrary text chunks, but semantically meaningful code units.
- Embedding-based search: code chunks are embedded using a code-specialized embedding model. When you ask a question, your question is embedded and matched against the chunk index.
- Symbol graph: import relationships between files are tracked. If you ask about a class, the system can retrieve not just the class definition but its callers, its dependencies, and its tests.
This is meaningfully better than embedding raw file contents. A function has a clear semantic boundary. Embedding file chunks with arbitrary boundaries loses this structure.
The dual latency model
Cursor runs two parallel inference paths at different latency budgets:
- Fast path (<200ms): small, specialized models (likely DeepSeek-Coder or similar) for ghost text completion. No round-trip to a frontier model. Cached prefix tokens reduce latency further.
- Slow path (1-10s): frontier models (Claude, GPT-4o) for codebase chat, multi-file edits, and complex refactors. Users accept latency here because they've issued a deliberate command.
The routing between paths is based on the type of interaction. Typing → fast path. Explicit chat message → slow path. This is model routing at the UI layer, not the backend.
Measuring quality: accepted completions
Cursor's primary quality metric is acceptance rate — what percentage of shown completions the developer actually accepts (keeps typing rather than dismissing). A high-quality completion system has >30% acceptance on good context.
But acceptance rate is noisy. A developer might accept a completion that's technically correct but not what they wanted. Persistence rate — whether the completion is still in the file 30 days later — is a better signal and closer to what GitHub Copilot uses.
Interactive lab:
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →