Production & LLMOps 11 min read

GitHub Copilot at Scale: How a Hundred Million Suggestions Per Day Actually Works

Copilot's retrieval architecture, how they score and rank context from open files and imports, the dual-path latency model, and how they measure quality on 100M+ daily completions.

GitHub Copilot launched in 2021 as the first mass-market AI code completion tool. Three years later, it processes hundreds of millions of suggestions per day across millions of developers. Understanding how it works at this scale reveals production decisions that apply far beyond code completion.

The context problem at scale

A code completion request arrives every time a developer pauses. At Copilot's scale, that's thousands of requests per second. Each request needs relevant context assembled from the developer's open files in under 50ms (the budget before the network round-trip even begins).

Copilot's context assembly uses what they call a 'prompt crafting' pipeline. Given a cursor position, it assembles:

Path marker: the current file's language and path — gives the model domain context
Similar files: files with similar content to the current file, retrieved using BM25 over file names and a lightweight semantic match over import lists
Recently opened files: files the developer viewed recently, with a recency decay (files from 2 minutes ago count more than files from 2 hours ago)
Current file prefix: everything in the current file above the cursor
Current file suffix: a window of text below the cursor (for FIM models)

Copilot doesn't embed the full codebase per request. The retrieval is fast heuristics: BM25 keyword matching and recency signals. This keeps context assembly under 50ms even for large repositories.

The two-path architecture

Copilot runs completions on two latency tracks:

Inline completions (<300ms): a single-line or short multi-line suggestion that appears as ghost text while you type. Uses a smaller, faster model optimized for low latency. Fires automatically after a short typing pause.
Multi-line completions (300ms–1s): triggered by a longer pause or explicit request. Uses a larger model. Can suggest entire function implementations.

The model itself changed significantly from the original Codex (2021) to current versions. Codex was a fine-tuned GPT-3 variant. Modern Copilot uses models specifically trained for code with FIM support and stronger context utilization.

Measuring quality at 100M+ daily completions

You cannot manually review quality at this scale. Copilot uses three automated quality signals:

Acceptance rate: what fraction of shown completions are accepted (developer presses Tab). Copilot targets >35% on meaningful completions. A completion that's ignored within 1 second doesn't count.
Persistence rate: is the accepted completion still present in the file 30 days later, roughly unchanged? This measures whether completions are actually useful vs. accepted then immediately deleted.
Code-as-written rate: does the committed code contain Copilot suggestions without modification? Copilot can recognize its own suggestions in git diffs and measure how many reach production code.

These metrics are collected passively — no user surveys, no manual annotation. The system monitors developer behavior at scale to infer quality.

Behavioral signals (acceptance, persistence, usage) are far more reliable quality measures than explicit ratings for developer tools. Developers don't rate completions — they just use them or don't. Build your evals around what developers actually do.

Enterprise: privacy and dedicated infrastructure

Enterprise Copilot is architecturally different from the personal tier. Key differences:

Code is routed to Azure OpenAI deployments in the customer's region, never to shared infrastructure
Snippets are not retained after the request completes — no training on enterprise customer code
Enterprise customers can connect private codebase indexes (Copilot Enterprise) — indexing internal repositories to improve retrieval quality
The model weights are the same; the infrastructure isolation is what's different

What Copilot teaches about production AI at scale

Fast retrieval beats perfect retrieval for latency-sensitive systems. BM25 + recency signals assembled in 50ms beats neural retrieval assembled in 500ms for real-time completion.
Behavioral metrics outperform explicit quality ratings. Acceptance rate and persistence rate are ground truth. User ratings are noise.
Two-path architecture separates latency concerns cleanly. Sub-300ms path for the common case; slower path for complexity. Design your latency tiers before designing your models.
Privacy-preserving design requires infrastructure isolation, not just policy. Enterprise AI products live or die on the trust their data handling creates.

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →