GenAI Systems Lab Open interactive version →
AI Engineering 12 min read

How I'd Build an AI Coding Assistant

Repository-aware context, diff-aware completion, retrieval over codebases, and the eval harness that actually catches regressions.

The gap between a coding assistant demo and one engineers actually use daily is huge. Here's what separates them: repository awareness, diff-aware context, and evals that catch regressions before users do.

The Context Problem

Naive coding assistants stuff the current file into the context window. Good ones retrieve relevant context from across the repository: related files, function definitions, type signatures, test files for the function being edited. The retrieval problem for code is different from text RAG — you want structural relevance (imports, call graphs) not just semantic similarity.

What to Index

Diff-Aware Completion

When the user is editing code, the context should reflect the current diff — not just the current file state. Include: the git diff of modified files, the surrounding unchanged code for context, and any type errors or linter warnings in the current state. This lets the model generate completions that fix the error, not just continue from the last character.

The Retrieval Stack

Context TypeRetrieval MethodPriority
Current fileDirect (always included)Highest
Called functionsSymbol index lookupHigh
Imported modulesImport graph traversalHigh
Semantically similar codeEmbedding search over codebaseMedium
Recent git changesgit log --diff-filter=MMedium
Test filesSymbol index (test_ prefix)Medium

Eval Harness

The eval that actually matters: generate code with the assistant, run the existing test suite, measure pass rate. Simple, but most teams don't do it. Add: a regression test (does today's model score lower than last week?), a latency benchmark (P50/P99 completion time), and a user acceptance test (do engineers accept or reject the suggestion?). Track acceptance rate in production — it's your most honest signal.

# Coding assistant eval
results = []
for test_case in eval_dataset:
    completion = assistant.complete(test_case.prefix, test_case.repo_context)
    full_code = test_case.prefix + completion
    passed = run_tests(full_code, test_case.tests)
    results.append({"id": test_case.id, "passed": passed, "completion": completion})

pass_rate = sum(r["passed"] for r in results) / len(results)
print(f"Pass@1: {pass_rate:.1%}")

Model Selection

For code completion: Claude Sonnet 4 and GPT-4o lead on complex multi-file tasks. For autocomplete (< 50ms latency requirement): distilled models like Codestral, StarCoder2-3B, or Qwen2.5-Coder-1.5B running locally. Don't use a frontier model for single-line completions — the latency kills the UX.


Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →