How I'd Build an AI Coding Assistant
Repository-aware context, diff-aware completion, retrieval over codebases, and the eval harness that actually catches regressions.
The gap between a coding assistant demo and one engineers actually use daily is huge. Here's what separates them: repository awareness, diff-aware context, and evals that catch regressions before users do.
The Context Problem
Naive coding assistants stuff the current file into the context window. Good ones retrieve relevant context from across the repository: related files, function definitions, type signatures, test files for the function being edited. The retrieval problem for code is different from text RAG — you want structural relevance (imports, call graphs) not just semantic similarity.
What to Index
- Symbol index: function/class definitions with their signatures, docstrings, and file location. Use Tree-sitter for language-aware parsing.
- Import graph: what files import what. Critical for understanding dependencies and finding related code.
- Recent git history: files changed together often belong together semantically.
- Test files: always retrieve tests alongside implementation. Prevents generating code that breaks existing tests.
Diff-Aware Completion
When the user is editing code, the context should reflect the current diff — not just the current file state. Include: the git diff of modified files, the surrounding unchanged code for context, and any type errors or linter warnings in the current state. This lets the model generate completions that fix the error, not just continue from the last character.
The Retrieval Stack
| Context Type | Retrieval Method | Priority |
|---|---|---|
| Current file | Direct (always included) | Highest |
| Called functions | Symbol index lookup | High |
| Imported modules | Import graph traversal | High |
| Semantically similar code | Embedding search over codebase | Medium |
| Recent git changes | git log --diff-filter=M | Medium |
| Test files | Symbol index (test_ prefix) | Medium |
Eval Harness
The eval that actually matters: generate code with the assistant, run the existing test suite, measure pass rate. Simple, but most teams don't do it. Add: a regression test (does today's model score lower than last week?), a latency benchmark (P50/P99 completion time), and a user acceptance test (do engineers accept or reject the suggestion?). Track acceptance rate in production — it's your most honest signal.
# Coding assistant eval
results = []
for test_case in eval_dataset:
completion = assistant.complete(test_case.prefix, test_case.repo_context)
full_code = test_case.prefix + completion
passed = run_tests(full_code, test_case.tests)
results.append({"id": test_case.id, "passed": passed, "completion": completion})
pass_rate = sum(r["passed"] for r in results) / len(results)
print(f"Pass@1: {pass_rate:.1%}")
Model Selection
For code completion: Claude Sonnet 4 and GPT-4o lead on complex multi-file tasks. For autocomplete (< 50ms latency requirement): distilled models like Codestral, StarCoder2-3B, or Qwen2.5-Coder-1.5B running locally. Don't use a frontier model for single-line completions — the latency kills the UX.
- Tree-sitter — language parsing
- Continue.dev — open source coding assistant
- SWE-bench — coding eval benchmark
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →