AI Engineering 12 min read

How I'd Build an AI Coding Assistant

Repository-aware context, diff-aware completion, retrieval over codebases, and the eval harness that actually catches regressions.

The gap between a coding assistant demo and one engineers actually use daily is huge. Here's what separates them: repository awareness, diff-aware context, and evals that catch regressions before users do.

The Context Problem

Naive coding assistants stuff the current file into the context window. Good ones retrieve relevant context from across the repository: related files, function definitions, type signatures, test files for the function being edited. The retrieval problem for code is different from text RAG — you want structural relevance (imports, call graphs) not just semantic similarity.

What to Index

Symbol index: function/class definitions with their signatures, docstrings, and file location. Use Tree-sitter for language-aware parsing.
Import graph: what files import what. Critical for understanding dependencies and finding related code.
Recent git history: files changed together often belong together semantically.
Test files: always retrieve tests alongside implementation. Prevents generating code that breaks existing tests.

Diff-Aware Completion

When the user is editing code, the context should reflect the current diff — not just the current file state. Include: the git diff of modified files, the surrounding unchanged code for context, and any type errors or linter warnings in the current state. This lets the model generate completions that fix the error, not just continue from the last character.

The Retrieval Stack

Context Type	Retrieval Method	Priority
Current file	Direct (always included)	Highest
Called functions	Symbol index lookup	High
Imported modules	Import graph traversal	High
Semantically similar code	Embedding search over codebase	Medium
Recent git changes	git log --diff-filter=M	Medium
Test files	Symbol index (test_ prefix)	Medium

Eval Harness

The eval that actually matters: generate code with the assistant, run the existing test suite, measure pass rate. Simple, but most teams don't do it. Add: a regression test (does today's model score lower than last week?), a latency benchmark (P50/P99 completion time), and a user acceptance test (do engineers accept or reject the suggestion?). Track acceptance rate in production — it's your most honest signal.

# Coding assistant eval
results = []
for test_case in eval_dataset:
    completion = assistant.complete(test_case.prefix, test_case.repo_context)
    full_code = test_case.prefix + completion
    passed = run_tests(full_code, test_case.tests)
    results.append({"id": test_case.id, "passed": passed, "completion": completion})

pass_rate = sum(r["passed"] for r in results) / len(results)
print(f"Pass@1: {pass_rate:.1%}")

Model Selection

For code completion: Claude Sonnet 4 and GPT-4o lead on complex multi-file tasks. For autocomplete (< 50ms latency requirement): distilled models like Codestral, StarCoder2-3B, or Qwen2.5-Coder-1.5B running locally. Don't use a frontier model for single-line completions — the latency kills the UX.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →