GenAI Systems Lab Open interactive version →
AI Engineering 11 min read

How I'd Build a Code Review Bot: AST Parsing, Diff Retrieval, and False Positive Tuning

Building an AI code reviewer that developers actually trust. AST-aware chunking, diff-scoped retrieval, explanation generation, calibrating false positive rate, and the feedback loop that makes it smarter over time.

A code review bot that developers ignore is worse than no bot — it trains people to click 'dismiss' without reading. The bar is not 'leaves a comment.' The bar is 'leaves a comment I would have left.' Here's the architecture that gets there.

The Core Problem With Naive Implementations

Most teams start here: take the entire diff, stuff it into the context window, ask GPT-4o to review it. This works at demo scale. In production, diffs can exceed context limits, the model loses track of cross-file dependencies, and it reviews things that don't need reviewing (whitespace, auto-generated files, test fixtures). The noise-to-signal ratio destroys trust in 2-3 weeks.

AST-Aware Diff Parsing

Don't pass the raw diff. Parse it. Use tree-sitter or language-specific AST parsers to identify which functions, classes, and methods changed. This gives you the unit of review (a function) instead of a sea of unified diff output. For each changed function: extract the function body before and after, its callers, and its callees.

Rule: only review changed functions and their direct callers. Ignore changed test files (review separately with a different prompt). Ignore auto-generated files (detect via path pattern). This alone cuts false-positive rate by 60%.

Repository-Scoped Retrieval

A code review bot without codebase context misses the most important class of bugs: 'this function is being called in 3 other places with assumptions about its return type that your change just broke.' Build a code index: embed every function/class at the function level (not file level). At review time, retrieve the top-5 most similar functions to each changed function. These are the likely impacted callsites.

The Review Prompt Architecture

Three-part prompt that outperforms a single monolithic prompt:

False Positive Tuning

Track every comment the bot leaves. Track developer reactions: dismissed, acted on, disputed. Build a classifier that predicts 'will be acted on' from comment features (comment type, code context, developer seniority). Use this to gate comments before posting — only post comments the classifier predicts will be acted on at >60% rate. This is the feedback loop that makes the bot smarter over time without retraining.

Never post more than 3 comments on a PR automatically. Developer attention is finite. A bot that leaves 15 comments per PR will be ignored after day 5. Rank comments by predicted impact and post the top 3.

Measuring Success

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →