How I'd Build a Code Review Bot: AST Parsing, Diff Retrieval, and False Positive Tuning
Building an AI code reviewer that developers actually trust. AST-aware chunking, diff-scoped retrieval, explanation generation, calibrating false positive rate, and the feedback loop that makes it smarter over time.
A code review bot that developers ignore is worse than no bot — it trains people to click 'dismiss' without reading. The bar is not 'leaves a comment.' The bar is 'leaves a comment I would have left.' Here's the architecture that gets there.
The Core Problem With Naive Implementations
Most teams start here: take the entire diff, stuff it into the context window, ask GPT-4o to review it. This works at demo scale. In production, diffs can exceed context limits, the model loses track of cross-file dependencies, and it reviews things that don't need reviewing (whitespace, auto-generated files, test fixtures). The noise-to-signal ratio destroys trust in 2-3 weeks.
AST-Aware Diff Parsing
Don't pass the raw diff. Parse it. Use tree-sitter or language-specific AST parsers to identify which functions, classes, and methods changed. This gives you the unit of review (a function) instead of a sea of unified diff output. For each changed function: extract the function body before and after, its callers, and its callees.
Rule: only review changed functions and their direct callers. Ignore changed test files (review separately with a different prompt). Ignore auto-generated files (detect via path pattern). This alone cuts false-positive rate by 60%.
Repository-Scoped Retrieval
A code review bot without codebase context misses the most important class of bugs: 'this function is being called in 3 other places with assumptions about its return type that your change just broke.' Build a code index: embed every function/class at the function level (not file level). At review time, retrieve the top-5 most similar functions to each changed function. These are the likely impacted callsites.
The Review Prompt Architecture
Three-part prompt that outperforms a single monolithic prompt:
- Part 1 — Diff context: changed function before/after, language, framework
- Part 2 — Codebase context: 3-5 retrieved similar/calling functions with their signatures
- Part 3 — Review criteria: ordered by priority (correctness > security > performance > style). Critically: include a 'not worth commenting on' list (variable naming, whitespace, test coverage when not changed)
False Positive Tuning
Track every comment the bot leaves. Track developer reactions: dismissed, acted on, disputed. Build a classifier that predicts 'will be acted on' from comment features (comment type, code context, developer seniority). Use this to gate comments before posting — only post comments the classifier predicts will be acted on at >60% rate. This is the feedback loop that makes the bot smarter over time without retraining.
Never post more than 3 comments on a PR automatically. Developer attention is finite. A bot that leaves 15 comments per PR will be ignored after day 5. Rank comments by predicted impact and post the top 3.
Measuring Success
- Comment action rate: % of bot comments the developer edits code in response to (target: >40%)
- Bugs caught pre-merge: requires manual labeling of post-merge bugs (track with issue tags)
- Review time delta: does the bot help humans review faster or generate more back-and-forth?
- Developer NPS on bot: quarterly survey. If score is below 30, the bot is hurting you.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →