Review-Bot Quality Comparison
Codex · Copilot · CodeRabbit · Qodo — benchmarked on a real-world Scala + Angular codebase
About the codebase under review
The bots were benchmarked on a real, actively-developed media-transcription web app — it transcribes audio/video, generates AI summaries, chapters and cheatsheets, and offers chat over the resulting transcripts. It is a non-trivial full-stack codebase with real concurrency, payment/quota, and third-party-API surface, which is why the four reviewers' strengths diverge so sharply by area.
Backend (.scala / .sql / .conf)
- Scala 3 on the Play Framework, with Akka actors & streaming for background jobs and live updates
- PostgreSQL via a hand-rolled DAO layer + Liquibase migrations
- Audio pipelines (ffmpeg transcode/segment), S3 storage, PDF text extraction
- OpenAI integrations — transcription (Whisper), LLM chat/summaries, image generation
- JWT auth, usage quotas / cost circuit-breakers; deployed on a Heroku-style PaaS
Where bugs are mostly correctness, concurrency, reliability & security — races, resource leaks, auth/quota bypass, SQL edge cases.
Frontend (.ts / .html / .scss)
- Angular + TypeScript single-page app
- Angular Material components; SCSS design tokens & theming
- RxJS-driven state, component lifecycle & routing
- Cypress end-to-end + unit/spec tests
- Large in-flight UI redesign (hence many docs/plan PRs in this window)
Where issues skew toward component state/lifecycle, browser-compat, accessibility and UI consistency.
Leaderboard
Composite (0–100, relative to the leader) = 35% precision (high-value rate) · 25% acceptance · 20% low-noise · 20% low-overhead, each normalized to the best bot. Transparent weights — see methodology. Ranking favors real-bug ROI over raw volume.
The numbers
Inline findings (volume)
Outcome — maintainer's own verdict
accepted/fixedrejecteddiscussedno response
Value distribution (1=noise · 5=important real bug)
Real catches vs noise (absolute)
Expertise zone (findings by area)
Redundancy — solo vs shared locations
solo (unique)shared with another bot
Summary/walkthrough overhead (issue comments — token cost)
bars = # of auto-summary comments · hover for avg size
Location coverage: how many bots flag each spot
Per-bot verdict
Bottom line & recommendation
Methodology
The core design choice: score the outcome, not the packaging
This report measures whether each comment turned out to be a real, useful issue — judged primarily by the maintainer's own reply in the thread (did they fix it, agree, or dismiss it). It deliberately does not score a comment for looking like a good review — i.e. for containing words like "bug"/"security", code blocks, or a self-assigned severity badge. That distinction matters because some bots stamp every comment "Action required / 🐞 Bug" regardless of substance: a form/keyword scorer rewards that labeling, while an outcome scorer catches that most of it was noise the maintainer ignored. This is the main reason rankings here can differ from a fast keyword-heuristic scorer — see the table at the end.
Pipeline (per bot, identical for all four):
- Pull every review comment + reply thread from the via the GitHub API (inline review comments and issue-level summary comments kept separately).
- Reconstruct each thread: bot's finding → all replies, tagging the maintainer's (the maintainer) replies.
- An LLM reads every thread and applies one identical rubric, classifying outcome, category, zone, and a 1–5 value score.
- Aggregate deterministically (counts, rates, overlap) and render.
Definitions used in the tables/charts:
- Scope: 90 most-recently-merged PRs (), bot comments total = inline review comments + issue-level summaries.
- Ground truth = maintainer replies: "good catch / fixed / will fix" → accepted; "not a bug / already handled / by design / stale" → rejected; mixed → discussed; none → no-response. GitHub reactions were all zero in this repo, so replies are the signal.
- Value 1–5 (LLM, anchored to the maintainer reply + technical substance): 5 = important real bug/security/reliability with correct reasoning · 3 = legit minor improvement · 1 = noise, wrong, or trivial nitpick.
- Real catches = findings rated ≥4. Noise = findings rated ≤2. Precision = real-catches ÷ findings.
- Acceptance rate = accepted ÷ (accepted + rejected), excluding no-response threads.
- Zone = file path: backend (.scala/.sql/.conf/.xml), frontend (.ts/.html/.scss/.js), other (.md docs/plans).
- Overhead = bot-authored issue-level comments (PR summaries, walkthroughs, compliance reports). Useful context, but a real token/notification cost and not a review finding — so counted separately, never folded into quality.
- Redundancy = share of a bot's flagged (PR, file) locations also flagged by another bot — a proxy for how much unique signal it adds.
- Composite (0–100, relative to leader): each sub-metric normalized to the best bot, then 35% precision + 25% acceptance + 20% low-noise (1−noise rate) + 20% low-overhead. Weights are transparent and editable; the ranking is robust to reasonable reweighting because Codex leads on most axes simultaneously.
Caveats (read these):
- Value scoring is LLM judgment, not exact measurement — treat scores as well-calibrated estimates, not ground truth to 2 decimals.
- "No response" is ambiguous: could mean "too minor to bother replying" or "maintainer missed it." It is excluded from acceptance rate for that reason.
- Maintainer acceptance reflects what this maintainer valued; another team's priorities could shift the weights.
- Bots reviewed overlapping but not byte-identical commit ranges; docs-heavy redesign PRs in this window inflate docs/plan comments for all bots.
Outcome-based (this report) vs keyword-heuristic scoring
| Aspect | This report (outcome) | Keyword/heuristic scorer |
| What "quality" means | Did the maintainer accept it & is it technically real | Does the text contain bug/security keywords, code blocks, severity labels |
| Reads maintainer replies | Yes — primary signal | No |
| Gameable by self-labeling | No — labels ignored | Yes — "Action required / 🐞 Bug" inflates score |
| Counts summary/walkthrough cost | Yes (overhead column) | Usually excluded |
| Cost / reproducibility | LLM judgment — costlier to re-run | Deterministic, cheap, scriptable |
Both are valid for different goals. A keyword scorer is a great cheap recurring dashboard; this outcome-based read is better for a keep/drop decision, because it can tell a labeled bug from a real one.
Generated from GitHub PR review data · anonymized for public sharing