Review-Bot Quality Comparison

Codex · Copilot · CodeRabbit · Qodo — benchmarked on a real-world Scala + Angular codebase

About the codebase under review

The bots were benchmarked on a real, actively-developed media-transcription web app — it transcribes audio/video, generates AI summaries, chapters and cheatsheets, and offers chat over the resulting transcripts. It is a non-trivial full-stack codebase with real concurrency, payment/quota, and third-party-API surface, which is why the four reviewers' strengths diverge so sharply by area.

Backend (.scala / .sql / .conf)

Scala 3 on the Play Framework, with Akka actors & streaming for background jobs and live updates
PostgreSQL via a hand-rolled DAO layer + Liquibase migrations
Audio pipelines (ffmpeg transcode/segment), S3 storage, PDF text extraction
OpenAI integrations — transcription (Whisper), LLM chat/summaries, image generation
JWT auth, usage quotas / cost circuit-breakers; deployed on a Heroku-style PaaS

Where bugs are mostly correctness, concurrency, reliability & security — races, resource leaks, auth/quota bypass, SQL edge cases.

Frontend (.ts / .html / .scss)

Angular + TypeScript single-page app
Angular Material components; SCSS design tokens & theming
RxJS-driven state, component lifecycle & routing
Cypress end-to-end + unit/spec tests
Large in-flight UI redesign (hence many docs/plan PRs in this window)

Where issues skew toward component state/lifecycle, browser-compat, accessibility and UI consistency.

Leaderboard

Composite (0–100, relative to the leader) = 35% precision (high-value rate) · 25% acceptance · 20% low-noise · 20% low-overhead, each normalized to the best bot. Transparent weights — see methodology. Ranking favors real-bug ROI over raw volume.

The numbers

Inline findings (volume)

Outcome — maintainer's own verdict

accepted/fixedrejecteddiscussedno response

Value distribution (1=noise · 5=important real bug)

Real catches vs noise (absolute)

Expertise zone (findings by area)

Redundancy — solo vs shared locations

solo (unique)shared with another bot

Summary/walkthrough overhead (issue comments — token cost)

bars = # of auto-summary comments · hover for avg size

Location coverage: how many bots flag each spot

Per-bot verdict

Bottom line & recommendation

Methodology

The core design choice: score the outcome, not the packaging

This report measures whether each comment turned out to be a real, useful issue — judged primarily by the maintainer's own reply in the thread (did they fix it, agree, or dismiss it). It deliberately does not score a comment for looking like a good review — i.e. for containing words like "bug"/"security", code blocks, or a self-assigned severity badge. That distinction matters because some bots stamp every comment "Action required / 🐞 Bug" regardless of substance: a form/keyword scorer rewards that labeling, while an outcome scorer catches that most of it was noise the maintainer ignored. This is the main reason rankings here can differ from a fast keyword-heuristic scorer — see the table at the end.

Pipeline (per bot, identical for all four):

Pull every review comment + reply thread from the via the GitHub API (inline review comments and issue-level summary comments kept separately).
Reconstruct each thread: bot's finding → all replies, tagging the maintainer's (the maintainer) replies.
An LLM reads every thread and applies one identical rubric, classifying outcome, category, zone, and a 1–5 value score.
Aggregate deterministically (counts, rates, overlap) and render.

Definitions used in the tables/charts:

Scope: 90 most-recently-merged PRs (), bot comments total = inline review comments + issue-level summaries.
Ground truth = maintainer replies: "good catch / fixed / will fix" → accepted; "not a bug / already handled / by design / stale" → rejected; mixed → discussed; none → no-response. GitHub reactions were all zero in this repo, so replies are the signal.
Value 1–5 (LLM, anchored to the maintainer reply + technical substance): 5 = important real bug/security/reliability with correct reasoning · 3 = legit minor improvement · 1 = noise, wrong, or trivial nitpick.
Real catches = findings rated ≥4. Noise = findings rated ≤2. Precision = real-catches ÷ findings.
Acceptance rate = accepted ÷ (accepted + rejected), excluding no-response threads.
Zone = file path: backend (.scala/.sql/.conf/.xml), frontend (.ts/.html/.scss/.js), other (.md docs/plans).
Overhead = bot-authored issue-level comments (PR summaries, walkthroughs, compliance reports). Useful context, but a real token/notification cost and not a review finding — so counted separately, never folded into quality.
Redundancy = share of a bot's flagged (PR, file) locations also flagged by another bot — a proxy for how much unique signal it adds.
Composite (0–100, relative to leader): each sub-metric normalized to the best bot, then 35% precision + 25% acceptance + 20% low-noise (1−noise rate) + 20% low-overhead. Weights are transparent and editable; the ranking is robust to reasonable reweighting because Codex leads on most axes simultaneously.

Caveats (read these):

Value scoring is LLM judgment, not exact measurement — treat scores as well-calibrated estimates, not ground truth to 2 decimals.
"No response" is ambiguous: could mean "too minor to bother replying" or "maintainer missed it." It is excluded from acceptance rate for that reason.
Maintainer acceptance reflects what this maintainer valued; another team's priorities could shift the weights.
Bots reviewed overlapping but not byte-identical commit ranges; docs-heavy redesign PRs in this window inflate docs/plan comments for all bots.

Outcome-based (this report) vs keyword-heuristic scoring

Aspect	This report (outcome)	Keyword/heuristic scorer
What "quality" means	Did the maintainer accept it & is it technically real	Does the text contain bug/security keywords, code blocks, severity labels
Reads maintainer replies	Yes — primary signal	No
Gameable by self-labeling	No — labels ignored	Yes — "Action required / 🐞 Bug" inflates score
Counts summary/walkthrough cost	Yes (overhead column)	Usually excluded
Cost / reproducibility	LLM judgment — costlier to re-run	Deterministic, cheap, scriptable

Both are valid for different goals. A keyword scorer is a great cheap recurring dashboard; this outcome-based read is better for a keep/drop decision, because it can tell a labeled bug from a real one.

Generated from GitHub PR review data · anonymized for public sharing