Review-Bot Quality Comparison

Codex · Copilot · CodeRabbit · Qodo — benchmarked on a real-world Scala + Angular codebase

About the codebase under review

The bots were benchmarked on a real, actively-developed media-transcription web app — it transcribes audio/video, generates AI summaries, chapters and cheatsheets, and offers chat over the resulting transcripts. It is a non-trivial full-stack codebase with real concurrency, payment/quota, and third-party-API surface, which is why the four reviewers' strengths diverge so sharply by area.

Backend  (.scala / .sql / .conf)

  • Scala 3 on the Play Framework, with Akka actors & streaming for background jobs and live updates
  • PostgreSQL via a hand-rolled DAO layer + Liquibase migrations
  • Audio pipelines (ffmpeg transcode/segment), S3 storage, PDF text extraction
  • OpenAI integrations — transcription (Whisper), LLM chat/summaries, image generation
  • JWT auth, usage quotas / cost circuit-breakers; deployed on a Heroku-style PaaS

Where bugs are mostly correctness, concurrency, reliability & security — races, resource leaks, auth/quota bypass, SQL edge cases.

Frontend  (.ts / .html / .scss)

  • Angular + TypeScript single-page app
  • Angular Material components; SCSS design tokens & theming
  • RxJS-driven state, component lifecycle & routing
  • Cypress end-to-end + unit/spec tests
  • Large in-flight UI redesign (hence many docs/plan PRs in this window)

Where issues skew toward component state/lifecycle, browser-compat, accessibility and UI consistency.

Leaderboard

Composite (0–100, relative to the leader) = 35% precision (high-value rate) · 25% acceptance · 20% low-noise · 20% low-overhead, each normalized to the best bot. Transparent weights — see methodology. Ranking favors real-bug ROI over raw volume.

The numbers

Inline findings (volume)

Outcome — maintainer's own verdict

accepted/fixedrejecteddiscussedno response

Value distribution (1=noise · 5=important real bug)

Real catches vs noise (absolute)

Expertise zone (findings by area)

Redundancy — solo vs shared locations

solo (unique)shared with another bot

Summary/walkthrough overhead (issue comments — token cost)

bars = # of auto-summary comments · hover for avg size

Location coverage: how many bots flag each spot

Per-bot verdict

Bottom line & recommendation

Methodology

The core design choice: score the outcome, not the packaging

This report measures whether each comment turned out to be a real, useful issue — judged primarily by the maintainer's own reply in the thread (did they fix it, agree, or dismiss it). It deliberately does not score a comment for looking like a good review — i.e. for containing words like "bug"/"security", code blocks, or a self-assigned severity badge. That distinction matters because some bots stamp every comment "Action required / 🐞 Bug" regardless of substance: a form/keyword scorer rewards that labeling, while an outcome scorer catches that most of it was noise the maintainer ignored. This is the main reason rankings here can differ from a fast keyword-heuristic scorer — see the table at the end.

Pipeline (per bot, identical for all four):

  1. Pull every review comment + reply thread from the via the GitHub API (inline review comments and issue-level summary comments kept separately).
  2. Reconstruct each thread: bot's finding → all replies, tagging the maintainer's (the maintainer) replies.
  3. An LLM reads every thread and applies one identical rubric, classifying outcome, category, zone, and a 1–5 value score.
  4. Aggregate deterministically (counts, rates, overlap) and render.

Definitions used in the tables/charts:

Caveats (read these):

Outcome-based (this report) vs keyword-heuristic scoring

AspectThis report (outcome)Keyword/heuristic scorer
What "quality" meansDid the maintainer accept it & is it technically realDoes the text contain bug/security keywords, code blocks, severity labels
Reads maintainer repliesYes — primary signalNo
Gameable by self-labelingNo — labels ignoredYes — "Action required / 🐞 Bug" inflates score
Counts summary/walkthrough costYes (overhead column)Usually excluded
Cost / reproducibilityLLM judgment — costlier to re-runDeterministic, cheap, scriptable

Both are valid for different goals. A keyword scorer is a great cheap recurring dashboard; this outcome-based read is better for a keep/drop decision, because it can tell a labeled bug from a real one.

Generated from GitHub PR review data · anonymized for public sharing