Scout: Validating Agent-Authored Code in CI When There Is No Oracle

Summary

Continuous integration was designed for human-authored code against deterministic test oracles: the same input produces the same output, the test either matches or doesn’t, and a failure is a real defect. Agent-authored code violates every assumption in that chain. The same prompt produces different code on different runs; the test that passed yesterday may fail today because the agent took a different path to the same answer; the test that fails today may be flagging a real regression or an irrelevant trajectory difference. Standard CI either accepts the false-positive rate and gets disabled within a week, or accepts the false-negative rate and ships defects.

May 2026 produced the first widely-circulated public vendor account of this problem and four parallel attempts to solve it. GitHub published its argument that “correct” no longer fits a deterministic reference output and proposed dominator-analysis-based structural validation as a substitute. Stripe described the Minions pipeline shipping a thousand-plus pull requests per week, anchored on selective CI, automatic linting in under five seconds, and a hard two-attempt cap on agent-driven test fixes. Anthropic published a longer methodology piece on demystifying agent evals — layered grading, isolated trial environments, LLM-as-judge calibration against human labels — and a separate engineering write-up on running sixteen parallel Claudes against a C compiler, where the test-harness quality was named as the load-bearing component. Two academic frameworks landed in the same month: HiL-Bench measuring agent judgment about when to ask for help, and AgentPulse measuring agents continuously across eighteen real-world signals rather than at a single point in time. The four sources disagree on the metaphor — graph dominance, blueprint orchestration, layered rubric, continuous signal — but they converge on a small number of mechanics that work in production: deterministic checks first, semantic graders second, isolated trial environments, calibration against human labels, and explicit retry caps. The teams reporting working pipelines run these together, not separately.

This briefing pulls those mechanics apart for platform engineers and CI/CD leads at teams shipping agent-authored code at scale.

Key Findings

Standard CI’s deterministic-oracle assumption is the failure mode

GitHub’s framing on the eval side states the problem directly: in the agent-authored workload “there is no ground-truth ‘correctness’” against which a run can be scored, per its companion blog post on token efficiency in agentic workflows. The corresponding post on validating agentic behavior names the resulting CI failure shape: “But your CI pipeline still flagged the run as a failure—not because the task failed, but because the execution path no longer matched the recorded script or assertion timing.” The same task may have many correct executions, the test that recorded one execution path won’t match the next one, and CI built around recorded scripts or assertion timing produces a steady stream of false flags.

That is the structural source of flaky-test fatigue. The pipeline isn’t reporting noise that can be tuned out by re-running; it’s reporting a category error in what the test is measuring. Each “‘false negative’ that halts production,” to use GitHub’s phrasing for the case where the agent has succeeded but the test still fails, trains the team that CI failures are not signal. Within a week of sustained false-negative volume, the rational practitioner response is to disable the test, weaken the assertion, or add a retry wrapper that hides the problem. The pipeline still runs, but it has stopped doing what CI exists to do.

The four published responses share a single architectural move: validate the structure of what the agent did, not the literal sequence. They disagree on how to do that.

Pattern A — Dominator analysis: learn essential states from a few successful traces

GitHub’s pattern, described in the same blog post and the underlying paper, models each agent execution as a graph where nodes are observable states (screenshots for UI agents, code snapshots for development agents) and edges are the transitions between them. Run the task two-to-ten times on known-good executions, build a Prefix Tree Acceptor across the traces, and apply dominator analysis from compiler theory to identify which states are mathematically essential — the ones the trace cannot reach the success state without passing through. (Sharma et al., “Learning Correct Behavior from Examples”)

Validation of a new run reduces to: did the trace hit the essential states in the right topological order? The paper claims accuracy, precision, recall, and F1 all at 100% against a learned reference, versus an F1 of roughly 70% for self-assessment by the agent itself. The most interesting reported number is on distinguishing agent errors from product regressions — what GitHub frames as the “not a bug” case where the agent failed because the product changed. Self-assessment scores zero on this; structural validation reaches an F1 of 52%. That is the gap between “the agent thinks it succeeded” and “the trace actually reached the states a successful trace has to reach.”

The mechanic that matters for practitioners: dominator validation handles UI noise (transient loading screens, cosmetic re-renders) without flagging it as failure, because intermediate non-essential states fall out of the model. It is also robust to the agent finding a different valid path — as long as the essential states are hit, the trace is acceptable.

GitHub’s writeup is careful about scope. The framework currently learns from a fixed seed of two-to-ten traces and does not yet refine the dominator model from new successful runs. Online learning — recomputing dominators as the system observes more successful executions and the codebase shifts underneath them — is named as future work, not current capability. That gap matters for calibration over months (see §5 below).

Pattern B — Stripe’s blueprints: deterministic walls around the agent loop

Stripe’s Minions agents merged “over a thousand pull requests per week” as of March 2026 per Cameron Bernhardt, engineering manager at Stripe, in InfoQ’s reporting: “Minions have progressed from concept to generating over a thousand pull requests per week. All code is human-reviewed, but the agents are increasingly producing changes end-to-end.” The reported architecture, drawn from Stripe’s own Minions Part 2 engineering writeup and paraphrased in ByteByteGo’s coverage and Anup Jadhav’s read of the same source, centers on three pieces:

Local linters that close the feedback loop in under five seconds. A background daemon precomputes applicable lint rules so the agent sees typos and trivial defects fast enough to fix them cheaply before any CI cost is incurred.
Selective CI on a battery reported at over three million tests. Only the tests touching files the agent modified are executed. Autofixes apply for known failure patterns, so a recognised fixable failure does not cost an agent turn.
A hard cap of two attempts. Per ByteByteGo’s read of Stripe’s writeup, the cap is deliberate because “LLMs show diminishing returns when retrying the same problem repeatedly.” If two passes through CI haven’t produced a green build, the run terminates as a pull request and a human takes the partial work.

The orchestration framing is what Stripe calls blueprints: sequences that alternate deterministic code nodes (lint, push, branch creation, PR template population) with agentic loops (feature implementation, CI failure resolution). The agent has agency inside the loops; it does not have agency at the gates between them. ByteByteGo summarises the principle: “The primary reason the Minions work has almost nothing to do with the AI model powering them” — the wins compound from the deterministic walls around the loop, not from the model’s improvement.

The structural lesson for teams building their own CI pipeline is that none of these mechanics involve novel ML. Sub-five-second linting, test-impact analysis, autofix-on-known-patterns, and a retry cap are pre-agent engineering practices. The Minions pattern is the assembly of those practices into a pipeline where the agent gets fast feedback, expensive feedback, and explicit termination — in that order, with the right primitives gating each step.

A practitioner caveat worth stating directly: the published account of the Minions pipeline is Stripe’s own engineering blog and the third-party paraphrases of it. Internal failure rates, per-blueprint pass rates, and the cost-per-merged-PR are not in the public record. Public reporting on the long-term steady-state economics of the Minions architecture remains thin, and any team copying the pattern is copying the shape, not the calibrated numbers. Stripe’s separate published benchmark for AI-built Stripe integrations gives one adjacent signal — its emphasis is structural realism (full codebases, browser-based verification, deterministic graders validating actual Stripe API objects created at runtime) rather than abstract semantic scoring — and matches the architectural posture of the Minions pipeline, but it does not substitute for the cost-and-failure detail that remains internal.

Pattern C — Layered evals: deterministic checks first, LLM rubric second, human last

Anthropic’s Demystifying evals for AI agents is the longest published vendor-side methodology piece on the topic. The core ordering is the inverse of how teams instinctively start: deterministic graders first (shell scripts, regex assertions, schema validation, static analysis), LLM-as-judge second on the dimensions the deterministic graders can’t reach (style, tone, semantic correctness), human review last and only on a sample.

The non-obvious points in Anthropic’s guidance:

Trial isolation is load-bearing. “Each trial should be ‘isolated’ by starting from a clean environment. Unnecessary shared state between runs (leftover files, cached data, resource exhaustion) can cause correlated failures due to infrastructure flakiness rather than agent performance.” Failures that look like agent regression are frequently infrastructure flakiness measured against a shared workspace. A team triaging “is the agent worse” before “is the test environment dirty” is investigating the wrong question.
A 0% pass rate is more often a broken task than an incapable agent. Specifically: “with a 0% pass rate across many trials (i.e 0% pass@100)” — Anthropic flags this as “most often a signal of a broken task, not an incapable agent.” The pattern repeats at the eval-design layer: when the test is wrong, the agent looks worse than it is. The Anthropic-internal example: Opus 4.5 initially “scored 42%” on CORE-Bench “until an Anthropic researcher found multiple issues,” after which “Opus 4.5’s score jumped to 95%.”
Pass^k is the real consistency metric. A 75% per-trial pass rate looks reassuring until it’s compounded: (0.75)³ ≈ 42% probability of passing three trials in a row. Anthropic’s published practice on SWE-bench includes averaging across many trials — reportedly 25 per task — in the Opus 4.6 system card; teams running fewer than three trials per task are over-fitting their CI gates to single-run variance. Cekura’s production-evals writeup makes the same point operationally — its working framing is that measuring both pass@k (success at least once across k runs) and pass^k (success on every run) is the right pre-deployment discipline, because a 75% pass rate yielding only 42% probability of three consecutive passes is the gap between “the agent can do it” and “the agent reliably does it.”
LLM-judge calibration is a numeric discipline. The published guidance: “LLM-as-judge graders should be closely calibrated with human experts to gain confidence that there is little divergence between the human grading and model grading.” The implementation: collect human labels on a representative slice, score the judge against those labels, target 75–90% agreement, recalibrate as the task surface shifts.
Grade output, not trajectory, by default. “It’s often better to grade what the agent produced, not the path it took.” The exception is when the trajectory is the artifact under test — for example, when the agent is supposed to use a specific tool or follow a particular sequence. Trajectory grading is brittle by construction because there are many valid paths; output grading is robust as long as the output is comparable.

Anthropic’s parallel-Claudes C-compiler write-up is the same pattern at a larger scale: sixteen Opus 4.6 instances in Docker containers, GCC as a “known-good compiler oracle” against the agent-written compiler, GCC torture tests as deterministic graders, real-world projects (Linux kernel x86/ARM/RISC-V, QEMU, FFmpeg, SQLite, Postgres, Redis) as integration tests. The author’s first recommendation for orchestrating parallel Claude instances: “Write extremely high-quality tests.” When the test harness is good, the agent solves the problem you set; when it’s not, the agent solves a wrong problem accurately.

The C-compiler pipeline also shipped a deterministic-sampling mode — a --fast flag running “1% or 10% random sample” that is “deterministic per-agent but random across VMs” — so parallel agents could explore complementary slices of the test surface without redundancy. That is a useful pattern for any team running multiple agents whose CI cost would otherwise multiply.

Pattern D — Rubrics as contextual verifiers, when tests can’t reach the question

Raghavendra et al.’s Agentic Rubrics as Contextual Verifiers for SWE Agents makes the case for rubric-based scoring where tests genuinely can’t capture the failure mode: an expert agent explores the repository and produces a structured, context-grounded checklist; candidate patches are scored against the checklist without executing the tests. The reported headline: rubric scores are “consistent with ground-truth tests while also flagging issues that tests do not capture.”

The rubric structure they recommend has four axes: File Change (are the edits minimal, correctly scoped, sufficient), Spec Alignment (does the patch satisfy the requirements), Integrity (no hardcoding, no shortcut implementations), and Runtime considerations. The composition is the point — none of these are individually novel, but the assembled rubric catches failures that test-execution alone misses, including the dangerous category of code that compiles, passes every test, and is wrong.

The rubric model intersects with Galileo’s framework writeup, which proposes a hierarchical taxonomy of 7 primary dimensions / 25 sub-dimensions / 130 fine-grained items and recommends a calibration target of “minimum 0.80 Spearman correlation with human evaluators for production deployment.” That is the same calibration discipline Anthropic recommends, expressed numerically: if your rubric scorer doesn’t reach 0.80 correlation with a human-labelled validation slice, the rubric is noisy enough that its CI signal isn’t trustworthy.

Two practitioner-relevant Promptfoo recommendations that bridge the rubric approach into CI mechanics (Promptfoo guide on evaluating coding agents): “JavaScript assertions check structure. For semantic quality… use model grading.” And, on flaky-test debugging: “If a prompt fails 50% of the time, the prompt is ambiguous. Fix the instructions rather than running more retries.”

The complement on the pipeline side comes from Augment Code’s CI/CD for AI Agents guide, which names the missing primitive that turns a rubric into a CI gate. Its argument is that “verification only changes outcomes when it runs as a mandatory gate at a defined point in the workflow,” and recommends a dual-gate pattern: the agent runs its own verifier internally before creating a PR, and CI re-runs the same verifier as a hard gate the agent cannot bypass. The framing — verifier as first-class CI primitive, not as agent-internal heuristic — is the structural answer to why agent self-assessment scored zero F1 on the “not a bug” detection that dominator analysis solved.

Pattern E — Judgment over capability: HiL-Bench, AgentPulse, and the over-month signal

The benchmark literature added two pieces in April–May 2026 that bear directly on CI calibration.

HiL-Bench — Trinh, Elfeki, and ten co-authors — measures whether coding agents know when they don’t know enough and should ask for help. The headline gap is dramatic: on SQL tasks, frontier models hit 86–91% pass rate with complete information, but only 5–39% when asked to recognise blockers and call ask_human(). On SWE tasks the drop is steeper, 67–85% down to 1–9%. The judgment gap ranges 51–81 percentage points across domains. The benchmark’s Ask-F1 metric — harmonic mean of (relevant questions / total questions asked) and (resolved blockers / total blockers) — deliberately penalises both over-asking and under-asking. Reported per-model failure signatures are distinct: GPT models extreme under-asking (14–27% recall on blocker resolution), Claude high precision but low recall, Gemini-SQL over-asking broadly (6.6 questions per task at 36% precision).

The practitioner implication for CI: capability and judgment are independent properties. A frontier-tier agent scoring in the high-80s on SWE-bench Verified can still be the worst possible CI citizen if its judgment about when to ask is uncalibrated — silently guessing through a blocked task is exactly the trajectory that produces a compiles-passes-tests-wrong PR. CI rubrics that score correctness without scoring help-seeking will systematically under-fit this failure mode.

AgentPulse — Gao, Wang, Yu — measures 50 agents across 10 workload categories using four composite factors aggregated from 18 real-time signals: Benchmark Performance, Adoption Signals, Community Sentiment, Ecosystem Health. The four factors are largely complementary, with maximum pairwise correlation ρ=0.61 across 18 signals — meaning a one-dimensional benchmark snapshot misses most of what the others surface. Among 11 agents with SWE-bench scores in the AgentPulse sample, benchmark and composite rankings showed Spearman correlation of only 0.25; 9 of 11 agents shifted by at least two ranks when the composite was used.

The CI-discipline implication is that point-in-time benchmark numbers don’t predict month-over-month adoption or ecosystem health. A team selecting an agent based on a snapshot leaderboard score is making a stale-data decision the first week, and a noticeably stale-data decision by month three. The AgentPulse-style approach — sustained signal across heterogeneous metrics — is the over-month version of the within-run calibration discipline Anthropic and Galileo recommend.

Why builds get disabled — and how the published patterns prevent it

The flaky-test-fatigue collapse is mechanically the same in every team that hits it: false-negative volume (CI fails on green executions) drives developers to weaken or disable the test, which drops the false-positive guarantee CI was supposed to provide, which means real defects ship, which means trust in CI degrades further. The four patterns above share three properties that interrupt that cycle:

Validate the structure, not the trace. Dominator analysis (GitHub), output grading rather than trajectory grading (Anthropic), rubric scoring (Raghavendra et al.), and oracle comparison (the C-compiler pipeline) all reject the “did the agent take the recorded path” question and replace it with “did the agent reach the essential states.” The trace-equality failure mode disappears.
Layer deterministic checks before semantic ones. Stripe’s lint-then-test-then-stop pattern, OpenAI’s deterministic-then-rubric staging (OpenAI Developers — Testing Agent Skills), Promptfoo’s JS-assertion-then-LLM-rubric pattern, and Anthropic’s deterministic-then-LLM-judge-then-human pattern are the same shape. The expensive grader runs second and only when the cheap grader has already passed. Cost is contained; signal is layered.
Bound retries explicitly. Stripe’s two-attempt cap, the Anthropic guidance on isolating trial environments to prevent correlated failures, and Promptfoo’s posture that a 50%-failure prompt is an ambiguity bug to fix rather than a retry rate to crank up all push the same direction: agentic retry budgets are a system parameter, not an emergent agent behaviour. The pipeline decides when the agent has had enough; the agent does not get to decide for itself.

The teams that report working CI on agent-authored code at volume — Stripe at 1,000+/week per Bernhardt, Anthropic with the parallel-Claudes compiler, GitHub processing “over 60 million reviews, growing 10x in less than a year” per their PR-review blog post — all assemble these properties together. None of them rely on a single eval primitive to do the work.

Keeping the signal calibrated over months, not just at adoption

The teams that survive the first three months are the ones that treat the rubric, the dominator model, and the LLM-judge as artifacts that themselves need maintenance. Several specific practices appear across the sources:

Schedule rubric and judge re-calibration on a fixed cadence. Galileo’s framework recommends commit-based, scheduled (daily/weekly), and event-driven triggers running together — the scheduled cadence is where slow drift gets caught. Cekura frames the same point as treating the eval suite as a lifecycle rather than a snapshot, with the corollary that when an eval saturates above 80% the right move is to make the eval harder rather than to declare victory.
Channel production failures into regression tests. The pattern recurs across Anthropic’s evals guide, the GitHub PR-review blog post, and Galileo’s framework: every real failure that escapes CI becomes a regression test that protects against the same failure in the next run. Capability evals (low pass rate, improvement target) and regression evals (near-100% pass rate, protection target) are separated, because mixing them produces wrong prioritisation.
Track AI-codebase-drift signals weekly. Propel’s writeup defines drift as “the slow spread of low-quality patterns in agent-generated code, documentation, and review artifacts” and proposes weekly tracking of duplicate-branch rate, follow-up-fix rate, and missing-artifact rate as the operational signals. None of these are CI-pass-rate metrics; they’re the metrics CI-pass-rate hides.
Periodically re-derive essential states. GitHub’s dominator pattern, as currently published, learns from a fixed seed of traces. Practitioners adopting it should expect to re-derive the dominator model on a cadence matching how fast their product is changing. The framework’s online-learning extension is named as future work, not current capability — anyone treating the dominator model as set-and-forget will discover the same drift problem CI already had, one abstraction layer up.
Watch the judgment metric, not just the capability metric. HiL-Bench’s central finding — that frontier models with high-capability pass rates can collapse to single-digit judgment performance on the SWE-side when help-seeking is required — implies that CI rubrics scoring only correctness will systematically miss the failures that come from miscalibrated help-seeking. Ask-F1-style metrics, or their pipeline-side equivalents (escalation rates, human-handoff rates, ratio of unresolved-blocker PRs to total PRs), should sit alongside pass-rate.

The framing that ties all this together: CI signal calibration is itself an agent task, not a one-time setup. The eval suite, the rubric, the dominator model, and the LLM-judge are all artifacts that need their own maintenance pipeline. When teams report that CI “worked for a quarter and then drifted,” the artifact that drifted was almost always the eval, not the agent.

Practical Implications

For platform engineers and CI/CD leads at teams shipping agent-authored code at meaningful volume, the working recipe assembled from the published patterns:

Run deterministic graders first; LLM-as-judge second; human review last and only on a sample. Lint and static analysis catch the cheap failures in under five seconds. Test execution catches the next layer. Rubric scoring (or dominator-based structural validation) catches the semantic and trajectory failures tests can’t reach. Human review samples 10% to keep the rubric calibrated. The order matters because expensive graders waste cost when run on inputs the cheap graders would have rejected.
Cap agent retries explicitly. Two attempts is what Stripe’s reported pattern uses; three is the upper end of what the published guidance supports. Past three, the marginal probability of success on the same problem with the same context is too low to justify the cost. Make the cap a hard pipeline parameter, not an agent-controlled budget.
Isolate every trial environment. Workspace state bleeding between runs is the most common source of correlated failures that look like agent regression but are infrastructure flakiness. Anthropic’s published warning is direct on this; the cost of getting it wrong is debugging a non-problem for a week.
Calibrate every LLM-judge against human labels before it goes into production. Target 75–90% agreement on a held-out validation slice, or 0.80+ Spearman correlation if you’re using a scoring rubric. Recalibrate quarterly, after every model upgrade, and after every significant change to the task surface.
Validate structure, not trajectory. If your pipeline asserts that the agent took the same path as the recorded reference, it will produce false-negative volume until the team disables it. Either grade output against the spec, or use dominator-style essential-state validation against a learned reference. Trajectory equality is a brittle proxy.
Run trials at pass^k, not pass@k. A 75% pass rate on a single trial gives a 42% probability of passing three trials. CI gates set against pass@1 will under-fit reliability; gates set against pass^3 or pass^5 surface variance that pass@1 hides. Decide which trade-off you want before the variance shows up in production.
Channel every escaped failure into a regression test. Every defect that made it through CI to a human reviewer is the cheapest possible regression test you’ll ever capture. Keep the capability evals and regression evals separate; the capability bar is allowed to be 50%, the regression bar must be 99%+.
Instrument judgment, not just capability. HiL-Bench-style escalation metrics — ratio of PRs the agent flagged for human help versus PRs it silently completed — are the early-warning signal for the failure mode that won’t show up in pass-rate. Track help-seeking calibration as a first-class CI metric.
Treat the eval suite as a system that itself drifts. Schedule re-calibration, re-derive dominator references, refresh rubric ground-truth labels on a fixed cadence. The eval that worked in month one and silently rotted by month four is the structural failure mode of every CI pipeline that hits this problem.
Read the over-month signal, not just the within-run signal. AgentPulse-style sustained metrics across heterogeneous signals — adoption, sentiment, ecosystem health, benchmark — predict adoption sustainability that point-in-time pass-rate misses. Snapshot leaderboards are stale-data decisions by month three.

The unglamorous summary: the wins in 2026 are not coming from a new eval primitive. They’re coming from assembling the primitives that already exist — lint, test impact analysis, LLM-as-judge with calibration, retry caps, isolated environments, regression-test promotion — into pipelines where the agent gets fast feedback, expensive feedback, and explicit termination, in that order, with the right gates between them. The teams that ship a thousand agent PRs a week are the ones that built the pipeline first and let the model improve into it.

Open Questions

How does dominator analysis behave on long-horizon, multi-day agent runs? The published paper validates against runs that complete in minutes; the failure mode of essential-state derivation across executions that span days, multiple checkpoints, and partial human takeovers is not yet established.
What does steady-state cost look like for the layered-grader stack? Anthropic, Promptfoo, and OpenAI all recommend deterministic-then-LLM-judge layering, but the cost-per-merged-PR of the LLM-judge layer in production, at the volume Stripe runs, is not in the public record. Public reporting on the long-term economics of layered grading at agent-CI volume remains thin.
When does rubric grading overfit to the rubric? Galileo’s hierarchical taxonomy (7/25/130 levels) and Raghavendra et al.’s four-axis rubric both work in benchmark conditions; the failure mode of an agent learning to score well on the rubric without actually improving the underlying behaviour is the well-known specification-gaming risk, and the published guidance is not yet rich on how to detect it in CI.
How portable is the Stripe pattern to teams without a three-million-test repository? Stripe’s selective-CI relies on a test surface dense enough that test-impact analysis returns a usefully-small set per change. Smaller monorepos or sparser test suites may not produce the same fast-feedback dynamic; the patterns the literature would adapt for that case are not yet published in detail.
How does HiL-Bench’s judgment metric integrate into CI gates? The benchmark establishes that the judgment-versus-capability gap exists; the operational question of how to convert Ask-F1 into a build-pass/fail signal — and whether the right metric is per-PR or per-agent-version — is the natural next research direction, and the published work to date is more diagnostic than prescriptive.
What is the right cadence for re-deriving the dominator reference? GitHub’s framework currently learns from a fixed seed of 2–10 traces. The cadence question — re-derive weekly, on every product release, on every agent-model upgrade — is downstream of how fast the product surface moves, and no published guidance has tried to characterise the trade-off.