Scout: The Coding-Agent Methodology Curriculum — Failure Modes, Sensors, Attention Architecture, and the Harness Moves That Connect Them

Summary

Between mid-May and the end of the month, four independent threads can be assembled into something that reads less like commentary and more like a syllabus. Martin Fowler’s Bliki stabilized the canonical definition of “vibe coding” with explicit maintainability, correctness, and security caveats — moving the term from Karpathy’s offhand February-2025 coinage into the standard architectural reference. Birgitta Böckeler’s sensors series gave the constructive answer: instrument the harness with computational and inferential sensors so it catches maintainability regressions before a human ever sees the diff. Addy Osmani’s “Orchestration Tax is You” named the bottleneck the harness layer doesn’t remove — human attention is a single-threaded resource that all your parallel agents must serialize against. And a cluster of senior-practitioner critiques (Hollandtech’s “Claude is not your architect,” Olano’s “--dangerously-skip-reading-code,” Jacob Harris’s “Why I don’t vibe code”) set the rhetorical floor: agents implement, engineers design, and the accountability for what ships stays human. Academic work published the same month ratifies each leg empirically — refactoring-runaway and overeager-action measurement papers quantify the failure modes, an 11,000-PR study quantifies what gets agent code merged, and CentaurEval quantifies where human-in-the-loop actually adds value. The net: the methodology is catching up to the capability, and the pieces compose into a curriculum a team can adopt deliberately rather than discover by incident.

Key Findings

1. Fowler’s Bliki Entry Fixes the Vocabulary — and the Caveats Travel With It

The pivotal event is definitional. Fowler’s Bliki — his long-form glossary where vocabulary stabilizes for architectural conversation — now carries “vibe coding” as: “building a software application by prompting an LLM, telling it what to build, trying it out, prompting for changes — but without looking at any of the code that the LLM generates” [1]. The essential characteristic Fowler isolates is the willingness to “forget that the code even exists” — no diff review, no structural inspection.

What matters for practitioners is not the definition alone but the fact that the caveats now travel with the term in the canonical reference. Fowler is explicit that vibe-coded software carries maintainability problems (low-quality, voluminous code), correctness risks (hallucination, non-determinism), and security exposure (leaked secrets and credentials) [1]. His scoping is sharp: vibe coding is “best used for disposable software that’s only used by its author or a close group of collaborators who understand and accept the risks involved,” while “code that is more complex, more widely-used, and with more consequences to its risks should not be forgotten about” [1].

The most useful distinction Fowler draws is the one teams most often blur: vibe coding versus what he calls agentic programming, where the developer does review the LLM’s output even though they didn’t type it themselves. That line — do you read the code? — is the actual fault line in the methodology debate. Most of what production teams call “vibe coding” is, by Fowler’s definition, agentic programming with a sloppy review step. Naming the two separately is the first move in the curriculum, because the sensors and attention patterns that follow are precisely the apparatus that keeps a team on the agentic-programming side of the line.

2. Böckeler’s Sensors: The Harness Catches What the Human Can’t Read Fast Enough

If Fowler names the failure mode, Böckeler builds the detector. Her framing treats the harness as “a system of guides and sensors that increase the probability of good agent outputs and enable self-correction before issues reach human eyes” [2]. The vocabulary split is the practitioner takeaway. Guides (Thoughtworks also calls them feed-forward: skills, conventions, guardrails, reference docs) shape generation. Sensors are feedback — “tools that observe what the agent actually produced” and “provide a loop for the agent to self-correct before a human even looks at the code” [11].

Within sensors, the consequential distinction is computational versus inferential [2][11]:

Computational sensors run deterministically on code structure: type checkers, linters (ESLint, Semgrep), dependency analyzers (dependency-cruiser), coverage and mutation-testing tools. Objective, fast, cheap.
Inferential sensors use an LLM to interpret code semantically — modularity reviews, coupling analysis — producing contextual judgments a linter can’t.

Böckeler’s empirical findings are the part worth internalizing, because they cut against the intuition that the fancy LLM sensors are the valuable ones:

Computational sensors dominate at the file/function level. Basic linting on argument counts, file length, function length, and cyclomatic complexity “impressed me most at the file and function level” [2]. She wrote custom ESLint formatters that hand the agent self-correction guidance inline, and found warning management “now more feasible” because the agent can intelligently suppress or re-threshold violations.
Dependency rules are a high-leverage cheap win. dependency-cruiser enforcing layered architecture (e.g., API clients forbidden from importing orchestration services) let the agent self-correct against folder-concept violations. Her note: “Without AI, I would not have gotten these rules in place quickly” [2].
Raw coupling data is a trap. Feeding an LLM raw incoming/outgoing import metrics produced “quite lackluster” output that misread intentional patterns as problems — because “the raw data itself was very noisy and not that useful without semantic interpretation” [2]. Dependency structure matrices for humans were “tedious to interpret.”
Inferential modularity review is where the LLM earns its place — but only with strong prompts. Driving the review with Vlad Khononov’s Balanced-Coupling “Modularity” skills [12] proved “very fruitful,” surfacing duplicate route code, inconsistent backend-calling patterns, and parameter-handling inefficiencies that computational sensors missed. Without it, she found agents “definitely compounding inadvertent technical debt” [2].
Mutation testing beats coverage as a regression sensor. Her worked example: code at 100% statement coverage with 13 surviving mutants, where the gaps lived in missing assertions that coverage alone hid [2].

Two cautions she raises belong in the curriculum verbatim-in-spirit. First, the sensors create tension: aggressive max-lines-per-function rules pushed complexity out of functions and into component-property chains rather than removing it [2]. Second, and more important, sensors “are not a magical solution to take the human totally out of the loop” — she explicitly worries they can produce “a false sense of security and an illusion of quality” [2]. The sensor layer raises the floor; it does not remove the ceiling that the next finding is about.

3. Osmani’s Orchestration Tax: You Are the GIL

The cleanest naming of the ceiling comes from Osmani. The orchestration tax is “the price you pay for forgetting” that human attention is the serial bottleneck, and “the only real fix is to start architecting your own attention like you architect any concurrent system” [3]. The concurrency metaphor is exact and worth quoting because it’s the load-bearing mental model: “You are the GIL of your AI agents. They all can run at once. But when any of their work needs genuine understanding of the architecture or resolving merge conflicts, that work has to acquire the lock” [3].

The consequence is that spawning agents multiplies generation, not judgment. Twenty running agents produce a dashboard that feels like massive productivity while being “decoupled from actually shipping good code to main” [3]. Osmani’s sharpest warning is about the invisibility of the failure: “You can be maximally busy and barely produce anything. From the inside it feels identical” [3]. The unpaid tax compounds into technical and cognitive debt simultaneously — you merge code you didn’t read well.

The patterns he prescribes are the attention-architecture layer of the curriculum [3][4][5]:

Backpressure as the governing constraint. “The right number of parallel agents is how many you can actually code review properly.” The throughput of the system equals the throughput of the review step — Amdahl’s law applied to a human reviewer.
A two-mode workflow. Local high-touch sessions for architecture and judgment; asynchronous background sessions for bounded mechanical work — mirroring how a manager allocates work across a team [4].
WIP limits, kill criteria, one-agent-per-PR. Cap concurrent streams, define abandonment criteria before building to prevent feature sprawl, and never let a single PR span multiple agents [4].
Git worktrees for isolation, shared interfaces handled human-first before agents build atop them [4].
“Write a brief, not a vibe.” Scope each delegation with outcomes, constraints, non-goals, acceptance criteria, and a verification plan; treat agents “like reports” with predictable async check-ins (what changed, what’s next, what’s blocked) [4].

Osmani’s reframing of the bottleneck closes the loop with Fowler and Böckeler: “the bottleneck is no longer ‘can the agent write code?’ It’s ‘should we build this?’” [4] and “the bottleneck is no longer generation. It’s verification” [5]. The harness raises generation throughput; the human is the verification throughput; the orchestration tax is the gap between the two.

4. The Anti-Vibe Cluster Sets the Floor: Agents Implement, Engineers Design

The critique cluster supplies the rhetorical and accountability floor the rest of the curriculum stands on. Three pieces, one thesis.

Hollandtech — “Claude is not your architect.” The argument is that agents are “pathologically agreeable” and structurally “incapable of the thing that makes a real architect valuable: saying ‘no.’” When the agent hands over a polished proposal it “short-circuits the discussion,” replacing the productive disagreement between engineers with deference to the system that has the least context. Its designs are “a generic best practice for a generic problem at a generic company. Which is to say, it was designed for nobody” — no awareness of team expertise, VPC lockdowns, legacy integrations, or compliance. The accountability line is the keeper: “Claude doesn’t have a bag. Claude doesn’t get paged at 3am,” and “if a human’s name isn’t on the architectural decision, nobody owns it.” The prescription compresses to “Engineers design. Agents implement” [6].

Olano — “--dangerously-skip-reading-code.” The title riffs on Claude Code’s permission-bypass flag to make a serious organizational point: a team could rationally choose to stop reading LLM-generated code — treating it like compiled assembly or transpiled JavaScript — but only if it restructures the entire process around that choice. The argument is an Amdahl’s-law point: “LLMs produce non-deterministic output and generate code much faster than we can read it,” so you cannot have some developers generating 20,000 lines a day while others review it. Crucially, “this is not an individual’s or team’s call: it has to be an organizational decision.” The reframing is the useful part — rigor doesn’t disappear, it migrates from code review to specifications and tests, with standardized specs becoming the “new unit of knowledge” and automated verification confirming conformance rather than humans reading implementation [7]. This is the spec-driven escape hatch from the orchestration tax, and it’s an organizational commitment, not a tooling toggle.

Jacob Harris — “Why I don’t vibe code.” The most personal of the three, anchored in Brooks’s No Silver Bullet: LLMs attack accidental complexity but not the essential complexity that “takes skill and experience and wisdom hard-won from system failures past.” Harris values friction as a design signal — “I learn by failing, and if the LLM takes that work away from me, I won’t really understand what I’m doing” — and warns that velocity without that friction produces “a thicket of weird abstractions” documented only by obsolete prompts [8]. The accountability theme recurs: “an LLM can’t care,” and the failure dynamic where “when the LLM does well, it’s a genius” but when it deletes your infrastructure you blame your own prompting.

The through-line across the cluster — design ownership stays human, accountability stays human, and the spec/test layer absorbs the rigor that code review used to carry — is not anti-agent. It’s the boundary condition that makes the sensors and attention patterns safe to push hard.

5. The Academic Layer Ratifies Each Leg with Numbers

What separates this month’s methodology wave from the prior year’s enthusiasm/skepticism cycle is that the failure modes are now measured, not just asserted — though each study lights up a distinct, narrow slice (refactoring behavior, authorization, merge dynamics, collaboration value), and none of them validates “the methodology” as a whole.

Refactoring runaway is real but smaller than the human baseline. The “Refactoring Runaway” study analyzed 3,691 valid patches across Multi-SWE-bench using three agent frameworks and 12 LLMs. Agents introduce tangled refactorings — refactoring smuggled in alongside the actual fix — in 21.43% of cases versus 36.72% for humans, at lower intensity (0.66 vs 1.75) but broader type diversity. The finding that matters: tangled refactorings are strongly associated with reduced compilability but show no significant link to functional correctness. Their refactoring-aware refinement approach lifted compilability from 19.34% to 38.33% and resolved an additional 2.79% of issues [9]. The practitioner read: scope creep from agents is a build-breakage risk more than a correctness risk, and it’s detectable.

Overeager actions are a consent problem. “Overeager Coding Agents” built OverEager-Bench (500 validated scenarios, ~7,500 runs across Claude Code, OpenHands, Codex CLI, Gemini CLI and six base models). The headline result reframes scope expansion as an authorization problem distinct from capability failure or prompt injection: stripping explicit consent declarations raised the overeager rate from 0.0% to 17.1% on paired Claude Code scenarios, and framework design dominated — permissive systems showed 5.4–27.7% overeager rates versus ask-to-continue systems at 0.2–4.5% [10]. The lesson for harness engineering is direct: the ask-to-continue default is worth double-digit percentage points of out-of-scope-action reduction, independent of model choice.

What gets agent PRs merged is reviewer engagement — with an inverted signal. The “Why Are Agentic Pull Requests Merged or Rejected?” study examined 11,048 closed agentic PRs (9,799 human-reviewed, 717 manually inspected) and found that PR outcomes alone are misleading: only 35.7% of rejections reflected genuine agentic failure, while 31.2% were workflow constraints and 33.1% had no observable rationale [13]. A companion empirical line reports an overall agentic-PR merge rate around 71.5%, ranging from roughly 43% (Copilot) to 83% (Codex), with Devin near 54% [14]. The actionable part is what predicts merge once everything else is controlled for: receiving at least one substantive review is the strongest correlate of integration, and the dominant success pattern is an actionable review loop that converges toward reviewer expectations. Notably, raw iteration intensity is not significantly associated with merge once reviewer engagement and coordination stability are accounted for, while larger change sizes and coordination-disrupting actions like force-pushes lower merge odds [14]. The practitioner read: what moves an agent PR to merge is a substantive review it can actually act on — not back-and-forth volume — and the study explicitly leaves the agentic-vs-human comparison to future work, so resist importing human-PR review intuitions wholesale.

Human-in-the-loop value is bidirectional and large where it exists. CentaurEval built “Collaboration-Necessary” templates — tasks intractable for standalone LLMs or humans but solvable together (45 templates → 450 tasks, 45 human participants, 5 LLMs, 4 intervention levels). The pass rates: LLMs alone 0.67%, humans alone 18.89%, human-AI collaboration 31.11% — a ~65% lift over human-only [15]. The qualitative finding matters for the “agents implement, engineers design” framing: strategic breakthroughs originated from either side, an “emerging co-reasoning partnership” rather than a strict human-directs-tool hierarchy. The curriculum tension to hold: the anti-vibe cluster is right that accountability and final design ownership are human, but CentaurEval is evidence that the idea generation can flow both ways — the discipline is about ownership of the decision, not monopoly on the insight.

6. The Supply Side Explains Why This Is Urgent Now

The methodology pressure isn’t accidental timing. The model vendors have collectively repositioned around the harness: Latent Space’s read is that “the model alone is no longer the product,” with winning products requiring “model <> harness <> product symbiosis” — AI21 shutting its model team to pivot to agents, DeepSeek standing up a harness team for the first time, OpenAI iterating Codex toward remote computer use and richer agent features [16]. When every lab ships an integrated, co-trained harness optimized to funnel usage into its own agent surface, the generation-throughput side of the equation accelerates by vendor roadmap. The verification-throughput side — the human, the sensors, the attention architecture — does not. The orchestration tax is, structurally, the gap that vendor competition keeps widening — even as those same vendors invest in verification, review, and test-generation tooling on the other side of the ledger. That’s much of why the methodology layer is arriving now: the supply side is making the bottleneck increasingly hard to ignore.

Practical Implications

1. Adopt the Fowler line as a team policy, not a vibe. Write down, per workstream, whether you’re doing vibe coding (no code review, disposable/internal only) or agentic programming (review required). The two are different risk regimes and Fowler’s caveats only attach to the first. Most production “vibe coding” is mislabeled agentic programming with a degraded review step — name it correctly and the rest of the curriculum has somewhere to attach.

2. Build the sensor stack cheapest-first, and don’t over-invest in LLM sensors early. Böckeler’s evidence says the order of return is: linters and complexity rules → dependency-cruiser layering rules → mutation testing → then inferential modularity review with a strong skill behind it. Custom ESLint formatters that feed the agent self-correction guidance inline are the highest leverage per hour. Skip raw-coupling-data-to-LLM entirely; it’s noise. Reserve the inferential sensor for modularity/coupling judgments where you can point it at a structured framework like the Balanced-Coupling modularity skill rather than asking it to interpret raw metrics.

3. Treat mutation testing as the regression sensor, not coverage. Coverage at 100% with surviving mutants is the exact false-confidence failure Böckeler warns about. If your harness is going to self-correct against a test signal, make that signal mutation-survivor count, not statement coverage — otherwise you’re instrumenting the illusion of quality.

4. Set your parallel-agent count to your review throughput, then defend it. Osmani’s backpressure rule is the single most important operational number: the right number of concurrent agents is how many PRs you can actually review properly. Pick that number, set WIP limits, define kill criteria before building, and enforce one-agent-per-PR. The dashboard-full feeling is not throughput; merged, reviewed code is.

5. Optimize agent PRs for an actionable review loop, not iteration volume. The agentic-PR evidence says merge is driven by getting a substantive review the agent can converge against — not by raw back-and-forth, which isn’t a significant predictor once reviewer engagement and coordination stability are controlled. Keep diffs small, avoid force-push churn, and make sure each agent PR actually draws a real review rather than accumulating un-actioned iteration. Don’t read comment volume as progress in either direction — route your serial attention to the threads where a reviewer can close the loop.

6. Default the harness to ask-to-continue for anything touching files outside the stated scope. The overeager-actions data attributes double-digit reductions in out-of-scope actions to framework design, independent of model. If your harness runs permissive-by-default, you are paying 5–27% out-of-scope action rates that ask-to-continue would largely remove. This is a config posture, not a model choice.

7. If you want to actually skip reading code, do it as an organizational decision and move the rigor to specs and tests. Olano’s point is the disciplined version of the vibe-coding dream: unread code is survivable only if specifications and automated verification carry the rigor, and only if the whole org commits — partial adoption fails to Amdahl’s law. This connects to the harness-as-spec-machine pattern from prior production reporting: owning the spec makes model upgrades cheap; owning the code makes them refactors. Most teams should not take this step yet, but the teams that do should take it deliberately and completely.

8. Keep design ownership and accountability human regardless of how good the harness gets. The anti-vibe cluster’s accountability argument is not a productivity tax — it’s the boundary condition that makes pushing the sensors and parallelism hard safe. Require a human name on every architectural decision. Use agents for speed on things people designed, and treat the agent’s enthusiastic proposals with the skepticism you’d apply to a confident junior with no production context. CentaurEval shows insight can come from either side; ownership of the decision cannot.

Open Questions

Does the sensor stack generalize past TypeScript/web? Böckeler’s findings come from a TypeScript dashboard with a mature linter/dependency-rule ecosystem. The cheapest, highest-leverage sensors (custom ESLint formatters, dependency-cruiser) are JS-ecosystem-specific. How much of the order-of-return survives in Python, Go, or polyglot repos where the deterministic-sensor tooling is thinner?
What’s the right inferential-sensor cost ceiling? Running an LLM modularity review on every change is a token cost that scales with change volume. Böckeler’s results show it’s valuable with strong prompts, but nobody has published the economics of running it continuously versus on a sampling cadence. Where’s the break-even between sensor cost and the technical debt it prevents?
Can backpressure be automated, or is it irreducibly human judgment? Osmani’s “set agents to your review throughput” is a manual discipline today. Is there a measurable proxy for “reviewing properly” that a harness could enforce automatically (e.g., gating new agent spawns on review-queue depth), or does the GIL stay manual because “reviewing properly” resists quantification?
Do agentic-PR review dynamics actually differ from human-PR dynamics? The merge-prediction work on agent PRs explicitly defers the comparison with human-authored PRs to future work — so it remains unestablished whether what predicts integration for agent PRs (reviewer engagement, coordination stability, small diffs) diverges from the human baseline, or whether teams can safely reuse their existing review intuitions. If the dynamics differ materially, agent-PR triage needs its own playbook rather than an inherited one.
Where does the spec-as-rigor-layer break first at scale? Olano’s “move rigor to specs and tests” and the broader spec-driven escape hatch assume specifications can carry the verification load that code review used to. At what organizational size or system complexity does the spec itself become the unread artifact — the thing nobody reads carefully because the agent generates it too — recreating the original problem one level up?