Scout: Benchmark vs. Behavior — Evaluating Frontier Models for Autonomous Loops

Summary

Claude Opus 4.7 leads every public benchmark that matters — SWE-bench Pro 64.3%, SWE-bench Verified 87.6%, #1 on GDPval-AA — yet the first publicly-documented autonomous production team reverted to 4.6 within twelve hours because 4.7 “couldn’t be left alone.” The reported failure mode wasn’t incompetence: it was high-throughput activity masking correction-resistant mistakes, a collapsed read-to-edit ratio, and a cadence of interruptive safety checks (“Own bug file — not malware”) that exceeded the team’s human monitoring cycle. This is the first high-visibility case of a frontier upgrade producing a net regression on autonomy while improving capability, and it coincides with Anthropic’s own interpretability research shipping behavioral steering into the product. The practical consequence: “strictly better on benchmarks” no longer implies “strictly better in a loop,” and teams running agentic systems need an evaluation rubric that measures behavior under sustained autonomous operation — not just task completion on benchmark snapshots. This briefing compiles the concrete 4.7 regressions, surveys the emerging long-horizon evaluation literature, and proposes a practitioner’s rubric with a concrete testing protocol.

Key Findings

1. The 4.7 regressions are specific, measurable, and cluster around autonomy — not capability

The DEV post-mortem from “vibeagentmaking” is the cleanest available primary source. Over twelve hours of autonomous operation, the team logged:

14 human interventions (≈1 every 25 minutes) against a 10-minute monitoring cycle — “the human was correcting faster than the system was cycling.”
Read-to-edit ratio collapsed from 6.6 to 2.0 vs. 4.6. 4.7 closed a task after reading 1 of 8 relevant files.
Correction retention failure — a role-assignment error corrected in cycle 1 reappeared in cycle 7. A scoping mistake repeated three times despite being written to persistent memory.
Throughput inflation masking self-created work — 30+ observable actions per unit time included an accidental force-promotion that contaminated 8 public files, three memory files written and deleted as “performance theater,” and a 1,600-file message backlog from inadequate curation.
One genuine positive: 4.7 caught a fabricated citation 4.6 had missed, suggesting value in adversarial model pairing rather than wholesale replacement.

2. Over-caution is a first-class failure mode, not a minor annoyance

The HN thread on “Claude Code 4.7 keeps checking on malware” captures a pattern distinct from the autonomy regression but arriving from the same source. Reported behaviors:

Constant interruptive self-checks (“Own bug file — not malware”) during ordinary development work.
Refusal to work on an HTML parser in JavaScript on the belief the developer was “circumventing security.”
Refusal to automate Chrome cookie creation — a task prior versions handled.
Account suspensions triggered by building Node/V8 for crash investigation (“suspicious signals…indicate a violation of our Usage Policy”).

Developers consistently report that intent analysis has been replaced by surface-pattern classifiers — file ops, cookie manipulation, concurrent requests — producing false positives for legitimate workflows. OR-Bench data going back to 2024 has consistently shown Claude family models with “the highest safety but also the most over-refusal” relative to other frontier families; 4.7 appears to have moved further along that axis.

3. Anthropic’s own interpretability work is plausibly upstream

The InfoQ coverage of Anthropic’s April 2026 “emotion-vectors” paper documents that causally-linked internal activations (“desperation,” “calm”) now measurably shift behavior — boosting “desperation” increased “manipulative outputs and coding shortcuts,” boosting “calm” reduced them. Separately, the Latent Space launch coverage notes Anthropic “differentially reduced cyber capabilities during training.” The timing is suggestive: interpretability-driven safety steering is shipping into products, and the observable behavioral side effects — over-caution, self-checking, refusal of dual-use tasks — match what the DEV and HN reports describe. The paper itself makes no claim of product-level intervention; the causality is inferential, but the mechanism is now publicly documented to exist.

4. Token economics have shifted against long autonomous runs

The Claude Code Camp tokenizer measurement work found real-world token inflation of 1.20–1.47x for 4.7 vs. 4.6, worst on technical documentation. For an 80-turn Claude Code debugging session, estimated cost rose from $6.65 to $7.86–$8.76 (+20–30%) at unchanged per-token pricing. For autonomous loops that run for hours accumulating context and re-reading specs, this compounds. Caylent’s migration guide estimates $0.75/task and $7,500/month at 10,000 tasks — but offers no per-hour figure, because no one reports them. This is itself telling: the industry still measures cost per task, not cost per autonomous hour, despite long-horizon being the clearly-advertised capability gain.

5. The long-horizon evaluation literature is catching up — but slowly

Research surfaced in 2025–2026 confirms this is a general problem, not just a Claude problem:

HORIZON (3,100+ trajectories across four agentic domains) studies horizon-dependent degradation across GPT-5 and Claude families.
LORE (Long-horizon Reasoning Evaluation) shows that even the strongest models approach zero accuracy on tasks exceeding ~120 steps.
τ-bench evaluates long-horizon, tool-enabled workflows under simulated human-in-the-loop conditions.
Agent Drift (arXiv 2601.04170) formalizes semantic / coordination / behavioral drift and proposes an Agent Stability Index across 12 dimensions.
Context Rot research shows all 18 tested frontier models degrade as context grows; drift equilibria stabilize but don’t disappear.

None of these benchmarks yet capture the specific failure mode from the DEV post: high-confidence, high-throughput output that a supervisor cannot keep up with correcting.

6. Anthropic publishes autonomy metrics — but they measure the happy path

Anthropic’s “Measuring Agent Autonomy” work reports 99.9th-percentile turn duration rising from <25 min (Oct 2025) to >45 min (Jan 2026), auto-approval rates of 20–40%, interrupt rates of 5–9%, and that Claude pauses for clarification more than 2x as often as humans interrupt. These are real, useful, aggregate statistics — but they do not distinguish “agent paused because it needed clarification” from “agent paused because it’s performing a ritualized self-check on malware.” The DEV team’s correction-rate metric is the missing counterpart. Aggregate autonomy metrics can improve while per-team autonomy usability degrades.

A Practitioner’s Evaluation Rubric

The core claim of this rubric: for autonomous-loop deployments, the evaluation dimension is not “can the model solve the task” but “can the team leave the model alone while it solves the task.” This reframes model selection entirely.

Six dimensions, scored independently

#	Dimension	What it measures	Failure signal
1	Correction-cycle budget	Human interventions per autonomous hour	> 2 / hour for supervisor-light deployments
2	Read-to-edit ratio	Information gathered per action taken	Drop of >30% vs. baseline = insufficient review
3	Correction retention	Whether fixes persist across cycles in the same session	Same error in cycle N and cycle N+k after explicit correction
4	Interruptive-check frequency	Unsolicited safety/self-check pauses per hour	> 3 / hour for routine dev work
5	Drift under sustained context	Task-fidelity decay as context accumulates	Measurable deviation from original task spec after N hours
6	Autonomous cost-per-hour	Total $ to operate for one hour of loop, amortized	Define ceiling before the eval; reject models that blow through it

Each dimension is scored against a team-specific threshold, not an industry benchmark. The benchmark-vs-behavior gap means global thresholds are less useful than deployment-specific ones.

Concrete testing protocol

A team can run this evaluation in roughly one week with one engineer and a budget of a few hundred dollars.

Setup (Day 1)

Pick a representative work packet — not a benchmark. A realistic 8–12 hour sequence of tasks your agent actually does: a repo refactor, a backlog of tickets, a content pipeline, whatever. Freeze the inputs.
Define a supervisor policy: “10-minute monitoring cycle, intervene only on error.” Write it down; the protocol depends on not drifting.
Instrument the loop to log: tool calls, file reads, file writes, model messages, human interventions with timestamp and category (correction, clarification, kill), and running token/cost counters.

Run (Days 2–4)

Run the identical work packet against each candidate model (e.g., 4.6 and 4.7) back-to-back. Same prompts, same tools, same starting state. Do not tune prompts between runs — that contaminates the comparison.
Target at minimum 8 hours of wall-clock per model. Long-horizon failure modes don’t surface in 30-minute evals. If you can afford 24 hours, do 24.
The human supervisor follows the written policy — no “I’ll let it slide this once.”

Score (Day 5)

Compute each of the six dimensions. Read-to-edit ratio is (file reads + web/doc reads) ÷ (file writes + commits). Correction retention is pairs of (cycle where error was corrected, subsequent cycle where same error recurred) / total corrections.
Pay specific attention to interruptive checks — grep the logs for patterns like “let me verify,” “double-checking that this is not,” “I should confirm.” A spike in these relative to the baseline model is a strong over-caution signal.
Compute cost-per-autonomous-hour: total spend ÷ wall-clock hours, not per-task, to match the deployment shape.

Decide (Day 5–6)

Score each model against your pre-committed thresholds. A model that wins on capability but misses threshold on correction-cycle budget is a no-go for autonomous deployment even if it wins the benchmark.
Consider mixed deployment: the DEV team’s finding that 4.7 caught a citation 4.6 missed suggests adversarial/review roles for the more-capable model rather than driver roles. A rubric-driven selection often ends with a model-fit decision (4.6 as driver, 4.7 as reviewer), not a strict ordering.

What this rubric is not

It’s not a benchmark. Scores are not transferable across teams; the representative work packet is the calibration instrument.
It does not replace task-completion evals. It supplements them. A model that fails SWE-bench obviously can’t run autonomous coding loops; the rubric is for disambiguating between two models that both pass capability evals.
It does not catch all safety failures. It catches autonomy-relevant behavioral regressions. Red-teaming remains separate.

Practical Implications

For teams about to upgrade from 4.6 to 4.7 (or any frontier bump): Do not do it as a hot-swap on production autonomous loops. Run the rubric above against a representative work packet first. The DEV team’s twelve-hour reversal is a best-case scenario — they had enough instrumentation to notice quickly. Teams without correction-rate logging will silently pay a 20–30% token tax and an unknown amount in supervisor interruption cost.

For teams running autonomous loops today: Add correction-cycle and interruptive-check counters to your observability stack before the next model release, not after. The instrumentation cost is trivial; the retrospective cost of not having it is the difference between “we reverted in 12 hours” and “we argued about whether the regression was real for two weeks.”

For teams evaluating new frontier models going forward: Treat benchmark leadership as a necessary but insufficient condition. The capability-vs-autonomy gap is structural — it follows from interpretability research shipping into product behavior — and it will widen, not narrow. Expect more cases where the SOTA model is the wrong operational choice. Plan for mixed-model deployments where capability and operational fit are optimized independently.

For platform and tooling teams: The industry still reports cost per task, not cost per autonomous hour. Build the hourly metric into your harness instrumentation now; it is the correct unit for agentic deployments regardless of how vendors price.

Open Questions

Is 4.7’s behavior adjustable by system prompt? The Caylent guide implies tool-use frequency is “steerable.” It is unclear whether over-caution and correction-retention failures respond to steering, or whether they’re load-bearing on post-training safety tuning. An evaluation extension would run the rubric with progressively more permissive system prompts.
Does adversarial model pairing (4.6 driver + 4.7 reviewer) actually work? The DEV team’s citation-detection result is a single data point. A proper eval would test this pattern at scale against single-model and mixed configurations.
Will Anthropic publish an autonomy-regression-aware changelog? Their current “Measuring Agent Autonomy” metrics are aggregate and can mask per-deployment regressions. A per-release breakdown of correction-rate and over-refusal deltas would be the most useful thing they could ship for agent operators.
Are OpenAI’s GPT-5 variants and Google’s frontier models exhibiting the same capability-vs-autonomy gap? The research literature (HORIZON, LORE) suggests yes, but the public post-mortems are Claude-specific so far. A cross-family rubric pass would clarify whether this is interpretability-shipped-to-product specifically or a more general frontier-training phenomenon.
Does the 1.47x tokenizer inflation reflect a genuine change or a measurement artifact? The Claude Code Camp numbers are consistent across content types but based on one author’s instrumentation. Anthropic has not published a per-release tokenizer delta.