Scout Reports | The Artificer's Grimoire

July 25, 2026 29 sources

Scout: The Agentic Software Engineering Corpus — The Foundational Reading and a Measurement Layer Under Repair

The orientation corpus for agentic software engineering — the two field surveys, the SWE-bench/SWE-agent origin pair, the field-telemetry studies, and the METR productivity RCT — paired with the long-horizon benchmark wave (SWE-Bench Pro, FeatureBench, SWE-EVO, ProjDevBench, NL2Repo-Bench, RepoReason, SWE-Explore) and the validity critiques that determine how much any of those numbers are worth

Every team building agent infrastructure inherits an evaluation vocabulary it did not choose: resolve rate on a public benchmark. In 2026 that vocabulary came apart in public — the field's flagship benchmark was retired by its most prominent user, its designated successor was audited and found roughly a third broken, and a position paper made the structural case that a coding-agent score was never a model score at all. The foundational literature explains how the field got here; the benchmark wave shows where it is going; together they set what a practitioner should and should not conclude from a leaderboard.

July 25, 2026 22 sources

Scout: The Coding-Agent Control Plane Productizes — What the Claude Apps Gateway Standardizes

The Claude apps gateway as the first first-party, self-hosted operational control plane for a coding agent — identity, per-group policy, usage telemetry, upstream routing, and spend caps consolidated into one customer-run tier — and how it compares to GitHub's vendor-hosted agent control plane, the hand-rolled proxies, the MCP server-access gateways, and the agent-identity layer it sits beside

Teams running Claude Code past a pilot have been assembling this tier by hand: an IdP integration, a forwarding proxy, an OTel collector, a budget dashboard. It now ships in the CLI binary with no license fee, which turns a build project into an adopt-or-not decision — and makes the enforcement boundary (what the gateway decides server-side vs. what it merely delivers to a laptop) the thing worth reading closely before the security review.

July 25, 2026 24 sources

Scout: Spend Governance for Credentialed Agents — Enforcing Budgets in Front of the Action

Where a dollar ceiling can actually bind when an autonomous agent holds real cloud credentials — the structural lag in cloud billing, the permission-boundary and blast-radius controls that do enforce pre-action, the narrow surfaces where dollar-denominated pre-authorization already ships (AgentCore Payments, inference gateways, Anthropic's Spend Limits API), and the gap between them

Cloud billing settles hours to a day behind an agent that can commit thousands of dollars in a single API call, which makes every spend control denominated in dollars a detective control by construction. Teams handing agents cloud credentials need to know which controls bind before the action (service control policies, service quotas, session-scoped roles, lease-bounded accounts) and which only tell you afterward (budgets, anomaly detection, cost dashboards) — because the two look identical on a governance slide and behave nothing alike at 3am.

July 25, 2026 16 sources

Scout: The Drift-Control School — What the 2026 Spec-Driven Development Papers Actually Establish

Five 2026 arXiv papers on spec-driven development and long-horizon spec/architecture drift — Code-to-Contract, Spec Kit Agents, Citation Discipline, the Spec Growth Engine, and the Kitchen Loop — read together as a forming literature, with the enforcement mechanics separated from the evidence that supports them

Spec-anchored workflows and self-evolving codebases are the practitioner bet on keeping intent intact across long autonomous runs. An academic corpus is now forming around exactly that bet — and the useful part is the enforcement machinery, not the methodology framing. Teams can lift the drift gates, orphan-citation checks, and phase hooks without adopting any of the five frameworks whole.

July 16, 2026 19 sources

Scout: The Enforcement Point Moves Out of the Agent's Reach — the July 2026 Agent-Authorization Research Wave

A cluster of July 2026 research moving the authorization decision for agent tool calls outside the agent's own context — an off-host signing gateway (aiAuthZ), a kernel-resident governance layer (Governed MCP), execution-integrity manifests (CXI), and an action-graded severity scale — read as one practitioner map, alongside MCP Enterprise-Managed Authorization reaching stable

Teams running agents against real tools and data need to decide where the authorization decision for a consequential action lives. The July research converges on one answer — outside the agent's context, where attacker-writable input can't reach it — and the products are starting to ship it. Knowing which of these to adopt now (out-of-agent tool-call authorization, default-deny gateways, severity-graded evaluation) versus watch (cryptographic per-message receipts, kernel-resident governance) is the useful work.

July 16, 2026 13 sources

Scout: Context Compaction for Long-Horizon Agents — Techniques, Tradeoffs, and When Summaries Lie

Compaction — summarizing prior trajectory to continue a long-horizon agent under a compressed context — as a first-class technique: the design space (reversible reduction, lossy summarization, learned/RL-trained compression, structured eviction), the failure modes (silent context loss and belief drift), and practitioner guidance on when and how to compact

Any agent that runs long enough will exceed its context window mid-task, and the mechanism teams reach for — compress the history and keep going — is now shipping as an API primitive, an RL training objective inside a frontier model's production pipeline, and a measurable failure surface. Whether a compaction pass silently drops the one observation that mattered, or quietly rewrites the agent's model of the task while the benchmark still shows green, is a reliability decision teams are making today, mostly by default. Knowing which reduction strategies are reversible, where the information actually goes, and what to instrument is the difference between a durable long-horizon agent and one that confidently finishes the wrong job.

July 16, 2026 10 sources

Scout: When the Harness Regresses: Attributing Coding-Agent Quality to Scaffolding, Not the Model

How to attribute and diagnose coding-agent quality regressions — distinguishing a harness (scaffolding) change from a model change, and the diagnostic instruments (trajectory taxonomies, automated failure diagnosis) that make a regression visible when a pass/fail resolve rate cannot

When a coding agent starts shipping worse patches, the reflex is to blame the model — a silent downgrade, a bad checkpoint, a quantization. But the scaffolding layer changes far more often than the weights, and controlled 2026 evidence shows how much of a quality swing it can drive on its own. A team that can't tell a harness regression from a model regression debugs the wrong layer for weeks; the fix is trajectory-level instrumentation, because resolve rate can't localize what broke.

July 10, 2026 12 sources

Scout: When the Agent's Input Reaches the OS — DuneSlide, Langflow, and the Coding-Agent Trust Boundary

Two 2026 disclosures — Cato Networks' 'DuneSlide' zero-click prompt-injection-to-OS-RCE flaws in Cursor, and Sysdig's JADEPUFFER agentic ransomware operation run through an internet-exposed Langflow instance — read against a persistent-state distributed-attack paper and the lethal-trifecta literature, to map the trust-boundary gap they share: an agent's decisions reaching the host without a boundary enforced outside its reach. What the boundary looks like across mainstream agentic IDEs, and concrete containment guidance for sandbox posture, tool-input treatment, and irreversible-action gating.

Both disclosures are about already-patched vulnerabilities, so neither is a live zero-day to scramble against — but both make concrete a failure that every team running an agentic IDE or agent framework now owns: the coding agent is a general-purpose interpreter that treats attacker-controlled content (an MCP result, a web page, a workspace file) as instructions, and its tool surface reaches the shell. DuneSlide shows the agent as the conduit — untrusted input riding a native tool through the sandbox to OS-level code execution. JADEPUFFER shows the agent as the operator — an LLM running a multi-stage intrusion largely on its own, even if how much a human configured and steered it behind the scenes remains unsettled. The containment question is the same in both directions, and the default vendor posture doesn't answer it.

July 10, 2026 16 sources

Scout: Grading Jailbreaks — Anthropic's CJS Scale and the Dual-Use Classifier Taxonomy as a Governance Template

Anthropic's Cyber Jailbreak Severity (CJS) scale and four-band dual-use request classifier taxonomy — how the framework works, how it compares to CVSS-era severity scoring, and which pieces transfer to teams governing dual-use agent capabilities

Every team running agents with dual-use capability — code execution, security tooling, browser automation — faces the same three problems Anthropic just published its answers to: how to sort requests by risk, how to size the false-positive margin, and how to triage guardrail bypasses when they inevitably arrive. The framework is copyable; knowing which parts are template and which parts are Anthropic-specific is the useful work.

July 10, 2026 15 sources

Scout: From Prompt to Loop — The Loop Specification and the Software Factory

The loop specification as the new unit of agent engineering — what the artifact concretely contains, how it differs from a prompt template and from an autoresearch recipe, and how much of the org-scale 'software factory' framing is substantiated versus vendor-narrated

Teams running coding agents are being told to stop writing prompts and start writing loops, and vendors are selling the org-scale version as a 'software factory.' Whether to invest in loop specifications as versioned engineering artifacts — and whether to buy the factory story — are near-term architecture and procurement decisions. Knowing exactly what the artifact contains, and where the evidence stops and the marketing starts, is the difference between adopting a durable practice and buying a metaphor.

June 29, 2026 10 sources

Scout: The Harness Is a Cost Lever: Evaluating and Choosing Agent Harnesses

A practitioner's framework for evaluating and selecting a coding-agent harness independent of model choice — what to benchmark, how to measure token efficiency honestly, where to discount vendor parity claims, and when a third-party or meta-harness beats a vendor one

Agent economics has stopped being a model-selection problem and become a model-and-harness problem. At a fixed model, the orchestration layer can swing both task-resolution rate and token cost by margins that dwarf a model generation — so the team that can't evaluate a harness independently of the model is leaving a major cost and capability lever unpulled.

June 29, 2026 10 sources

Scout: Do Context Files Actually Help? AGENTS.md Under Empirical Scrutiny

Empirical evidence on whether repository-level context files (AGENTS.md / CLAUDE.md) improve coding-agent outcomes, what common misconfigurations look like, and how to keep instruction files from rotting

Teams writing AGENTS.md/CLAUDE.md on faith now have measurements. The evidence reshapes what belongs in a context file, how long it should be, and what to stop putting in it — directly affecting per-task cost and agent reliability on production repos.