Scout: Harness Engineering Patterns: From Mental Model to Implementation

Summary

Harness engineering crystallized as a named discipline in early April 2026 through three independent publications: Birgitta Bockeler’s practitioner framework on martinfowler.com, Anthropic’s three-agent harness design for long-running applications, and Benjamin Rombaut’s source-code taxonomy of 13 open-source coding agents. The convergence is not coincidental — it reflects the industry arriving at a shared conclusion that the scaffolding code surrounding a language model matters more than the model itself for production reliability. The core formula is now explicit: Agent = Model + Harness, where the harness encompasses control loops, tool orchestration, permission gates, context management, state persistence, and feedback mechanisms. The Claude Code source leak, which exposed 512,000 lines of harness-level TypeScript, served as an unintentional case study confirming that even the most capable models require deeply engineered runtime infrastructure. Teams that treat harness design as an afterthought are building on sand; those investing in structured control loops, layered permission models, and deliberate context strategies are seeing 2-5x reliability gains.

Key Findings

1. The Mental Model: Prompt, Context, Harness — A Nesting Hierarchy

The relationship between the three “engineering” disciplines is now well-defined and hierarchical. Prompt engineering optimizes a single model call. Context engineering optimizes what the model sees across an entire task — system prompts, retrieved documents, tool results, conversation history. Harness engineering encompasses both and adds the runtime infrastructure: tool orchestration, state management, error recovery, permission enforcement, observability, and multi-session coordination [1][5][14].

Bockeler’s framework introduces a particularly useful vocabulary: guides (feedforward controls that steer the agent before it acts — coding conventions, bootstrap instructions, architecture decision records) and sensors (feedback controls that catch problems after the agent acts — linters, type checkers, test suites). She further distinguishes between computational sensors (deterministic, fast, cheap — run on every change) and inferential sensors (LLM-based, slower, more expensive, non-deterministic — used selectively for semantic analysis) [1].

This guides-and-sensors model reframes harness engineering from “build a control system” to “design a feedback loop.” The practical question becomes: which guides reduce the agent’s error surface, and which sensors catch what the guides miss?

2. Five Foundational Control Loop Primitives

Rombaut’s taxonomy paper analyzed 13 open-source coding agents across 12 architectural dimensions and identified five composable loop primitives that serve as building blocks for all observed control architectures [3]:

ReAct (Reason-Act-Observe). The standard cycle where the model reasons about the current state, selects a tool action, executes it, and observes the result. Used as the primary loop by 7 of 13 agents. Sequential, no backtracking.

Generate-Test-Repair. The agent generates code, runs tests, and uses failure output to drive regeneration. Creates a tight feedback loop — particularly effective when combined with existing test suites. Aider’s inner loop exemplifies this pattern.

Plan-Execute. A distinct planning phase produces a structured plan, followed by an execution phase that works through it step by step. AutoCodeRover’s search-then-patch separation is the canonical example. The scaffold, not the model, controls phase transitions.

Multi-Attempt Retry. Failed attempts trigger subsequent retries with accumulated history. The key design decision is what state carries between attempts — full history, reflections only, or structured error summaries.

Tree Search. Agents explore multiple solution branches rather than committing to a single path. Ranges from flat sampling (Agentless) through depth-first search (DARS-Agent) to full Monte Carlo Tree Search with reward backpropagation (Moatless Tools).

The critical finding: 11 of 13 agents compose multiple primitives rather than relying on a single control structure. AutoCodeRover embeds ReAct loops within its plan-execute phases. Aider layers generate-test-repair over a user-driven ReAct loop. The primitives are Lego bricks, not blueprints [3].

For teams designing new harnesses, this means choosing a primary loop and then asking: which secondary primitives improve reliability for our specific failure modes?

3. Anthropic’s Three-Agent Harness: Separation of Concerns at the Agent Level

Anthropic’s engineering team published a detailed case study of a planner-generator-evaluator architecture for long-running application development (multi-hour sessions producing full-stack apps) [2][6]. The architecture makes three non-obvious design decisions:

File-based communication, not API calls. All three agents communicate through a single shared file on disk. Each agent appends structured sections and reads sections written by others. The file serves as the complete, append-only record: the plan, work log, sprint contracts, evaluation results, and feedback. This is intentionally low-tech — files survive crashes, are inspectable by humans, and impose no serialization overhead [2].

Sprint contracts before code. Before the generator writes any code, it negotiates a “sprint contract” with the evaluator — an explicit agreement on what “done” looks like. This addresses the problem of underscoping: without contracts, models start building before fully thinking through requirements and produce thin, under-featured output [6].

GAN-inspired adversarial evaluation. The evaluator uses Playwright MCP to interact with the running application the way a user would — clicking through pages, testing API endpoints, verifying database state. The evaluator is calibrated with few-shot examples and scoring criteria to prevent the self-evaluation bias that plagues single-agent systems. Separating evaluation from generation mirrors the generator-discriminator dynamic in GANs [2][6].

The cost tradeoff is stark: a solo agent built a 2D retro game maker in 20 minutes for $9, but the core features did not work. The three-agent harness took 6 hours and cost $200 but produced a substantially complete, functional product. The lesson is not “spend more” — it is that reliability at scale requires architectural investment, and the harness is where that investment lives [2].

A notable evolution: with the release of Opus 4.6, Anthropic dropped context resets from the harness entirely. Larger context windows and better long-range coherence meant each agent could maintain useful state without periodic clearing. This illustrates that harness design is model-dependent — improvements in the model can simplify the harness, and practitioners should revisit harness decisions as models improve [2].

4. The Claude Code Leak: A Production Harness Under the Microscope

On March 31, 2026, an npm packaging error shipped a 59.8 MB source map file with Claude Code version 2.1.88, exposing 512,000 lines of TypeScript across 1,906 files [4][7]. The leak provided an unprecedented look at harness-level decisions in a production coding agent:

Extended ReAct loop. Claude Code’s core loop runs four phases per turn: (1) automatic context compaction when the token budget nears 98% exhaustion, (2) an optional thinking phase for pre-action reasoning at configurable depth, (3) an optional self-critique phase, and (4) the standard reason-act-execute-observe cycle [7][8].

Layered permission model. The system exposes roughly 19 permission-gated tools across file reads/edits, shell execution, git operations, web fetching, notebook editing, and MCP tool calls. Each tool has its own independent permission gate checked against a rule pipeline — it is not “the agent has filesystem access” but rather “the Read tool checks its permission gate before every invocation.” Six permission modes (default, acceptEdits, plan, auto, dontAsk, bypassPermissions) provide graduated trust levels [7][9].

Background permission classifier. In auto mode, a background classifier running on Sonnet 4.6 evaluates whether tool calls can proceed without user confirmation. This classifier is deliberately designed to prevent the model from persuading itself past the gate — the classification model is separate from the reasoning model [7][9].

Prompt cache optimization. promptCacheBreakDetection.ts tracks 14 cache-break vectors, and “sticky latches” prevent mode toggles from busting the cache. This is a harness-level performance optimization invisible to the user but critical for cost and latency at scale [4].

Multi-agent coordination via prompts, not code. The orchestration algorithm in coordinatorMode.ts is implemented as a prompt, not as imperative code. It manages worker agents through system prompt instructions like “Do not rubber-stamp weak work” and “You must understand findings before directing follow-up work.” This is a philosophical choice: use the model’s reasoning for coordination rather than encoding fixed workflows [4].

Anti-distillation mechanisms. The ANTI_DISTILLATION_CC flag sends anti_distillation: ['fake_tools'] in API requests. A separate undercover.ts module instructs the agent to never mention internal codenames. These are harness-level decisions about competitive protection — a dimension rarely discussed in academic treatments [4].

5. Seven Context Compaction Strategies — No Consensus

Rombaut’s taxonomy identifies seven distinct approaches to context compaction across the 13 agents studied, with no convergence toward a standard [3]:

Hard truncation — naive conversation history truncation (simplest, loses information indiscriminately)
Sliding window — fixed-size window of recent messages
LLM-generated summarization — compress history into summaries (Aider)
Selective tool result dropping — remove verbose tool outputs while preserving message structure
Polling parameter — configurable output verbosity (SWE-agent)
Verification probe — monitor context size and trigger compaction checks proactively (Gemini CLI)
LLM-initiated compaction — the agent itself requests context reduction via a dedicated tool (Cline)

Claude Code uses a hybrid: automatic summarization at 98% context utilization, image/PDF stripping, and tool output offloading where outputs exceeding a threshold are persisted to disk with only head/tail tokens kept in context [7][9]. MCP tool outputs are capped at 25,000 tokens by default, with large results (up to 500,000 characters) automatically persisted to disk rather than held in context.

The lack of convergence here is itself a finding. Context compaction is an open design problem with no external constraint driving standardization — unlike tool interfaces or edit formats, where practical requirements force convergence [3].

6. The Convergence-Divergence Map

Rombaut’s most actionable finding is that scaffolding architectures converge where external constraints dominate and diverge in open design areas [3]:

Dimension	Status	Driver
Tool capabilities (read, search, edit, execute)	Converged	Functional requirements
Edit format (string replacement)	Converging	Reliability of exact-match editing
Execution isolation (Docker for benchmarks)	Converging	Security requirements
Context compaction	Divergent	No external constraint
State management	Divergent	No external constraint
Multi-model routing	Divergent	Rapid model evolution
Persistent memory	Divergent	Nascent capability

For teams evaluating or building harnesses, the converged dimensions represent safe bets — adopt the consensus approach. The divergent dimensions represent genuine architectural decisions where you need to match the strategy to your specific constraints.

7. The Outer Harness: What Users Build

Bockeler makes a distinction that is easy to overlook: part of the harness is built into the agent by its developers (the “inner harness” — system prompts, tool definitions, the control loop), but coding agents also provide features for users to build an “outer harness” for their specific use case [1]. The outer harness includes:

CLAUDE.md / AGENTS.md / .cursorrules — project-level instructions re-read on every turn
Git hooks and CI gates — computational sensors that run automatically
Architecture decision records and schema validators — machine-readable constraints
Test suites tuned for LLM output — feedback mechanisms the agent can observe
Skills / plugins — reusable capability bundles with progressive disclosure

Bockeler’s core insight is that a harness-friendly codebase is itself a form of harness engineering. Strongly-typed languages turn the type checker into a sensor. Well-defined module boundaries provide architectural constraints. Clear naming conventions reduce the context the agent needs to retrieve. The implication: teams should evaluate their codebase through the lens of “how well can an agent navigate and verify work in this repository?” [1].

Practical Implications

1. Start with the loop, not the prompt. The most consequential harness decision is the control loop primitive. Choose your primary loop (usually ReAct for interactive agents, plan-execute for batch agents) and then layer secondary primitives for your specific failure modes. Generate-test-repair is high-value whenever you have an existing test suite [3].

2. Separate evaluation from generation. Anthropic’s three-agent harness demonstrates that self-evaluation is unreliable at scale. If you are building long-running agent workflows, invest in a separate evaluator — whether that is a distinct agent, a CI pipeline, or a human review gate. The sprint contract pattern (agree on “done” criteria before starting work) is adoptable immediately [2].

3. Design your permission model as a pipeline, not a boolean. Claude Code’s approach — independent permission gates per tool, graduated trust levels, a separate classifier model for auto-approval — is the most sophisticated production example available. At minimum, distinguish between read operations (low risk, auto-approve) and write operations (higher risk, require confirmation or classification) [7][9].

4. Invest in context compaction early. Context limits are the most common failure mode for long-running agents. Choose a compaction strategy deliberately rather than hitting the wall and adding truncation as a patch. LLM summarization plus tool output offloading is the current best practice, but monitor the space — no approach has won yet [3][7].

5. Make your codebase harness-friendly. This is the highest-leverage, lowest-cost intervention. Add CLAUDE.md or AGENTS.md files with project conventions. Ensure your test suite runs cleanly and quickly. Use strongly-typed languages where possible. Keep module boundaries clean. These changes improve agent reliability without requiring any changes to the agent itself [1].

6. Treat harness decisions as model-dependent. Anthropic dropped context resets when Opus 4.6 shipped. The Ralph Loop pattern (reinjecting prompts across context boundaries) may become unnecessary with larger windows. Revisit your harness architecture when you upgrade models — complexity that was necessary with one model may be dead weight with the next [2][3].

7. Budget for harness iteration. Manus went through five rewrites in six months. LangChain rebuilt four times in a year. Harness engineering is not a one-time design exercise — it is an ongoing optimization loop. Plan for it in your roadmap [13].

Open Questions

Will the inner harness commoditize? If control loops, permission models, and context strategies converge, does the harness become a standard runtime layer rather than a differentiating architectural decision? Or do domain-specific constraints keep harnesses diverse?
What is the right granularity for multi-agent decomposition? Anthropic uses three agents; Claude Code’s coordinator mode uses N workers. Rombaut found most agents are monolithic. Under what conditions does multi-agent decomposition actually improve reliability versus adding coordination overhead?
How should harness performance be measured? There is no standard benchmark for harness quality independent of model quality. SWE-bench conflates both. Teams need metrics for harness-specific properties: context utilization efficiency, permission gate accuracy, compaction information loss, recovery rate from tool failures.
What happens to harness engineering when context windows become effectively unlimited? Several compaction strategies and context reset patterns exist solely because of context limits. If 10M+ token windows become standard, does the harness simplify dramatically, or do new problems (attention degradation, cost) create new harness requirements?
Will harness security become a formal discipline? The Claude Code leak exposed anti-distillation mechanisms, undercover modes, and permission classifiers — all security-relevant harness components. As agents gain more autonomy, will harness security auditing become as routine as application security testing?