Scout: Harness Engineering in Production — Frontier Evidence and the Lock-In Critique

Summary

The April 6 harness-engineering scout mapped the discipline’s mental models — Böckeler’s guides/sensors and inner/outer harness, Rombaut’s control-loop primitives and convergence-divergence map, Anthropic’s three-agent planner/generator/evaluator, and the Claude Code source leak as an unplanned case study. This briefing is the production chapter the prior scout could not yet include. Four load-bearing developments reshaped the picture between April 7 and April 11: (1) Ryan Lopopolo’s Latent Space interview exposing OpenAI Frontier’s “Symphony” operation — five months, 1M LOC, 1B tokens/day, 1,500+ PRs, zero human-written code, zero pre-merge human review — as the upper bound of what harness-mediated autonomy currently delivers; (2) Cognition’s parallel production datapoint — Devin’s PR merge rate climbed from 34% to 67% in 18 months while its best-week throughput grew from 154 to 659 merged PRs; (3) LangChain’s three-part argument that the harness is now a lock-in vector because memory and harness are structurally inseparable — captured in the aphorism “if you don’t own your harness, you don’t own your memory”; and (4) Deep Agents Deploy shipping as the first explicitly-positioned open alternative to Claude Managed Agents, collapsing the open/managed gap from “framework vs product” to “MIT-licensed product vs proprietary product.” The net effect: harness choice is now an architectural commitment roughly equivalent to choosing a database, and the decision has competing shipped products on both sides of the lock-in line.

Key Findings

1. The Frontier Datapoint: What “Fully Delegated” Actually Looks Like

Ryan Lopopolo leads Product Exploration within OpenAI Frontier, the enterprise-agent platform. His team’s five-month experiment is the most aggressive public production harness deployment on record [1][2]. The constraint he imposed on himself — write zero lines of code personally — forced the harness to absorb every capability gap. The operational footprint:

1M+ lines of code across roughly 1,500 PRs over five months
1 billion tokens per day consumed (~$2–3k/day at market rates with prompt caching)
0% human-written code, 0% pre-merge human review — all review happens post-merge
3-person core team producing all 1,500 PRs
~500 NPM packages in the repository (extreme decomposition to maximize multi-agent parallelism)
6 skills total encoding reusable business-logic primitives with built-in tracing and metrics

The harness stack that made this work — named Symphony — is an Elixir-based orchestration layer that spawns daemon processes on the Erlang BEAM runtime, one per task. When a PR fails, Symphony’s “rework state” discards the entire attempt and restarts from scratch rather than attempting incremental repair. The team organizes the system through a six-layer hierarchy (policy, configuration, coordination, execution, integration, observability) with coordination implemented in Elixir mapping agent decisions to runtime primitives [1].

The origin story is itself the argument for harness investment. In December 2025 the team operated at 3.5 PRs/engineer/day. Claude 5.2’s January 2026 release — with no other changes — pushed throughput to 5–10 PRs/engineer/day, and the humans became the bottleneck. Lopopolo: “The only fundamentally scarce thing is the synchronous human attention of my team.” Symphony exists to remove humans from the PR lifecycle so they can scale with model capability rather than against it [1].

Two details matter disproportionately for practitioners. First, Lopopolo calls the distribution pattern “ghost libraries” — software shipped as specifications rather than source, with disconnected agents re-implementing from the spec and review agents iteratively refining the spec until it reproduces the system with high fidelity. This inverts the normal source-of-truth relationship: the spec is the artifact; the code is regenerable. Second, the team’s one-minute inner loop is a hard architectural constraint. The build system evolved Make → Bazel → Turbo → NX specifically to keep the inner feedback loop under sixty seconds. Everything that degraded that budget got replaced [1].

What agents handle: product code, tests, CI configuration, release tooling, documentation, dashboard definitions, repository management scripts, PR authorship, post-merge review, CI flake fixing, merge conflict resolution, and the final merge to main. What humans still do: approve native app releases (blessed human still required), smoke-test before distribution, steer zero-to-one product ideation [1][2].

2. The Cognition Datapoint: The Other Production Curve

Cognition’s 2025 Devin performance review, read alongside Frontier, gives us a second production curve — one that answers to external customers rather than internal product exploration [3]. The numbers:

PR merge rate: 34% → 67% in 18 months
Best internal week: 154 → 659 merged PRs/week between 2025 and early 2026
4x faster at problem-solving, 2x more efficient on resource consumption
Hundreds of thousands of merged PRs across the Devin install base (Goldman Sachs, Santander, Nubank, among others)

Tian Pan’s analysis [11] surfaces a datapoint that is harness-specific: the same underlying model scored 69% standalone on a benchmark, but 81% with “a sophisticated agent harness that retries failures and explores files iteratively.” Twelve points of absolute improvement from scaffold alone — comparable to a full model generation. This is the cleanest empirical evidence yet for the “harness matters more than the model” claim that the prior scout sourced primarily from Böckeler and Rombaut.

Taken together, Frontier and Devin bracket the production space. Frontier is the internal-team upper bound (one harness, one codebase, three experts, full autonomy). Devin is the external-customer curve (many harnesses, many codebases, variable-expertise operators, graduated autonomy). Both doubled their effective throughput in roughly a year, and both explicitly attribute the improvement to harness work rather than model work. For the first time, the industry has production denominators against which to evaluate harness investment.

3. LangChain’s Lock-In Argument: Memory Is Harness-Tied

Harrison Chase’s “Your Harness, Your Memory” post is the sharpest architectural critique in the chapter [4]. The core claim, quoting Letta’s Sarah Wooders: “Managing context, and therefore memory, is a core capability and responsibility of the agent harness.” Chase’s aphorism: “If you don’t own your harness, you don’t own your memory.”

The argument is structural, not ideological. Chase identifies three tiers of ownership loss:

Stateful APIs (mildly problematic) — OpenAI’s Responses API, Anthropic’s server-side compaction. State lives remotely. Threads survive but you cannot switch models without losing them.
Closed harnesses (bad) — Claude Agent SDK’s proprietary harness. Artifacts exist client-side but “the shape of those artifacts and how a harness should use them is unknown.”
API-locked everything (worst) — Claude Managed Agents. “Literally everything is behind an API, locked into their platform.” No visibility into long-term memory.

The lock-in mechanism is compounding. Memory is a proprietary dataset of user interactions and preferences that accumulates over time. Model switching used to be frictionless because APIs were similar. Stateful harnesses end that symmetry: once the memory accumulates inside a proprietary harness, switching models means losing the accumulated state. Chase’s personal example — an email assistant deleted by accident, requiring full re-teaching of preferences from templates — is the consumer-scale version of what becomes a multi-quarter migration problem for an enterprise.

Medium’s independent teardown of Claude Managed Agents [7] makes the mechanism concrete. Anthropic’s service provides sandboxed execution, session persistence, credential management, scoped permissions, tool execution infrastructure, and Claude Console observability. The pricing is standard Claude API token rates plus $0.08 per session-hour of active runtime. A 24-agent fleet running 8 hours/day costs ~$15.36/day in session fees alone, before inference. The proprietary components: the harness itself (model-harness co-optimized), session-state management, credential vault isolation, and permission enforcement at the infrastructure level. Crucially, “Managed Agents is Claude-only. There’s no way to run GPT-5, Gemini, Kimi K2, Deepseek, or any other model inside the harness.” Migration to any other model requires rebuilding the orchestration layer. This is the architectural commitment Chase is pointing at [4][7].

The Kai Waehner enterprise analysis [8] generalizes the concern across four lock-in vectors: API dependency (architecture bends around vendor design), agent framework capture (proprietary orchestration compounds switching cost), data gravity (the more institutional knowledge you invest, the harder exit becomes), and ecosystem entanglement (the AI decision becomes inseparable from a cloud/productivity/data commitment). Harness choice touches all four simultaneously.

4. Deep Agents Deploy: The Shipped Open Alternative

LangChain’s Deep Agents Deploy is the first explicitly-positioned open competitor to Claude Managed Agents [5]. The framing is not “framework vs product” — that was the old LangGraph vs Claude SDK axis — but “MIT-licensed product vs proprietary product.” Architecturally it bundles a custom Deep Agent with a LangSmith Deployment server providing 30+ endpoints: MCP (agent-as-tool), A2A (multi-agent), Agent Protocol (UI), human-in-the-loop guardrails, and short/long-term memory endpoints.

Operator-controlled parameters:

Model selection (multi-provider)
Agent instructions via AGENTS.md (open standard, portable)
Skills (the Agent Skills format — also portable)
MCP tools
Sandbox provider (Daytona, Runloop, Modal, or LangSmith)

The memory-sovereignty pitch is explicit: “By choosing an open harness you are choosing to own your memory, and not have it be locked into a proprietary harness or tied to a single model.” Memory persists in a standard format, queryable directly via API, and when self-hosted “remains in your databases only.”

The architectural shape is worth noting: Deep Agents Deploy replicates the high-level shape of Claude Managed Agents (harness + server + sandbox) almost component-for-component. The differentiation is not “do less” — it is “same capability surface, ownership reversed.” For teams that want managed-agent ergonomics without the tier-3 lock-in from Chase’s taxonomy, it collapses the previous trade-off.

5. Evals as the Harness Hill-Climbing Signal

The third LangChain piece, “Better Harness,” provides the methodology layer: how do you know your harness is improving? [6] The argument borrows ML’s training-data discipline and applies it to harness iteration. “Harness hill-climbing” is autonomous harness optimization driven by eval signal, with the harness (not the model) as the object of optimization.

The recipe is four stages — sourcing → experiment design → optimization → review — implemented as a six-step loop:

Source evals from hand-curated examples, production traces, and external datasets; tag every eval with behavioral categories (tool selection, multi-step reasoning, followup quality, etc.)
Split into optimization and holdout sets to prevent overfitting
Establish baseline performance before making harness changes
Diagnose failures from trace analysis, segmented by behavioral category
Validate proposed harness changes, checking for regressions but accepting net gains
Apply human review to catch token-wasteful instructions and edge cases

LangChain reports concrete lifts on Sonnet 4.6 and GLM-5: tool selection moved from 0/2 → 2/2 on the optimization set while holding at 7/8 on holdout; followup quality went 0/3 → 3/3 on optimization and generalized to 6/6 on holdout. The harness changes that surfaced from the loop are instructive — not model-side fixes, but instruction refinements like “Use reasonable defaults when requests imply them,” planning heuristics like “Ask domain-defining questions before implementation questions,” and constraint clarifications like “Do not ask for details already supplied.” These are precisely the kinds of changes that live in the outer harness (CLAUDE.md, AGENTS.md, system prompts) that the prior scout covered at the pattern level [6].

The methodological innovation is subtle: evals here are training data for the harness, not the model. This means harness engineering now has the same overfitting/generalization discipline that ML training does. Holdout sets, behavioral tagging, human review as a second signal — all of it. The prior scout noted the absence of a harness-quality benchmark independent of model quality; this post is LangChain’s proposal for how teams should construct one internally.

6. Böckeler’s QCon Keynote: The Executive Case

Böckeler’s Martin Fowler essay was the practitioner foundation that anchored the prior scout. Her QCon keynote [9] is the leadership pitch for the same material, with two notably new threads. The first is the cost-curve reality check: economics have moved from “$0.12 per 100 lines” to “$200+ monthly flat rates,” which now approach developer salary fractions. Initial efficiency gains evaporate through multi-turn iteration, testing loops, and review cycles — the savings are real but less dramatic than the first-order numbers suggest.

The second is the Amazon response to AI-related outages: adding senior-engineer review gates, “negating the speed advantage entirely.” This is the executive version of the security argument. Faster delivery and safety-net investment are not adversarial — the harness is the safety net, and insufficient investment in it produces outages that cost back the speed gains. Her framing: “You have to be this tall to ride the roller coaster” — experienced developers must assess risk before reducing supervision.

The practical upshot for leaders: harness engineering is not a developer-productivity investment masquerading as an architectural one; it is explicitly an architectural investment whose business case is sustainable velocity rather than peak velocity. Böckeler cites the OpenAI team (Lopopolo’s) as exemplar — continuous harness refinement as permanent operational burden, not one-time build-out.

7. The Rombaut Paper: Academic Ratification

Rombaut’s blog taxonomy anchored the prior scout; his paper “Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures” [10] is the peer-reviewable version and arrived April 3. The paper’s specific methodological contribution is grounding claims in file paths and line numbers — making the taxonomy falsifiable rather than descriptive. Existing LLM-agent surveys, Rombaut argues, classify systems by abstract capabilities (tool use, planning, reflection) that cannot distinguish a Monte Carlo Tree Search agent from a while-loop-with-retry agent, even though the two have fundamentally different cost, reliability, and failure characteristics.

For this briefing the paper matters less for new findings than for locking in the prior scout’s five-primitive model and convergence-divergence map as published work. When the managed-vs-open debate hardens over the next six months, “Which primitives does your harness compose?” becomes an interview-grade architecture question rather than an informal one.

Practical Implications

1. Treat harness choice as a database-class architectural commitment. Chase’s lock-in taxonomy is the new framework for evaluating managed-agent offerings. A managed harness with proprietary memory surfaces (Tier 3 — Claude Managed Agents today) is not comparable to a managed harness with open memory surfaces (Tier 2 — Deep Agents Deploy self-hosted). Before adopting any managed agent platform, write down where the memory lives, in what format, under whose control, and what migration would require. If you cannot answer those questions, you are making a database-class commitment without database-class due diligence.

2. Budget the “Lopopolo ratio” for autonomy-heavy workloads. Frontier’s operational economics — $2–3k/day per 3-person team at 1B tokens/day — is the current upper bound. If your roadmap includes any workload approaching Frontier-style delegation (post-merge review only, agents owning CI and release), model the cost linearly: ~$1k/day/engineer of delegated capacity before your own agents are even custom-tuned. This is below a fully-loaded US senior engineer’s salary, but above many offshore rates. The economic case closes, but not trivially.

3. Instrument evals as harness training data, not model training data. The LangChain hill-climbing methodology is immediately adoptable without adopting LangChain’s stack. The pattern — tag evals by behavioral category, split optimization from holdout, diagnose by category, validate with human review — is a LangSmith-free discipline. Teams that ship Claude Code or Cursor without eval instrumentation are leaving the harness-improvement loop unclosed. The 12-point gap between standalone-model and model-in-harness performance [11] is the yield on this work.

4. Shorten your inner loop before you tune your outer loop. Frontier’s one-minute build constraint is the easiest lesson to steal. Every build-system change that took the inner loop past 60 seconds got replaced — not deprioritized, replaced. Before investing in skills, permission policies, or context compaction, verify your inner loop. Agents iterating against a 10-minute test suite cost ten times more per turn than agents iterating against a 1-minute test suite, and model retry loops compound the overhead.

5. If you adopt Claude Managed Agents, adopt it deliberately. The point of Chase’s critique is not that managed agents are wrong — the point is that they are expensive in a new dimension. Claude Managed Agents are a reasonable choice when (a) the workload is Claude-shaped for the foreseeable future, (b) the memory that accumulates is domain-generic rather than business-critical, and (c) Anthropic’s infrastructure operations genuinely exceed what your team could build. When any of those assumptions flip, the lock-in cost compounds daily. Deep Agents Deploy now provides a credible off-ramp, but migrating memory across architectures is not free even between open systems.

6. Design for ghost-library distribution, not source distribution. Frontier’s inversion — spec-as-source-of-truth, code as regenerable — is a radical suggestion for internal platform teams. It is not immediately practical for most organizations (you cannot ship ghost libraries to customers without substantial regeneration infrastructure), but the internal case is real. If your team uses agents to maintain internal tooling, the question worth asking is: would we rather own the code, or own the spec? Owning the spec makes model upgrades cheap; owning the code makes upgrades require refactoring.

7. Expect the harness to become a procurement category. Enterprise procurement currently has slots for cloud providers, database vendors, and SaaS platforms. Within twelve months — assuming the lock-in debate continues to harden — it will have a slot for agent-harness platforms, with the associated RFP discipline (portability requirements, memory format specifications, exit clauses). Teams that get ahead of this now will be in a stronger negotiating position than teams that discover it post-adoption.

Open Questions

Will the managed-vs-open distinction collapse or harden? Deep Agents Deploy is the first credible open alternative to Claude Managed Agents, but it ships from LangChain — a company whose revenue model depends on LangSmith adoption. Is the managed/open line the real fault line, or will it re-form as LangSmith-dependent vs truly portable within six months?
Does ghost-library distribution generalize? Frontier’s model works because the team has one codebase, three experts, full observability, and no external customers. Every one of those assumptions fails at some point in a larger organization. Can a 500-person engineering organization operate on ghost libraries for internal tooling, or does the pattern only survive at small team scale?
How much of Symphony’s architecture is Elixir/BEAM-specific? The Frontier team chose Elixir for daemon-per-task concurrency and supervisor-tree fault tolerance. Teams trying to replicate the pattern in Python or Node.js will face harder fault-isolation problems. Is Symphony reproducible in mainstream stacks, or does it genuinely require actor-model runtime primitives?
What does a harness-quality benchmark look like? LangChain’s internal evals methodology solves the problem within one team’s harness. It does not solve the problem across harnesses — there is still no SWE-bench equivalent that isolates harness quality from model quality across systems. Will an independent benchmark emerge, or will harness quality remain a qualitative judgment indefinitely?
When the lock-in migration happens at scale, what breaks first? Every managed-agent customer today is accumulating memory and state that will need to move. The first large-scale migration (Claude Managed Agents → Deep Agents Deploy, or either → something else) will reveal which parts of the lock-in are real architectural obstacles and which are just friction. Until then, the critique is theoretical.