Scout: Subagent Swarms vs Async Agents — Two Orchestration Models Diverge

Summary

Within the same week of late May 2026, two distinct answers to “how do you orchestrate more than one coding agent at once” came into sharp relief. Anthropic shipped Dynamic Workflows for Claude Code (research preview, alongside Opus 4.8 on May 28), where the model writes its own JavaScript orchestration script and fans a single prompt out across many parallel subagents in one session — capped at 16 concurrent and 1,000 total per run per the runtime limits trade press surfaced (MarkTechPost). Cognition closed a $1B raise at a $26B valuation the day before, backing the opposite model: Devin as an async background agent on a full virtual machine, running spec-to-PR with curated memory, where the human first touches the work at PR review (Cognition, TechCrunch). These are not two flavors of “multi-agent.” They are two different bets on where the human sits relative to the agents — operating a swarm within a single session, or reviewing a contributor’s PR after the fact — and they fail in different places: the synchronous swarm runs into the limit that human attention does not parallelize, while the async model runs into review-trust, where a large-scale study of real coding-agent sessions found 91.49% of visible resolutions still required explicit user correction (arXiv:2605.29442). This briefing is a decision framework for choosing between them — and for the increasingly common case of running both.

Key Findings

1. The two models put the human in structurally different places

The clearest way to separate these is not by vendor, parallelism count, or autonomy level. It is by asking when, and against what artifact, the human re-enters the loop.

In the synchronous-swarm model — Dynamic Workflows is the reference implementation — Claude “plans the work and then runs hundreds of parallel subagents in a single session” (Anthropic). The model “dynamically writes orchestration scripts that run tens to hundreds of parallel subagents in a single session, checking its work before anything reaches you” (Anthropic blog). The run stays attached to a session: subagents fan out, “other agents try to refute what they found, and the run keeps iterating until the answers converge” before results fold back to the operator. The human is the synchronous operator of one fan-out — framing the run up front and judging the coordinated result it returns rather than steering each subagent in real time — whose unit of work is a task and whose bar is the existing test suite. Opus 4.8’s headline claim is codebase-scale migrations “from kickoff to merge, with the existing test suite as its bar” (Anthropic).

In the async-agent model — Devin is the reference implementation — the agent runs as a background process on its own full virtual machine, not a container, having “separate[d] the brain from the machine” (Latent Space). It reads the repo, plans, executes, runs its own tests, and “return[s] a pull request for human review”; the engineer “re-enters at checkpoints, not at every keystroke” (Augment Code). The human is an asynchronous reviewer whose unit of work is a PR, and whose bar is code review. Cognition’s own internal discipline is explicit about this: “review the PR, not the logs” (Cognition).

That single difference — synchronous-operator-of-a-swarm versus async-reviewer-of-a-PR — propagates into every downstream decision: the harness you need, the memory you need, the review discipline, and the cost curve. Much of what follows is a consequence of it.

2. The swarm’s failure mode: human attention does not parallelize

The synchronous swarm is built to remove the execution bottleneck. It does — convincingly. Anthropic’s flagship example — which the company notes was “not yet in production” — is a port of “750,000 lines of Rust” with “99.8% of the existing test suite passing” in “eleven days from first commit to merge” (Anthropic blog). A practitioner who ran a 200-agent swarm reported that the coordination overhead he braced for simply wasn’t there: “I’d braced for a pile of glue code to manage 200 concurrent results and there was just none” (Patzelt).

But removing the execution bottleneck relocates it rather than eliminating it. The same practitioner found his time had simply moved upstream: “I spent almost no time running experiments and almost all of it deciding which experiments were worth a thesis: framing the problem, carving it into slices, briefing the agents, judging what came back” (Patzelt). Decomposition and judgment became the bottleneck because those are the parts that don’t fan out. You can run 1,000 subagents; you cannot run 1,000 parallel copies of your own attention to brief and adjudicate them.

This is the structural tax of the swarm model, and the broader ecosystem framing reinforces it. A Pathmode summary of Anthropic’s 2026 agentic-coding trends report frames a “delegation gap” — developers use AI in a majority of their work but report being able to fully delegate only a small fraction of tasks — as the central problem of “the orchestration era” (Pathmode). The swarm is the most acute expression of that gap: the more agents you fan out, the more the binding constraint becomes the quality of the spec you fed them and the bandwidth of the human who has to make sense of what comes back.

Two concrete swarm hazards deserve naming because they are quiet rather than loud. First, the planning is not steerable mid-flight in the research preview — the model plans internally and begins spending tokens and touching files before the operator can intervene. Second, there is a silent-degradation risk: dynamic workflows “will, if you let them, quietly spawn Haiku subagents instead of Opus,” with “nothing errors out,” producing weaker results with no signal (Patzelt). Anthropic itself cautions that the feature “can consume substantially more tokens than a typical Claude Code session” and recommends “starting with smaller, well-scoped tasks” (InfoQ). The swarm’s bill is paid in tokens and upfront specification discipline, not in glue code.

3. The async model’s failure mode: merge-and-review trust

The async model removes a different bottleneck: it lets a human start a task, go offline, and come back to a finished PR. Cognition’s self-reported traction argues there’s real enterprise demand for the model — figures to read as company-disclosed, not independently audited: Devin reached $492M run-rate revenue with enterprise usage up “10x” year-to-date (Cognition), and on Cognition’s own repos, “89% of code committed by our engineers is committed by Devin” (Cognition). An earlier Cognition datapoint shows the same curve climbing: commits on its repos rose from “16% in January” to “80% in March” 2026 (Latent Space).

The cost is that the trust burden moves entirely onto the review surface. When the human only sees the PR, the PR has to be trustworthy — and the empirical record says agent output frequently isn’t, even when it looks finished. A large-scale analysis of 20,574 real-world coding-agent sessions — across IDE and CLI workflows, not async PR agents specifically — found seven recurring forms of developer-agent misalignment, “spanning how agents read projects, interpret developer intent, follow rules, bound their actions, implement and execute code, and report progress.” Its headline numbers cut at the async model’s premise even though they aren’t Devin-specific: “90.50% of episodes impose effort and trust costs rather than irreversible system damage, yet 91.49% of visible resolutions still require explicit user correction” (arXiv:2605.29442). Most failures are not catastrophic — they are friction and misalignment that a human still has to catch and fix. The transfer to the async model is mechanical, not numerical: if your operating model is “review the PR, not the logs,” then whatever correction load agent output carries is concentrated almost entirely on the review surface, so that review has to be good enough to catch it.

This is also why the async model invests so heavily in making the review surface legible. Cognition built tooling to turn “large, complex GitHub PRs into intuitively organized diffs” and pairs PRs with video-based testing verification before merge (Cognition, Latent Space). The async model is only as good as the review it enables — which is the inverse of the swarm’s constraint, where the model is only as good as the spec it was briefed with.

4. The harness differences are not interchangeable

Each model demands a different harness, and a team that picks the operating model without picking the matching harness will feel the mismatch fast.

The swarm harness is in-session and ephemeral. The orchestration script is generated per-run by the model itself (“a JavaScript script that orchestrates subagents at scale,” with “a runtime [that] then executes it in the background” while “your session stays responsive”) (MarkTechPost). It needs a fast inner loop (the test suite is the convergence oracle), strict subagent isolation to avoid write conflicts, and concurrency governance — the runtime caps at “up to 16 concurrent agents” and “1,000 agents total per run” (MarkTechPost), and “progress is saved as the run proceeds” so an interrupted job resumes within the same session. The state that matters is the run.

The async harness is durable and per-task. Devin runs each task on a full VM — “a fresh Linux environment equipped with a browser, shell, and code editor” — with session-management primitives the swarm doesn’t need: fork and rollback, machine snapshots, and async handoffs (Augment Code). The state that matters is the agent’s persistent workspace and the PR it produces. A swarm subagent is disposable; an async agent’s environment is a long-lived asset you snapshot and resume.

5. The memory differences follow from where the human sits

Memory is where the two models diverge most sharply, and it follows directly from the human’s position.

The swarm, being single-session, largely doesn’t accumulate cross-run memory as a first-class concern — its subagents are spun up and torn down within a run, and the durable artifact is the merged code plus whatever lives in the repo’s CLAUDE.md/AGENTS.md instruction files. Continuity across runs is the operator’s job, expressed as better specs.

The async model treats memory as core infrastructure, because an async contributor that doesn’t remember your conventions re-litigates them every PR. Devin accumulates curated “Knowledge” — teams “add Knowledge to teach Devin your codebase conventions” (Cognition) — and the entries are human-approved as they emerge rather than silently absorbed (Latent Space). Cognition’s own engineers describe memory as “a pretty unsolved problem” on the retrieval side (Latent Space), which is worth holding onto: the async model’s defining feature is also its least-finished one. This maps onto a pattern the harness-engineering literature has been circling for two months — that an agent’s memory is structurally tied to its harness, so the harness you commit to is also the memory architecture you commit to (prior scout).

6. The review discipline each requires is different work

These models require different review skills, not just different review timing.

The swarm requires upfront review — adversarial spec authoring. The leverage is in how you spec each subagent: “spec them like you’d brief a team of contractors, not like you’d toss a prompt at a chatbot,” with sharp role definition dramatically outperforming vague prompts (Patzelt). The model runs its own internal adversarial pass (agents “try to refute what they found”) (Anthropic blog), but that pass is only as good as the decomposition it’s checking. The human’s review work happens before the run, in framing and slicing.

The async model requires downstream review — PR adjudication at volume. The skill is reading agent-authored diffs critically and catching — before merge — the trust and effort costs that agent output broadly carries (arXiv:2605.29442). This is why Cognition’s “review the PR, not the logs” discipline and its diff-organizing tooling matter: the async model’s review surface has to scale with PR volume, and at 89% agent-authored commits (Cognition), the review queue is the human team’s primary interface with the codebase.

7. The cost models compare on different axes

The two models are not priced on the same primitive, which makes naive head-to-head cost comparison misleading.

The swarm’s cost is token burn within a run. Opus 4.8 is “$5 per million input tokens and $25 per million output tokens,” with a faster mode at “$10 per million input tokens and $50 per million output tokens” (Anthropic). Fan out to hundreds of subagents and the per-run token cost scales with the swarm size — Anthropic’s explicit caution is that workflows “can consume substantially more tokens than a typical Claude Code session” (InfoQ). The practitioner verdict is that this is worth it when the answer beats the meter, but that lighter plans “hit the ceiling before lunch” (Patzelt). The cost is spiky and concentrated in the moment of fan-out.

The async model’s cost is closer to a per-engineer budget for a standing contributor. Cognition’s framing for Devin deployments runs from roughly “$1,000 an engineer up to $5,000 an engineer” (Latent Space) — a budgeted, ongoing line item per human the agent works alongside, not a per-run spike. The cost is smooth and amortized across a stream of PRs. The practical implication: the swarm is a capital-expense-shaped burst (big, occasional, migration-scale), while the async agent is an operating-expense-shaped subscription (steady, per-seat, continuous-delivery-scale). A team comparing them on cost alone is comparing a one-time renovation to a salaried hire.

Practical Implications

A decision framework: which model for which workload

Start with the shape of the work, not the vendor.

One large, bounded, test-gated transformation — a migration, a framework swap, a codebase-wide audit. This is the synchronous swarm’s home turf. The work decomposes into many independent slices, the test suite is a credible convergence oracle, and you want it done in days not weeks. Dynamic Workflows targets exactly this: “framework swaps, API deprecations, language ports that span thousands of files,” plus “codebase-wide bug hunts” and “security audits” (Anthropic blog). Budget for a token spike, invest your effort in the decomposition and the spec, and verify your test suite actually fails on the regressions you care about before you trust 99.x% pass rates. Confirm subagents aren’t silently downgrading model tier mid-run.
A continuous stream of independent, reviewable units of work — ticket-shaped tasks, bug fixes, dependency bumps, vuln remediation. This is the async agent’s home turf. The work arrives as a queue, each item fits in a PR, and the human team has review bandwidth to spend. Devin-style deployments fit here, and the enterprise traction is in this shape (e.g. one large bank reportedly “fixes 70% of security vulnerabilities automatically”) (Cognition). Budget per-engineer-seat, invest in memory curation and PR-review tooling, and treat your review surface as the load-bearing safety control — the broad evidence on agent misalignment says it is (arXiv:2605.29442).
Exploratory or high-ambiguity work requiring frequent back-and-forth. Neither swarm nor async contributor is the right default — a single focused agent session beats both. Fanning out amplifies ambiguity rather than resolving it, and the async round-trip latency makes tight iteration painful.

The combine case is real, and the seam is the handoff. These models are complementary more than competitive. A natural division of labor: the async agent owns the steady stream of ticket-shaped delivery and accumulates the durable memory; the swarm is invoked for the occasional bounded burst (the big migration, the audit) that would clog a PR queue. The integration risk is that they don’t share a memory or review surface — the swarm’s run-state and the async agent’s curated Knowledge are different stores, and a convention the async agent learned won’t automatically reach the swarm’s subagents. If you run both, decide deliberately where shared conventions live (most practically, in the repo’s AGENTS.md/CLAUDE.md, which both can read) so the two models don’t drift apart on house style.

Match the review skill to the model. Staffing a swarm workload with people who are great at PR review but weak at adversarial decomposition wastes the swarm; staffing an async-agent workload with people who write crisp specs but won’t scrutinize diffs wastes the async model. The review work is real in both cases — it just happens at opposite ends of the run.

What to do this quarter

Classify your agent backlog by shape first. Sort pending agent work into “bounded test-gated transformations” vs. “continuous reviewable units” vs. “exploratory.” That sort, not a vendor preference, tells you which model each piece belongs to.
If you pilot the swarm: start with a small, well-scoped task per Anthropic’s own guidance, instrument token spend per run, and pin the subagent model tier so a swarm can’t silently fall back to a cheaper model.
If you pilot the async model: stand up memory curation and PR-review tooling before you scale PR volume, not after. The correction load is front-loaded onto review, and a review surface that doesn’t scale becomes the bottleneck the model was supposed to remove.
If you run both: put shared conventions in a single repo-level instruction file both can consume, and accept that cross-model memory federation is not solved for you yet.

Open Questions

Does the swarm’s planning become steerable? Today the model plans internally and starts spending before the operator can intervene. A mid-flight steering surface — pause, inspect the plan, adjust the decomposition, resume — would materially change the swarm’s risk profile for high-stakes runs. There’s no public signal yet on whether the research preview gets one.
Where does the agent-correction load go as models improve? The 20,574-session study is a mid-2026 snapshot (arXiv:2605.29442). If frontier models keep closing the alignment gap, the async model’s review burden eases and its economics improve; if the gap is sticky, the review surface stays the permanent bottleneck. Which way this trends is the single biggest variable in the async model’s long-run cost.
Can the two models share memory and review surfaces? The combine case is attractive precisely because the models are complementary, but they don’t currently federate — different state stores, different review timing. Whether a shared memory/review layer emerges (or whether teams keep them as separate silos joined only by AGENTS.md) is an open ecosystem gap and a plausible product opportunity.
Is “review the PR, not the logs” robust as agents take on riskier work? The async model’s review discipline assumes the PR is a faithful summary of what the agent did. The misalignment study’s “report progress” failure category (arXiv:2605.29442) is exactly the case where that assumption breaks. As async agents touch more sensitive systems, does PR-level review stay sufficient, or does the log come back?
Where does human-AI collaboration actually beat either alone? CentaurEval’s benchmark of collaboration-necessary tasks found human-AI collaboration reaching a 31.11% pass rate where standalone LLMs hit 0.67% and humans alone 18.89% (arXiv:2512.04111) — a large lift, but still a minority of tasks solved. Both orchestration models are bets on collaboration scaling; how far the collaboration premium extends as task difficulty rises is unresolved, and it bounds how much delegation either model can ultimately absorb.

Sources

Introducing Claude Opus 4.8 — Anthropic (Opus 4.8 pricing, fast mode, Dynamic Workflows “hundreds of parallel subagents in a single session,” codebase-scale migrations with the test suite as the bar)
Introducing dynamic workflows in Claude Code — Anthropic/Claude blog (orchestration-script mechanic, plan/fan-out/refute/converge flow, 750k-line Rust port at 99.8% test pass in eleven days, target use cases)
Anthropic Ships Claude Opus 4.8 Alongside Dynamic Workflows and Cheaper Fast Mode, With Workflows Capped at 1,000 Subagents — MarkTechPost (16 concurrent / 1,000 total runtime caps, JavaScript orchestration script, background runtime, session resumes, version/plan availability)
Claude Code Adds Dynamic Workflows for Parallel Agent Coordination — InfoQ (independent reporting; plan/distribute/verify/converge model, ultracode, token-consumption caution, “start with smaller, well-scoped tasks”)
Claude Opus 4.8 /ultracode: I Ran a 200-Agent Swarm — Marco Patzelt (hands-on 200-agent run; decomposition-as-bottleneck, silent Haiku downgrade, token economics, “brief a team of contractors”)
More Devins in More Places — Cognition (Series D: $1B raise at $26B valuation, $492M run-rate revenue, >10x enterprise growth, 89% of Cognition’s own committed code from Devin, enterprise outcomes)
AI coding startup Cognition raises $1B at $25B pre-money valuation — TechCrunch (raise confirmation, valuation, async background-agent architecture, parallel subagents, chat-based mid-stream correction)
The Age of Async Agents — Cognition’s Walden Yan & OpenInspect’s Cole Murray — Latent Space (brain/machine separation, full-VM execution, spec-to-PR, agent “Knowledge” memory and approval, per-engineer agent budgets, “16% in January to 80% in March” commit curve)
[AINews] Cognition raises $1B in $26B Series D — Latent Space / AINews (raise figures, $492M run-rate, >10x enterprise growth, ”>$1B ARR by EOY” framing, “largest remaining independent agent lab”)
How Cognition Uses Devin to Build Devin — Cognition (“review the PR, not the logs,” Devin Review diff organization, “add Knowledge to teach Devin your codebase conventions,” 659-vs-154 PR throughput)
Devin vs Codex Desktop App (2026): Cloud Agent or Local-Hybrid Planner? — Augment Code (full-VM per-session sandbox, fork/rollback/snapshot/async-handoff primitives, PR-for-review flow, “re-enters at checkpoints, not at every keystroke”)
How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions — Tang et al., arXiv (seven recurring misalignment forms; 90.50% of episodes impose effort/trust costs; 91.49% of visible resolutions require explicit user correction)
CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding — Luo et al., arXiv (collaboration-necessary templates; pass rates of 0.67% LLM-alone, 18.89% human-alone, 31.11% human-AI collaboration)
The Orchestration Era Needs Intent — Pathmode (the “delegation gap” — constant AI use, limited delegation — and the “orchestration era” framing; implementer-to-orchestrator shift)
Scout: Harness Engineering in Production — Frontier Evidence and the Lock-In Critique — prior Grimoire scout (continuity; harness-and-memory inseparability, production autonomy datapoints)