Artificer’s Grimoire — Edition 13 · May 24, 2026
Three loosely-related arcs ran in parallel this week, each touching a different layer of the agent stack. The vendor agent-platform GA wave — anticipated in Edition 11, productised by Anthropic in Edition 12 — added several major releases: AWS MCP Server reached GA with IAM-grade governance, Cloudflare named the six-layer agent platform and shipped a Browser Run rebuild, Google I/O 2026 launched Gemini 3.5 Flash + Spark (background agents) + Antigravity 2.0. The package-and-tooling foundation every one of those platforms depends on was under sustained attack the same week — GitHub disclosed a breach of internal repositories via a poisoned VS Code extension, Grafana lost source code via the TanStack npm compromise, Sonatype flagged Shai-Hulud back targeting maintainer accounts. Anthropic ran a four-move week through the middle of it — the Stainless acquisition (the SDK and MCP-server generator powering most labs’ developer experience), MCP Tunnels plus self-hosted sandboxes for Managed Agents, Project Glasswing’s 10,000+ critical-vulnerability disclosure with Cloudflare and Mozilla as partners, and a quietly-shipped Claude Code sandbox patch with no CVE — at the same time Microsoft began canceling internal Claude Code licenses and pushing GitHub Copilot CLI. The arcs aren’t coordinated and shouldn’t be read as one unified shift. But the rough pattern across the week is worth naming: vendor agent platforms keep arriving, the developer-tool foundation they depend on keeps getting attacked, and one of the largest enterprise customers of multiple AI labs is forcing internal standardisation on its own product even when its engineers reportedly prefer the alternative.
Must Read
Google I/O 2026 — Gemini 3.5 Flash, Spark background agents, Antigravity 2.0
Google’s I/O keynote (2026-05-19) shipped three agent-relevant pieces and a positioning bet on Search. Gemini 3.5 Flash launched as the first model in Google’s “frontier intelligence with action” series — generally available via Antigravity, the Gemini API in AI Studio, and Android Studio. Google’s own post describes the model as outperforming Gemini 3.1 Pro on coding and agentic benchmarks, citing 76.2% on Terminal-Bench 2.1, 1656 Elo on GDPval-AA, and 83.6% on MCP Atlas. As with most frontier-model benchmark releases, real-world transferability to production agent workflows remains uncertain and the methodology behind each benchmark matters heavily for how those headline numbers should be read. Simon Willison’s read is that Flash 3.5 is meaningfully more expensive than prior Flash generations but Google plans to route it broadly, including into Search. Gemini Spark is a 24/7 personal agent built on Gemini 3.5 and the Antigravity platform — Google’s productised answer to the same scheduled / background-agent pattern Anthropic shipped as Routines in Edition 12. Antigravity 2.0 is the updated agent platform — the deployment surface Google is positioning Gemini 3.5 Flash and Spark to run on for both consumer and enterprise customers; the Latent Space brief notes Google’s reported overall Gemini-processing volume of roughly 3.2 quadrillion tokens per month as of the May 20 publication, a ~7× year-over-year jump. Ars Technica’s framing of the Search story is that Google “is set to remake Search with agentic AI in 2026” — taking the public-facing search product into the same agent-driven territory that OpenAI’s ChatGPT Search and Perplexity have been pulling at for two years.
Why it matters: The Edition 11 thesis was that the agent runtime would consolidate into vendor primitives; Edition 12 was the productisation week on the Anthropic side. This is the Google side — same shape, different vendor. Spark is the Routines analog, Antigravity 2.0 is the Claude-Code-meets-platform analog, Gemini 3.5 Flash on agentic benchmarks is the model-tier reset that pairs with it. Two practitioner takeaways. First, the “frontier coding model + integrated agent platform + GA-grade pricing tier” stack is increasingly an expected offering from major platform vendors (Anthropic, Google, OpenAI), even if most practitioner teams haven’t yet adopted these as operational defaults. The differentiation is starting to shift toward harness ergonomics, governance, and pricing model rather than benchmark raw scores. Second, Spark’s positioning as a persistent 24/7 personal agent on the consumer-facing surface is the consumer-side preview of the same “agents that run while you don’t” workflow that’s been the enterprise story since the Routines / Symphony / Copilot Coding Agent wave Edition 12 catalogued. The combined market is now both ends — consumer always-on agents and enterprise scheduled agents — running on the same platform primitives, with the harness layer increasingly the visible product surface.
Supply-chain compounding — GitHub breach, Grafana via TanStack, Shai-Hulud back
Three incidents in one week, each at a different layer of the developer-tool substrate every agent harness depends on. GitHub disclosed on 2026-05-20 that on 2026-05-18 it had detected and contained a compromise of an employee device involving a poisoned VS Code extension published by a third party. GitHub removed the malicious extension version, isolated the endpoint, and began incident response. GitHub’s current assessment is that the activity involved exfiltration of GitHub-internal repositories only; threat actor “TeamPCP” claims roughly 3,800 repos, which GitHub describes as “directionally consistent with our investigation so far,” and the published GitHub post states no evidence of impact to customer information stored outside internal repositories. The extortion attempt reportedly demands offers in excess of $50,000 on underground forums. TanStack and Grafana — on 2026-05-11, attackers published 84 malicious versions across 42 @tanstack/* npm packages using the pull_request_target “Pwn Request” pattern combined with GitHub Actions cache poisoning and runtime memory extraction of an OIDC token from the Actions runner process. Grafana detected the activity the same day, rotated a significant number of workflow tokens, but per Grafana’s own write-up a missed token gave attackers GitHub access; on 2026-05-16 Grafana confirmed source-code exfiltration and refused a ransom demand, citing FBI guidance. Grafana states no evidence of customer production-system impact. Shai-Hulud — Sonatype’s writeup names the npm-maintainer account as the persistent soft target and reports the worm-style npm compromise pattern is back, with a separate Sonatype post on 2026-05-21 describing a hijacked npm package distributing a PolinRider-linked RAT. Help Net Security and other trade press separately report that the GitHub and Grafana breaches share a common root cause in the nx-console TanStack compromise; the GitHub blog itself attributes the breach narrowly to a poisoned VS Code extension without naming the package, so the cross-incident link is reported by trade press, not confirmed by GitHub.
Why it matters: Edition 12’s harness-attack-surface Must Read framed agent-harness security as a problem of agents reasoning around their own sandboxes. Edition 13’s supply-chain cluster is the dual problem: every harness layer is built on top of npm, VS Code Marketplace, and GitHub Actions, and the same week three first-party incidents named the substrate as the failure point. The agent-economics framing matters here too. Coding agents are accelerating the rate at which developers add dependencies, install editor extensions, and rely on CI workflows that pull from public package registries — and the same dynamics that make Claude Code productive on a fresh codebase make agent-driven CI pipelines hungry consumers of fresh npm packages. The Grafana incident is the worst-case demonstration: a sophisticated team detected the TanStack compromise the same day it shipped, rotated tokens, and still lost source code to one missed token. For practitioner teams, three actions land in the next month. First, audit which VS Code extensions employees can install — the GitHub-employee-device vector is now the named entry point of a major incident, and the question “which extensions does our coding-agent harness require” becomes a security inventory problem. Second, treat OIDC token rotation after any GitHub Actions workflow that touched a compromised package as the assumed-breach default; Grafana’s experience says one missed token is enough. Third, take Sonatype’s “maintainer accounts are still the soft target” framing as a planning input — the agent-driven dependency growth curve is climbing into a maintainer-account threat model that already has named worms. The substrate every agent platform depends on is being attacked in the same weeks the platforms are reaching GA.
Anthropic’s four moves — Stainless acquisition, MCP Tunnels, Project Glasswing, the silent sandbox patch
Four overlapping Anthropic releases in five days. Stainless acquisition (2026-05-18): Anthropic acquired the SDK and MCP-server generator startup whose software powers every official Anthropic SDK and is widely used by rival AI labs including OpenAI and Google. The Information’s reporting cited in TechCrunch describes the deal as exceeding $300M. Per the Anthropic post, the company will wind down all hosted Stainless products; customers retain rights to SDKs they’ve already generated and can continue to modify them. MCP Tunnels + self-hosted sandboxes (2026-05-19, research preview): a lightweight gateway makes a single outbound encrypted connection from Anthropic infrastructure to a customer-managed MCP server. No inbound firewall rules, no public endpoints. Positioned as the enterprise answer for teams that want Managed Agents but cannot expose internal databases, APIs, or ticketing systems to the public internet. Currently invite-only access. Project Glasswing initial update (2026-05-22): Anthropic’s collaborative initiative with approximately 50 partners reports Claude Mythos Preview has discovered over 10,000 high- or critical-severity vulnerabilities across partner organizations in roughly one month. Anthropic’s post cites Cloudflare finding 2,000 bugs (400 critical/high) with what they describe as “better than human testers” accuracy; Mozilla discovering 271 vulnerabilities in Firefox 150, “over ten times more” than in previous versions; and open-source scanning identifying an estimated 6,202 high/critical vulnerabilities with 90.6% validation rates. Concrete deliverables: Claude Security public beta (already patching 2,100+ vulnerabilities), a Cyber Verification Program, and an Alpha-Omega/OpenSSF partnership. The post explicitly names the bottleneck as having shifted from finding vulnerabilities to patching them. Worth reading these figures alongside their methodological caveats: the “critical/high-severity” classifications and validation rates are reported by Anthropic and partner organizations rather than independently audited, and ecosystem-wide reporting on AI-generated bug-report noise (see the Ars Technica “bug bounty slop” item in Worth Scanning) sets the counter-context for any “10,000+ vulnerabilities” headline. The silent sandbox patch (disclosed by Aonan Guan, covered by SecurityWeek 2026-05-20): a SOCKS5 hostname null-byte injection bypass in the Claude Code network sandbox allowed allowlist circumvention — a host like attacker-host.com\x00.google.com would pass the filter. Per SecurityWeek’s coverage and Guan’s blog post, the vulnerability was present from the sandbox’s October 20, 2025 GA through Claude Code 2.1.88 on 2026-03-31; the fix shipped without a CVE assignment and without mention in the release notes. Guan’s complaint is that teams running vulnerable versions in production had no signal to rotate credentials or analyze egress traffic. This is the second silent sandbox patch from Anthropic in recent months.
Why it matters: The four moves cohere as one strategic arc, and it lands in the same week as Microsoft pulling internal Claude Code licenses. Stainless is Anthropic acquiring a developer-experience layer (SDKs, MCP-server scaffolding) that, per the Anthropic and TechCrunch posts, is also relied on by rival labs including OpenAI and Google — those teams now have a vendor-relationship decision to make as the hosted Stainless products wind down. MCP Tunnels is Anthropic offering an enterprise-perimeter answer to the public-internet-exposure objection that’s been blocking Managed Agents adoption inside regulated industries. Project Glasswing is the vulnerability-discovery-as-product play — Claude Security in beta is Anthropic monetising the same capability that found 10,000+ partner-organization bugs in a month. And the silent sandbox patch is the inverse signal: at the same time Anthropic is shipping vulnerability discovery as a product and self-hosted sandboxes as enterprise primitives, an independent researcher’s allowlist-bypass report ships fixed-without-CVE-or-release-note. The contrast is the editorial point — running the harness layer in production now requires treating “vendor’s own sandbox” the way regulated workloads have always treated supply-chain components: as something with its own CVE-and-patch posture, whose advisory channels you have to audit because the vendor’s silent-fix culture means absence of CVE is not absence of vulnerability. Guan’s framing of the structural complaint: “A team running that config in production from October 20 through November 26 had no way to know the sandbox was effectively off, and no notice afterwards that it had ever been off.” Practitioner teams running Claude Code or other agent harnesses in production should treat Aonan Guan’s blog as a primary advisory channel for the rest of 2026 until vendor disclosure norms catch up.
Microsoft begins canceling internal Claude Code licenses — Experiences + Devices org loses access by June 30
Per Windows Central’s coverage, Microsoft began rolling back its internal Claude Code experiment on 2026-05-14. The Experiences + Devices group — the org responsible for Windows, Microsoft 365, Outlook, Teams, and Surface — will lose most Claude Code licenses by June 30, which is also the last day of Microsoft’s fiscal year. The internal memo (cited by Windows Central as authored by Rajesh Jha, EVP) frames the move as standardising on a single AI coding tool. Multiple trade-press reports characterise the unofficial driver as cost, though the published memo language emphasises standardisation. Microsoft opened Claude Code access to employees in December 2025, and the reporting describes broad internal adoption among developers, project managers, and designers before this week’s rollback. Some coverage characterises the move as a strategic contradiction: Microsoft sells GitHub Copilot as the default AI layer for software development while its own engineers reportedly found significant value in Anthropic’s Claude Code.
Why it matters: Read in sequence with Edition 12’s agent-meter Must Read — Anthropic moving programmatic Claude usage onto a dedicated credit pool billed at API-style rates effective 2026-06-15, OpenAI offering two free months of Codex for enterprise migrants — Microsoft’s move adds a third data point to the same conversation: the largest enterprise customer of multiple AI labs is now publicly forcing internal standardisation on its own coding-agent product even when its engineers reportedly prefer the alternative. Three practitioner takeaways. First, the cost-of-Claude-Code-at-enterprise-scale is now a story big enough to drive procurement-level decisions at Microsoft-scale buyers, which is itself a load-bearing signal about the per-developer-month total cost of ownership for premium agent harnesses. Second, vendor-customer overlap inside the AI-coding-agent market is now visibly producing competitive pressure. Microsoft is both a major Anthropic customer — Claude is currently available in Microsoft Foundry and powers the Researcher agent in M365 Copilot — and OpenAI’s largest investor, and now ships a coding-agent product that competes directly with both labs. The structural conflict is unresolved. Third, the migration window itself is the practitioner-relevant detail — June 30 is the last day of Microsoft’s fiscal year, and the standardisation timing pulls forward decisions any enterprise team running multi-vendor coding agents was probably going to have to make anyway. Expect more enterprise standardisation decisions to land before June 15 as teams choose how to consume the Anthropic agent-meter change and the OpenAI Codex window.
AWS MCP Server reaches GA + Cloudflare completes the six-layer agent platform
Two vendor platform stories that complete a pattern Edition 11 named (the harness shipped as platform primitive) and Edition 12 extended (Claude Platform on AWS GA, Routines). AWS MCP Server GA — AWS’s actual GA announcement landed earlier in May; InfoQ’s deep coverage published 2026-05-24 surfaces the full feature set. The managed server gives AI coding agents access to AWS services through MCP with IAM-grade governance: IAM context keys (no separate IAM permission required), CloudTrail logging, CloudWatch metrics, and two new IAM global condition context keys — aws:ViaAWSMCPService (set to true when the request passes through an AWS managed MCP server) and aws:CalledViaAWSMCP (containing the service principal of the AWS managed MCP server). Agents can call any AWS API through a single tool, including operations that require file uploads or long-running execution; sandboxed Python execution lets agents run code against AWS services without local filesystem or shell. Cloudflare’s six-layer agent platform — InfoQ’s coverage (2026-05-22) summarises the now-completed stack: compute (Dynamic Workers + Sandboxes), orchestration (Dynamic Workflows), memory (Agent Memory), browsing (Browser Run), and commerce (Stripe Projects). Browser Run itself was rebuilt on Cloudflare Containers — per Cloudflare’s blog, the rebuild delivers 4× higher concurrency (120 simultaneous browsers, up from 30), 50% faster response times for quick actions, and adds WebGL and WebMCP support. The migration required no client-side changes.
Why it matters: This is the build-vs-buy line moving hard against custom platform work, two weeks in a row. The AWS MCP Server’s IAM context keys deserve specific attention: they make it possible to author IAM policies that treat “request through a managed AWS MCP server” as a first-class principal — the same kind of differentiated identity that distinguishes “human authenticated via SSO” from “service authenticated via assume-role” in regulated AWS environments. Teams that have been building bespoke proxies to add authentication and audit logging around MCP servers — a common pattern surfaced in the Edition 11 MCP enterprise gateway scout — now have a managed alternative with IAM-native governance. Cloudflare’s six-layer naming is the editorial-flag worth holding: when a platform vendor publicly names its layer stack, the layers are being marketed at architects making 18-month bets. The combination this week: AWS plus Cloudflare now offer end-to-end agent runtimes (compute, sandbox, orchestration, MCP-mediated tool access, browser, commerce) with vendor-grade governance baked in. The Edition 11 thesis — that the harness layer would consolidate into vendor primitives — is becoming increasingly visible across multiple major cloud and platform vendors, though whether the pattern stabilises into long-term market structure remains to be seen. For many practitioner teams, the question is increasingly shifting from “should we build our own agent runtime” toward “which vendor’s runtime do we standardise on and how do we keep optionality” — though the build-bias still makes sense for teams with specific governance, perimeter, or cost constraints that the vendor primitives don’t yet meet.
The methodology critique wave hardens — Vibe Coding bliki, the Orchestration Tax, anti-vibe cluster, sensors series
Four threads in the same week with the same underlying message — agent-driven coding workflows have a methodology problem that is no longer adequately described by enthusiasm or skepticism alone. Fowler’s Bliki entry for “Vibe Coding” stabilises the canonical definition: “building a software application by prompting an LLM, telling it what to build, trying it out, prompting for changes — but without looking at any of the code that the LLM generates.” Coined by Karpathy in February 2025; per Fowler, best for disposable software, with maintainability/correctness/security problems when used broadly. Böckeler’s sensors series advances the harness-engineering framework Fowler introduced earlier — “Maintainability sensors for coding agents” (May 19) and “Three more static code analysis sensors” (May 20). Böckeler’s reported finding: a computational sensor for coupling data was lackluster; prompting an inferential sensor to review modularity was more effective. Osmani’s “The Orchestration Tax is You” names a constraint that’s been latent in the orchestration discourse: starting more agents is now easy, but the judgement to steer them and merge their output still routes through one serial processor — the human — and per Osmani, “the only real fix is to start architecting your own attention like you architect any concurrent system.” Hacker News / blog critique cluster runs in parallel — “Claude is not your architect. Stop letting it pretend” (May 24), Olano’s “--dangerously-skip-reading-code” (May 23), Jacob Harris’s “Why I don’t vibe code” (May 20). Latent Space’s “All Model Labs are now Agent Labs” (May 23) reframes the supply side of the same shift.
Why it matters: The Edition 12 Long View flagged that the agent threat model is becoming asymmetric and the cost model is catching up. Edition 13 is the week the methodology model also starts catching up — and it’s doing so from multiple independent directions that converge on the same operational realities. Fowler stabilising the Bliki definition matters because the Bliki is the long-form reference where Fowler’s vocabulary stabilises for architectural conversation — once “Vibe Coding” lands there, the term and the included acknowledgement of the maintainability/correctness/security tradeoffs are in the standard reference rather than only in the critique. Böckeler’s sensors series is the constructive side: how to instrument a coding-agent harness so the harness itself catches the maintainability regressions vibe coding produces. Osmani’s “Orchestration Tax” names what the harness layer doesn’t solve: the human-attention bottleneck that scales N×slower than the agent-spawn rate. The HN/blog cluster is the third leg — increasingly senior practitioner voices pushing back on the “agent as architect” default. Together they form the working methodology of mid-2026: agents can write code faster than humans can review it; the harness can catch some classes of regressions before they reach humans; the orchestration tax is real and architecting your attention is the practitioner-side response; and the Bliki now has a canonical term to anchor the conversation. For teams building or operating coding-agent workflows, this is the curriculum to assign — Fowler defines the failure mode, Böckeler shows how to detect it, Osmani names the cost, and the critique cluster sets the rhetorical floor.
Worth Scanning
- Anthropic — Widening the conversation on frontier AI (Anthropic, 2026-05-19) — Anthropic policy/communications post that pairs with The Anthropic Institute’s earlier research agenda (Economic diffusion, Threats and resilience, AI systems in the wild, AI-driven R&D, published 2026-05-07).
- KPMG integrates Claude across its core business and workforce of more than 276,000 in strategic alliance (Anthropic, 2026-05-19) — Enterprise-distribution data point; reads alongside the GitHub Gartner Magic Quadrant leader note for the third year running below.
- GitHub recognized as a Leader in the Gartner® Magic Quadrant™ for Enterprise AI Coding Agents for the third year in a row (GitHub Blog, 2026-05-22) — Take with appropriate analyst-validation skepticism, but the third-consecutive-year framing is a market-incumbency signal.
- Designing a Multi-Agent System for Engineering Support at Scale: a Case Study from Grab (InfoQ, 2026-05-20) — First-party production deployment write-up; the rubric weights case studies heavily because they include architecture detail rather than vendor framing.
- xAI Releases Grok Skills and Updates Tool Calling Responses API (InfoQ, 2026-05-22) — Skills semantics now shipped by Anthropic, xAI, and (in the broader sense) Google. The vendor convergence on Skills as a capability unit continues.
- Google Introduces Middleware Architecture for Genkit Applications (InfoQ, 2026-05-24) — Genkit moves toward agent-framework feature parity with LangGraph / CrewAI / Mastra by adding middleware.
- Datasette Agent (Simon Willison, 2026-05-21) — First-party constrained agent harness over Datasette + SQLite, with companion plugins for charts, sprites, and cost accounting.
- Latent Space — Giving Agents Computers — Ivan Burazin, Daytona (Latent Space, 2026-05-21) — Sandbox-per-task as primitive; pairs editorially with the Cloudflare + AWS GA wave.
- Latent Space — Railway: The Agent-Native Cloud (Latent Space, 2026-05-20) — Railway’s framing of the agent-native cloud thesis; same shape, smaller vendor.
- Latent Space — New AI Infra unicorns: Exa, Modal, TurboPuffer (Latent Space, 2026-05-22) — Three agent-infrastructure unicorns named in the same brief; the runtime layer Edition 11 catalogued is now unicorn-class market.
- Ocean Emerges From Stealth With $28M for Agentic Email Security Platform (SecurityWeek, 2026-05-21) — Another agents-applied-to-security entrant alongside Project Glasswing.
- ‘Underminr’ Vulnerability Lets Attackers Hide Malicious Connections Behind Trusted Domains (SecurityWeek, 2026-05-23) — Domain-fronting-class vulnerability relevant to agent-egress allowlist designs in the same architectural class as the Claude Code SOCKS5 null-byte bypass.
- Building Trusted AI Development With Kiro and Sonatype (Sonatype, 2026-05-18) — Worked example of SDD methodology meeting supply-chain-governance practice.
- Take your local GitHub sessions anywhere (GitHub Blog, 2026-05-18) — Session-portability for local Copilot context; another “portable agent state” vendor primitive.
- Bug bounty businesses bombarded with AI slop (Ars Technica, 2026-05-18) — Useful counterweight to Project Glasswing’s 10,000-vulnerability framing; there is a quality distribution behind any such number that doesn’t survive transfer to third-party security workflows.
New Tools & Repos
- DeepSeek-Reasonix — Terminal coding agent (5.5k★) engineered around DeepSeek’s byte-stable prefix-cache mechanic; the project documents one user processing 435M input tokens in a day with a 99.82% cache hit rate at ~$12 (versus ~$61 uncached on v4-flash). Vendor-specific harnesses optimised for vendor-specific cache mechanics are now a practitioner cost-reduction lever.
- Datasette Agent + plugins — Python · Simon Willison’s constrained agent over Datasette + SQLite, with companion plugins
datasette-agent-chartsanddatasette-agent-sprites, plusdatasette-llm-accountantfor cost tracking. - Anthropic MCP Tunnels (research preview) — Outbound-only gateway for connecting Managed Agents to private MCP servers; access by request.
- AWS MCP Server (GA) — Managed MCP server with IAM-based governance; ships with
aws:ViaAWSMCPServiceandaws:CalledViaAWSMCPIAM global condition context keys. - Cloudflare Browser Run (Containers rebuild) — Per Cloudflare’s blog, 4× concurrency, 50% faster quick actions, WebGL + WebMCP support; no client-side migration required.
- Agent Skills for Context Engineering v2.3.0 — Adds a Measured Router Benchmark and corpus-wide skill hardening.
- GitHub Spec Kit 0.8.13 — Patch release of GitHub’s official Spec Kit (0.8.12 also shipped 2026-05-20).
- LangGraph 1.2.1 + sdk 0.3.15 + checkpoint 4.1.1 — Patch wave for LangGraph + SDK + checkpoint.
- CrewAI 1.14.5 — Stable patch release.
Papers
- “Refactoring Runaway”: Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution — Names a specific coding-agent failure mode (refactoring runaway / tangled refactorings) where the agent’s issue-resolution scope expands beyond the original change. Pairs with the methodology-critique cluster.
- Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study — Empirical study of merge / reject patterns for agent-generated PRs; directly relevant to teams running Copilot Coding Agent, Claude Code, Codex, or Cursor at scale.
- CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding — Benchmark designed to quantify the value humans add when reviewing and steering agent output — the empirical counterpart to Osmani’s Orchestration Tax thesis.
- Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses — Names the harness layer and describes observability-driven auto-evolution of agent harnesses; research counterpart to the harness-as-platform-primitive thesis from Editions 11–12.
- Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks — Measurement paper on when coding agents take actions outside the stated task scope, even on benign prompts; pairs with the refactoring-runaway failure-mode characterization.
- SLEIGHT-Bench: A Benchmark of Evasion Attacks Against Agent Monitors — Benchmark for evasion attacks against agent observation/monitoring infrastructure; the adversarial counterpart to the harness-engineering observability work.
- Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents — Argues for interface/harness adaptation as the lever for deterministic agent behaviour rather than model fine-tuning; methodological match for the “lean harness” framing from Edition 12.
- Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents — Governance framework around the “Skills” abstraction; contract-based framing for enterprise agent deployments.
- DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback — Millisecond-level checkpoint/rollback for agent sandboxes; the kind of primitive that enables speculative-then-commit patterns in agent runtimes.
- Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents — Framework for turning legacy systems into operational specifications consumable by AI agents; direct match for the SDD pattern stack.
- SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios — Long-horizon benchmark for coding agents under realistic software-evolution scenarios.
- Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems — AI-generated formally verified systems combining inductive and deductive synthesis — the kind of work that matters when vibe coding pushes back on correctness.
Ecosystem Watch
- Anthropic acquires Stainless (Anthropic, 2026-05-18) — Acquisition of the SDK and MCP-server generator used by OpenAI, Google, and Cloudflare; per TechCrunch citing The Information, the deal is reported above $300M. Anthropic will wind down hosted Stainless products; customers retain SDK rights.
- Project Glasswing: An initial update (Anthropic, 2026-05-22) — Claude Mythos Preview reportedly found 10,000+ high/critical vulnerabilities across ~50 partner organisations in roughly one month; Claude Security shipped in public beta.
- Anthropic MCP Tunnels + self-hosted sandboxes (InfoQ, 2026-05-19) — Research preview of an outbound-only gateway for connecting Managed Agents to private MCP servers.
- AWS MCP Server GA (AWS) — Managed MCP server with IAM-based governance;
aws:ViaAWSMCPServiceandaws:CalledViaAWSMCPglobal condition keys are the new IAM primitive. - Cloudflare Agents Week roll-up + Browser Run Containers rebuild (Cloudflare) — Six-layer agent platform named; Browser Run rebuilt on Containers with 4× concurrency per Cloudflare’s blog.
- Google I/O 2026 — Gemini 3.5 Flash + Spark + Antigravity 2.0 (Google, 2026-05-19) — Gemini 3.5 Flash GA via Antigravity; Spark is the 24/7 personal-agent surface; Antigravity 2.0 is the platform.
- xAI Grok Skills + Tool Calling Responses API (xAI, via InfoQ 2026-05-22) — xAI ships Skills semantics; OpenAI-compat-shaped tool-calling response updates.
- Microsoft begins canceling internal Claude Code licenses (Microsoft, reported 2026-05-22) — Experiences + Devices org loses Claude Code by June 30; standardisation framed as cost-driven by trade press.
- Ocean — $28M for agentic email security (Ocean, via SecurityWeek 2026-05-21) — Agentic security funding round.
- InfoQ launches Online AI Engineering Cohort and Certification (InfoQ, 2026-05-22) — Senior-practitioner certification program; reads as the editorial layer trying to formalise practitioner methodology.
The Long View
Edition 11 named the harness shipping as a platform primitive. Edition 12 catalogued the productisation week on one vendor (Anthropic). Edition 13 is the week the substrate shows up as the load-bearing variable — and the substrate is in active failure mode.
Two parallel arcs ran through this week’s stories and the editorial framing of them as one shift would overstate the underlying coordination — vendor product roadmaps, supply-chain incidents, and procurement decisions happen on independent timelines. But naming the rough co-occurrence is still worth doing. On the platform side, the GA wave is real and accelerating: AWS MCP Server with IAM-grade governance, Cloudflare’s six-layer agent stack, Google’s Antigravity 2.0, Anthropic’s Stainless acquisition and MCP Tunnels. Vendor primitives are increasingly available, and the build-vs-buy line for agent runtime is shifting against custom platform work for teams without specific perimeter or governance constraints that the vendor stacks don’t yet meet. On the substrate side, the foundations every one of those platforms depends on were under sustained attack in the same five days: GitHub disclosed an internal-repository breach via a poisoned VS Code extension, Grafana lost source code via the TanStack npm compromise after rotating tokens but missing one, Sonatype flagged Shai-Hulud back and an unrelated hijacked npm package distributing a RAT. Trade press has begun connecting the GitHub and Grafana incidents to a common nx-console root cause; GitHub itself hasn’t confirmed the connection.
The structural point: the layers most agent practitioners assume are “the boring foundation” — package registries, editor extensions, CI workflow integrations — are turning into named single points of failure for production agent systems, in the same weeks that platform vendors are marketing the layers above them as managed and IAM-governed. Anthropic’s silent Claude Code sandbox patch is the same shape one level up: the harness vendor has its own CVE-and-patch posture that practitioners have to audit, because absence of CVE no longer reliably means absence of vulnerability.
The methodology-critique wave fits the same frame. Fowler stabilising “Vibe Coding” in the Bliki, Böckeler shipping sensors, Osmani naming the orchestration tax — these are practitioners building the methodology layer that the harness layer doesn’t ship with. Vendor primitives keep arriving faster, but the methodology and substrate work — what to instrument, what to detect, which dependencies to audit, how to architect human attention as a concurrent resource — is the practitioner’s. The agent platforms get better every quarter. The substrate gets attacked every quarter. The methodology has to keep pace with both.
The Artificer’s Grimoire — weekly intelligence on harness engineering and autonomous agents — for practitioners, by Tim Schiller (Artificer Digital).