Deep-dive research briefings on specific topics, newest first.
Vendor comparison of scheduled / background coding-agent platforms for 2026 H2
Four vendors shipped or productised background-agent capabilities inside one fortnight; teams choosing where to host scheduled agent workflows in 2026 H2 need an orchestration / governance / pricing / ecosystem map that distinguishes the four offers without taking vendor framing at face value
Patterns for continuous integration on agent-authored code when the reference output is non-deterministic — what to measure, how to weight semantic versus structural validators, how to fail builds without producing flaky-test fatigue, and how to keep the eval signal calibrated as both the codebase and the agent drift.
Agent-authored pull requests are now the majority of CI traffic at several large vendors. Standard CI was built around deterministic graders against fixed reference outputs; the workload it now has to gate doesn't fit that shape. Teams shipping agent-authored code at meaningful volume need a working eval rubric before the pipeline either rubber-stamps everything or gets disabled out of fatigue.
LayerX's ClaudeBleed disclosure against Anthropic's Claude Chrome extension — the execution-origin-vs-execution-context failure mode, the partial patch, and what an execution-context-aware authorisation pattern looks like for agent products shipping into shared host environments
ClaudeBleed is the first widely-reported takeover-class vulnerability in an Anthropic-shipped consumer agent surface. The failure mode — trusting where code runs rather than who is running it — is the trust-boundary mistake every team shipping agent UI into browsers, IDEs, terminals, or OS shells inherits whether they realise it or not. The Chrome-extension specifics matter less than the authorisation pattern they reveal.
The operational pipeline behind Mozilla's 271-vulnerability Firefox 150 release — agentic harness, sanitizer-driven validation, ephemeral-VM parallelism, deduplication and triage integration, and the remediation-pipeline capacity question for security teams trying to replicate it
Capability is no longer the bottleneck — operations is. Mozilla's pipeline turns a frontier model into a working defensive primitive, and the operational shape it requires (sanitizer-build success signal, ephemeral-VM parallelism, second-stage grader model, deduplicated bug lifecycle integration, two-engineer-per-patch remediation discipline) is the actual blueprint other security teams will be asked to reproduce in 2026-Q3 and beyond.
Three sandbox-per-task primitives shipped the same week at three different layers — Google's GKE Agent Sandbox (kernel-isolated pods), Cloudflare's Dynamic Workflows (per-tenant durable code in V8 isolates), and Anthropic's Claude Code Auto Mode (per-action permission classifier). Comparing what each actually isolates, where the failure modes live, and which primitive to reach for under which constraints.
Sandbox-per-task is now the dominant production pattern for running untrusted agent code, and three of the four credible vendor implementations landed in seven days. The build-vs-buy calculus and the layering decision (which combination of these primitives stacks?) are the load-bearing platform-architecture choices for any team running agents at scale in mid-2026.
Governance patterns for agent-skills marketplaces — signing, intent disclosure, sandboxed execution, consent UX — and how to evaluate the supply-chain hygiene of any skills marketplace your agents pull from
Skills are the unit of packaged agent know-how converging across vendors, and they execute inside the user's agent context with the user's permissions. Intent-misaligned skills are a distinct threat from malware-laden skills, and the marketplace governance for either is barely past prototype.
Practitioner configuration, audit, and graduation patterns for Auto Mode-style classifier gates and Codex /goal-style budget gates, in light of new independent empirical evidence
Auto Mode and /goal are now the default agency primitives in the two largest coding-agent product lines — and the first independent stress-tests, the first practitioner-visible production wipe, and the first coordinated regulator guidance all landed in the same week
How spec-driven development practices change when multi-engineer teams (and their AI agents) need to coordinate around shared specifications, comparing the enterprise-consultancy formalisations (ThoughtWorks SPDD, Spec Kit) with the community-practitioner adaptations (Specsmaxxing's feature.yaml, Gherkin/BDD revivals)
The cost of getting team-scale SDD wrong is review fatigue, spec drift, and the SpecFall antipattern that turns shared specifications into stale documentation; teams adopting AI coding agents now have to pick between several mature-enough but architecturally different methodologies, and the choice determines who reviews what, where the spec lives, and what fails first when the team grows past four engineers
The post-Mythos design space for agent containment — sandboxes, egress proxies, capability-scoped credentials, probe classifiers, and what each vendor's managed runtime ships by default
An agent that can compose a multi-step exploit to escape its sandbox is now an empirical event, not a thought experiment. Every team running long-horizon agents with network access has to design containment with adversaries in mind, and the primitives split cleanly enough that practitioners can compose their own stack rather than buying one bundled.
Token economics of agentic coding in April 2026 — where the spend actually goes, what the major-vendor pricing experiments imply, and the architectural levers that bend the cost curve at production scale
Pricing for agentic coding tools is repricing publicly in real time — Copilot moves to usage-based on June 1, Anthropic A/B-tested Claude Code out of the $20 tier and reverted, and the per-developer monthly bill is now load-bearing for any team's 2026 stack-budget assumption
Comparative analysis of the four managed agent runtimes shipping in Q2 2026 — model-coupled (Anthropic), neutral (Cloudflare), and hyperscaler-bundled (AWS / Microsoft / Google) — across the dimensions that determine lock-in
The runtime layer is where session sandboxing, identity, MCP policy, observability, and billing converge; picking one is the load-bearing decision for any 2026-Q3 production agent stack
Comparative analysis of the four major agent-registry offerings and the decision framework for picking one
Registries are the layer where agent-protocol fragmentation gets absorbed and where enterprise governance lives — choosing one is a lock-in decision for the decade