Artificer’s Grimoire — Edition 5 · March 29, 2026

Both Anthropic and OpenAI ship autonomous agent infrastructure in the same week — while a supply chain attack on LiteLLM makes the case that guardrails aren’t optional. Meanwhile, hard data arrives: over-privileged agents cause 4.5x more security incidents, and a single skills file closes a 70-point performance gap.

Must Read

Anthropic’s Biggest Week: Auto Mode, Computer Use, and the Permission Problem

Source: TechCrunch / Simon Willison / CNBC · 2026-03-23 to 2026-03-24 Score: 5 · Tags: claude-code, auto-mode, computer-use, agent-permissions, governance

Anthropic shipped two major features in consecutive days. Auto mode uses a Sonnet 4.6 classifier to make permission decisions on behalf of the developer, blocking actions that escalate beyond task scope or appear driven by prompt injection. Computer use gives Claude the ability to see, navigate, and control a user’s desktop. Together with Dispatch (persistent mobile-to-desktop agent threads), this represents a shift from per-action approval to autonomous-with-guardrails.

Why it matters: This is Anthropic’s answer to the agent permission problem. Every coding agent user knows the friction — approving every file write, every shell command. Auto mode replaces the binary allow/deny with a classifier that understands task scope. The architecture is worth studying: a cheaper, faster model making permission decisions for a more capable model. If it works, it’s the template for how autonomous agents earn trust incrementally. If it doesn’t, Anthropic just handed agents the keys to the desktop. Latent Space called it “the biggest Claude launch of all time” — the ambition matches the risk.

LiteLLM Supply Chain Attack: CI/CD Poisoning, Credential Theft at Scale

Source: Simon Willison / Sonatype / Datadog Security Labs · 2026-03-24 Score: 5 · Tags: supply-chain, security, pypi, credential-theft, agent-infrastructure

LiteLLM versions 1.82.7 and 1.82.8 were compromised by threat actor TeamPCP via a poisoned Trivy security scanner in the project’s CI/CD pipeline. The malware exfiltrated SSH keys, cloud credentials, Kubernetes secrets, and Docker configs — all triggered on install, without importing the library. The compromised versions were live for approximately two hours on a package averaging three million daily downloads. TeamPCP also compromised Checkmarx’s GitHub Actions in the same campaign.

Why it matters: This is the largest supply chain attack targeting AI infrastructure tooling to date, and the attack vector — CI/CD pipeline poisoning — is particularly concerning. LiteLLM is the de facto LLM proxy layer; if it’s in your stack, you were potentially exposed. The attack triggered on install, not import — meaning CI/CD pipelines that pull fresh dependencies were compromised silently. Simon Willison’s coverage of package manager cooldowns — pnpm, Yarn, Bun, Deno, and uv all now support minimum release age settings — is the practical defensive response. If you’re running agent infrastructure in production, audit your dependency pinning this week.

OpenAI Extends Responses API into an Agent Platform

Source: InfoQ / OpenAI · 2026-03-27 Score: 5 · Tags: openai, agents, shell-tool, skills, context-compaction

OpenAI extended the Responses API with a shell tool, built-in agent execution loop, hosted Debian 12 container workspace, server-side context compaction for multi-day sessions, and reusable SKILL.md manifests. Both OpenAI and Anthropic have now converged on the same skills standard: YAML-frontmatter markdown files that encode agent capabilities.

Why it matters: The gap between “model API” and “agent platform” closed this week. OpenAI is now offering sandboxed execution, persistent state, and a skills ecosystem — the same architectural components Anthropic ships with Claude Code. The skills convergence is the signal to watch: SKILL.md, CLAUDE.md, AGENTS.md, and LangSmith Fleet skills are all variations on the same pattern. We’re watching a de facto standard emerge in real time. For practitioners, the implication is clear — invest in your skills layer. It’s becoming the portable unit of agent capability across platforms.

Agent Governance Gets Hard Data: 4.5x More Incidents, and a Pattern for Fixing It

Source: InfoQ (Teleport) / InfoQ (Declarative Architecture) · 2026-03-26 to 2026-03-28 Score: 5 · Tags: agent-security, governance, least-privilege, declarative-architecture

Two pieces that define the governance problem and its solution. Teleport’s “2026 State of AI in Enterprise Infrastructure Security” report finds enterprises granting excessive permissions to AI systems experience 4.5 times as many security incidents. Separately, InfoQ published “Architectural Governance at AI Speed,” proposing Declarative Architecture — transforming ADRs and event models into automated guardrails where the conformant path is the path of least resistance.

Why it matters: The Teleport data is the number this conversation needed. Not “governance is good practice” but “over-permissioned agents cause 4.5x more incidents.” That’s a procurement conversation, a board-level metric. Pair it with Declarative Architecture — which offers the operational pattern — and you get the full picture: least-privilege isn’t just about restricting access, it’s about encoding architectural constraints so agents can’t drift. The ALARA paper from arXiv (2603.20380) arrives with the same thesis from a different angle: context exposure has real costs and should be minimized to what’s “reasonably achievable.” Three independent sources converging on the same principle in one week.

Context Engineering Validated: A Skills File Closes a 70-Point Performance Gap

Source: Google Developers Blog · 2026-03-26 Score: 4 · Tags: context-engineering, skills, agent-capabilities, gemini

Google DeepMind developed a “Gemini API developer skill” providing agents with live documentation and SDK guidance. The gemini-3.1-pro-preview model jumped from 28.2% to 96.6% success rate when equipped with the skill — a 70-point improvement from context alone, no retraining.

Why it matters: This is the most compelling data point for context engineering we’ve seen. 28% to 97% from a single skills file. No fine-tuning, no model change — just giving the agent the right context at the right time. If you’re still debating whether CLAUDE.md files, AGENTS.md, or skill manifests are worth maintaining, this number settles it. The skill layer isn’t a nice-to-have — it’s where the majority of agent capability actually lives. Every team running coding agents should be measuring their equivalent of this gap.

Worth Scanning

Addy Osmani: The Code Agent Orchestra (Google Chrome) — Patterns for multi-agent coding: subagents, agent teams, and the shift from conductor to orchestrator. Practical reference from an influential voice.
LangChain’s Agent Stack This Week (LangChain Blog) — Four posts in one week: agent authorization (Assistants vs. Claws identity model), eval readiness checklist, agent middleware, and skills in LangSmith Fleet. The skills piece confirms the convergence pattern.
Kensho (S&P Global): Multi-Agent Finance with LangGraph (LangChain Blog) — Unified agentic access layer solving fragmented financial data retrieval at enterprise scale.
Martin Fowler: ADR Bliki Update (Martin Fowler) — Updated ADR guidance emphasizing inverted pyramid style. ADRs as a forcing function for clarity — and a form of context anchoring (Edition 4).
AWS: Architecting for Agentic AI Development (AWS) — Codebase patterns that help AI agents understand, modify, and validate applications. Reference architecture for Bedrock-based agent stacks.
GitHub Actions 2026 Security Roadmap (GitHub) — Secure defaults, policy controls, and CI/CD observability. Directly relevant in the wake of the LiteLLM pipeline attack.
GitHub Copilot Data Policy Change (GitHub) — From April 24, Copilot Free/Pro/Pro+ interaction data will be used for model training unless users opt out. Affects millions of developers.
Google AppFunctions: Android as Agent-First OS (InfoQ) — Apps provide functional building blocks agents leverage through AI assistants. Early beta.
Coding Agents Could Make Free Software Matter Again (Hacker News, 135 pts) — Thesis that source code access becomes more valuable when agents can act on it. High community engagement.
Vibe Porting: JSONata in Go in 7 Hours for $400 (Simon Willison) — Pattern: comprehensive test suite + coding agent = credible language port. Also: vibe coding SwiftUI apps with Claude Opus 4.6.
Nicole Forsgren at QCon: The AI Productivity Paradox (InfoQ) — Generating code faster often makes deployment bottlenecks more expensive. DORA data meets AI hype.

New Tools & Repos

Lat.md — Markdown · 83 HN pts — Agent Lattice: a knowledge graph for your codebase, designed for AI coding agents.
Miasma — 304 HN pts — Tool to trap AI web scrapers in an endless poison pit. Adversarial response to AI crawlers.
Google Developer Knowledge API — Public preview — MCP server for accessing Google’s documentation corpus in real time.
Google Data Commons MCP — Hosted MCP service for public data queries on GCP.
FunctionGemma — 270M params — On-device function calling model for Android and iOS.

Papers

Learning to Commit: Online Repository Memory — Mo Li et al. — Agent PRs get rejected for lack of “organicity” — violating project conventions, not functional bugs. Supervised contrastive reflection on historical commits teaches agents the project’s change patterns.
Ask or Assume: Uncertainty-Aware Clarification — N. Edwards, S. Schuster — Multi-agent scaffold decouples underspecification detection from code execution. OpenHands + Claude Sonnet 4.5 achieves 69.4% vs 61.2% single-agent on underspecified SWE-bench.
ALARA for Agents: Least-Privilege Context Engineering — C. Agostino, N. D’Souza — Radiation safety’s ALARA principle applied to agent context: minimize exposure to what’s “reasonably achievable.” Portable, composable multi-agent teams.
ManagerWorker: Can AI Models Direct Each Other? — Rui Liu — Strong manager + weak worker (62%) matches strong single agent (60%) at a fraction of cost. Weak manager + weak worker (42%) performs worse than weak alone (44%).
RACE-bench: Reasoning-Augmented Code Agent Evaluation — S. Liu et al. — 528 real-world feature additions with structured reasoning ground truth. Measures both patch correctness and reasoning quality.
Context Engineering via Digital-Twin MDP — X. Yang et al. (IBM Research) — RL-guided context engineering using digital twin MDPs for enterprise AI agents.
OPENDEV: Terminal-Native Coding Agent — N. Bui — Dual-agent architecture (planning/execution), lazy tool discovery, adaptive context compaction, automated memory. Open-source, written in Rust.
Multi-Agent Orchestration Benchmarking — S. Kulkarni, Y. Kulkarni — Compares sequential pipeline, parallel fan-out, hierarchical supervisor-worker, and reflexive self-correcting architectures across five frontier LLMs on 10K SEC filings.
The Controllability Trap — ICLR 2026 Workshop — Challenges binary “human-in-the-loop or not” with a Control Quality Score (CQS) — a continuous metric for graduated governance.

Ecosystem Watch

Anthropic: Claude Auto Mode + Computer Use + Dispatch — Three launches in two days: auto mode (autonomous permissions), computer use (desktop control), and Dispatch (persistent mobile-to-desktop threads). The largest Claude Code feature drop to date.
OpenAI Responses API Expansion — Shell tool, agent loop, hosted container workspace, context compaction, SKILL.md manifests. The Responses API is now an agent platform.
Dreamer Acqui-Hired by Meta Superintelligence Labs — Hugo Barra’s Agent OS team hired roughly two months after launch. Meta continues assembling an AI agent ecosystem following the Manus acquisition.
GitHub Copilot Data Policy — Copilot Free/Pro/Pro+ interaction data used for training from April 24 unless users opt out.
CrewAI 1.12.x — Agent skills, hierarchical memory isolation, RBAC permissions matrix, Qdrant Edge storage.
Google ADK Integrations Ecosystem — GitHub, Notion, Hugging Face and more third-party integrations added.

The Long View

The Case for Slowing Down

Mario Zechner, creator of the Pi agent framework used by OpenClaw, wrote something this week that cuts against the prevailing narrative: “A human cannot shit out 20,000 lines of code in a few hours. With an orchestrated army of agents, there is no bottleneck, no human pain. These tiny little harmless booboos suddenly compound at a rate that’s unsustainable.”

He’s not alone. Nicole Forsgren presented DORA data at QCon showing that generating code faster with AI often makes deployment bottlenecks more expensive — the AI productivity paradox. A practitioner blog post cataloguing “uncomfortable truths” about coding agents drew 101 comments on Hacker News. And the “Learning to Commit” paper found that agents get rejected not for writing broken code, but for ignoring how the project actually works.

There’s a pattern worth naming: velocity without alignment is negative productivity. Agents that generate code faster than teams can review it, that ignore project conventions, that compound small errors across thousands of lines — these agents aren’t accelerating development, they’re creating a review debt that someone has to pay.

The Controllability Trap paper from ICLR 2026 offers a useful reframe: governance isn’t binary. It’s not “human in the loop or not” — it’s a continuous variable that degrades in real time as systems become harder to steer. The paper’s Control Quality Score measures that degradation, enabling graduated responses before control is lost entirely.

This isn’t an anti-agent argument. It’s an argument for what Anthropic is attempting with auto mode’s scope classifier, what Declarative Architecture encodes in machine-readable constraints, and what the ALARA paper formalizes as least-privilege context. The fastest path isn’t always the straightest one — and the agents that win will be the ones that know when to pause.

The Artificer’s Grimoire is a curated intelligence feed from Artificer Digital. Built by practitioners, for practitioners.

Artificer's Grimoire — Edition 5 · March 29, 2026