Scout: Enterprise Coding Agent Architectures

Summary

QCon London 2026 marked a turning point: three major enterprises — Stripe, Spotify, and HubSpot — independently presented production coding agent systems, each solving different problems (general-purpose code generation, large-scale code migration, and automated code review) yet converging on strikingly similar architectural patterns. The convergence is not coincidental. All three systems use a “harness” architecture that wraps LLMs in deterministic scaffolding — blueprints or workflow engines that alternate between unconstrained agent reasoning and rigid, codified steps. All three enforce human review as an architectural boundary, not a cultural practice. And all three discovered that the system surrounding the model matters more than the model itself — Stripe explicitly calls this out: “the walls matter more than the model.” For practitioners building internal coding agents, these three systems provide the clearest available blueprint for what works at enterprise scale.

Key Findings

1. The Harness Pattern: Deterministic Scaffolding Around Agentic Loops

The most significant architectural convergence across all three systems is what the ecosystem now calls the “agent harness” — a structured workflow engine that wraps LLM reasoning in deterministic scaffolding.

Stripe’s Blueprints are the most explicit implementation. A blueprint is a sequence of nodes where some run deterministic code and others run agentic loops. The “implement the feature” step gets the full agentic loop with tools and freedom. The “run linters” step is hardcoded. The “push the branch” step is hardcoded. This separation saves tokens, reduces errors, and guarantees that critical steps execute every time. When a deterministic node returns a failure — test failures after code generation, for example — the blueprint feeds that failure back into an agentic node for interpretation and retry, creating bounded retry loops (capped at two CI rounds).

Spotify’s Honk achieves the same pattern through its Fleet Management framework, which has existed since 2022 for applying code transformations at scale. The framework handles the deterministic scaffolding — targeting repositories, opening pull requests, getting reviews, merging into production — while the code transformation step was replaced with an agent that takes instructions from a prompt. Verification loops guide the agent toward desired results, with the LLM-as-judge component catching approximately 25% of problematic changes before they reach human reviewers.

HubSpot’s Sidekick implements the pattern through their Aviator framework — an internal Java-based agent framework that provides structured tool abstractions via RPC. The judge agent acts as a deterministic quality gate between the review agent’s output and what actually gets posted to GitHub, evaluating every comment against three criteria: succinctness, accuracy, and actionability.

The pattern is clear: none of these companies let the LLM run the system. The system runs the LLM. The harness decides when the agent reasons, what tools it can access, and what happens when it fails.

2. Constrained Tooling via MCP and Internal Abstractions

All three systems provide carefully curated tool access rather than giving agents broad capabilities.

Stripe built “Toolshed,” a centralized MCP server containing approximately 500 custom tools for fetching internal documentation, ticket details, build statuses, and code search results. Each Minion pulls from a curated slice of this toolset — not the full 500 tools. This deliberate scoping is an architectural choice: agents get exactly the tools they need for their task class, no more.

Spotify built a small internal CLI that delegates prompt execution to an agent, runs custom formatting and linting via local MCP, evaluates diffs using LLM-as-judge, uploads logs to GCP, and captures traces in MLflow. This CLI abstraction allows them to seamlessly switch between agents and LLMs — they experimented with Goose and Aider before settling on Claude Code via the Claude Agent SDK.

HubSpot uses RPC-based tool abstractions native to their Java stack. Aviator agents retrieve repository context — configuration settings, coding conventions, file structure — through structured tool calls rather than raw file access. This gives HubSpot control over what context the review agent sees and how it accesses the codebase.

The convergence on MCP (or MCP-like abstractions) is significant. These companies aren’t giving agents shell access and hoping for the best. They’re building curated tool catalogs that encode institutional knowledge about how their codebases work.

3. Isolated Execution Environments

All three systems run agents in sandboxed environments, but with different isolation strategies matched to their risk profiles.

Stripe runs Minions on pre-warmed “devboxes” — isolated development environments identical to what human engineers use but isolated from production and the internet. These spin up in 10 seconds, providing fast iteration without security exposure. The choice to use the same environment as human developers is deliberate: it means agent-generated code encounters the same toolchain, linters, and test infrastructure that human code does.

Spotify leverages its existing Fleet Management infrastructure, which already handles isolated execution across thousands of repositories. The agent operates within the same sandboxing boundaries that deterministic code transformations used before.

HubSpot embeds Sidekick directly into services via Aviator, running within the existing HubSpot infrastructure rather than spinning up separate execution environments. Because Sidekick is a review agent (read-only with respect to production code), the isolation requirements are less stringent than for code-generation agents.

4. Multi-Layered Verification Before Human Review

None of these systems submit raw LLM output for human review. Each implements multiple verification layers.

Stripe’s three-layer verification:

Deterministic checks — linters, formatters, type checkers (hardcoded in blueprints)
CI/CD pipeline — automated tests, static analysis (bounded to two retry rounds)
Human code review — engineers review the PR like any human-written code

Spotify’s verification stack:

Custom formatting and linting via local MCP tools
LLM-as-judge evaluating the diff against the original prompt (vetoes ~25% of changes)
Agent self-correction (successful in ~50% of vetoed cases)
Human review via standard Fleet Management merge workflow

HubSpot’s two-stage review:

Primary review agent analyzes the PR and generates comments
Judge agent evaluates each comment for succinctness, accuracy, and actionability
Only comments passing the judge gate are posted to GitHub

The LLM-as-judge pattern appears in both Spotify and HubSpot, applied differently but serving the same function: filtering agent output before it reaches humans, reducing noise, and improving signal quality.

5. Task Scoping: Well-Bounded Problems, Not Open-Ended Coding

All three companies constrain what their agents do, and their constraint strategies reveal their maturity models.

Stripe focuses Minions on well-defined tasks: configuration adjustments, dependency upgrades, minor refactoring, and feature implementations triggered via Slack. The “one-shot” framing is key — each Minion handles a single, bounded task from instruction to pull request. Engineers noted that Minions perform best on tasks with clear specifications and deterministic success criteria.

Spotify scopes Honk exclusively to code migrations — replacing deprecated APIs, updating library versions, applying codemod-style transformations across thousands of repositories. This is a deliberately narrow task class where the desired end state can be precisely specified in a prompt and verified via compilation and tests.

HubSpot constrains Sidekick to code review — an inherently bounded task where the input (a PR diff) and output (review comments) are well-defined. The judge agent further constrains output to comments that are actionable and accurate.

The lesson: all three companies started with constrained, well-bounded task classes rather than trying to build a general-purpose coding agent. The task boundaries are part of the architecture.

6. Human Review as Architectural Boundary, Not Cultural Practice

A critical distinction in all three systems: human review is not a recommended practice — it is an architectural constraint enforced by the system.

Stripe’s Minions have “submission authority” but not “merge authority.” The system can create pull requests but cannot merge them. This is a system-level policy, not a team convention.

Spotify’s Honk operates within Fleet Management’s existing merge workflow, which already requires human approval. The agent produces a PR; the merge path is unchanged from human-authored code.

HubSpot’s Sidekick produces review comments, not code changes. The human engineer retains full authority over whether to act on the feedback.

This is perhaps the most important pattern for practitioners to internalize: these enterprises didn’t build trust by hoping engineers would review agent output. They built systems where agents physically cannot bypass human review.

7. Multi-Model Strategies Are Emerging

The three systems take different approaches to model selection, reflecting different maturity levels.

Stripe uses LLMs through their blueprint architecture but doesn’t publicly specify which models power Minions. The architectural insight is that their blueprint system is model-agnostic — the deterministic scaffolding works regardless of which LLM sits in the agentic nodes.

Spotify explicitly identifies Claude Code (via the Claude Agent SDK) as their top-performing agent after experimenting with Goose, Aider, and a custom agentic loop. Their CLI abstraction allows model swapping, but they’ve converged on a single primary model.

HubSpot has gone furthest on multi-model support. Aviator provides first-class support for Claude, GPT, and Gemini, allowing the team to “experiment more freely and quickly fail over in the case of provider downtime.” This is multi-model as resilience strategy, not just capability optimization.

8. The “Harness Engineering” Discipline Is Crystallizing

The convergence across Stripe, Spotify, and HubSpot — along with similar patterns at Ramp (Inspect), Coinbase (Cloudbot), Shopify, and Airbnb — has given rise to what practitioners now call “harness engineering.” LangChain’s Open SWE framework, released the same week as QCon London, explicitly codifies these patterns: isolated sandboxes, curated tools (~15 in Open SWE vs. Stripe’s 500), Slack-first invocation, multi-agent architecture (Manager, Planner, Programmer, Reviewer), and pluggable sandbox providers.

The emergence of Open SWE as an open-source reference implementation is significant because it means the architectural patterns validated by Stripe, Spotify, and HubSpot are now accessible to teams without the resources to build from scratch. The core components — sandbox isolation, tool curation, workflow orchestration, verification loops, and human review gates — are becoming standardized.

Practical Implications

For Teams Starting to Build Internal Coding Agents

Start with the harness, not the model. The single clearest lesson from all three systems: invest in the scaffolding (workflow engine, sandbox, tool catalog, verification pipeline) before optimizing model selection. Stripe’s insight — “the walls matter more than the model” — is validated by independent convergence across all three companies.
Pick a bounded task class first. Do not attempt to build a general-purpose coding agent. Stripe started with configuration changes and dependency upgrades. Spotify started with code migrations. HubSpot started with code review. Each chose a task class with clear specifications, deterministic success criteria, and existing verification infrastructure (tests, linters, CI).
Build on existing infrastructure. Spotify built Honk on top of Fleet Management (a 2022-era codemod framework). HubSpot built Sidekick on Aviator (their internal Java agent framework). Stripe built Minions on their existing devbox infrastructure. None started from scratch — they bolted agent capabilities onto proven systems.
Evaluate Open SWE as a starting point. LangChain’s Open SWE provides the core architectural components (sandbox, tools, workflow, review) in an open-source package. For teams without Stripe-scale infrastructure, this is the fastest path to a working internal coding agent — but expect to customize the tool catalog and verification pipeline for your codebase.

For Teams Already Running Coding Agents

Add an LLM-as-judge verification layer. Both Spotify and HubSpot independently converged on using a secondary LLM to evaluate agent output before human review. Spotify’s judge catches 25% of problematic changes; HubSpot’s filters low-value review comments. This is cheap insurance that meaningfully reduces noise for human reviewers.
Implement bounded retry with hard caps. Stripe caps CI retries at two. If the agent can’t fix a failure in two attempts, a third attempt won’t help — escalate to a human. This prevents runaway token consumption and infinite loops. Adopt a similar cap for your agent workflows.
Invest in multi-model failover. HubSpot’s Aviator framework treats multi-model support as a reliability feature, not a capability feature. Provider outages are inevitable. Design your harness to swap models without architectural changes.

What to Avoid

Don’t give agents merge authority. None of these three enterprise systems allow agents to merge their own PRs. Submission authority (creating PRs) is the architectural boundary. If your team is debating auto-merge for agent PRs, the signal from three independent enterprise systems is clear: don’t.
Don’t give agents broad tool access. Stripe curates a specific tool slice per task class from 500 available tools. Giving an agent access to everything is not a feature — it’s a liability. Scope tool access to the minimum required for each task type.
Don’t underinvest in context engineering. Spotify’s three-part blog series on Honk dedicates an entire post to context engineering. Their key findings: describe the end state rather than prescribing steps, provide concrete code examples, clearly state when not to take action, and give the agent a verifiable goal. The prompt is infrastructure, not an afterthought.

Open Questions

How far can task boundaries expand? All three systems started with bounded task classes. Stripe hints at broader capabilities, but the published evidence covers configuration, migrations, and review. Can the harness pattern scale to open-ended feature development, or does it fundamentally require well-specified tasks?
What is the right organizational model for harness engineering? Is harness engineering a platform team function, a developer experience function, or something new? The three companies don’t publicly describe their team structures for maintaining these systems, but the skill set — workflow design, tool curation, verification engineering, prompt engineering — doesn’t map cleanly to existing roles.
How do these systems handle cross-cutting changes? Spotify’s Honk handles migrations across thousands of repositories, but each PR targets a single repo. What about changes that require coordinated modifications across multiple services? The harness pattern as described doesn’t obviously address multi-repo coordination beyond Spotify’s Fleet Management approach.
What happens when agent-generated code density reaches critical mass? Stripe merges 1,300+ agent-written PRs per week. At what point does agent-generated code become the majority of the codebase, and what implications does that have for maintainability, debugging, and institutional knowledge?
Will the harness pattern standardize or fragment? Open SWE codifies one version of the pattern. But Stripe’s blueprint architecture, Spotify’s Fleet Management integration, and HubSpot’s Aviator framework are deeply specific to their stacks. Is the harness pattern genuinely portable, or will every enterprise build a bespoke version?

Sources

Stripe Engineers Deploy Minions, Autonomous Agents Producing Thousands of Pull Requests Weekly — InfoQ / QCon London 2026
QCon London 2026: Rewriting All of Spotify’s Code Base, All the Time — InfoQ / QCon London 2026
HubSpot’s Sidekick: Multi-Model AI Code Review with 90% Faster Feedback and 80% Engineer Approval — InfoQ / QCon London 2026
Minions: Stripe’s one-shot, end-to-end coding agents — Stripe Engineering Blog
Minions: Stripe’s one-shot, end-to-end coding agents — Part 2 — Stripe Engineering Blog
1,500+ PRs Later: Spotify’s Journey with Our Background Coding Agent (Honk, Part 1) — Spotify Engineering
Background Coding Agents: Context Engineering (Honk, Part 2) — Spotify Engineering
Background Coding Agents: Predictable Results Through Strong Feedback Loops (Honk, Part 3) — Spotify Engineering
Automated Code Review: The 6-Month Evolution — HubSpot Product Blog
How Stripe’s Minions Ship 1,300 PRs a Week — ByteByteGo
What Is Stripe Minions’ Blueprint Architecture? — MindStudio
Stripe’s coding agents: the walls matter more than the model — Anup Jadhav
Deconstructing Stripe’s ‘Minions’: One-Shot Agents at Scale — SitePoint
Spotify says its best developers haven’t written a line of code since December — TechCrunch
Spotify cuts migration time by 90% with Claude Agent SDK — Anthropic
Agentic Coding: Spotify’s Lessons Learned — Dawn Liphardt
What Is an AI Coding Agent Harness? How Stripe, Shopify, and Airbnb Build Reliable AI Workflows — MindStudio
The Anatomy of an Agent Harness — LangChain
Open SWE: An Open-Source Framework for Internal Coding Agents — LangChain
Skill Issue: Harness Engineering for Coding Agents — HumanLayer
The Emerging “Harness Engineering” Playbook — Ignorance.ai
QCon London AI Coding State of the Game — InfoQ
QCon London 2026: Context Engineering: Building the Knowledge Engine AI Agents Need — QCon
Google’s Eight Essential Multi-Agent Design Patterns — InfoQ
2026 Agentic Coding Trends Report — Anthropic
Choosing the Right Multi-Agent Architecture — LangChain
Harness Engineering: The Complete Guide — NxCode