Scout: Supervisory Engineering: Skills, Practices, and Team Design

Summary

As coding agents move from prototype to production, a new discipline is crystallizing around the work that remains distinctly human: directing agents, evaluating their output, and correcting what they get wrong. Annie Vella’s research on 158 professional engineers names this “supervisory engineering work” and locates it in a “middle loop” between the traditional inner loop (write/test/debug) and outer loop (commit/review/deploy). Kief Morris frames the same shift as moving “on the loop” — designing the specifications, tests, and feedback mechanisms that govern agent behavior rather than reviewing every artifact directly. The 2025 DORA report provides the empirical counterweight: AI adoption correlates with higher throughput but also higher instability, and the difference between amplification and chaos comes down to seven organizational capabilities that are all team-level, not individual. The convergence is clear — the engineering profession is not being replaced, but it is being restructured around supervision, specification, and verification.

Key Findings

1. The Middle Loop: A New Category of Engineering Work

Annie Vella’s research, conducted as part of her Master’s in Software Engineering at the University of Auckland, identifies a structural gap in how we describe software engineering work. The traditional model has two loops: the inner loop (write code, build, run, test, debug) and the outer loop (commit, code review, CI/CD, deploy, monitor). AI is automating the inner loop at speed. But someone still has to direct that work, evaluate the output, and correct what’s wrong.

Vella names this the middle loop — a new layer where engineers supervise AI doing what they used to do by hand. The work breaks into three activities:

Directing: specifying intent, crafting prompts, managing context, codifying standards into reusable agent instructions and skills, and iterating when output misses the mark.
Evaluating: reading AI-generated output and deciding what to accept, modify, or reject.
Correcting: fixing errors, integrating output into existing code, and maintaining consistency across the codebase.

The amount of supervisory work an engineer does depends on two variables: trust in the AI’s output and criticality of the work. Vella draws on the concept of “calibrated trust” — trusting a system exactly as much as it deserves. Over-trusting leads to accepting hallucinations as truth; under-trusting leads to disuse. Finding the balance is a skill built through repeated interaction, not through policy alone.

Her survey of 165 engineers across 28 countries found that 78% feel they are spending less time writing code. The time hasn’t disappeared — it has migrated to the middle loop.

Sources: Annie Vella, “The Middle Loop” · Martin Fowler, Fragments: March 16

2. On the Loop, Not Out of the Loop

Kief Morris (Thoughtworks) proposes three models for how humans interact with AI systems in software engineering:

In the loop: the developer reviews each AI output before it takes effect. This is how most people use Copilot today — suggest, accept/reject, continue.
Out of the loop: the system operates autonomously. Humans intervene only on failure.
On the loop: humans design and maintain the mechanisms that guide and validate the system’s behavior — specifications, tests, review criteria, governance policies — but don’t inspect every generated artifact.

Morris argues that software engineering is particularly well-suited to the “on the loop” model because it already has verification infrastructure: type systems, test suites, linters, CI pipelines. The engineering work shifts from writing code to designing the feedback loop itself — what tests exist, what specifications guide the agent, what review criteria catch the things tests miss.

This framing is complementary to Vella’s middle loop. Directing, evaluating, and correcting are the activities; on-the-loop is the operational stance. The engineer’s value lies not in inspecting every line but in designing the system that makes inspection unnecessary for most changes.

Sources: Kief Morris, “Humans and Agents in Software Engineering Loops” · InfoQ coverage

3. The DORA Data: AI Amplifies, It Doesn’t Fix

The 2025 DORA State of AI-Assisted Software Development report — based on data from thousands of engineering teams — delivers the most rigorous empirical picture of AI’s impact on delivery performance. The headline findings:

90% of developers now report using some form of AI assistance.
AI adoption positively correlates with throughput — teams ship faster.
AI adoption negatively correlates with stability — more change failures, more rework, longer cycle times to resolve issues.
Only 24% of respondents trust AI “a great deal” or “a lot”; 30% trust it “a little” or “not at all.”

The central insight: AI is an amplifier, not a fix. Strong teams with mature practices use AI to become faster and more consistent. Struggling teams with fragmented processes find that AI magnifies their dysfunction — more code ships, but more of it breaks.

The report introduces the DORA AI Capabilities Model, identifying seven capabilities that amplify AI’s positive impact:

Clear AI stance (organizational policy on where and how AI is used)
Healthy data ecosystems
AI-accessible internal data
Strong version control practices
Working in small batches
User-centric focus
Quality internal platforms

Every one of these is a team-level or organization-level capability. None of them are individual developer skills. This is the critical insight for team design: supervisory engineering isn’t just about upskilling individuals — it’s about building organizational infrastructure that makes AI-assisted work stable.

Sources: DORA 2025 Report · InfoQ coverage · IT Revolution analysis · Google Cloud Blog announcement

4. The Skills That Define Supervisory Engineering

Synthesizing across all sources, the skills required for effective supervisory engineering cluster into four domains:

Specification craft. The ability to write clear, complete, machine-interpretable specifications. As the SDLC shifts pressure from code generation to everything around it — design, requirements, testing — the bottleneck moves to how well intent is captured. Vella notes that “requirements need to be clearer, design decisions and intent need to be explicit and recorded.” This is context engineering applied to team process.

Verification design. Building the feedback loops that make on-the-loop supervision possible: test suites, type systems, linters, CI gates, review criteria, and governance policies. Addy Osmani describes this as a “virtuous cycle where the AI writes code, automated tools catch issues, the AI fixes them, with the human overseeing high-level direction.” The 10x skill for agent-era engineers is not prompt engineering — it is designing the verification harness.

Calibrated judgment. Reading code you didn’t write, evaluating architectural decisions against implicit constraints, catching regressions that pass tests but violate intent. These are traditionally senior engineering skills. As Vella observes, the irony is that AI automates the work junior engineers do to develop these skills — creating a potential skill-development gap.

System-level thinking. Understanding how components interact, where failure modes hide, and what the blast radius of a change looks like. Osmani’s “conductors to orchestrators” framing captures this: the engineer evolves from implementing to orchestrating, with the value lying in architecture, interfaces, and integration rather than line-by-line coding.

Sources: Annie Vella, “The SDLC Strikes Back” · Addy Osmani, “Conductors to Orchestrators” · Addy Osmani, “My LLM Coding Workflow Going Into 2026”

5. Team Design for Agent-Assisted Workflows

Several patterns are emerging for how teams restructure around AI agents:

The “Delegate, Review, Own” model. The engineer specifies intent (via spec, prompt, or task description), the agent generates output, automated verification catches obvious issues, and the engineer reviews what passes. Front-loaded effort on specification, back-loaded effort on review, minimal intervention in the middle. This is Osmani’s orchestrator pattern in practice — one engineer can manage more total work in parallel than possible when writing everything by hand.

The “Agent Supervisor” role. Deloitte’s research describes humans who “enter workflows at intentionally designed points to handle exceptions requiring their judgment.” This isn’t about checking every output — it’s about strategic handoffs at critical decision points. The analogy is a factory supervisor, not a line worker.

The junior engineer gap. Multiple sources flag this as the most concerning structural issue. Microsoft executives have publicly worried about AI eating entry-level coding jobs. Research shows a 17-point comprehension gap when juniors learn with AI assistance (50% code understanding vs. 67% without). Some organizations are rebranding the junior role as “AI Reliability Engineer” — responsible for specification writing and hallucination checks. But the deeper problem remains: if the inner loop is how engineers develop calibrated judgment, and AI automates the inner loop, how does the next generation of senior engineers emerge?

Blended team composition. Gartner projects that by 2028, 38% of organizations will have AI agents as formal team members. The practical implication: team sizing, sprint planning, and capacity models need to account for agent throughput alongside human throughput. A team of 4 engineers with coding agents has fundamentally different capacity characteristics than the same 4 engineers without them — but also different failure modes.

Sources: Deloitte, “The Agentic Reality Check” · Optimum Partners, “Engineering Management 2026” · The Register, Microsoft on entry-level jobs

6. The Identity Question

Vella names what many practitioners feel but few articulate: engineers became engineers because they found identity in building things. That identity is being challenged. The shift from creator to supervisor — from builder to overseer — feels, to many, “suspiciously like management.”

This isn’t a trivial concern. Identity shapes motivation, retention, and career satisfaction. If supervisory engineering work is experienced as a loss rather than an evolution, organizations will face retention problems among their most capable engineers — precisely the people whose judgment and system-level thinking are most needed for effective supervision.

Vella’s reframe is useful: she points to the emergence of terms like “product engineer” (engineers who keep the customer front and center) and “product builder” (someone who builds products in an AI-first world with agents, context, and orchestration). The new identity isn’t “manager of AI” — it’s “builder who uses AI as material.”

Sources: Annie Vella, “The Software Engineering Identity Crisis” · Annie Vella, “Finding Comfort in the Uncertainty” · Aviator podcast

Practical Implications

For teams building agent infrastructure:

Invest in verification infrastructure before scaling agent usage. The DORA data is unambiguous: AI without strong feedback loops increases instability. Before deploying more agents, ensure your test suites, CI gates, and review criteria are robust enough to catch what agents get wrong. The verification harness is the product.
Treat specification writing as a first-class engineering skill. Spec-driven development is no longer a methodology debate — it’s an operational requirement. Engineers who can write clear, complete specifications that agents can execute against are more valuable than engineers who can write code faster. Invest in training and practice.
Design explicit handoff points, not blanket review. Don’t require human review of everything — that defeats the purpose of agents. Instead, identify the critical decision points where human judgment adds value (architectural decisions, security-sensitive changes, customer-facing behavior) and design the workflow to surface those specifically.
Address the junior engineer pipeline now. The skill-development gap is real and compounding. Consider structured apprenticeship models where juniors work alongside agents but with deliberate inner-loop exercises — debugging agent-generated code, writing tests before agents generate implementations, conducting architectural reviews. The goal is to build calibrated judgment through practice, not just exposure.
Adopt the DORA AI Capabilities Model as a team health check. Score your team against the seven capabilities. If you lack a clear AI stance, healthy data practices, or quality internal platforms, adding more AI will amplify dysfunction, not resolve it.
Name the new work. Vella’s “middle loop” vocabulary gives teams a way to talk about supervisory work as legitimate engineering — not as overhead or management. Adopting this language helps engineers see their evolving role as growth rather than loss.

Open Questions

How do you measure supervisory engineering performance? Traditional metrics (lines of code, commits, velocity points) don’t capture the value of specification quality, review thoroughness, or verification design. New metrics are needed but not yet established.
What does a career ladder look like for supervisory engineers? If the inner loop was the training ground for calibrated judgment, what replaces it? No organization has published a mature career framework that accounts for agent-assisted workflows.
Is the middle loop a permanent fixture or a transitional state? As agents improve, does the middle loop shrink — requiring less directing, evaluating, and correcting — or does it expand as agents take on more complex work that demands more sophisticated supervision?
How do you build calibrated trust at the team level? Vella describes calibrated trust as an individual skill built through repeated interaction. But teams need shared calibration — agreed-upon standards for when to trust agent output and when to intervene. No established frameworks exist for this.
What happens to organizations that skip the supervisory layer? The DORA data suggests they ship faster but break more. How bad does the instability problem get before it forces a correction, and what does that correction cost?