Scout: Agent Prompt Regression Testing | The Artificer's Grimoire

Summary

Prompt regression testing for AI agents has rapidly evolved from ad-hoc manual checks to a multi-layered discipline spanning three distinct approaches: evaluation-framework testing (promptfoo, DeepEval, LangSmith), specification-compiled testing (TDAD), and multi-model interference detection (Arbiter). TDAD and Arbiter, both published in March 2026, represent a step change — TDAD treats prompts as compiled artifacts with hidden test splits and semantic mutation testing, while Arbiter uses multi-model LLM analysis to find interference patterns that no single model can detect. The existing evaluation frameworks remain the most production-ready option, but they test outputs rather than prompt structure, leaving an entire class of architectural regressions invisible. Teams that want prompt change safety need all three layers: output-level regression suites in CI/CD, specification-compiled behavioral contracts, and periodic structural analysis of prompt architecture.

Key Findings

The Three Layers of Prompt Regression Testing

The landscape divides cleanly into three approaches, each catching a different class of regression:

Layer 1: Output-Level Evaluation (Established)

Tools like promptfoo, DeepEval, LangSmith, and LangWatch test what the agent produces. The pattern is straightforward: maintain a golden dataset, run each prompt version against it, score outputs via LLM-as-a-Judge or heuristic metrics, and fail the build if quality drops. This is the most mature layer — promptfoo alone is used by 25%+ of Fortune 500 companies and was acquired by OpenAI in March 2026 [1]. DeepEval offers 30+ evaluation metrics with pytest-style syntax [2]. LangWatch’s Scenario framework adds multi-turn agent simulation with domain-driven TDD, testing complete user journeys rather than single-turn outputs [3].

The limitation: these tools test behavioral surface but not structural integrity. A prompt can produce correct outputs on your test dataset while containing internal contradictions that manifest only under novel inputs.

Layer 2: Specification-Compiled Testing (Emerging — TDAD)

TDAD (Test-Driven AI Agent Definition) inverts the relationship between prompts and tests [4]. Instead of writing prompts and then testing them, engineers write behavioral specifications in YAML, and a coding agent (TestSmith) compiles those specs into executable test suites. A second agent (PromptSmith) then iteratively refines the prompt until tests pass. Three mechanisms combat specification gaming:

Hidden test splits (30–60% of tests): Evaluation tests are withheld during compilation, measuring true generalization rather than overfitting to known test cases.
Semantic mutation testing: A post-compilation agent (MutationSmith) generates plausible faulty prompt variants (e.g., SKIP_AUTH_GATE) and verifies the test suite detects them. Mutation scores of 86–100% across the benchmark confirm the test suites have real discriminative power.
Spec evolution scenarios: When requirements change (v1→v2), the framework measures backward compatibility via SURS (Spec Update Regression Score), which hit 97% in trials — meaning prompt updates rarely break previously working behaviors.

Quantitative results across 24 trials on the SpecSuite-Core benchmark: 92% v1 compilation success, 97% mean hidden pass rate, $2.32 average cost per spec. The reference implementation uses pytest, Claude Code in Docker, and the Claude Agent SDK [4].

Layer 3: Structural Interference Detection (Emerging — Arbiter)

Arbiter operates at a different level entirely — it analyzes prompt architecture rather than prompt outputs [5]. The framework combines formal rule-based evaluation (checking for mandate-prohibition conflicts, scope overlaps, priority ambiguities) with undirected multi-model scouring. The key insight: different models find different classes of problems.

Claude Opus 4.6 focuses on structural contradictions and security surfaces
Kimi K2.5 identifies economic exploitation and resource exhaustion vectors
Grok 4.1 examines permission schemas and state management
GLM 4.7 highlights data integrity and temporal paradoxes

Applied to Claude Code, Codex CLI, and Gemini CLI, Arbiter found 152 scourer findings and 21 hand-labeled interference patterns. Critical findings included direct contradictions in Claude Code (TodoWrite “ALWAYS use” vs. workflow “NEVER use TodoWrite”) and a structural data loss bug in Gemini CLI’s memory system where the compression schema lacks fields for saved user preferences — guaranteeing deletion during context truncation [5].

The entire cross-vendor analysis cost $0.27 USD — $0.002 per finding. This makes continuous structural analysis economically trivial.

Prompt Architecture Determines Failure Class, Not Severity

Arbiter’s most architecturally significant finding: prompt structure (monolithic, flat, modular) strongly correlates with the type of failure but not its severity [5].

Monolithic prompts (Claude Code, 1,490 lines) exhibit growth-level bugs at subsystem boundaries — the kind that emerge when independent teams add capabilities without integration testing.
Flat prompts (Codex CLI, 298 lines) trade capability for consistency; fewer features mean fewer contradiction opportunities.
Modular prompts (Gemini CLI, 245 lines) show design-level bugs at composition seams — each module works in isolation but inter-module contracts are underspecified.

This means there is no “safe” prompt architecture. The choice of structure determines where regressions will appear, not whether they will.

The Silent Regression Problem Is Real and Measured

The practitioner evidence converges: silent prompt regressions are not hypothetical. Model provider API updates cause behavioral drift without code changes [6]. The failures that hurt most — an agent stops collecting required information, calls a tool with wrong arguments, forgets policy constraints — surface days later in metrics or user complaints [7]. TDAD’s hidden test splits directly address this: 97% of hidden tests pass on first compilation, meaning the compiled prompts generalize beyond the visible training set [4].

Semantic Mutation Testing Adapts Classical SE to Prompts

TDAD’s semantic mutation testing is perhaps its most novel contribution. Traditional mutation testing mutates code to check if tests catch the change. TDAD mutates prompts — generating plausible faulty variants (e.g., removing an authorization gate, weakening a policy constraint) and verifying the test suite detects the regression [4]. Invalid mutants that don’t actually change behavior are filtered via activation probes, analogous to equivalent mutant detection in classical mutation testing.

This approach directly measures test suite strength — whether your tests would catch a real regression. Mutation scores of 86–100% demonstrate that TDAD-generated test suites have genuine discriminative power, not just coverage.

Practical Implications

Start with Layer 1 — It’s Table Stakes

If you don’t have output-level regression testing in CI/CD, start there. promptfoo provides the fastest path: declarative YAML configs, GitHub Actions integration, and LLM-as-a-Judge scoring [1]. DeepEval is the better choice if your team prefers pytest-style test authoring [2]. Either tool can block merges when prompt changes cause quality regressions.

Key practices for production regression suites:

Source test cases from production traffic, not synthetic examples
Refresh golden datasets regularly — they age quickly
Test multi-turn agent interactions, not just single-turn prompts
Include tool-use validation (correct tool selection, argument schemas)
Set quality thresholds that fail the build, not just report

Evaluate TDAD for High-Stakes Agent Prompts

TDAD is most valuable for agents with complex behavioral requirements — many tools, policy constraints, conditional logic. The spec → compile → test → deploy workflow fits naturally into existing CI/CD pipelines. At $2–3 per spec compilation and 30–60 minutes per run, it’s practical for weekly or pre-release validation but too slow for every PR [4].

Current limitations to weigh:

Only validated on Claude Sonnet 4.5; multi-model support is unproven
TestSmith’s safety training prevents generating adversarial test inputs
Scaling beyond 15-node agent specs is unquantified
V2 compilation success rate drops to 58% — spec evolution is harder than initial compilation

Run Arbiter Periodically on Your System Prompts

At $0.27 for a full cross-vendor analysis, there is no economic reason not to run Arbiter quarterly or after any significant prompt refactor [5]. The multi-model approach catches contradiction classes that internal review and single-model testing miss entirely. Even without the full Arbiter framework, the core insight is immediately actionable: test your system prompts with multiple model families, not just the one you deploy on.

Combine All Three Layers for Defense in Depth

The ideal prompt regression testing stack:

Layer	Tool	Frequency	Catches
Output regression	promptfoo / DeepEval	Every PR	Behavioral regressions on known scenarios
Behavioral compilation	TDAD	Pre-release	Specification gaming, hidden test failures, mutation gaps
Structural analysis	Arbiter	Quarterly / post-refactor	Architectural contradictions, interference patterns

No single layer is sufficient. Output testing misses structural contradictions. Specification compilation misses interference patterns. Structural analysis misses runtime behavioral drift. Together, they cover the space.

Open Questions

Multi-model TDAD: TDAD’s benchmark uses only Claude Sonnet 4.5. How well does spec compilation transfer to GPT, Gemini, or open-weight models? Cross-model portability is untested.
Scaling TDAD: The benchmark specs have 10–14 nodes. Production agent prompts like Claude Code’s 1,490-line system prompt are orders of magnitude more complex. Does the compilation approach scale, or does it require decomposition into sub-specs?
Arbiter automation: The paper demonstrates Arbiter as a one-shot analysis tool. Can it be integrated into CI/CD to run on every prompt change, and what’s the false positive rate at that cadence?
Adversarial robustness: TDAD’s TestSmith avoids generating hostile inputs due to LLM safety training. How do teams test prompt resilience against adversarial user inputs and prompt injection?
Model-provider drift: Neither TDAD nor Arbiter addresses regressions caused by upstream model updates (same prompt, different model version). The output-evaluation tools handle this, but only if you run them continuously against production — not just on prompt changes.
Combining TDAD + Arbiter: Could Arbiter’s interference detection inform TDAD’s mutation catalog? A structural contradiction found by Arbiter could become a TDAD mutation intent, closing the loop between architectural analysis and behavioral testing.

Sources

promptfoo — GitHub — CLI and library for prompt/agent/RAG evaluation with CI/CD integration
DeepEval — GitHub — Open-source LLM evaluation framework with pytest-style syntax
LangWatch Scenario — Agent Testing Framework — Domain-driven TDD for AI agents with multi-turn simulations
TDAD: Test-Driven AI Agent Definition (arXiv: 2603.08806) — Tzafrir Rehan, March 2026
Arbiter: Detecting Interference in LLM Agent System Prompts (arXiv: 2603.08993) — Tony Mason, March 2026
Automated Prompt Regression Testing with LLM-as-a-Judge and CI/CD — Traceloop
Prompt Regression Testing: Preventing Quality Decay — Statsig
LLM Testing in 2026: Top Methods and Strategies — Confident AI
CI/CD Integration for LLM Eval and Security — Promptfoo
From Scenario to Finished: How to Test AI Agents with Domain-Driven TDD — LangWatch
LLMs Are the Key to Mutation Testing and Better Compliance — Meta Engineering
The Best LLM Evaluation Tools of 2026 — Medium
Top 5 Prompt Testing & Optimization Tools in 2026 — Maxim AI
AI Agents, Meet Test Driven Development — Latent Space
OpenAI Acquires Promptfoo — AI Agent DevSecOps Era Begins
LangSmith — LLM & AI Agent Evals Platform
5 Best AI Evaluation Tools for AI Systems in Production — Braintrust
The Complete Guide to LLM & AI Agent Evaluation in 2026 — Adaline