Scout: The Autoresearch Pattern | The Artificer's Grimoire

Summary

Autoresearch is an autonomous experiment loop pattern where an AI agent iteratively modifies code, measures a scalar metric, keeps improvements, reverts failures, and repeats indefinitely. Originated by Andrej Karpathy as a 630-line Python script for ML training optimization, the pattern generalized within eight days to any domain with a measurable outcome — test coverage, bundle size, latency, SEO scores, security posture. The core insight is deceptively simple: constrain scope + define success numerically + automate verification + loop = compounding gains. Shopify CEO Tobias Lütke’s application to Liquid (53% faster parsing from ~120 experiments) demonstrated production viability. Multiple implementations now exist for Claude Code, Pi, and as standalone scripts. The pattern is best understood not as a framework but as an infrastructure primitive — a building block for any system that tolerates unsupervised overnight optimization.

Key Findings

1. The Canonical Loop

Every implementation follows the same five-step cycle:

Review state (code + history + results log)
  → Pick next change (based on what worked/failed/untried)
    → Make ONE focused change + git commit
      → Run mechanical verification (benchmark/test/score)
        → Keep if improved, revert if worse
          → Log result → Repeat

The critical design decisions that make this work:

Single mutable target: Karpathy’s original restricts modification to one file (train.py). Generalized versions allow glob patterns but still enforce explicit scope boundaries. The constraint prevents the agent from wandering into unrelated code.
Fixed time budget: Karpathy uses exactly 5 minutes per experiment, making results directly comparable regardless of what changed. This yields ~12 experiments/hour, ~100 overnight.
Git as experimental memory: Every change is committed before verification. Failures revert to the last good commit. The git history becomes a searchable log of what was tried and what worked.
Vocabulary-independent metric: The original uses val_bpb (validation bits per byte) rather than loss, making architectural changes fairly comparable. Generalized versions require any numeric scalar where direction of improvement is unambiguous.

2. Three Implementations Worth Studying

Karpathy’s Original (karpathy/autoresearch) — 630 lines of Python, MIT license. 32.8K stars. ML-specific but defines the canonical pattern. Three files: prepare.py (immutable data/eval), train.py (agent modifies), program.md (human-written strategy). The program.md file encodes a “simplicity criterion” where agents balance code elegance against metric gains, and a “NEVER STOP” directive.

Claude Code Skill (uditgoenka/autoresearch) — Domain-agnostic generalization as a Claude Code skill. Adds:

/autoresearch:plan wizard that analyzes the codebase, suggests metrics, constructs verify commands, and dry-runs before launch
Guard parameter (v1.0.4): a safety command (e.g., npm test) that must always pass — if the metric improves but Guard fails, Claude reworks the optimization (up to 2 attempts) without modifying test files
/autoresearch:security variant for autonomous vulnerability discovery using STRIDE threat modeling
TSV result logging with periodic summaries every 10 iterations
Integration with Claude Code’s /loop N for bounded iteration counts

Pi-autoresearch (davebcn87/pi-autoresearch) — 1,377 stars. Extension for Pi (the coding agent Shopify’s Lütke used for the Liquid PR). Key innovation: session continuity via dual-file state:

autoresearch.md — living document with objective, attempted ideas, dead ends, and wins. A fresh agent with no memory can read this file and continue where the previous session left off.
autoresearch.jsonl — append-only structured log. One JSON entry per run with metric values, status, commit hash, and description.
autoresearch.checks.sh — optional correctness gates (tests, types, lint) running after successful benchmarks.

3. The Shopify Liquid PR: Anatomy of a Production Win

PR #2056 is the most detailed public example of autoresearch applied to production code optimization.

Results:

Metric	Before	After	Change
Parse+render time	7,469µs	3,534µs	-53%
Parse time alone	6,031µs	2,353µs	-61%
Object allocations	62,620	24,530	-61%

What the agent discovered (93 commits from ~120 experiments):

Replaced StringScanner tokenizer with String#byteindex — single-byte search is ~40% faster than regex-based skip_until. This alone reduced parse time by ~12%.
Pure-byte parse_tag_token eliminated costly StringScanner#string= resets.
Splat-free filter dispatch via invoke_single/invoke_two methods.
Primitive type fast paths skipping to_liquid conversion.
Pre-computed frozen strings for integers 0-999.
While loops replacing .each for short arrays where YJIT optimizes better.

What failed: Split-based tokenizers (couldn’t handle nesting), tag name interning (collision overhead), shared expression caches (state leakage).

All 974 unit tests pass with zero regressions. The PR modified 14 core files across tokenization, parsing, rendering, and filtering.

4. Generalization Conditions

The pattern works when a problem satisfies three conditions:

Scriptable asset: Something the agent can modify (code, config, prompts, templates)
Measurable scalar outcome: A number that goes up or down (latency, score, coverage, size)
Time-boxed evaluation: The verification loop completes in a bounded time

Documented applications beyond ML training:

Performance optimization: Bundle size, build times, query latency, Lighthouse scores
Test quality: Coverage percentage, mutation testing score
Security hardening: STRIDE/OWASP coverage via autonomous red-teaming
Infrastructure: Terraform compliance scores, resource utilization
GPU kernel optimization: ~40 experiments/hour discovering faster Triton/CUDA implementations
Agent-on-agent optimization: One agent iteratively improves another’s code using evaluation scores

5. Failure Modes and Limitations

Goodhart’s Law is the primary risk. Optimizing the metric doesn’t guarantee meaningful outcomes. The agent will eventually exhaust productive strategies and resort to random modifications. In Karpathy’s domain, this manifests as “throughput gaming” — the agent finds ways to run more training steps in the 5-minute window rather than genuinely improving the model.

Compound noise accumulation. GPU non-determinism, timing variance, and measurement noise can cause the agent to keep changes that aren’t actually improvements. Without repeated runs or uncertainty modeling, a “keep” decision can be partly noise-driven.

Metric proxy mismatch. 5-minute training runs may not predict long-training behavior. Short-horizon metrics can mislead about long-horizon outcomes. Karpathy’s claim that improvements on depth-12 transferred to depth-24 is documented but not independently reproduced.

Evaluator integrity. Without strict read-only protection on the evaluation code, the agent could rewrite the scoring function to report improvement without actually improving. The prepare.py file in Karpathy’s version is immutable by convention, not by enforcement. The Guard parameter in the Claude Code skill is a partial mitigation.

Log injection. program.md instructs the agent to read run logs after execution. A malicious training script could output text that influences the agent’s subsequent decisions. Deploy with strict sandboxing.

Agent reliability variance. Some models (Codex) ignore persistence instructions and stop the loop. Claude is reported as more reliable for long-running sessions. Agent choice affects pattern viability.

Diminishing returns. After exhausting the low-hanging optimizations, the agent enters a long tail of marginal or zero-value experiments. There’s no built-in stopping criterion besides human interruption. The Claude Code skill’s /loop N bounded iteration count is a pragmatic mitigation.

Practical Implications

Immediate Opportunities

Performance optimization tasks. Teams could offer autoresearch-style optimization as a service: point the agent at a codebase with a benchmark suite, let it run overnight, deliver a PR with measurable improvements. The Liquid PR is a compelling proof-of-concept for this as a product capability.
Automated security hardening. The /autoresearch:security variant demonstrates autonomous STRIDE threat modeling and vulnerability discovery. Sandboxed execution environments are ideally suited for this — agents can probe for vulnerabilities in isolation without risking production systems.
Quality gates with compounding improvement. Instead of pass/fail CI, an autoresearch loop could continuously improve test coverage, reduce technical debt metrics, or optimize query performance against staging data. The Guard parameter pattern (metric must improve AND safety command must pass) is the right constraint model.

Architecture Considerations

Session continuity is the hard problem. The pi-autoresearch dual-file pattern (autoresearch.md + autoresearch.jsonl) solves context-window-limited agents needing to resume across sessions. A persistent state store could implement this natively — persisting experiment history in structured storage and injecting a summary into each agent session’s context.
Evaluator isolation is a security requirement. The metric evaluation must be immutable and sandboxed. If an agent can modify its own scoring function, the loop produces meaningless results. Container isolation should enforce read-only mounts for benchmark/evaluation code.
The pattern composes with phased development workflows. An investigation phase could identify optimization targets. An autoresearch loop could execute the optimization. A review gate could validate results before merging. This is a natural extension of any spec-driven development workflow.

What NOT to Do

Don’t over-generalize. The pattern works for optimization problems with tight feedback loops. It does not work for design problems, architectural decisions, or anything where “better” is subjective. Resist the temptation to apply it to tasks without a clear scalar metric.
Don’t trust the metric blindly. Always pair the optimization metric with a correctness guard (the Guard pattern). And periodically have a human review the accumulated changes — the Liquid PR worked because Lütke understood the codebase well enough to evaluate the agent’s optimizations.

Open Questions

Stopping criteria. When should the loop stop? Fixed iteration count? Plateau detection? Cost budget? No implementation has a principled answer yet.
Multi-metric optimization. Real-world optimization involves tradeoffs (speed vs memory, coverage vs execution time). The current pattern handles one metric plus a guard, but not Pareto optimization across multiple objectives.
Transfer validation. Karpathy claims improvements transfer from small to large models. The Liquid PR optimizations were validated by the existing test suite. But in general, how do you validate that micro-optimizations compose well and don’t introduce subtle behavioral changes?
Cost economics. ~120 experiments × LLM inference cost per experiment + compute cost per experiment = total optimization cost. At what point does the cost of running the loop exceed the value of the optimizations discovered? No published analysis addresses this.
Multi-file coordination. Karpathy restricts to one file. The Liquid PR touched 14 files. The Claude Code skill allows glob patterns. As scope expands, the search space explodes and the agent’s ability to reason about cross-file interactions degrades. Where’s the practical scope boundary?

Sources

karpathy/autoresearch — GitHub — Original 630-line implementation, MIT license
uditgoenka/autoresearch — Claude Code Skill — Domain-agnostic generalization with Guard, plan wizard, and security variant
davebcn87/pi-autoresearch — Pi Extension — Session continuity via dual-file state, used by Shopify
Shopify/liquid PR #2056 — 53% faster parsing, 93 commits from ~120 experiments
Simon Willison: Shopify/liquid Performance PR — Analysis of the Liquid autoresearch run
Andrej Karpathy’s 630-line script ran 50 experiments overnight — The New Stack — Technical deep dive on the original
Karpathy open-sources autoresearch — VentureBeat — Broader context and implications
Autoresearch Became a Primitive — paddo.dev — Ecosystem analysis, generalization conditions, limitations
Karpathy’s Minimal Agent Loop — Kingy AI — Detailed lifecycle analysis, compound error risks, implementation guidance
Autoresearch Explained: 100 Experiments Overnight — Data Science Dojo — Accessible technical overview
Exploring Karpathy’s Autoresearch — Ken Huang Substack — Agent design principles analysis
autoexp Gist — Generalized Autonomous Experimentation — Standalone generalization
Latent Space: Autoresearch — Sparks of Recursive Self-Improvement — AI engineering community perspective
MarkTechPost: Build Autonomous ML Research Loop in Colab — Step-by-step implementation guide