Artificer Digital The Artificer's Grimoire

Scout: Operationalising AI-Assisted Vulnerability Discovery — What Mozilla's Mythos Pipeline Actually Requires

Summary

Mozilla’s Behind the Scenes Hardening Firefox with Claude Mythos Preview post is the first detailed operational case study of a working production AI-assisted vulnerability-discovery pipeline at the scale of a tier-1 codebase, paired with Mozilla’s main-blog framing piece“The defects are finite, and we are entering a world where we can finally find them all” — that situates the engineering story inside Mozilla’s broader security posture. The headline numbers — 271 Mythos-attributed vulnerabilities in Firefox 150’s release, 180 of them rated sec-high, 423 total security fixes shipped in April 2026 against a prior-year April baseline of 31 — are the marketing surface. The operational surface is where the practitioner story lives: an agentic harness built atop existing fuzzing infrastructure, parallelised across ephemeral VMs each assigned a single target file, with sanitizer-build crash/no-crash as the deterministic success signal, a second LLM grading reports before human review, full integration into Mozilla’s existing security-bug lifecycle (deduplication, triage, patch tracking, release management), and a two-engineer-per-patch remediation discipline that explicitly resists automation. The pipeline reports fewer than 15 false positives across the entire 271-bug discovery campaign. The operational question this answers is what does it take to integrate frontier-model vulnerability discovery into a working security organisation — and the answer is: more existing infrastructure (sanitizer builds, bug tracker, fuzzing harness, 100+ contributors who can patch) than capability, with the model itself being a smaller part of the load-bearing system than the trade-press framing suggests.

Key Findings

1. The Success Signal Is the Whole Trick

Mozilla Distinguished Engineer Brian Grinstead’s framing of the architecture is the cleanest practitioner-shaped explanation of why this pipeline works where earlier AI bug-finding deployments failed. Per Help Net Security’s reporting, Grinstead’s verbatim statement: “This pipeline is extremely reliable at filtering out false positives, so long as you have a clearly defined success condition to validate against. Memory corruption issues are especially easy to validate: either you trigger Address Sanitizer or you don’t (in which case you tell the agent to keep working until it does).”

This is the load-bearing claim of the whole operational story. The agentic harness is permitted to keep iterating — modifying its hypothesis, generating new test cases, patching Firefox source to construct exploit conditions — until the sanitizer build crashes. A crash is the discriminator between speculative hypothesis and real vulnerability; a non-crash discards the candidate without ever entering the triage queue. The harness does not need to be right on first attempt; it needs to keep iterating until the deterministic signal flips. The reason false-positive rates can credibly approach zero — fewer than 15 across 271 confirmed vulnerabilities, per Grinstead’s reporting in Help Net Security — is that the gate is mechanical, not judgement-based.

The practitioner consequence: this pipeline architecture is only available to security teams whose codebase has a working sanitizer build (AddressSanitizer for memory-safety bugs, ThreadSanitizer for concurrency, UndefinedBehaviorSanitizer for spec violations) that can be exercised reliably in CI. Mozilla has that because Firefox has been instrumented for sanitizer-driven fuzzing for years. Teams whose security posture is “we run a static-analysis tool nightly” do not have the deterministic success signal that makes the iterative-until-it-crashes loop work. The corollary for greenfield deployments: budget the sanitizer-build effort first, not the model integration.

The 15-false-positives detail is itself instructive on what does break. Per Grinstead via Help Net Security: “We did see a small handful of false positives, primarily caused by changing some precondition in order to trigger an issue that would otherwise be valid (e.g., enabling a testing preference or using a private API). We’ve seen fewer than 15 of these total, and when we see them we update the harness to prevent similar issues in the future.” The failure mode is the agent constructing exploit preconditions that real attackers couldn’t reach — toggling a debug flag, calling a private API. The fix is harness-level guardrails on what the agent is permitted to modify, evolved each time a false positive is observed. This is harness engineering at the source-code-modification level, not at the prompt level.

2. The Architecture Is Ephemeral-VM-Per-Target-File, With Hard Containment Boundaries

Mozilla’s Mozilla Hacks post and Help Net Security’s reporting together establish the pipeline shape: jobs are parallelised across multiple ephemeral virtual machines, each VM assigned to hunt for vulnerabilities within a specific target file, with findings written to a bucket and the VM destroyed after analysis completes. Grinstead’s verbatim containment description, per Help Net Security: “Any source code changes made to craft sandbox escapes are only used to generate bug reports. They never land in the upstream Firefox source code, nor are they published anywhere outside of the bug tracker. Scanning happens entirely within isolated VMs that have a local copy of Firefox’s open source codebase, with no means to publish their changes. After completing the analysis, any findings are written into an internal database and the VM is destroyed.”

The decomposition is one VM per target file, not one VM per agent or one VM per run. This is the same shape as the harness primitives the platform layer shipped this week — sandbox per task, per-tenant durable code, isolated execution — applied to the target axis rather than the tenant axis. It is also the natural concurrency boundary for a discovery campaign: the deduplication and triage layer downstream is doing the work of combining findings across files into a coherent bug-tracker view.

Two operational details matter for replicability. First, the VMs need read access to the full Firefox source tree (the agent has to be able to reason about cross-file interactions) but no write access that escapes the VM. The model is permitted to patch the local Firefox tree to construct sandbox-escape proofs — that’s how it demonstrates exploitability of latent flaws — but those patches die with the VM and never reach upstream. Second, Stiennon’s writeup notes that Mythos repeatedly attempted “prototype pollution escape paths” that Firefox’s architectural hardening had already mitigated, and the harness gracefully recorded the failed-attempt evidence rather than reporting them as exploitable. That’s the success signal doing its job in reverse: defensive measures that work register as “agent kept trying, never crashed the sanitizer,” which is correctly classified as “not a finding.”

The infrastructure budget implication is non-trivial. The Firefox codebase is roughly 30 million lines across tens of thousands of files; one-VM-per-file means the discovery campaign provisions tens of thousands of ephemeral VMs across the run, each with enough memory to load a sanitizer-build copy of Firefox plus run an agent loop. Public reporting does not include Mozilla’s exact compute spend on the campaign. AISLE’s published numbers on adjacent kernel scans — under $20,000 for a full OpenBSD analysis with 1,000 scaffold runs, under $2,000 for a one-day Linux kernel root-exploit discovery — suggest the Firefox campaign sits comfortably in five-figure-dollar inference-cost territory for the discovery side, before VM compute. For a defensive program already running fuzzing at Mozilla’s scale, that’s noise in the budget; for a smaller security team starting from a no-sanitizer-build, no-fuzzing-harness baseline, the absolute inference cost is the cheapest line item.

3. A Second LLM Grades Findings Before Human Review

The piece of the pipeline most under-covered in trade-press summaries but most operationally consequential is the secondary-LLM grading step. Per TechCrunch’s interview with Grinstead, the harness “used another model to grade reports before engineers handled patches.” This is the operational equivalent of the chain-of-thought-on-flagged-actions stage in Anthropic’s Constitutional Classifiers architecture — the discovery model produces structured findings with sanitizer-crash evidence, a separate grading model reviews each finding for completeness, severity classification, and reproducibility, and only graded reports reach human engineers.

Mozilla’s public reporting does not name the grading model, the prompt structure, or the grading rubric. Reporting on the specific grading-stage mechanics is thin in public sources. What is established by the Mozilla Hacks post and the The Decoder writeup is that the pipeline integrates “deduplicating against known issues, tracking bugs, triaging them, and getting fixes shipped,” and that the inputs to the human triage queue are already pre-filtered, severity-classified, and deduplicated against the existing bug tracker.

For practitioner teams building or specifying their own pipeline, the secondary-grader pattern is the right place to put the precision-vs-recall lever. The first-stage harness can be aggressive (high recall, modest false-positive rate before the sanitizer gate, near-zero false-positive rate after) because the grader filters the surviving findings down to actionable reports. Without the grader, the human triage workflow inherits the variance of LLM-generated bug summaries directly; with the grader in place, what reaches human eyes is a triage-ready summary already filtered through a separate model trained for grading reproducibility and severity. This is the same pattern that’s emerged in agent-CI/CD discussions (GitHub’s defense-in-depth model) for non-deterministic output validation, ported to the security domain.

4. The Remediation Pipeline Is the Bottleneck, and It Resists Automation

The discovery side scales with compute. The remediation side does not. Grinstead’s verbatim framing, per TechCrunch’s coverage: “For the bugs we’re talking about in this post, every single one is one engineer writing a patch and one engineer reviewing it. We have not found it to be automatable.”

Per the Mozilla Hacks post, over 100 people contributed code to the Firefox 150 remediation effort. The arithmetic is straightforward: 271 bugs, two engineers per bug (writer + reviewer), one bug per writer-reviewer pair at any moment, and the operational reality that some bugs require deeper investigation and longer fix windows than others. Mozilla absorbed the April 2026 spike — 423 total fixes versus 31 in April 2025 per TechCrunch and 76 fixes in March 2026 per The Register’s coverage — by reprioritising everything else and rotating contributors through patch-writer and patch-reviewer roles. Per the Mozilla Hacks post, the team “reprioritized everything else” during the discovery campaign.

This is the operational fact most likely to surprise teams planning their own Mythos-class deployment. The discovery side of an AI-assisted vulnerability program is a budget question — provision the VMs, pay the inference bill, instrument the sanitizers. The remediation side is a headcount and team-discipline question, and the multiplier on existing security-engineer capacity is roughly 1× per simultaneously-discoverable bug, not the order-of-magnitude that the discovery throughput suggests. The Register’s coverage frames this as a feature of Mozilla’s approach rather than a limitation — “the agentic harness – the middleware mediating between AI and the end user” is doing the load-bearing work of converting frontier-model capability into shippable patches, and the middleware is human as much as it is machine.

The Hacker News piece Mythos Changed the Math on Vulnerability Discovery generalises the practitioner observation: “finding a vulnerability and fixing it are two entirely different workflows, and the gap between them is where most security programs quietly bleed out.” For procurement teams: any Mythos-class capability deployment has to be costed across both sides of the workflow. Discovery is cheap; integrating-into-existing-bug-tracker-and-doubling-patch-throughput-without-quality-loss is not.

5. The 20-Year and 15-Year Bug Specifics Are the Capability Signal

The trade-press headline numbers point to capability, but the technical specifics of the two long-standing bugs are the load-bearing evidence for why this isn’t just a fuzzing-throughput story.

The 20-year-old XSLT vulnerability (Mozilla bug 2025977), per the Mozilla Hacks post: “reentrant key() calls cause a hash table rehash that frees its backing store while a raw entry pointer is still in use.” This is a use-after-free that requires reasoning about (a) the XSLT spec’s key-function semantics permitting reentrant calls, (b) the implementation’s hash-table-with-incremental-rehash strategy, and (c) the lifecycle of a raw pointer held across the rehash boundary. The detection path is not “fuzz inputs at the XSLT parser and watch for crashes” — twenty years of fuzzing failed to surface it — but “reason about which API patterns can trigger the reentrancy path and construct a reproducer that exercises the pointer-aliasing window.”

The 15-year-old <legend> element bug (Mozilla bug 2024437), per the same source, was “triggered by meticulous orchestration of edge cases across distant parts of the browser, including recursion stack depth limits, expando properties, and cycle collection.” The reasoning span here is across three subsystems that wouldn’t normally co-occur in a single fuzz target.

Digit.in’s analysis frames the distinguishing pattern as “combinatorial reasoning” — bugs where multiple individually-innocent behaviours combine to create a vulnerability, and where the combination space is too large for random fuzzing to navigate efficiently. Simon Willison’s read of the Mozilla pipeline post is the cleanest practitioner-language framing of the capability shift: AI-generated security reports have “gone from slop to rather more tasty” — the dramatic recall improvement happens to land in exactly the bug categories that conventional tooling has historically failed to surface. The practitioner takeaway is that Mythos’s capability surplus over earlier fuzzing-and-static-analysis tools concentrates exactly in the bug categories that have historically been the residual cost of mature security programs — the bugs that survive because they require reasoning about non-local interactions. For security teams running long-running fuzz campaigns against mature codebases, the question is no longer whether the residual bug-fixing backlog is reachable, but whether they can absorb the throughput when the residual starts being surfaced systematically.

6. Replicability Is About Existing Infrastructure, Not Model Access

Mozilla’s deployment is reproducible by other security teams in principle — Mythos itself is gated under Project Glasswing’s roughly fifty-organisation allowlist plus the additional 40+ organisations Anthropic has extended access to, per Anthropic’s Glasswing page. SecurityWeek’s earlier coverage of the 271-bug result names the launch partner list — AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks — and Firefox CTO Bobby Holley’s framing of the capability cliff, “Encouragingly, we also haven’t seen any bugs that couldn’t have been found by an elite human researcher.” The frontier-capability ceiling is broadly reachable from a second vendor (the AISI evaluation of GPT-5.5 found capability symmetry against Mythos on the same test ranges). What is not broadly reachable, on the evidence in the Mozilla pipeline post, is the operational substrate that turns the model into useful security work.

Stiennon’s analysis is direct on the point: software vendors with mature codebases can replicate Mozilla’s approach immediately using available frontier models, but the gating factor is existing infrastructure — sanitizer builds, fuzzing harness, bug tracker integration, dedicated security-engineering capacity to absorb the patch flow. The Implicator essay makes the strongest version of the argument: even after Mythos goes broadly available, smaller software vendors will lack the codebase-specific wrapper that turns frontier capability into shippable security work, because they lack the substrate Mozilla spent years building.

AISLE’s nano-analyzer work offers the strongest counter-evidence to the “you need Mythos” framing. Their published numbers: under $100 in API cost for full FreeBSD and OpenBSD kernel scans plus benchmarking, using GPT-5.4-nano (roughly 100× cheaper than speculated Mythos pricing) and GPT-OSS-120B (open-weights, roughly 600× cheaper). Two confirmed bugs in FreeBSD NFS code, including a 26-year-old memory-corruption flaw under responsible-disclosure-then-vendor-investigation. AISLE’s thesis — “a thousand adequate eyes looking everywhere should find things that one brilliant eye looking selectively misses, even if each individual eye is less perceptive” — is that throughput-via-parallelism on adequate models can match capability-via-depth on frontier models for the long-tail-of-bugs use case. The implication for practitioner teams is structurally identical to Mozilla’s: the harness, not the model, is the load-bearing primitive. nano-analyzer’s three-stage pipeline (context generation → vulnerability scan → skeptical triage with grep validation, all parallelised at the file level) is the same operational shape as Mozilla’s harness, with the model swap as the dependent variable.

7. The CI Integration Story Is the Next Phase

The Decoder’s coverage reports Mozilla’s stated plan to integrate the pipeline “directly into its development process so that every new piece of code is automatically checked before it gets committed.” Per the Mozilla Hacks post, future-phase work includes “integrating analysis into continuous integration for patch-based scanning as changes land.” The shift from periodic discovery campaigns to per-commit scanning changes the operational shape materially.

The economics of per-commit scanning are different from the discovery-campaign mode. The campaign mode is a parallelisable burst: scan tens of thousands of files in a few weeks, surface the existing residual bugs, then remediate over months. The per-commit mode is a streaming workload: every PR triggers a scoped scan against the changed-file set and their reachable dependencies, with a latency budget short enough to gate the CI pipeline. The agent’s iterative loop — “keep trying until the sanitizer crashes” — has to terminate inside that budget when the scan finds nothing, which means a different harness profile: shallower exploration depth, fewer iterations per file, tighter early-exit conditions when no candidate hypothesis crashes after some configurable number of attempts.

For teams planning their own pipeline trajectory, this is the budget reality check. Campaign-mode scans against a static codebase set the floor for what’s findable. Per-commit scans set the floor for what’s preventable — bugs that would have been written but are caught before merge. The two modes are complementary, not substitutable: the campaign surfaces existing latent bugs, the per-commit gate prevents new ones. Public reporting on Mozilla’s per-commit cost model and latency budget remains thin.

8. The Mozilla Pipeline Is the Inflection Point for the “Remediation Capacity” Question

The practitioner question facing security teams in 2026-Q3 is no longer whether AI-assisted vulnerability discovery is a procurement decision. It is whether the rest of the security organisation can absorb the throughput. The Rapid7 analysis on Project Glasswing names the gap explicitly: 48,185 CVEs in 2025, 40% high or critical severity, and triage-ownership-remediation workflows that “already lag behind discovery rates” — before the AI-throughput multiplier compounds.

The Hacker News piece on the post-Mythos math makes the practitioner reframing concrete: “the scarce resource is no longer the ability to find vulnerabilities. It is the organizational capacity to assess, prioritize, validate, and remediate them faster than adversaries can weaponize them.” CSO Online’s coverage carries Holley’s blunt version of the capability-side framing — “Computers were completely incapable of doing this a few months ago, and now they excel at it” — alongside Ensar Seker’s note that defenders are “realizing the attack surface is larger, and more rapidly discoverable than previously assumed.” Mozilla’s pipeline absorbs the discovery throughput because Mozilla had the substrate to absorb it — 100+ contributors, established sanitizer infrastructure, mature bug-tracker integration, a security-engineering culture organised around fast turnaround. Most enterprise security organisations operating in 2026 do not have that substrate at that scale.

The infrastructure investment list for a security organisation planning a Mozilla-equivalent deployment, derived from the cross-referenced public reporting, breaks into four budget categories:

CategoryWhat you needWhat it costs
Discovery substrateSanitizer build of the target codebase (AddressSanitizer minimum, ThreadSanitizer and UBSan for broader coverage); existing fuzzing harness to integrate with; ephemeral-VM orchestration that can spin up and tear down at file-target granularityMonths of engineering work if not already in place; ongoing operational maintenance
Pipeline softwareAgentic harness with iterate-until-success-signal loop; secondary-LLM grader with structured output; deduplication against existing bug tracker; severity classificationPublic reference implementations are emerging (AISLE’s nano-analyzer is the closest thing to a portable open-source example); custom integration with your bug tracker is bespoke
Inference budgetLLM API spend across discovery (high-volume, high-iteration) and grading (lower-volume per finding, higher-rigour)Five-figure-dollar territory for a tier-1-codebase campaign; per-commit mode is a different cost curve
Remediation capacityEngineering headcount with security context to write patches and engineering headcount to review them; release-management discipline to ship the surge without quality regressionThe hardest line item; not buyable with capex

Note on collective integrity: this matrix omits the trust model — the question of how much agency the agent gets within the sandbox, what credentials it has access to, and how its proposed patches are validated before reaching the upstream tree. That’s covered by the agent-containment scout series and is structurally separate from the discovery-pipeline question.

Practical Implications

What to Build

Sanitizer-driven success signals are the first investment. If your codebase doesn’t have an AddressSanitizer build that runs in CI, the iterate-until-it-crashes loop has nothing to gate on, and the false-positive rate of any agentic harness you bolt on top will inherit whatever variance the LLM provides. Mozilla’s “either you trigger Address Sanitizer or you don’t” test is portable to any C/C++/Rust codebase; for managed-runtime targets, the equivalent is a runtime fault detector (segfault handler, assertion-as-crash, Valgrind-style memory-tracker) that produces a binary signal. Without a deterministic gate, agentic discovery degrades to LLM-generated bug-report slop.

The harness, not the model, is the practitioner-facing artifact. Mozilla’s harness is not open-sourced — the Mozilla Hacks post describes the pattern but not the implementation. AISLE’s nano-analyzer is a portable reference for the file-parallelised, three-stage (context → scan → skeptical-triage) shape, tuned for memory-safety bugs. The Project Zero / Big Sleep documentation and Google’s Cloud CISO Big Sleep coverage describe a related but distinct pattern — code-comprehension agent with sandboxed Python script execution for fuzz-input generation, applied to SQLite, Chrome, and other widely-deployed open-source targets. OpenAI’s Aardvark launch (October 2025) puts a third vendor on the same architectural shape, with GPT-5 as the discovery model and per-commit CI integration as the deployment mode. The three production examples converge on the same primitives: deterministic verification, per-file parallelism, secondary grading, human-reviewed patch loop.

Budget headcount before model access. The first dollar of value from an AI-assisted vulnerability program comes from the remediation side. A security organisation that can write and review 50 patches per week with current headcount cannot absorb a discovery campaign that surfaces 271 patches in a month. The practitioner pre-deployment audit: how long from existing-pipeline discovery to verified fix? How many high-severity findings sit in “being worked on” states for more than a week? Can you re-test post-remediation, or do you just close tickets? These questions, raised in The Hacker News piece, don’t require Mythos access to answer meaningfully. They establish whether your organisation has the substrate.

What to Buy

Model access is the cheapest line item. Mythos is gated, but the AISI capability-symmetry finding establishes that frontier-equivalent capability is available from a second vendor. AISLE’s published benchmarks establish that nano-class and open-weights models can match frontier discovery on a meaningful subset of memory-safety bugs at 100×–600× lower per-token cost. The model is not the procurement bottleneck.

Existing managed-vulnerability-discovery services are increasingly relevant. GitHub’s Copilot Autofix ships an end-to-end CodeQL-detection-plus-LLM-remediation loop with operational scale numbers — 460,000+ alerts remediated in 2025, median time-to-fix dropping from 1.5 hours to 28 minutes per GitHub’s reporting. The architecture is structurally different from Mozilla’s — CodeQL static-analysis as the discovery primitive rather than agentic-with-sanitizer — and covers a different bug class (alert-shaped findings rather than memory-corruption-class exploitable vulnerabilities), but for teams running on GitHub-hosted code the operational baseline is already there. The right framing: Mythos-class deployments are the new ceiling for discovery throughput; CodeQL-plus-Autofix is the new floor.

The grader-model layer is buyable through standard vendor APIs. The secondary-grading step that filters discovery output before human review doesn’t require frontier capability — it requires a model trained to produce structured output and follow severity-classification rubrics. This is well within the operating range of cheaper models (Claude Haiku, GPT-4o-mini, Gemini Flash) and adds a small amount of cost per discovered candidate in exchange for a substantial improvement in human-triage signal-to-noise. Practitioner teams building their own pipeline should plan to use the cheapest competent model for grading and the most capable model only for discovery.

What to Avoid

Per-commit scanning without a campaign mode first. The pipeline shape Mozilla used — discovery campaign first, then plan for CI integration — is the right sequencing. A team that turns on per-commit scanning against a codebase with years of accumulated latent bugs will find that the gate fires on every PR because the agent is surfacing pre-existing flaws, not regressions introduced by the PR. The campaign mode flushes the residual bugs in a controlled burst; per-commit mode then catches regressions against a clean baseline. Reversing the order produces a CI gate that nobody can deploy past.

Treating the discovery pipeline as a security-product procurement. Mythos-class deployments are not turnkey products. The ArmorCode playbook and related vendor materials are written in a procurement frame — buy the unified-exposure-management platform, integrate with your existing 350+ security tools, deploy “risk-based prioritization.” That framing is honest about the symptom (vulnerability data needs management at scale) but misleading about the cause (the harness work, the sanitizer substrate, and the remediation-pipeline staffing are not buyable). The procurement decision exists; it is downstream of the build-and-staff decision.

Automating the patch-writing loop ahead of the evidence. Grinstead’s framing — “every single one is one engineer writing a patch and one engineer reviewing it. We have not found it to be automatable” — is the load-bearing operational fact from the Mozilla deployment. Other vendors will ship marketing claims about end-to-end discovery-to-patch automation; the published evidence from the largest production deployment to date is that the patch-writing loop is the human-in-the-loop floor, not the next-quarter automation target. Teams that staff toward the automation premise rather than the human-in-the-loop reality will under-resource their remediation side.

A Note on the Anthropic Gating Posture

Mythos’s gating posture — Project Glasswing’s allowlist, $100M in committed model-usage credits, explicit non-availability for general use — is a vendor choice that’s load-bearing on this deployment story. Mozilla is a Glasswing partner; without that, the Firefox campaign as documented could not have happened. For the broader practitioner population not currently on the partner list, the operational architecture remains reproducible against a second vendor’s frontier model, but the specific Mythos-named numbers don’t generalise. The capability frontier is industry-shared per the AISI evaluation; the vendor relationship that includes access plus engineering collaboration is not. The procurement decision for security teams is whether to lobby for Glasswing-equivalent access, build against a generally-available frontier model, or invest in the throughput-via-cheaper-models approach AISLE demonstrates.

Open Questions

  • The compute-cost curve at production scale. Mozilla’s published reporting does not disclose the inference spend on the Firefox 150 discovery campaign, nor the VM-compute spend across the parallelised scan. AISLE’s adjacent kernel-scan numbers (under $20,000 for OpenBSD, under $2,000 for a Linux kernel exploit) suggest the discovery side is comfortably in five-figure-dollar territory, but reporting on Mozilla’s specific compute spend, the per-finding cost curve, and how the campaign-mode budget translates to per-commit-mode budget remains thin.

  • The grader-model implementation. Mozilla’s public reporting establishes that a secondary LLM grades findings before human review, but does not name the grading model, the prompt structure, the severity-classification rubric, or the rejection-rate distribution. For practitioners designing their own equivalent layer, the published architecture is suggestive rather than reproducible. Reporting on the grading-stage mechanics remains limited to high-level descriptions.

  • Per-commit operational metrics. The CI-integration phase Mozilla has flagged as the next step is the operational mode most practitioner teams will actually deploy against, but no production reporting on per-commit Mythos-class scanning at tier-1-codebase scale exists yet. The latency budget, the per-commit hit rate, the false-positive rate against deltas (versus against full files), and the cost-per-PR-scanned are all open empirical questions.

  • The threshold codebase size for replicability. The Mozilla pipeline is documented on a 30-million-line, multi-decade-old codebase with mature fuzzing infrastructure. The smallest codebase against which the same architectural pattern produces meaningful capability uplift over conventional static-analysis-plus-fuzz is not established in public reporting. AISLE’s nano-analyzer demonstrates the pattern works on tens-of-thousands-of-lines kernel modules; the curve between “kernel module” and “Firefox” is uninvestigated.

  • The harness as portable artifact. Mozilla has not published its harness code, and there is no public indication of whether it will. For the moment, practitioners replicating the architecture are building from the structural description plus AISLE’s reference implementation. Whether a community-maintained open-source harness emerges, and which licence and governance model it adopts, is the next open ecosystem question.

Sources

  1. Mozilla Hacks — Behind the Scenes Hardening Firefox with Claude Mythos Preview
  2. The Mozilla Blog — The zero-days are numbered
  3. Simon Willison — Behind the Scenes Hardening Firefox with Claude Mythos Preview
  4. Help Net Security — What Mozilla learned running an AI security bug hunting pipeline on Firefox
  5. TechCrunch — How Anthropic’s Mythos has rewritten Firefox’s approach to cybersecurity
  6. The Register — Mozilla says AI helped squash 423 Firefox security bugs
  7. SecurityWeek — Claude Mythos Finds 271 Firefox Vulnerabilities
  8. The Decoder — Mozilla’s agentic AI pipeline turns Claude Mythos Preview loose and finds 271 unknown Firefox vulnerabilities
  9. Implicator — Firefox Shows Mythos Needs Mozilla’s Harness
  10. Richard Stiennon (Substack) — More Mythos and Mozilla
  11. Digit.in — Claude Mythos found decade old Firefox bugs that years of fuzzing missed
  12. CSO Online — Claude Mythos signals a new era in AI-driven security
  13. AISLE — System Over Model: Zero-Day Discovery at the Jagged Frontier
  14. AISLE — AI Cybersecurity After Mythos: The Jagged Frontier
  15. The Hacker News — Mythos Changed the Math on Vulnerability Discovery. Most Teams Aren’t Ready for the Remediation Side
  16. Rapid7 Blog — Project Glasswing and the Next Challenge for Defenders
  17. ArmorCode — The Claude Mythos Security Playbook: Operationalizing AI-Scale Vulnerability Discovery
  18. Anthropic — Project Glasswing: Securing critical software for the AI era
  19. Google Project Zero — From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code
  20. Google Cloud Blog — Cloud CISO Perspectives: Our Big Sleep agent makes a big leap
  21. The Hacker News — OpenAI Unveils Aardvark: GPT-5 Agent That Finds and Fixes Code Flaws Automatically
  22. GitHub Blog — Found means fixed: Introducing code scanning autofix, powered by GitHub Copilot and CodeQL
  23. GitHub Blog — Secure code more than three times faster with Copilot Autofix
  24. GitHub Blog — Validating agentic behavior when “correct” isn’t deterministic
  25. AISI — Our evaluation of OpenAI’s GPT-5.5 cyber capabilities
  26. AISLE — nano-analyzer (GitHub)