Scout: Agent Skills Ecosystem — The Emerging Standard

Summary

The agent skills ecosystem has undergone a rapid, largely uncoordinated convergence toward a single architectural pattern: a directory containing a SKILL.md file with YAML frontmatter metadata and a markdown body of instructions, optionally accompanied by scripts, references, and assets. Anthropic published this as an open standard in December 2025; within four months, OpenAI Codex, Google Gemini CLI, GitHub Copilot, Cursor, JetBrains Junie, CrewAI, LangSmith Fleet, and over 30 other agent products adopted structurally identical or compatible formats. The AGENTS.md specification — a complementary standard stewarded by the Agentic AI Foundation under the Linux Foundation — addresses project-level agent configuration rather than reusable capability bundles, and coexists rather than competes. The performance data is compelling: Google measured a 70-point accuracy improvement (28% → 97%) from a single skills file; LangChain reported a similar jump (29% → 95%) for their open-source ecosystem skills. This is not incremental — skills are now the primary lever for agent capability, outperforming fine-tuning for domain adaptation.

Key Findings

1. The Format Has Converged: SKILL.md Is the De Facto Standard

The core format across all major implementations is remarkably consistent:

---
name: skill-name
description: What this skill does and when to use it
---

# Skill Name

## Instructions
[Step-by-step guidance for the agent]

## Examples
[Concrete usage examples]

Required fields are name and description in YAML frontmatter. The markdown body contains procedural instructions. Optional subdirectories hold scripts (scripts/), reference materials (references/), and assets (assets/).

Anthropic’s specification defines the canonical field constraints: name is max 64 characters, lowercase alphanumeric with hyphens; description is max 1024 characters [1][2]. OpenAI adopted structurally identical naming conventions and metadata format in both Codex and the Responses API [3][4]. GitHub Copilot reads skills from .github/skills/ directories using the same SKILL.md frontmatter format [5]. CrewAI, Cursor, JetBrains Junie, and Vercel’s AI SDK all support the standard [6][7][8].

The convergence was not coordinated by committee. Anthropic published the open standard first; competitors adopted it because the format solved a real problem and was simple enough to not need proprietary extension. OpenAI’s adoption was particularly telling — their Codex CLI uses identical file naming conventions, metadata format, and directory organization [9].

2. Progressive Disclosure Is the Key Architectural Innovation

What distinguishes the SKILL.md pattern from earlier approaches (system prompts, AGENTS.md, custom instructions) is progressive disclosure — a three-level loading architecture that decouples skill discovery from skill execution:

Level	When Loaded	Token Cost	Content
Level 1: Metadata	Always (at startup)	~100 tokens per skill	`name` and `description` from YAML frontmatter
Level 2: Instructions	When skill is triggered	Under 5,000 tokens	SKILL.md body with instructions and guidance
Level 3+: Resources	As needed during execution	Effectively unlimited	Bundled files, scripts, references

This means an agent can have dozens of installed skills with negligible context cost. The agent loads skill names and descriptions at startup, reads the full SKILL.md only when it determines relevance, and accesses bundled resources only when specific instructions reference them [1][2].

The practical implication is that the amount of context bundled into a skill is effectively unbounded. Scripts execute via bash and return only their output — the code itself never enters the context window. Reference documents are read on demand. This is why skills outperform monolithic system prompts: they match the right context to the right task at the right time.

Google’s Gemini implementation adds a functional twist: skills provide an activate_skill tool and a fetch_url tool for retrieving live documentation, enabling skills to pull current information rather than relying solely on bundled content [10].

3. AGENTS.md and SKILL.md Are Complementary, Not Competing

A common source of confusion is the relationship between AGENTS.md and SKILL.md. They serve different purposes:

Dimension	AGENTS.md	SKILL.md
Scope	Project-level configuration	Reusable capability bundle
Purpose	”How to work in this codebase"	"How to perform this task”
Discovery	Nearest file in directory tree	Skill directories in known paths
Portability	Project-specific	Cross-project, cross-platform
Governance	Linux Foundation (Agentic AI Foundation)	Anthropic (open standard)
Adoption	60,000+ GitHub repositories [11]	30+ agent products [9]

AGENTS.md tells an agent about a specific project’s conventions, coding standards, and constraints — analogous to a README for agents. SKILL.md teaches an agent how to perform a generalizable task — analogous to a training manual. A well-configured agent stack uses both: AGENTS.md for project context, skills for domain capabilities.

The AGENTS.md specification uses standard markdown with no required frontmatter. The filename must be uppercase. Files are discovered by walking the directory tree upward, with the nearest file taking precedence — enabling per-directory overrides [11][12].

4. The Performance Data Is Unambiguous

Three independent measurements confirm that skills dramatically improve agent capability without model changes:

Vendor	Baseline	With Skill	Improvement	Model
Google DeepMind [10]	28.2%	96.6%	+68.4 pts	Gemini 3.1 Pro
LangChain [13]	29%	95%	+66 pts	Claude (via Claude Code)
Google DeepMind [10]	6.8%	87%	+80.2 pts	Gemini 3.0 Flash

Google’s evaluation used 117 prompts across agentic coding, chatbots, document processing, and streaming tasks. LangChain tested coding agent performance on LangChain/LangGraph tasks specifically. Both demonstrate that the gap between “general-purpose model” and “domain-specialist agent” is primarily a context gap, not a capability gap.

Google additionally reported 63% fewer tokens per correct answer when combining skills with MCP documentation access — skills not only improve accuracy, they improve efficiency by giving the agent a direct path rather than exploratory search [10].

5. The Ecosystem Is Stratifying Into Three Tiers

The skills ecosystem is developing a clear hierarchy:

Tier 1: Platform-native skills. Anthropic ships pre-built skills for document processing (PowerPoint, Excel, Word, PDF) [2]. OpenAI bundles skills into the Responses API’s hosted container workspace [3]. These are vendor-maintained and optimized for their specific runtime environments.

Tier 2: Verified partner skills. Anthropic’s skills repository (62,000+ GitHub stars within four months) includes partner-built skills from Atlassian, Figma, Canva, Stripe, and Notion [9]. These represent official vendor integrations maintained by the tool providers themselves.

Tier 3: Community skills. Thousands of community-contributed skills are compatible with the universal format. VoltAgent’s awesome-agent-skills repository catalogs skills across multiple platforms [14]. The Serenities AI guide documents compatibility across 16+ tools [6]. Quality and maintenance vary widely.

LangSmith Fleet occupies a unique position — it provides a managed skills platform where skills are created, versioned, and shared within a workspace, with automatic sync and upcoming version pinning [15]. Skills can be downloaded and used in Claude Code, Cursor, or Codex via CLI, providing a bridge between managed and filesystem-based approaches.

6. Security Is the Unresolved Risk

Anthropic’s documentation includes explicit warnings: “Skills provide Claude with new capabilities through instructions and code, and while this makes them powerful, it also means a malicious Skill can direct Claude to invoke tools or execute code in ways that don’t match the Skill’s stated purpose” [2].

The threat model is real. Skills can:

Execute arbitrary code via bundled scripts
Make network requests (in Claude Code, with full local network access)
Reference external URLs whose content can change post-audit
Invoke tools in harmful ways while appearing benign in their SKILL.md

There is currently no code signing, no integrity verification, and no sandboxing at the skill level. The mitigation is entirely trust-based: only use skills from trusted sources. This is reminiscent of early package manager ecosystems before supply chain security became a priority — and given the LiteLLM attack from this same week (Edition 5), the parallel is uncomfortable.

Runtime environment constraints vary by surface: Claude API skills have no network access and cannot install packages; Claude Code skills have full network access and local package installation [2]. This inconsistency means a skill that’s safe in one environment may be dangerous in another.

Practical Implications

1. Invest in skills now — the format is stable. The SKILL.md format with YAML frontmatter is the clear winner. Build skills in this format and they’ll work across Claude Code, OpenAI Codex, GitHub Copilot, Gemini CLI, Cursor, and more. The portability story is real, not aspirational.

2. Use AGENTS.md and SKILL.md together. AGENTS.md defines your project’s rules of engagement. Skills define reusable capabilities. Don’t choose between them — layer them. Put project conventions in AGENTS.md; put generalizable workflows in skills.

3. Measure your context gap. Google’s 28% → 97% result is a wake-up call. If your agents are underperforming, the most likely fix isn’t a better model — it’s better context. Write a skill for your most common agent tasks and measure the before/after accuracy. The ROI will likely exceed any model upgrade.

4. Treat skill security like dependency security. Audit third-party skills before installation. Review bundled scripts, check for network calls, and be wary of skills that fetch external content. The ecosystem has no integrity verification infrastructure yet — treat every community skill as untrusted code.

5. Design skills for progressive disclosure. Put the minimum viable instruction in SKILL.md. Put detailed references in separate files. Put deterministic operations in scripts. This isn’t just good architecture — it directly reduces token costs and improves accuracy by keeping context focused.

6. Watch the versioning problem. LangSmith Fleet is adding version pinning; the filesystem-based approach has no versioning at all. If you’re sharing skills across a team, you need a distribution strategy — git submodules, a skills registry, or a managed platform like Fleet.

Open Questions

Will skill signing emerge? The current trust model won’t scale to a public ecosystem. Will Anthropic or another vendor introduce cryptographic verification for skills, similar to package signing?
How will skill conflicts be resolved? When multiple skills match a request, which takes precedence? The specification is silent on conflict resolution and priority ordering across skills from different sources.
Will AGENTS.md and SKILL.md formally merge? Both use markdown. Both configure agent behavior. The Agentic AI Foundation and Anthropic’s open standard exist in parallel — will governance consolidate, or will the two standards continue to coexist at different layers?
What about non-coding agent skills? The current ecosystem is overwhelmingly focused on coding agents. As agents expand into customer support, data analysis, and enterprise workflows, will the SKILL.md format prove sufficient, or will domain-specific extensions emerge?
How will skill quality be measured? Beyond star counts, there’s no standard way to evaluate whether a skill actually improves agent performance. Will evaluation benchmarks for skills emerge as a category?