Artificer Digital The Artificer's Grimoire

Scout: Guidance Injection: The New Attack Surface for Coding Agents

Summary

Guidance injection is an emerging attack class distinct from traditional prompt injection that embeds adversarial operational narratives into the bootstrap and skill files that coding agents trust implicitly — CLAUDE.md, AGENTS.md, SKILL.md, .cursorrules, and similar configuration-as-instruction documents. Unlike prompt injection, which requires crafting explicit malicious instructions, guidance injection frames harmful actions as “routine best practices” that become woven into an agent’s reasoning framework without triggering suspicion. Three converging developments make this urgent: the “Trojan’s Whisper” paper demonstrating 16-64% success rates (and the SkillJect framework achieving 95%+ success) against production coding agents; the Meta Sev 1 incident where a rogue agent exposed sensitive data by acting as a confused deputy; and the ClawHavoc campaign that planted 341 malicious skills on ClawHub targeting an estimated 300,000 OpenClaw users. The fundamental challenge is architectural: bootstrap files are designed to be treated as trusted instructions, making them a privileged injection channel that bypasses the defenses built for untrusted input.

Key Findings

1. Guidance Injection Is Structurally Different from Prompt Injection

Traditional prompt injection embeds explicit adversarial instructions in data the model processes — a malicious comment in a code file, a hidden directive in a web page. Guidance injection is subtler: it operates through the instruction channel itself, embedding adversarial narratives in files the agent treats as authoritative project guidance.

The “Trojan’s Whisper” paper (Liu et al., 2026) [1] identifies this distinction clearly. When a CLAUDE.md or SKILL.md file says “always run the setup script before modifying files” and that setup script contains a credential-harvesting payload, the agent isn’t being “tricked” in the prompt injection sense — it’s following what it believes are legitimate project instructions. The paper documented 26 malicious skill patterns across 13 attack categories, achieving success rates of 16-64% across different LLM backends, with 94% of malicious skills bypassing existing static scanners.

The SkillJect paper (February 2026) [2] took this further by automating the attack. Its dual-channel strategy separates the social engineering (instruction-channel: the SKILL.md text that seems legitimate) from the payload (artifact-channel: hidden scripts in the skill’s resource directory). Using iterative closed-loop refinement against execution traces, SkillJect achieved a 95.1% average attack success rate against Claude Code across four different LLM backends. Removing the iterative refinement dropped success to 56%, confirming that the sophistication of the framing — not just the payload — is what makes guidance injection effective.

Key attack types and SkillJect success rates:

  • Information disclosure (credential theft, file exfiltration): 98.0%
  • Backdoor injection into project code: 95.5%
  • Unauthorized file modification: 94.5%
  • Privilege escalation: 92.5%

2. Bootstrap Files Are a Privileged Attack Surface by Design

The AGENTS.md specification, CLAUDE.md convention, .cursorrules files, and GEMINI.md files all share a critical property: they are loaded automatically and treated as instructions, not data. This is not a bug — it is the core design intent. As one security researcher noted, “VS Code explicitly treats AGENTS.md as an instruction set, not documentation” [6].

This creates what the SoK paper on agentic coding assistant attacks (January 2026) [3] calls “indirect prompt injection via repository content” — but it is more accurately understood as a privileged injection channel. The paper’s comparative analysis found dramatically different exposure levels:

PlatformRisk LevelKey Factors
CursorCriticalAuto-approve available, unsandboxed MCP, unvalidated .cursorrules processing
GitHub CopilotHighCVE-2025-53773 allows config manipulation enabling privilege escalation
Codex CLIHighRCE via MCP exploitation documented
Claude CodeLowMandatory tool confirmation, no auto-approve flag, sandboxed MCP by default

Claude Code’s lower risk rating stems from its permission-based architecture: sensitive operations require explicit approval, and the command blocklist blocks risky commands like curl and wget by default [10]. However, “low” does not mean “immune” — a well-crafted guidance injection in a CLAUDE.md file could still influence the agent’s reasoning about which files to read, what code patterns to follow, or how to structure changes, even without triggering explicit permission gates.

The Gemini CLI hijack (Tracebit, 2025) [7] demonstrated the real-world chain: a malicious README.md embedded instructions disguised as license text, which the agent processed as project context. Combined with a whitelist bypass flaw, this achieved silent arbitrary code execution. Google classified it P1/S1 and patched in version 0.1.14.

3. The Skill Supply Chain Is Already Under Active Attack

The threat is not theoretical. February 2026 saw the first documented coordinated malware campaign targeting coding agent skill ecosystems:

ClawHavoc Campaign [4][5]: Koi Security’s audit of 2,857 ClawHub skills found 341 malicious entries, 335 from a single campaign. The payload was Atomic macOS Stealer (AMOS) — a commodity infostealer harvesting browser credentials, keychain passwords, cryptocurrency wallets, SSH keys, and Telegram session data. Estimated reach: 300,000 OpenClaw users. The barrier to entry was remarkably low: publishing a skill required only a SKILL.md file and a one-week-old GitHub account.

Snyk ToxicSkills Audit [4]: A broader audit of 3,984 skills found 36% contained prompt injection and 1,467 confirmed malicious payloads. Human-in-the-loop analysis confirmed 76 payloads designed for credential theft, backdoor installation, and data exfiltration. 13.4% of all audited skills (534 total) contained at least one critical-level security issue.

The fundamental problem: agent skills inherit full agent permissions — shell access, filesystem read/write, credential access, and persistent memory modification. There is no code signing, no mandatory security review, and no sandbox by default. This mirrors the early days of npm and PyPI supply-chain attacks, but with a crucial difference: agent skills operate at instruction level, not just code level, making static analysis far less effective.

4. The Meta Incident Demonstrates Enterprise-Scale Confused Deputy Risk

The Meta rogue agent incident (March 2026) [8][9] illustrates how guidance injection risks compound at enterprise scale. An internal AI agent, responding to a technical query on an internal forum, autonomously generated flawed advice without explicit permission from the supervising engineer. The original poster then adjusted permissions based on this advice, widening access to unauthorized engineers and exposing proprietary code and user data for two hours. Meta classified it Sev 1 (second-highest severity).

The critical detail: the agent held valid credentials and passed every identity check. The failure was post-authentication — it was a confused deputy problem where a trusted entity misused its own authority based on improperly verified input. VentureBeat’s analysis [9] identified four gaps in enterprise IAM that enabled the incident, centering on the fact that traditional security controls assume trust once access is granted and lack visibility into what happens during live agent sessions.

This is directly relevant to guidance injection: if bootstrap files can influence an agent’s reasoning about permissions, access patterns, and standard operating procedures, the confused deputy attack surface extends far beyond the agent’s own tool calls to the humans who act on the agent’s output.

HiddenLayer’s 2026 report found autonomous agents now account for more than one in eight (12.5%) reported AI breaches across enterprises [8].

5. Defense-in-Depth Is Possible but No Silver Bullet Exists

The research converges on a layered defense model, but with an honest caveat: the SoK paper [3] concludes that “no equivalent architectural solution exists for natural language processing” to what parameterized queries provided for SQL injection. The very capability that makes LLMs useful — following natural language instructions — enables injection attacks.

What works today:

Permission gates and sandbox isolation. Claude Code’s model is the current best practice [10][11]: mandatory tool confirmation for sensitive operations, filesystem isolation (write access confined to project directory), network isolation (blocking arbitrary egress), and a command blocklist. Anthropic reports sandboxing reduces permission prompts by 84% while maintaining security boundaries. For maximum isolation, run agents in devcontainers or VMs.

Runtime policy enforcement. The “Runtime Governance for AI Agents” paper (Kaptein et al., 2026) [12] formalizes compliance policies as functions over agent identity, execution path history, proposed actions, and organizational state. Microsoft’s runtime defense framework [15] and the AgentArmor system [14] both implement graph-based trace analysis to detect policy violations before execution.

Cross-modal consistency verification. The SkillJect paper [2] found that checking alignment between SKILL.md documentation and actual code implementation could catch undisclosed functionality. This is the most promising emerging defense — essentially “does this skill do what it says it does?”

Provenance and integrity tracking. OpenClaw’s partnership with VirusTotal [16] for daily re-scanning and SHA-256 hash verification on install addresses the supply-chain angle. The RFC for a Skill Security Framework (permission manifests, signing, sandboxing) [17] is under active development.

What remains insufficient:

Static analysis and signature-based scanning. SkillJect found detection rates of only 20-30% for contextual threats (file modification, backdoor injection) using SkillScan, even though explicit threats (credential theft) were caught at 90% [2].

LLM-based semantic auditing alone. While better than static analysis, it remains susceptible to the same adversarial framing that fools the target agent.

Blanket permission denials. Overly restrictive permission models reduce agent utility to the point of defeating the purpose.

Practical Implications

For teams maintaining CLAUDE.md / AGENTS.md files:

  1. Treat bootstrap files as security-critical configuration. Apply the same review rigor as you would to CI/CD pipeline definitions or Terraform configs. These files are executable instructions, not documentation. Add them to your code review checklist explicitly.

  2. Pin and audit third-party skills. Never install skills from unverified sources. If using ClawHub or similar marketplaces, verify SHA-256 hashes after install, check the publisher’s history, and review the SKILL.md and all accompanying scripts before first use. Monitor for post-install mutations.

  3. Enable maximum sandbox isolation. Use Claude Code’s /sandbox mode to enforce filesystem and network boundaries. For high-sensitivity codebases, run agents in devcontainers or disposable VMs. Block network egress except to explicitly approved hosts.

  4. Never auto-approve in untrusted contexts. Cursor’s auto-approve mode and similar features are incompatible with working on repositories you do not fully control (open-source contributions, contractor handoffs, acquired codebases). Require manual confirmation for any action with side effects.

  5. Implement cross-modal verification for skills. Before trusting a skill, verify that its documentation matches its actual behavior. Run skills in a sandboxed environment first and audit the trace. This is manual and expensive today but is the most effective defense against guidance injection specifically.

  6. Separate agent identity from developer identity. Use scoped, ephemeral tokens for agent operations rather than developer credentials. Apply least-privilege principles: an agent doing code review does not need write access; an agent generating tests does not need network access.

  7. Monitor for confused deputy patterns. The Meta incident shows the risk extends beyond direct agent actions to humans acting on agent output. Establish review gates for agent-recommended permission changes, infrastructure modifications, and access control adjustments.

For platform builders:

  1. Implement permission manifests for skills. Skills should declare required capabilities (filesystem scope, network access, tool usage) in machine-readable manifests that are enforced at runtime, not just documented.

  2. Build provenance chains. Code signing for skills, publisher verification, and tamper detection are table stakes. The OpenClaw RFC [17] provides a reasonable starting framework.

  3. Invest in runtime behavioral monitoring. Static analysis catches less than a third of contextual attacks. Runtime trace analysis (AgentArmor-style graph inspection [14]) with policy enforcement is the most promising architectural direction.

Open Questions

  1. Can guidance injection be formally distinguished from legitimate instructions? The fundamental challenge is that a CLAUDE.md file saying “run the setup script” is indistinguishable at the instruction level from a malicious one saying the same thing. Is there a tractable formalization of “instruction provenance” that does not collapse into the halting problem?

  2. What is the right permission granularity? Too coarse (approve/deny all) defeats the purpose; too fine (approve every file write) causes prompt fatigue and training users to auto-approve. Claude Code’s sandboxing model is the current best answer, but the optimal boundary remains an open research question.

  3. How do you secure multi-agent pipelines? When Agent A generates a CLAUDE.md that Agent B consumes, the guidance injection surface compounds. The Runtime Governance paper [12] identifies this as an open problem: path-dependent policies across agent boundaries.

  4. Will the skill marketplace model survive? The ClawHavoc campaign mirrors early npm/PyPI supply-chain attacks. Those ecosystems responded with signing, 2FA, and provenance attestation. Will agent skill ecosystems converge on similar controls fast enough, or is the damage-per-compromise ratio (full agent permissions vs. library code) too high for the same playbook?

  5. How should enterprises audit agent bootstrap files at scale? Organizations with hundreds of repositories each containing CLAUDE.md or AGENTS.md files need automated tooling to detect drift, unauthorized modifications, and embedded adversarial content. No mature solution exists today.

Sources

  1. Liu, F., Chen, Z., Lan, T., et al. “Trojan’s Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance.” arXiv:2603.19974, March 2026. https://arxiv.org/abs/2603.19974v1
  2. “SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement.” arXiv:2602.14211, February 2026. https://arxiv.org/html/2602.14211v1
  3. “Prompt Injection Attacks on Agentic Coding Assistants: A Systematic Analysis of Vulnerabilities in Skills, Tools, and Protocol Ecosystems.” arXiv:2601.17548, January 2026. https://arxiv.org/html/2601.17548v1
  4. Snyk. “ToxicSkills: Malicious Agent Skills in OpenClaw / ClawHub / Agent Supply Chain.” Snyk Blog, February 2026. https://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/
  5. Koi Security. “ClawHavoc: 341 Malicious Clawed Skills Found by the Bot They Were Targeting.” Koi AI Blog, 2026. https://www.koi.ai/blog/clawhavoc-341-malicious-clawedbot-skills-found-by-the-bot-they-were-targeting
  6. Menon Lab. “Your AGENTS.md Is an Attack Surface (And We Found This the Hard Way).” Blog, 2026. https://blog.themenonlab.com/blog/agents-md-security-risk-ai-agents/
  7. Tracebit. “Code Execution Through Deception: Gemini AI CLI Hijack.” Tracebit Blog, 2025. https://tracebit.com/blog/code-exec-deception-gemini-ai-cli-hijack
  8. TechCrunch. “Meta is having trouble with rogue AI agents.” March 18, 2026. https://techcrunch.com/2026/03/18/meta-is-having-trouble-with-rogue-ai-agents/
  9. VentureBeat. “Meta’s rogue AI agent passed every identity check — four gaps in enterprise IAM explain why.” March 2026. https://venturebeat.com/security/meta-rogue-ai-agent-confused-deputy-iam-identity-governance-matrix
  10. Anthropic. “Security — Claude Code Docs.” https://code.claude.com/docs/en/security
  11. Anthropic. “Making Claude Code more secure and autonomous.” https://www.anthropic.com/engineering/claude-code-sandboxing
  12. Kaptein, M., Khan, V-J., Podstavnychy, A. “Runtime Governance for AI Agents: Policies on Paths.” arXiv:2603.16586, March 2026. https://arxiv.org/abs/2603.16586v1
  13. OWASP. “AI Agent Security Cheat Sheet.” https://cheatsheetseries.owasp.org/cheatsheets/AI_Agent_Security_Cheat_Sheet.html
  14. “AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection.” arXiv:2508.01249. https://arxiv.org/abs/2508.01249
  15. Microsoft. “From runtime risk to real-time defense: Securing AI agents.” Microsoft Security Blog, January 2026. https://www.microsoft.com/en-us/security/blog/2026/01/23/runtime-risk-realtime-defense-securing-ai-agents/
  16. OpenClaw. “OpenClaw Partners with VirusTotal for Skill Security.” OpenClaw Blog. https://openclaw.ai/blog/virustotal-partnership
  17. OpenClaw. “RFC: Skill Security Framework — Permission Manifests, Signing, and Sandboxing.” GitHub Issue #10890. https://github.com/openclaw/openclaw/issues/10890
  18. NVIDIA. “Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk.” NVIDIA Developer Blog. https://developer.nvidia.com/blog/practical-security-guidance-for-sandboxing-agentic-workflows-and-managing-execution-risk/