Plate IILLM Architecture中文HOWARDISM

LLM-Driven Vulnerability Research

PublishedApril 10, 2026FiledConceptDomainLLM ArchitectureTagsCybersecurityLLM CapabilitiesVulnerability ResearchExploit DevelopmentReading13 minSourceAI-synthesised

Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and Anthropic's Project Glasswing response

Illustration for LLM-Driven Vulnerability Research

Sources#

Summary#

Frontier LLMs have crossed a threshold where they can autonomously discover zero-day vulnerabilities in production software and, in the case of Claude Mythos Preview, chain them into working exploits — capabilities that previously required expert human researchers working for days to weeks per bug. These capabilities emerged from general improvements in code reasoning and autonomy, not security-specific training, which implies future models will continue improving along this axis.

Details#

Capability Ladder#

The progression across model generations is steep:

Opus 4.6: strong at identifying and fixing vulnerabilities; near-0% success at autonomous exploit development. Found high/critical-severity bugs in OSS-Fuzz, webapps, crypto libraries, and the Linux kernel, but couldn't reliably turn them into working exploits.
Mythos Preview: discovers zero-days in every major OS and browser, and autonomously develops working exploits. On Firefox 147 JS engine bugs, Opus 4.6 developed shell exploits 2 times out of hundreds of attempts; Mythos Preview succeeded 181 times (+ 29 with register control). On ~7000 OSS-Fuzz entry points, Mythos Preview achieved full control flow hijack (tier 5) on 10 targets vs. 0 for prior models.

The Scaffold#

All findings used the same simple agentic scaffold:

Launch a container (isolated from internet) with the project under test and source code
Invoke Claude Code with a paragraph-level prompt: "find a security vulnerability in this program"
The agent reads code, hypothesizes vulnerabilities, runs the project to confirm/reject, adds debug logic or uses debuggers as needed
Output: either "no bug" or a bug report with PoC and reproduction steps

To increase diversity, each parallel agent instance focuses on a different file. Files are pre-ranked 1–5 by likelihood of containing interesting bugs (constants = 1, internet-facing parsers = 5). A final validation agent confirms bug severity and filters minor issues.

Non-experts (Anthropic engineers with no formal security training) used this scaffold overnight and found working RCEs by morning.

Notable Zero-Day Findings#

OpenBSD TCP SACK (27 years old)#

A two-bug chain in OpenBSD's SACK implementation: (1) missing lower-bound check on SACK range start, (2) NULL-pointer write when a single SACK block simultaneously deletes the only hole and triggers the append path. The impossible precondition is satisfied via signed integer overflow when the attacker places the SACK start ~2^31 away from the real window. Remote DoS against any TCP-responding OpenBSD host. Cost: <$50 for the specific run (within ~$20K total for 1000 runs yielding dozens of findings).

FFmpeg H.264 (16 years old)#

Mismatch between 16-bit slice table entries and 32-bit slice counter. memset(..., -1,...) initializes entries to 65535 as sentinel; an attacker crafts a frame with exactly 65536 slices, colliding with the sentinel. The deblocking filter then writes out of bounds. The underlying bug dates to 2003; it became a vulnerability in a 2010 refactor. Missed by every fuzzer and human reviewer since.

Memory-Safe VMM Guest-to-Host Corruption#

A vulnerability in unsafe code within a production Rust VMM gives a malicious guest an out-of-bounds write to host process memory. Easy DoS, potentially chainable. Demonstrates that memory-safe languages don't eliminate the attack surface in systems that must interact with hardware.

Thousands More#

Over 1000 estimated critical-severity and thousands of high-severity vulnerabilities across the open-source ecosystem, with 89% agreement between model and human severity assessments (198 manually reviewed reports).

Exploit Sophistication#

Mythos Preview doesn't just find bugs — it chains them into full exploits:

FreeBSD NFS RCE (CVE-2026-4747): stack overflow in RPCSEC_GSS → 20-gadget ROP chain split across 6 sequential RPC packets, bypassing stack canary (function uses int32_t[] not char[], so -fstack-protector skips it), no KASLR on FreeBSD kernel. Leaks hostid via unauthenticated NFSv4 EXCHANGE_ID.
Linux kernel privilege escalation: chains 2–4 vulnerabilities (KASLR bypass + read primitive + write primitive + heap spray) for full root. Nearly a dozen working examples.
Browser JIT heap sprays: discovers read/write primitives, chains into JIT heap spray, escalated to cross-origin bypass and sandbox escape → kernel write.
N-day exploit generation: given a CVE ID and git commit, autonomously produces working privilege escalation exploits. Two detailed examples:
ipset one-bit write → cross-cache page-table manipulation → PTE R/W bit flip → writable mapping of setuid binary → root. Cost: <$1000, half a day.
unix socket UAF one-byte read → cross-cache reclaim via AF_PACKET ring → HARDENED_USERCOPY bypass via cpu_entry_area/vmalloc stack/non-slab pages → KASLR defeat → stack scanning for ring address → fake cred via init_cred copy → tc qdisc UAF for controlled function call → commit_creds(fake_root_cred) → root. Cost: <$2000.

Emergent, Not Trained#

These capabilities were not explicitly trained. They emerged as downstream consequences of general improvements in code understanding, reasoning, and autonomy. The same improvements that make a model better at patching bugs also make it better at exploiting them. This implies the capability trajectory will continue with future general-purpose model improvements.

Attacker-Defender Asymmetry and the Transitional Period#

Anthropic argues:

Long-term: LLMs benefit defenders more than attackers (like fuzzers before them). Defenders can direct resources, fix bugs before shipping, scale bugfinding across entire codebases.
Short-term: attackers may have the advantage during the transition, especially if frontier labs aren't careful about model release.
Friction-based defenses degrade: mitigations whose value comes from making exploitation tedious (as opposed to impossible) weaken against model-assisted adversaries that grind through tedious steps cheaply. Hard barriers (KASLR, W^X) remain important.
N-day window shrinks: autonomous CVE-to-exploit pipelines mean the time between disclosure and mass exploitation collapses. Patch cycles must tighten accordingly.

Project Glasswing#

Anthropic's response: limited release of Mythos Preview to critical industry partners and open-source developers to begin securing critical infrastructure before models with similar capabilities become broadly available. Not planned for general availability. Upcoming Claude Opus model will ship with new safeguards developed against Mythos-class outputs.

Update (2026-04-17): the "upcoming Claude Opus model" is now named and shipped — see Claude Opus 4.7. Opus 4.7 is the first post-Glasswing GA model. Notable details:

Cyber capabilities were differentially reduced during training (not only filtered at inference).
Ships with classifier safeguards that "automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses."
Legitimate researchers route through the new Cyber Verification Program for vulnerability research, pentest, and red-teaming use.
CyberGym score updated: Opus 4.6 baseline revised from 66.6 → 73.8 after harness-parameter tuning (same harness, better elicitation).

Update (2026-05-28): the Opus 4.8 System Card (§3) reports cyber evaluations on a benchmark suite including some used for the first time (ExploitBench, CyberGym, Firefox exploits, OSS-Fuzz). Pattern: without safeguards Opus 4.8 is somewhat more capable than Opus 4.7 on most cyber evals; with safeguards it performs comparably to 4.7; and it remains substantially behind Mythos Preview on cyber capability. So the capability ladder above still holds — the general-access frontier is rising but the Glasswing-class gap to the gated model persists, and safeguards continue to neutralize the bare-model uplift. This is consistent with the broader RSP determination that 4.8 does not advance the catastrophic-risk frontier.

Update (2026-06-07): the Anthropic Institute essay When AI builds itself quantifies Glasswing's impact: in its first weeks, Mythos Preview found more than ten thousand high- and critical-severity vulnerabilities across "the world's most important systems" — enough that the cyber-defense bottleneck has already shifted from finding vulnerabilities to patching them fast enough. The essay cites this as evidence that even if model capability froze today, the world would still change materially (its first future: stalled trend, wide diffusion). It also sharpens the N-day window argument — patch velocity, not discovery, is now the binding constraint.

Update (2026-06-14): the ladder gains a new top rung. Mythos 5 ships as the Glasswing upgrade to Mythos Preview with "the strongest cybersecurity capabilities of any model in the world," now including agentic hacking (reconnaissance, discovery, lateral movement — not just exploit-finding). Its general-access sibling Fable 5 keeps that capability in the model but interposes a cyber classifier that "prevent[s] Fable from making any progress" on offensive cyber tasks, falling back to Opus 4.8 instead of refusing (see Capability-Gated Model Fallback). One external partner judged Fable 5's cyber safeguards the most robust of any model tested (including Opus 4.8 and 4.7): zero harmful single-turn compliance on attack-planning, exploit development, or defense evasion — even against 30 public jailbreak techniques. No universal jailbreak was found in 1,000+ bug-bounty hours (the UK AISI made partial progress on a short task). This is the operational realization of "ship Mythos-class capability without shipping the offensive uplift."

Recommendations for Defenders#

Use current frontier models (Opus 4.6) for vulnerability finding now — they find many hundreds of bugs even without exploit capability
Build scaffolds and procedures with current models as preparation for Mythos-class availability
Think beyond vuln-finding: triage, dedup, reproduction steps, patch proposals, config audits, PR review, legacy migrations
Shorten patch cycles; treat CVE-carrying dependency bumps as urgent
Review and scale vulnerability disclosure processes for model-generated volume
Automate technical incident response pipelines (triage, hunting, artifact capture, postmortem drafting)
Prepare contingency plans for vulnerabilities in abandoned/acquired software

Connections#

Agent Harness Engineering — the vulnerability-finding scaffold is a minimal harness: isolated container, single prompt, agentic experimentation loop. The file-ranking pre-pass and validation agent mirror the initializer/coding agent split
Claude Code Best Practices — Claude Code is the runtime used for all vulnerability research; the scaffold relies on its agentic capabilities (tool use, shell access, debugging)
LLM-as-Compiler Knowledge Base — the responsible disclosure process uses SHA-3 cryptographic commitments to prove possession of vulnerabilities without revealing them — a form of verifiable knowledge compilation
Client-Side Agent Optimization — the file-ranking 1–5 pre-pass and final validation agent are hand-tuned instances of exactly what AgentOpt searches over automatically; the scaffold can be modeled as a pipeline with planner (file-ranker) / solver (bug-finder) / critic (validator) roles subject to combo optimization
Scale-Dependent Prompt Sensitivity — the paragraph-level prompt ("find a security vulnerability…") rewards thoroughness, which is the behavior larger models over-produce. A case where large-model verbosity aligns with task utility rather than working against it
Claude Opus 4.7 — first GA model shipped under Project Glasswing with differentially-reduced cyber capabilities and classifier safeguards; the operational answer to "what comes after Mythos Preview for the general public"
Claude Opus 4.8 — next GA model; somewhat more cyber-capable than 4.7 without safeguards, comparable with them, still far behind Mythos
Responsible Scaling Policy Evaluations — cyber is one of the catastrophic-risk domains the RSP gates; the 4.8 determination is that the frontier is not advanced
Claude Code Auto Mode — classifier-gating at the tool-call boundary mirrors the Glasswing request-level classifier; both use secondary-model pre-flight to filter primary-agent actions
Mythos Model — entity page for the preview model that produced these findings; internal use at Anthropic acknowledged in 2026 Q2 sources
Claude Mythos 5 — the June 2026 Glasswing upgrade; current apex of the cyber-capability ladder ("strongest cybersecurity capabilities of any model in the world")
Capability-Gated Model Fallback — the cyber classifier + Opus-4.8 fallback that neutralizes Fable 5's offensive-cyber capability for general users
Anthropic — the vendor behind Mythos Preview and Project Glasswing, the context for these findings
AI-Accelerated Offense — the threat-landscape generalization of these findings: vuln-to-exploit compressed from months to hours, motivating the Zero Trust for AI Agents framework
Impossible, Not Tedious (Design Test) — the "friction-based defenses degrade" observation here is turned into a prescriptive Zero Trust design test
Agent Supply Chain Risk — the same capability that finds zero-days recognizes known-vuln signatures in unpatched upstream components, weaponizing the dependency tree
Autonomous Defense — the defensive deployment of this capability: model-driven triage, hunting, and artifact capture rather than exploitation
Recursive Self-Improvement — Glasswing's 10k+ findings are the essay's proof that even a stalled capability trend reshapes the world (its first future)
AI Accelerating AI Development — a worked example of AI-accelerated technical output, here in security research rather than internal engineering

Open Questions#

How do these capabilities transfer to non-memory-safety bug classes (logic bugs, protocol-level flaws, supply chain attacks)?
What's the ceiling for autonomous exploit complexity? The N-day examples are remarkably sophisticated — is there a qualitative limit?
How will the security industry's equilibrium shift when multiple labs have Mythos-class models?
Can defensive scaffolds (continuous fuzzing + model-driven triage + auto-patching) close the attacker-defender gap during the transition?
What safeguards are effective against Mythos-class outputs without crippling legitimate security research?

Sources#

Claude Mythos Preview red.anthropic.com
Introducing Claude Opus 4.7 — first post-Glasswing GA model; operational safeguards
Claude Opus 4.8 System Card — §3 (Cyber): ExploitBench, CyberGym, Firefox exploits, OSS-Fuzz
When AI builds itself — Glasswing's 10k+ first-weeks findings; "bottleneck shifted from finding to patching"
Claude Fable 5 and Claude Mythos 5 — Mythos 5 as the Glasswing upgrade; Fable 5's cyber classifier and jailbreak-robustness results

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 22

Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agent Supply Chain Risk
Runtime-composed agent ecosystems expand the supply-chain attack surface: model poisoning (250 docs backdoor a 13B mode…
AI-Accelerated Offense
Frontier models compress the vulnerability-to-exploit timeline from months to hours at marginal dollar cost; both attac…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Autonomous Defense
Running security operations at the speed of AI-accelerated threats: put a model at the front of the alert queue, automa…
Capability-Gated Model Fallback
Fable 5's safeguard architecture: classifiers detect cyber / bio-chem / distillation queries and route the response to…
Claude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
Claude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
Claude Mythos 5
The safeguards-lifted form of Claude Fable 5 (June 2026): same underlying Mythos-class model, deployed through Project…
Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
Impossible, Not Tedious (Design Test)
Zero Trust design test for agentic security: does a control make the attack impossible, or just tedious? Friction-only…
LLM-as-Compiler Knowledge Base
Karpathy's architecture: LLM incrementally compiles raw docs into a persistent interlinked wiki, replacing RAG with a 4…
LLM Architecture, Training & Alignment
Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Opus 4.6 → 4.7 Changes and Multi-Agent Coding Considerations
4.6→4.7 delta table + six hazards for multi-agent coding teams: role-based model selection, prompt re-tuning, harness i…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
Scale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
When Does Verification Quality Determine Whether AI Automation Works?
Verification-quality ladder from Lean/formal proof search through software CI and vulnerability reproduction; autonomy…

Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Capability-Gated Model Fallback
Fable 5's safeguard architecture: classifiers detect cyber / bio-chem / distillation queries and route the response to…
Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…

Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Capability-Gated Model Fallback
Fable 5's safeguard architecture: classifiers detect cyber / bio-chem / distillation queries and route the response to…
Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…

Cited by 22

Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agent Supply Chain Risk
Runtime-composed agent ecosystems expand the supply-chain attack surface: model poisoning (250 docs backdoor a 13B mode…
AI-Accelerated Offense
Frontier models compress the vulnerability-to-exploit timeline from months to hours at marginal dollar cost; both attac…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Autonomous Defense
Running security operations at the speed of AI-accelerated threats: put a model at the front of the alert queue, automa…
Capability-Gated Model Fallback
Fable 5's safeguard architecture: classifiers detect cyber / bio-chem / distillation queries and route the response to…
Claude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
Claude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
Claude Mythos 5
The safeguards-lifted form of Claude Fable 5 (June 2026): same underlying Mythos-class model, deployed through Project…
Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
Impossible, Not Tedious (Design Test)
Zero Trust design test for agentic security: does a control make the attack impossible, or just tedious? Friction-only…
LLM-as-Compiler Knowledge Base
Karpathy's architecture: LLM incrementally compiles raw docs into a persistent interlinked wiki, replacing RAG with a 4…
LLM Architecture, Training & Alignment
Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Opus 4.6 → 4.7 Changes and Multi-Agent Coding Considerations
4.6→4.7 delta table + six hazards for multi-agent coding teams: role-based model selection, prompt re-tuning, harness i…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
Scale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
When Does Verification Quality Determine Whether AI Automation Works?
Verification-quality ladder from Lean/formal proof search through software CI and vulnerability reproduction; autonomy…