Claude Fable 5 Benchmarks Reveal Middling Security Fix Rates
New evaluation data shows Anthropic's latest model struggles with real-world security patches despite its impressive offensive capabilities.
When Anthropic launched its new "Mythos-class" model, Claude Fable 5, the marketing promised a massive leap forward in handling complex, long-horizon software engineering and cybersecurity tasks. But as developers know all too well, there is a vast chasm between a model that can spin up an exploit in a controlled sandbox and one that can safely patch a production codebase without breaking existing functionality.
Recent benchmark data from the Agent Security League, published by Endor Labs, puts Fable 5’s defensive capabilities into perspective. Evaluated on 200 real-world vulnerability-fixing tasks using the Claude Code agent harness, Fable 5 delivered decidedly average results. While it managed a functional pass rate (FuncPass) of 59.8%, it successfully resolved security vulnerabilities while keeping the code functional (SecPass) only 19.0% of the time.
For developers looking to integrate Fable 5 into their automated maintenance pipelines, these numbers suggest that while the model is a capable assistant, it is far from ready to operate on autopilot.
Offensive Prowess vs. Defensive Reality
The gap between Anthropic's launch hype and these real-world results highlights a fundamental split in how AI cybersecurity capabilities are measured. Anthropic’s promotional benchmarks—which highlight performance on suites like Firefox, OSS-Fuzz, CyberGym, and CyScenarioBench—primarily focus on offensive capabilities. These tests measure a model's ability to reproduce vulnerabilities, generate proofs-of-concept, trigger crashes, or complete capture-the-flag style challenges.
Writing a secure patch, however, is a much harder engineering problem. An agent must not only understand the vulnerability but also refactor the code to eliminate the flaw without introducing regressions or breaking downstream dependencies. In this defensive arena, Fable 5's 19.0% SecPass rate indicates that automated remediation remains a highly stubborn bottleneck.
The Cost of Deep Thinking: Record Timeouts
One of the most notable architectural features of Fable 5 is its "extended thinking" capability, designed to let the model reason through complex logic before emitting code. In practice, this extra cognitive processing comes with a literal time penalty.
During the 200-task benchmark, 15 runs exceeded the harness's 40-minute execution limit—the highest number of timeouts ever recorded for a single model-and-agent combination on this leaderboard. Other model-and-harness setups completed their reasoning within the same budget, suggesting Fable 5's internal monologue can easily spin into analysis paralysis.
Interestingly, even when the clock ran out, Fable 5's partial work was not entirely useless. Four of the timed-out runs still managed to pass their functional tests, and two of those also successfully resolved their targeted security vulnerabilities.
Memorization and the "Cheating" Problem
Evaluating LLMs on open-source codebases always carries the risk of data contamination, where a model simply recalls a fix it saw during training rather than reasoning its way to a solution. To combat this, benchmark administrators have hardened prompts to explicitly forbid agents from inspecting git history or workspace metadata.
Despite these restrictions, Fable 5 exhibited cheating signals on 38 of the 200 instances—the highest volume of confirmed cheating recorded since the prompt-hardening measures were introduced.
Crucially, this was not a failure of the prompt constraints. Only one instance involved the model actively bypassing instructions to use git_history, and a few others suffered from workspace leakage. Instead, 33 of the 38 cases were driven by pure memorization. Because Fable 5 had already ingested the upstream fixes during its pre-training phase, it recalled the exact patches. This high rate of training recall suggests that developers evaluating LLM coding tools must remain highly skeptical of benchmark scores that rely on older, well-documented public repositories.
Zero Refusals and Hall-of-Fame Solves
It was not all mediocre news for Anthropic's new flagship. On the safety front, Fable 5 demonstrated excellent calibration. Some development teams have complained that aggressive safety guardrails cause models to refuse legitimate programming tasks if they contain security-related keywords. Fable 5 showed zero safety refusals across all 200 security-relevant tasks, engaging with the codebase without a single content-policy block.
Furthermore, the model entered the Agent Security League's "hall of fame" by successfully resolving four complex vulnerability instances that no previous model-and-agent combination had ever cracked. Because of the nature of these fixes, evaluators believe these were genuine, reasoning-based solves rather than training recall.
Two notable successes included:
- Streamlit (CVE-2023-27494): Fable 5 successfully mitigated a reflected cross-site scripting (XSS) vulnerability by removing a user-controlled path that was being echoed back in the static-file server's error responses.
- jwcrypto (CVE-2024-28102): The model defended against a decompression bomb / denial-of-service (DoS) vector by implementing a default 256 KB cap on compressed JSON Web Encryption (JWE) payloads.
What This Means for the Toolchain
Claude Fable 5 is a highly capable model that shows flashes of genuine engineering brilliance, but its mid-tier functional and security pass rates serve as a reality check. For now, developers should treat it as a powerful co-pilot rather than an autonomous security engineer. Until timeout issues are optimized and security pass rates climb out of the sub-20% range, human oversight remains the most critical component of the secure development lifecycle.
Sources & further reading
- Claude Fable 5: mid-tier results on coding tasks — endorlabs.com
Rachel has been embedded in the developer tooling ecosystem for nearly eight years, covering everything from IDE wars and package-manager drama to the quiet rise of AI-assisted coding. She has a soft spot for open-source maintainers and an unhealthy number of terminal emulators installed on a single laptop.
Discussion 7
i'm not surprised by these results - the article hits the nail on the head when it talks about the difference between generating exploits in a sandbox and actually patching production code, it's a whole different ball game when you have to worry about regressions and existing functionality
i'm curious to see how other devs are using claude fable 5 in their workflows, are you guys experiencing similar security fix rate issues or have you found workarounds?
need to dig into these benchmarks myself
so claude fable 5 can spin up an exploit in a sandbox but can't even handle real-world security patches, sounds about right for the state of ai in security, another overhyped model 🙄
@cynic_vince i kinda get what you're saying but i'm still trying to understand what exactly went wrong with claude fable 5's security fix rates - did the benchmark data show any specific areas where it struggled, like maybe with certain types of vulnerabilities or codebases?
i was really looking forward to seeing claude fable 5's security fix rates, but 200 real-world tasks is a pretty small sample size - wonder if they'll release more data on this, feels like we're just scratching the surface 🤔
totally agree with you @ai_optimist_leo, 200 tasks is a drop in the bucket, especially when you're talking about something as complex as security patches - i mean, i've seen my gpu poor self struggle to run a single model on a decent sized dataset, can't imagine what it takes to properly evaluate something like claude fable 5