GLM 5.2 Beats Claude in Security Benchmark
Zhipu AI's open-weight model outshines proprietary giants in detecting complex access control vulnerabilities without leaking code.
Finding security vulnerabilities in code is one of the most demanding tasks we hand over to large language models. Unlike simple syntax checks, identifying logical flaws like Insecure Direct Object References (IDORs) requires a deep understanding of authorization boundaries, routing, and state. For a long time, conventional wisdom said you needed massive, proprietary frontier models behind expensive APIs to even stand a chance.
A recent benchmark from security platform Semgrep turned that assumption on its head. In a head-to-head evaluation of IDOR detection, GLM 5.2, an open-weight model from Zhipu AI, scored a 39% F1 score. This performance comfortably bypassed Claude Code, which posted a 32% F1 score, and even outpaced Claude Opus 4.8 in raw prompting scenarios.
This is a massive moment for security teams. For organizations that cannot leak proprietary codebases to external APIs due to compliance or privacy constraints, the arrival of a highly capable, MIT-licensed model that runs locally changes the math entirely.
The Architecture of GLM 5.2
Zhipu AI rolled out GLM 5.2 to its coding plan members on June 13, 2026, and released the open weights under an MIT license on June 16, 2026.
Under the hood, GLM 5.2 is a Mixture-of-Experts (MoE) model. It boasts roughly 750 billion total parameters, but only activates about 40 billion parameters per token. This design keeps inference costs remarkably low. In Semgrep's testing, GLM 5.2 found vulnerabilities at an estimated cost of just $0.17 per bug.
Equally important for security audits is the model's expanded context window, which now stretches to 1 million tokens, up from 200K. Security analysis is rarely self-contained. To find an IDOR, a model must trace a request from an HTTP controller, through middleware checks, down to the database query, often spanning dozens of files. Zhipu AI designed this context window to remain reliable across long, complex agent trajectories, ensuring the model does not lose the thread when parsing deeply nested codebases.
Raw Prompting vs. The Harness
While GLM 5.2's victory over Claude Code is impressive, the benchmark highlights a critical architectural lesson: the model is only as good as the scaffolding around it.
In this evaluation, both models were tested using a basic Pydantic AI harness. They received the same IDOR prompt, a basic search strategy, and pointers on what IDORs look like, but no advanced assistance like endpoint discovery or guided navigation.
When we look at the broader picture, Semgrep's own multimodal pipeline scored between 53% and 61% F1. The difference? Semgrep's pipeline runs inside a custom harness designed specifically for static analysis. This harness does the heavy lifting: it enumerates application endpoints, prunes irrelevant code, and feeds the model only the most critical context.
xychart-beta
title "IDOR Detection Performance (F1 Score %)"
x-axis ["Claude Code", "GLM 5.2", "Semgrep Pipeline (Max)"]
y-axis "F1 Score (%)" 0 --> 70
bar [32, 39, 61]
The data shows that while a superior model provides a better baseline, building a smart, agentic harness around the model is what moves the needle from experimental to production-ready.
What This Means for Your Security Workflow
For developers looking to adopt AI-driven security scanning, GLM 5.2 offers a compelling path forward.
First, the MIT license means you can host this model on your own infrastructure. If you are working in fintech, healthcare, or any sector with strict data sovereignty rules, sending code to external APIs is often a non-starter. Running GLM 5.2 locally solves this bottleneck.
However, hosting a 750-billion-parameter MoE model is not trivial. Even though only 40 billion parameters are active per token, you still need enough VRAM to hold the active weights and manage the massive 1-million-token context window. Teams will need to balance the infrastructure costs of running high-end GPUs against the API costs of proprietary models.
To get started, developers should avoid throwing raw code at the model in a single prompt. Instead, mimic the success of Semgrep's multimodal pipeline. Build an agentic workflow that maps out API endpoints, identifies authorization middleware, and extracts only the relevant controller code before feeding it to GLM 5.2.
The success of GLM 5.2 proves that open-weight models are no longer the underdogs in specialized, highly complex domains like cybersecurity. By combining the privacy of local execution with performance that rivals or exceeds proprietary giants, GLM 5.2 gives developers a powerful new tool to secure their codebases on their own terms.
Sources & further reading
- GLM 5.2 beats Claude in our benchmarks — semgrep.dev
Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.
Discussion 1
nice to see open models competing with the big players