GLM-5.2

Zhipu AI · Released June 13, 2026

9.4 /10 Overall Rating

What It Actually Is

There’s a number that’s easy to dismiss until you see where it came from: 1360. That’s GLM-5.2’s Elo on Design Arena — not a self-reported benchmark, but an independent, community-driven leaderboard where real users vote on real coding and design tasks. It’s the #1 spot. The first time an open-weight model has taken it.

And then there’s 87. That’s GLM-5.2’s score on AkitaOnRails’ custom coding benchmark — a practical, multi-turn evaluation that GLM-5.1 scored 46 on. A +41 point jump. Tier C to Tier A. The largest intra-family turnaround the benchmark has ever recorded.

These aren’t Zhipu AI’s numbers. These are independent evaluators measuring what the model actually does in practice. And they tell the same story as the official benchmarks, which is the part that matters.

Released by Zhipu AI on June 13, 2026, GLM-5.2 is a 744-billion-parameter Mixture-of-Experts model that activates roughly 40 billion parameters per forward pass. The architecture uses IndexShare to reduce per-token FLOPs by 2.9× at 1M context length, with MTP improvements that increase speculative decoding acceptance by 20%. Two reasoning effort levels — High for balanced efficiency, Max for depth — let you trade compute for capability.

The official benchmark table fills in the details. SWE-bench Pro 62.1% beats GPT-5.5 (58.6%), Qwen 3.7 Max (60.6%), and every other open model by a wide margin. Terminal-Bench 82.7 on the Claude Code harness actually edges Opus 4.8’s 78.9 — though Opus 4.8 leads on the Terminus-2 harness (85.0 vs 81.0). On FrontierSWE, the benchmark for multi-hour engineering projects, GLM-5.2 scores 74.4% — trailing Opus 4.8’s 75.1% by exactly 1%.

The MIT license is the force multiplier. No regional limits, no attribution requirements, no API lock-in. Download from Hugging Face, quantize, deploy on vLLM or SGLang or ktransformers. Works with Claude Code, ZCode, OpenCode, and any OpenAI-compatible endpoint. The strongest open coding model ever released, backed by both the builder’s benchmarks and the community’s independent validation.

Key Strengths

Design Arena #1 — Elo 1360: The first open-weight model to top Design Arena’s coding categories, surpassing the previously leading (now restricted) Claude Fable 5. Gained +27 Elo and +4 positions in a short window — one of the highest coding Elo scores ever recorded on the arena. This is independent, community-driven validation, not self-reported benchmarks.
AkitaOnRails 87/100 — Tier A: The most dramatic version-over-version improvement in the benchmark’s history. GLM-5.1 scored 46/100 (Tier C, #21). GLM-5.2 jumped to 87/100 (Tier A, tied #6) — a +41 point leap. Tied with Kimi K2.6/K2.7 variants; behind top closed models (Opus 4.7/4.8 and GPT-5.5 at 94-97). This is a practical multi-turn coding evaluation showing real reliability gains.
SWE-bench Pro 62.1%: Beats GPT-5.5 (58.6%), Qwen 3.7 Max (60.6%), DeepSeek-V4-Pro (55.4%), and Gemini 3.1 Pro (54.2%). Only Opus 4.8 (69.2%) scores higher. SWE-bench Verified subsets show ~78%+ in recent snapshots. The highest SWE-bench Pro score any open-weight model has achieved.
Terminal-Bench 82.7 (Claude Code harness): Actually edges Opus 4.8’s 78.9 on the same harness. On the Terminus-2 harness, 81.0 vs Opus 4.8’s 85.0. Both configurations show a massive 17.5+ point jump from GLM-5.1’s 63.5.
FrontierSWE 74.4% — Nearly Tied with Opus 4.8: Open-ended technical projects spanning hours to tens of hours. GLM-5.2 trails Opus 4.8 by only 1% and edges out GPT-5.5 by 1%. The highest-ranked open model on sustained engineering tasks. MIT license and 1M context make this the only open model competing at this level.

Benchmark Snapshot

Design Arena — #1 (Elo 1360) First open-weight model to top Design Arena's coding categories. Independent community-driven validation. Surpassed Claude Fable 5 with +27 Elo gain.
SWE-bench Pro — 62.1% Beats GPT-5.5 (58.6%), Qwen 3.7 Max (60.6%), and every open model. Only Opus 4.8 (69.2%) ranks higher. SWE-bench Verified subsets show ~78%+.
Terminal-Bench 2.1 — 81.0 / 82.7 81.0 on Terminus-2 (vs Opus 4.8's 85.0). 82.7 on Claude Code harness (edges Opus 4.8's 78.9). Massive +17.5 points from GLM-5.1.
AkitaOnRails — 87/100 Tier A Practical multi-turn coding eval. +41 points from GLM-5.1's 46/100 — the largest intra-family jump in the benchmark's history. Tied #6 overall.

Honest Limitations

Gap to Closed Leaders on Depth Benchmarks: Opus 4.8 still leads on SWE-bench Pro (69.2 vs 62.1), NL2Repo (69.7 vs 48.9), SWE-Marathon (26.0 vs 13.0), and DeepSWE (58.0 vs 46.2). GPT-5.5 leads on DeepSWE (70.0 vs 46.2). The gap is narrowing but hasn’t closed on the hardest tasks.
Heavy Architecture: 744B total parameters (~40B active per token) means even quantized deployments need multi-hundred-GB setups. Most users will access via API. The Coding Plan consumes quota at 3× peak / 2× off-peak.
Not Dominating General Chat: lmarena Code Arena places GLM-5.2 7th-10th range (Elo ~1447-1455). Strong in coding-specific slices but not leading overall text arenas. Coding-focused, not general-purpose.
No Native Vision: Text/code only. Can’t process screenshots or diagrams. For visual coding workflows, you need a separate vision model.

The Verdict: Something shifted. When an open-weight model takes the #1 spot on Design Arena, jumps 41 points on an independent practical coding benchmark, and trails the best closed model by single-digit percentages on FrontierSWE — that’s not incremental progress. GLM-5.2 doesn’t beat Opus 4.8 on every metric, and the gap on the hardest depth benchmarks is real. But for teams that need frontier-grade coding without API lock-in, or who want MIT-licensed weights they can inspect and deploy on their own infrastructure, this is the model that makes it viable. The combination of independent validation (Design Arena #1, AkitaOnRails Tier A) and official benchmarks (SWE-bench Pro 62.1%, FrontierSWE 74.4%) tells a consistent story: the strongest open coding model ever released, and it’s not even close to the previous open leaders.