Claude Opus 4.8

Anthropic · Released May 28, 2026

9.7 /10 Overall Rating

What It Actually Is

There’s a number that makes this review easy to write: 69.2%. That’s Opus 4.8 on SWE-Bench Pro — the benchmark that doesn’t care about toy problems, only whether an AI can fix actual bugs in actual production codebases. GPT-5.5 scores 58.6%. Opus 4.7 scored 64.3%. Gemini 3.1 Pro manages 54.2%.

The gap isn’t just wide — it’s embarrassing for the competition.

Released today (May 28, 2026), Claude Opus 4.8 builds on everything that made 4.7 the coding king and fixes everything that held it back. The hybrid reasoning engine is sharper. The self-verification loops are 4× more reliable at catching bugs before they ship. And the new effort control system means you finally get to choose: think fast, or think deep.

But the headline feature is Dynamic Workflows. Claude Code can now spawn hundreds of parallel subagents — each tackling a slice of a massive codebase migration, bug sweep, or language port. It’s the closest thing AI has to a real engineering team. And on the Super-Agent benchmark, Opus 4.8 is the only model to complete every single test case end-to-end.

The honest caveat? GPT-5.5 still wins on Terminal-Bench (78.2% vs 74.6%) — if your workflow is rapid shell iteration, OpenAI has the edge. And the deeper thinking traces mean higher token burn on complex tasks. But for the deep, multi-file, “ship a real feature” engineering work — the kind that actually matters — Opus 4.8 is in a league of its own.

Key Strengths

SWE-Bench Pro 69.2% (SOTA): The benchmark that measures whether AI can fix real bugs in real codebases. Opus 4.8 leads by 10.6 points over GPT-5.5 (58.6%), 4.9 points over its own predecessor Opus 4.7 (64.3%), and 15.0 points over Gemini 3.1 Pro (54.2%). The largest lead any model has ever held on this benchmark.
Self-verification that actually works: 4× less likely to let code flaws slip through without flagging them. Opus 4.8 catches its own mistakes, pushes back when a plan isn’t sound, and reports progress honestly instead of hallucinating completion. The ‘I’m done’ lie that plagued earlier models is largely gone.
Dynamic Workflows: Claude Code can now spawn and manage hundreds of parallel subagents for large-scale tasks — codebase migrations, bug sweeps, language ports. Think of it as AI project management, not just code generation.
100% Super-Agent completion: The only model to complete every case end-to-end on the Super-Agent benchmark, beating all prior Opus models and GPT-5.5. Agentic reliability isn’t just a talking point anymore — it’s measurable.
Effort control: You now choose how hard it thinks — Default, Extra, or Max. No more fighting the ’laziness’ problem that plagued Opus 4.7 on simple tasks. Ask for quick, get quick. Ask for deep, get deep.

Benchmark Snapshot

SWE-Bench Pro — 69.2% (SOTA) Real-world software engineering. The highest score any model has ever posted — beating GPT-5.5 (58.6%), Opus 4.7 (64.3%), and Gemini 3.1 Pro (54.2%). A 10.6-point lead over the nearest competitor.
Terminal-Bench — 74.6% Rapid terminal-based coding. Strong, but GPT-5.5 retains the lead at 78.2%. Opus excels at deep reasoning tasks; GPT-5.5 at rapid iteration.
Super-Agent — 100% End-to-end agentic task completion across translation, deep research, slide-building, and analysis. The only model to complete every case.

Honest Limitations

Token cost is real: Same nominal pricing as 4.7 ($5/$25 per million tokens), but deeper thinking on complex tasks burns more tokens. The tokenizer still inflates costs 15–35% on code-heavy prompts. Budget accordingly.
Terminal-Bench gap: GPT-5.5 leads at 78.2% vs Opus 4.8’s 74.6% on rapid terminal iteration tasks. If your workflow is mostly shell-driven, GPT-5.5 has the edge.
Latency on hard problems: Deeper thinking traces mean longer waits on complex tasks. Fast mode (2.5× speed, 3× cheaper) helps for lighter work, but the hardest problems still require patience.
Strict safety guardrails: Enhanced cybersecurity protections block certain high-risk code patterns. Legitimate security researchers may hit false positives.

The Verdict: The coding crown, without the asterisk. Opus 4.7 was the undisputed king of hard engineering problems but fumbled on simple ones. Opus 4.8 fixes both sides — the SWE-Bench Pro lead grows to a chasm (69.2% vs GPT-5.5’s 58.6%), while effort control eliminates the ’laziness’ complaints. The self-verification improvement is the real story: a model that catches its own bugs before you do. GPT-5.5 still wins on terminal speed, but for the kind of deep, multi-file engineering work that actually ships features — this is it.