Claude Opus 4.7

Anthropic · Released April 16, 2026

9.6 /10 Overall Rating

What It Actually Is

There’s a number that makes this review easy to write: 64.3%. That’s Opus 4.7 on SWE-Bench Pro — the benchmark that doesn’t care about toy problems, only whether an AI can fix actual bugs in actual production codebases. GPT-5.4 scores 57.7%. Kimi K2.6 scores 58.6%. Opus 4.6 scored 53.4%.

The gap isn’t close. It’s not even competitive. It’s a category break.

Released on April 16, 2026, Claude Opus 4.7 is what Anthropic calls a “hybrid reasoning model” — a system that can dynamically adjust how hard it thinks based on problem difficulty. The new “xhigh” effort level lets developers explicitly tell the model to reason deeper on hard problems, trading latency for accuracy. And on CursorBench — actual coding sessions with real developers in a real IDE — it scores 70%, up from 58% for Opus 4.6.

But the honest review requires the honest caveats. This model was optimized for hard, multi-step engineering work, and you feel it everywhere else. Simple prompts sometimes get less effort than they did on 4.6 (the “laziness” reports are real). The new tokenizer inflates costs 15–35% on code-heavy prompts. And some users report the 1M context window, while technically present, has weaker recall in the middle ranges than 4.6.

This is not a universal upgrade. It’s a specialist that happens to be the best specialist we’ve ever seen.

Key Strengths

SWE-Bench Pro 64.3% (SOTA): The benchmark that measures whether AI can fix real bugs in real codebases. Opus 4.7 doesn’t just lead — it leads by 5.7 points over GPT-5.4 (57.7%) and 10.9 points over its own predecessor Opus 4.6 (53.4%). The biggest single-generation leap in coding benchmarks this year.
CursorBench 70%: Not a synthetic benchmark — actual Cursor IDE sessions with real developers. Opus 4.7 scored 70% vs Opus 4.6’s 58%. The model that developers actually want in their editor.
Hybrid reasoning with ‘xhigh’ effort: A new effort tier that lets you trade latency for deeper thinking on truly hard problems. When you need the model to actually solve something, not just pattern-match, xhigh delivers.
High-res vision (3.75 MP): Feed it dense screenshots, architecture diagrams, error dialogs, or entire dashboards at up to 2576px resolution. It doesn’t just see images — it reads them with the precision of a code review.
Agentic autonomy: Multi-file edits, tool-use chains, self-verification loops — Opus 4.7 handles complex autonomous workflows with noticeably less hand-holding than 4.6. Cursor, CodeRabbit, and Replit all report fewer mid-task failures.

Benchmark Snapshot

SWE-Bench Pro — 64.3% (SOTA) Real-world software engineering. The highest score any model has ever posted — beating GPT-5.4 (57.7%), Kimi K2.6 (58.6%), and Opus 4.6 (53.4%). The gap is enormous.
CursorBench — 70% Real IDE coding sessions with actual developers. Opus 4.7 jumped 12 points over Opus 4.6 (58%) — the kind of improvement that changes which model people reach for daily.
SWE-Bench Verified — 87.6% Curated subset of SWE-Bench with verified solutions. Opus 4.7 leads all models, up from Opus 4.6's 80.8%.

Honest Limitations

Token inflation: The new tokenizer increases real costs 15–35% on code-heavy prompts compared to Opus 4.6 at the same nominal pricing ($5/$25 per million tokens). Budget accordingly for long sessions.
‘Lazy’ on easy prompts: Adaptive reasoning means it sometimes under-invests effort on straightforward requests. Power users report needing to explicitly set higher effort levels to get the thoroughness Opus 4.6 gave by default.
Long-context regressions: Some users report weaker recall in the 100K–1M token range compared to 4.6, particularly ’lost in the middle’ issues. The 1M context window is still there, but quality degrades more noticeably.
Heavier safety guardrails: Enhanced cybersecurity protections block certain high-risk code patterns. Legitimate security researchers may hit false positives.

The Verdict: The undisputed coding king — with an asterisk. On hard engineering problems, Opus 4.7 is in a league of its own. The SWE-Bench Pro gap over GPT-5.4 is the largest we’ve seen between any two frontier models this year. But Anthropic optimized this model for one thing and it shows — on simple prompts it can feel ’lazier’ than 4.6, and the token costs are real. Use it for the hard stuff. For quick questions, Sonnet is still your better bet.