GPT-5.5

OpenAI · Released April 23, 2026

9.8 /10 Overall Rating

What It Actually Is

Here’s the thing about coding AI in 2026: the benchmarks that used to matter are no longer the benchmarks that matter. SWE-Bench Pro tests whether a model can fix a single GitHub issue cleanly. That’s important — but it’s not what most developers actually need. Most developers need a model that can take a vague ticket, explore a messy repo, plan an approach, use tools, write code across multiple files, test it, and iterate until it works. That’s Terminal-Bench. And GPT-5.5 owns it.

Terminal-Bench 2.0 at 82.7% isn’t just a number — it’s a 13-point gap over Claude Opus 4.7 (69.4%). Expert-SWE at 73.1% means GPT-5.5 solves tasks that take senior engineers a full day or more. And it does this while using 40% fewer output tokens than GPT-5.4, meaning your Codex sessions are faster and cheaper per task despite the doubled per-token price. The agentic era of coding — where you describe the problem and the model plans, executes, and verifies — isn’t a vision anymore. It’s a product, and GPT-5.5 in Codex is its clearest implementation.

Key Strengths

Terminal-Bench 2.0 — 82.7%: The benchmark for agentic coding and terminal workflows. GPT-5.5 crushes Opus 4.7 (69.4%) and Gemini 3.1 Pro (68.5%) by double-digit margins. This tests what actually matters: give the model a messy task in a real terminal and see if it finishes.
Expert-SWE — 73.1%: Tasks that take senior engineers a median of 20 hours. GPT-5.5 solves 73.1% of them, up from GPT-5.4’s 68.5%. This is the benchmark that separates ‘good autocomplete’ from ‘actual engineering partner.’
FrontierMath Tier 4 — 35.4%: The hardest tier of mathematical reasoning. Opus 4.7 scores 22.9%, Gemini 16.7%. GPT-5.5 leads by a massive margin — critical for debugging novel algorithmic problems.
40% fewer output tokens: Same latency as GPT-5.4, but it communicates more efficiently. On Codex tasks, this translates to real speed and cost improvements despite the doubled per-token price.
1M context + Codex integration: Load entire monorepos. The model reads your architecture, understands your patterns, and writes code that fits — not generic boilerplate. Codex gets 400K context with native screen reading and tool use.

Benchmark Snapshot

Terminal-Bench 2.0 — 82.7% Agentic coding and terminal workflows. 13+ points ahead of Opus 4.7 (69.4%) — the biggest gap on any major coding benchmark.
Expert-SWE — 73.1% Long-horizon engineering tasks (20-hour median). Up from GPT-5.4's 68.5%. Proves the model can sustain quality across complex, multi-session work.
SWE-Bench Pro — 58.6% Production-level GitHub issues. Improved from 57.7%, but Claude Opus 4.7 still leads at 64.3%. The honest gap.
FrontierMath Tier 4 — 35.4% Hardest mathematical reasoning tier. 12.5 points ahead of Opus 4.7 (22.9%). Critical for novel algorithm design.

Honest Limitations

SWE-Bench Pro — 58.6%: Claude Opus 4.7 still leads at 64.3%. For narrow, high-stakes single-issue debugging and architecturally complex refactors, Opus remains the depth king. GPT-5.5 wins the workflow; Opus wins the scalpel.
API pricing doubled: $5/M input, $30/M output. Pro at $30/$180. The token efficiency helps, but long agentic sessions still add up. Budget before you build.
API not live yet: At launch, GPT-5.5 is in ChatGPT and Codex only. API access is coming ‘very soon’ — if you build automated pipelines, you’re waiting.
Hallucination caution: One early independent report flagged elevated hallucination rates on omniscience evaluations. For production code that touches safety-critical systems, pair with thorough review.

The Verdict: The agentic coding king. GPT-5.5 doesn’t win every narrow benchmark — Opus 4.7 still owns SWE-Bench Pro depth — but it dominates the category that matters for 90% of developers in 2026: getting complex, ambiguous, multi-file work across the finish line with minimal babysitting. Terminal-Bench 82.7% is the headline, but the real story is Expert-SWE 73.1% on tasks that take humans 20 hours. Give it a messy repo and walk away. It won’t match Opus for surgical refactors, but for the full spectrum of ‘give me a working solution’ — from terminal workflows to multi-file debugging to tool-using agents — it’s the strongest all-rounder available.