GPT-5.4 — Thinking

By OpenAI · Updated

What It Actually Is

Here’s the thing about generalist models: they’re not supposed to beat specialists. GPT-5.4 Thinking breaks that rule. OpenAI’s unified frontier model wasn’t designed as a coding tool — it was designed as an everything tool — and yet it matches or edges out purpose-built coding models on the benchmarks that matter most. It’s the equivalent of a decathlete who also happens to hold the 100m world record. GPT-5.4 doesn’t just complete your function; it thinks through the architecture, plans multi-file edits, uses tools to search documentation, and executes agentic coding tasks that span hours — not minutes. With a 1M context window and native tool-use that cuts token consumption by 47%, it can hold your entire monorepo in working memory while costing less per task than you’d expect.

Key Strengths

SWE-Bench Pro 57.7%: The most demanding software engineering benchmark, testing complex real-world issues from production repositories. This edges GPT-5.3-Codex’s 56.8% — a generalist model outperforming a specialist.
1M token context window: Roughly 750,000 words of code and documentation in a single session. Load entire codebases and reason across them without chunking or summarization loss.
47% token savings: Native tool-search cuts redundant context, so agentic workflows burn fewer tokens. Real-world cost per task drops despite higher per-token pricing.
1.5x faster in Codex: Token velocity improvements mean coding tasks complete noticeably faster. Testers report solving complex bugs in hours that previously took days.
Spreadsheet modeling 87.3%: Up from GPT-5.2’s 68.4%. Financial modeling, data transformation, and formula generation are dramatically improved — useful for data-heavy codebases.

Benchmark Snapshot

SWE-Bench Pro — 57.7%Production-level software engineering benchmark. Edges GPT-5.3-Codex (56.8%) as a generalist model — the first time a non-specialist has held this spot.
GPQA Diamond — 92.8%PhD-level reasoning for complex architectural decisions and debugging. Near-ceiling performance.
ARC-AGI-2 — 73.3% / 83.3% ProNovel reasoning benchmark — the model solves problems it has never seen in training data. Critical for debugging novel patterns.
GDPval — 83.0%Real-world professional task performance across 44 occupations, demonstrating broad utility beyond coding alone.

Honest Limitations

Higher API costs: $2.50/M input, $15/M output. Pro tier at $30/M input, $180/M output. Long agentic sessions add up fast — budget before you build.
Cyber safeguard friction: Security-related code (pen testing, exploit analysis) can trigger false positives in the safety system. Legitimate security work occasionally hits walls.
1M context at 2x rate in Codex: The full context window counts tokens at double rate in the Codex environment. Your 1M window effectively costs like 2M.
Breadth vs. depth: Despite benchmark-topping numbers, purpose-built models like Opus 4.6 still produce more architecturally coherent code on sprawling refactors. GPT-5.4 wins on breadth; Opus wins on depth.

The Verdict: The surprise new #1. GPT-5.4 Thinking wasn’t designed to be a coding model, but its SWE-Bench Pro score, 1M context window, and native tool-use make it the most complete coding assistant available. It won’t match Opus 4.6’s architectural depth on massive refactors, but for the full spectrum of professional coding tasks — from quick fixes to multi-hour agentic sessions — it’s the new default. Use it for breadth, call Opus for depth.