ChatGPT — GPT‑5.4 Thinking
By OpenAI · Updated
What It Actually Is
If the history of AI were a rock band, ChatGPT would be The Beatles — not necessarily the most technically sophisticated at every moment, but the one that changed what everyone expected music to sound like. GPT-5.4 Thinking is OpenAI’s unified frontier model, and it represents a genuine generational leap: one model that blends reasoning, coding, and agentic execution into a single thinking engine that plans ahead before it acts. Think of it as upgrading from a very smart assistant to a very smart colleague. GPT-5.4 doesn’t just answer questions — it thinks through multi-step problems, uses tools on its own, operates your computer when needed, and executes tasks that used to require multiple models and manual orchestration. It reads, writes, generates images, browses the web, runs code, and now does all of it with 33% fewer hallucinations and a 1M token context window.
Key Strengths
- GDPval dominance (83.0%): Tested across 44 real-world occupations — from legal analysis to financial modeling — GPT-5.4 surpasses GPT-5.2’s 70.9% by a 12-point margin. This isn’t a benchmark designed in a lab; it measures whether the model actually helps professionals do their jobs.
- Computer use that beats humans: OSWorld-Verified score of 75.0%, compared to the human baseline of 72.4%. The model can navigate desktop applications, fill forms, and execute multi-step workflows autonomously across your screen.
- Thinking that saves you money: A new tool-search mechanism cuts token usage by 47%, and the 1M context window means you can throw entire projects at it without chunking. Real-world cost drops despite higher per-token pricing.
- 33% fewer hallucinations: OpenAI’s most significant reliability improvement. When GPT-5.4 doesn’t know, it’s measurably more likely to say so rather than confidently fabricating an answer.
- Ecosystem breadth: Available on web, iOS, Android, desktop apps, and via API. Custom GPTs, plugin store, and Codex integration extend it for niche tasks.
- GDPval — 83.0%Real-world professional task performance across 44 occupations. 12.1 points above GPT-5.2 (70.9%) — the largest single-generation jump on this benchmark.
- GPQA Diamond — 92.8%PhD-level science exam with 198 questions. Near-ceiling performance on graduate-level reasoning.
- OSWorld-Verified — 75.0%Computer-use benchmark where the model operates desktop applications. Human baseline is 72.4% — GPT-5.4 exceeds it.
- ARC-AGI-2 — 73.3% / 83.3% ProNovel reasoning benchmark testing pattern recognition on tasks never seen in training data.
Honest Limitations
- Pricing jump: API costs rise to $2.50/M input and $15/M output (GPT-5.2 was $1.75/$14). Pro tier is $30/M input / $180/M output. The best performance costs genuinely more.
- Long-context accuracy dips: At 512K–1M tokens, accuracy on the MRCR v2 benchmark drops to 36.6%. The 1M context window exists, but don’t trust it blindly at the far end.
- Cyber safeguard false positives: Enhanced safety systems occasionally block legitimate security-related prompts. If you work in cybersecurity, expect friction.
- Gradual rollout: Not everything is available to everyone yet. GPT-5.2 is scheduled for retirement in June 2026 — plan your migration.
The Verdict: The default choice, upgraded. GPT-5.4 Thinking doesn’t change ChatGPT’s personality — it’s still the Swiss Army knife you know — but it sharpens every blade. The 12-point GDPval jump and human-beating computer use make it the clearest upgrade in everyday AI since GPT-4 arrived. If you subscribe to one AI, this remains the safe, capable pick — but now it’s more like a Lexus than a Toyota.