GPT-5.5
CodingThe agentic coding model that doesn't just autocomplete — it plans, tools up, debugs across files, and finishes the messy repo task while you walk the dog. Terminal-Bench 82.7% isn't a typo.
Terminal-Bench 2.0 82.7% (crushes Opus 4.7's 69.4%); Expert-SWE 73.1% on 20-hour human tasks; FrontierMath Tier 4 35.4%; ~40% fewer output tokens; 1M context with native tool use and Codex integration.
2× API price ($5/$30 per 1M tokens); trails Claude Opus 4.7 on SWE-Bench Pro (58.6% vs 64.3%); API not live at launch; early hallucination reports need verification.