OpenAI's New GPT-5.5 Release, Performance and Benchmarks: Everything You Need to Know

OpenAI's New GPT 5.5 AI Model

Space left for ad

OpenAI officially introduced GPT-5.5 on April 23, 2026, positioning it as a new frontier model for complex professional work rather than just a routine incremental update. The company says GPT-5.5 is built to better understand messy multi-step goals, use tools more effectively, check its own work, and carry tasks through to completion. At launch, OpenAI said the model was rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, with GPT-5.5 Pro going to Pro, Business, and Enterprise users, while the API rollout followed through the model docs and developer pages.

GPT-5.5’s Release Is About Workflow, Not Just Raw Chat Quality

The clearest theme in OpenAI’s launch materials is that GPT-5.5 is supposed to be better at real work across tools. In both the product announcement and the system card, OpenAI describes it as a model designed for coding, research, data analysis, document creation, spreadsheets, and other long-horizon tasks where the model has to plan, use tools, verify steps, and finish the job with less hand-holding. That is an important distinction because the release is framed less as “better chatbot answers” and more as “stronger agentic execution.”

OpenAI’s API documentation reinforces that positioning. The company lists GPT-5.5 as its newest frontier model for the most complex professional work, marks its reasoning level as “highest,” describes it as “fast,” and gives it a 1,050,000-token context window. The model page also shows support for multiple reasoning effort settings, from none through xhigh, which matters because several of the published benchmark gains depend on higher reasoning settings.

OpenAI's New GPT-5.5 Release, Performance and Benchmarks: Everything You Need to Know - image

GPT 5.5 Benchmark Comparision with Opus 4.7, and Gemini 3.1 Pro

Space left for ad

Release Timing, Access, and Safety Posture

OpenAI says GPT-5.5 launched with what it calls its strongest safeguards to date, after evaluation under its preparedness and safety frameworks, internal and external red-teaming, and feedback from nearly 200 trusted early-access partners. The system card says GPT-5.5 Pro is the same underlying model with a configuration that uses more parallel test-time compute, which means some results for GPT-5.5 serve as proxies for Pro unless OpenAI judged separate evaluation was necessary.

That matters because OpenAI is clearly presenting GPT-5.5 as a model meant for broader deployment into real workflows, not just as a lab demo. The release notes describe it as especially useful for coding, online research, document-grounded work, multilingual retrieval, and producing polished artifacts such as reports, spreadsheets, and plans.

Performance: Where GPT-5.5 Appears Strongest

The strongest performance story in OpenAI’s own materials is agentic coding. In the launch post, OpenAI says GPT-5.5 is its strongest agentic coding model to date and highlights gains on terminal workflows, GitHub issue resolution, and long-horizon engineering tasks. The company also says GPT-5.5 often reaches higher-quality outputs with fewer tokens and fewer retries, which is an efficiency claim, not just an intelligence claim.

On Terminal-Bench 2.0, OpenAI reports 82.7% for GPT-5.5 versus 75.1% for GPT-5.4, while Claude Opus 4.7 and Gemini 3.1 Pro are listed at 69.4% and 68.5%, respectively. On OpenAI’s internal Expert-SWE eval for long-horizon coding tasks, GPT-5.5 is listed at 73.1% versus 68.5% for GPT-5.4. These are large enough gaps to support OpenAI’s claim that the new model is materially stronger for tool-heavy engineering work.

GPT-5.5 also posts gains in several tool-use and productivity benchmarks. OpenAI reports 84.9% on GDPval (wins or ties) versus 83.0% for GPT-5.4, 78.7% on OSWorld-Verified versus 75.0%, 55.6% on Toolathlon versus 54.6%, and 84.4% on BrowseComp versus 82.7%. On Tau2-bench Telecom (original prompts), GPT-5.5 is listed at 98.0% versus 92.8% for GPT-5.4. Together, those numbers suggest the model’s gains are not limited to code generation alone; they extend into search, tool use, and structured knowledge-work tasks.

Academic and Reasoning Benchmarks Also Improved

OpenAI’s published academic-style evaluations show a similar pattern of improvement over GPT-5.4, though not universal dominance over every competitor. In the launch materials, GPT-5.5 is listed at 25.0% on GeneBench, compared with 19.0% for GPT-5.4, and 33.2% for GPT-5.5 Pro. On FrontierMath Tier 1–3, GPT-5.5 is listed at 51.7% versus 47.6% for GPT-5.4, while on FrontierMath Tier 4 it rises to 35.4% from 27.1%. OpenAI also reports 80.5% on BixBench and 93.6% on GPQA Diamond.

On Humanity’s Last Exam, the picture is more mixed. OpenAI reports GPT-5.5 at 41.4% without tools and 52.2% with tools. Those are slight gains over GPT-5.4’s 39.8% and 52.1%, but they do not top every competitor in the table; Claude Opus 4.7 and Gemini 3.1 Pro are listed higher on the no-tools version. That is a good reminder that GPT-5.5’s story is strongest in workflow-heavy and agentic tasks, not necessarily in every static benchmark category.

Health and Safety-Related Evaluation Results

The deployment safety hub also shows measurable gains over GPT-5.4 on some health-oriented evaluations. GPT-5.5 recorded a length-adjusted HealthBench score of 56.5, up from 54.0 for GPT-5.4, while HealthBench Hard improved to 31.5 from 29.1. HealthBench Professional rose to 51.8, up 3.7 points relative to GPT-5.4, while HealthBench Consensus was effectively flat-to-slightly-down at 95.6 versus 96.3. OpenAI characterizes this as generally improved performance on HealthBench, HealthBench Hard, and HealthBench Professional, with Consensus roughly flat.

These are not consumer-facing “medical approval” metrics, but they do matter because they show OpenAI is benchmarking the model in areas where answer quality and calibration can be high stakes. The modest-but-real gains also fit the broader GPT-5.5 pattern: steady improvement rather than one dramatic leap on every single axis.

Benchmarks Need Context, and OpenAI’s Own Tables Show That

One of the more useful things about OpenAI’s release materials is that they do not show GPT-5.5 winning every benchmark. For example, the launch tables list Claude Opus 4.7 ahead on SWE-Bench Pro, and Gemini 3.1 Pro ahead on BrowseComp and MCP Atlas. OpenAI also includes a footnote noting that labs have reported evidence of memorization on SWE-Bench-style evaluations, which is an important caveat for interpreting those numbers.

That makes the fairest reading of GPT-5.5 more nuanced than “best at everything.” A more accurate summary is that GPT-5.5 looks particularly strong where tasks involve planning, tool use, long context, ambiguity handling, and multi-step completion, while some rival models still post stronger numbers on specific benchmark slices.

Efficiency May Be One of the Biggest Practical Upgrades

OpenAI repeatedly emphasizes that GPT-5.5 is not only stronger, but also more efficient. The announcement says it often reaches better outputs with fewer tokens and fewer retries, and the API model page lists pricing of $5 per million input tokens and $30 per million output tokens. OpenAI also says, via its launch post, that on Artificial Analysis’s Coding Index GPT-5.5 delivers state-of-the-art intelligence at roughly half the cost of competing frontier coding models. That specific cost comparison is OpenAI’s characterization of an external index, so it is best treated as a directional claim rather than a standalone audited conclusion.

For users, that efficiency angle may be as important as the raw benchmark gains. If GPT-5.5 needs less steering, wastes fewer steps, and can preserve context over very large working sets, the improvement shows up not only in leaderboard scores but in less friction during real tasks.

Bottom Line on GPT-5.5

GPT-5.5 is a real OpenAI release, not just a rumor, and the official materials support the idea that it is a meaningful upgrade for coding, tool use, long-context work, and professional task completion. Its biggest benchmark wins are in areas like Terminal-Bench 2.0, GeneBench, and several productivity- and tool-oriented evals, while some competitors still lead on specific public tests. Overall, the strongest evidence suggests GPT-5.5 is less about chasing a single headline score and more about becoming a better model for getting real work done across tools and longer workflows.