GPT-5.5 Scores 93/100 in ZDNet Test Analysis

GPT-5.5 scored 93 out of 100 in ZDNet’s structured 10-round benchmark, earning full marks in five of ten categories including mathematical reasoning and literary analysis.
The model failed to follow a specific sourcing instruction in Test 1, drawing from AP, The Wall Street Journal, The Guardian, The Sun, and Wikipedia instead of the instructed Yahoo News.
GPT-5.5 is currently available only to ChatGPT Plus subscribers and above, tested at the Standard Thinking effort level.
The reviewer flagged the instruction-adherence failure as a direct concern for autonomous AI agent deployments on multi-step tasks.

What Happened

OpenAI’s GPT-5.5, positioned by the company as an improvement over GPT-5.4 in agentic coding, conceptual clarity, scientific research ability, and knowledge-work accuracy, scored 93 out of 100 in a structured 10-round evaluation administered by ZDNet contributor David Gewirtz and published in April 2026.

Gewirtz used ChatGPT Plus at the Standard Thinking effort level—the configuration currently available to paid subscribers. The model earned perfect scores in five of ten categories but lost seven points across two tests, primarily for expanding beyond the explicit boundaries of given prompts.

Why It Matters

The evaluation captures a model released during one of the most compressed upgrade cycles in OpenAI’s public history: GPT-5.5 arrived within days of both GPT-5.4 and the ChatGPT Images 2.0 launch, with OpenAI attributing the accelerated development timeline to AI-assisted coding internally.

The instruction-following failure documented in the test bears directly on industry debates about whether current large language models are reliable enough to power autonomous agents on multi-step, low-oversight tasks—an area all major AI labs have made a central product priority in 2026.

Technical Details

GPT-5.5 demonstrated full accuracy across five test categories—mathematical pattern recognition, academic concept explanation, cultural argumentation, literary theme analysis, and structured reasoning—but dropped five points in the news summarization test after drawing from AP, The Sun, The Wall Street Journal, The Guardian, and Wikipedia instead of the specifically instructed Yahoo News source.

In the literary analysis test, the model produced a 632-word response identifying and explaining ten distinct themes in George R.R. Martin’s A Game of Thrones, including power and its cost, the collapse of heroic fantasy ideals, honor versus pragmatism, and the human cost of war. In the travel itinerary category, GPT-5.5 scored 9 out of 10 after factoring in Boston’s March weather conditions to recommend a mix of indoor and outdoor activities with fallback options.

Gewirtz noted the sourcing failure was penalized more severely than a comparable error from GPT-5.2: the earlier model lost one point for mixing two sources, while GPT-5.5 lost five points for substituting five distinct external outlets for the one explicitly specified.

Who’s Affected

ChatGPT Plus subscribers and development teams building autonomous pipelines on OpenAI’s API are most directly affected by the instruction-adherence gap documented in the evaluation. Gewirtz wrote: “There’s a big push from all the AI companies about running autonomous agents. But if even a simple summary prompt can’t be followed correctly, it does not give me confidence that it’s safe to let agents run wild on long-horizon projects.”

Enterprise customers who have integrated GPT-5.5 into structured workflows—where prompt compliance is a functional requirement rather than a preference—face the most direct exposure to this behavior.

What’s Next

Gewirtz indicated he will publish a follow-up evaluation covering GPT-5.5’s image generation capabilities through ChatGPT Images 2.0. OpenAI has not announced a timeline for making GPT-5.5 available on free-tier ChatGPT accounts, nor has the company detailed any engineering changes planned to address the instruction-following behavior identified in the review.

GPT-5.5 Scores 93/100 in ZDNet Test, Drops Points for Ignoring Instructions

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

GPT-5.5 Scores 93/100 in ZDNet Test, Drops Points for Ignoring Instructions

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Google’s Gemini API Managed Agents Add 3.6 Flash Default and Security Hooks

White House Says Moonshot Used Banned Nvidia Chips to Build Kimi K3

OpenAI Details How AP, POLITICO and Axios Use Its Models in Newsrooms