GPT-5.4 Just Beat Humans at Real Office Work — 75% vs 72.4% on Desktop Tasks

OpenAI’s GPT-5.4 scored 75.0% on the OSWorld-Verified benchmark — a test of real desktop productivity tasks — surpassing the human baseline of 72.4%. That’s a 27.7 percentage point jump from GPT-5.2’s score. GPT-5.4 also features a 1-million-token context window and can autonomously execute multi-step workflows across files, browsers, and terminals.

What OSWorld-Verified Tests

OSWorld isn’t a language benchmark — it measures the ability to complete actual computer tasks. Test scenarios include:

Opening a spreadsheet, finding specific data, creating a chart, and exporting as PDF
Composing emails with attachments pulled from specific file paths
Navigating web interfaces to fill out forms with data from local documents
Multi-step file management: renaming, organizing, compressing, and uploading
Terminal operations: writing scripts, running commands, debugging errors

The “Verified” designation means tasks are checked by human evaluators, not just automated test harnesses. The 72.4% human baseline was established by giving the same tasks to office workers of varying skill levels.

What 75% Actually Means

Beating the human baseline doesn’t mean GPT-5.4 can do every office worker’s job. The benchmark measures task completion on isolated, well-defined tasks. Real office work involves:

Ambiguous requirements: “Make this look better” requires judgment the benchmark doesn’t test
Social context: Understanding office politics, team dynamics, and unstated preferences
Multi-day continuity: Tracking project state across weeks, not just completing one-off tasks
Error recovery: Handling unexpected situations that don’t fit the task description

The benchmark is a necessary condition for job replacement, not a sufficient one.

The 1-Million-Token Context Window

GPT-5.4’s context window holds approximately 750,000 words — enough to fit an entire novel, a year’s worth of email, or a complete codebase. This enables the model to:

Reference an entire project’s documentation while completing tasks
Process full datasets without chunking or summarization
Maintain context across long, complex workflows

For comparison: GPT-4o had 128K tokens, Arcee’s Trinity has 262K tokens, and Claude Opus 4.6 has 200K tokens. At 1M tokens, GPT-5.4 has the largest production context window of any frontier model.

The Jump From GPT-5.2

The 27.7 percentage point improvement from GPT-5.2 (47.3%) to GPT-5.4 (75.0%) is the largest single-generation improvement OpenAI has published on a real-world capability benchmark. For context, GPT-4 to GPT-4o was a 5-8 point improvement on most benchmarks. The GPT-5.2 to 5.4 jump suggests either a architectural change or a significant training methodology shift — OpenAI hasn’t disclosed specifics.

What This Means for Work

The practical question isn’t “can AI do my job?” — it’s “which parts of my job can AI do reliably?” At 75% on desktop tasks, GPT-5.4 can handle routine, well-defined office work: data entry, document processing, email management, file organization, and simple analysis. The 25% failure rate means it still needs oversight. The cognitive surrender research showing 79.8% of users follow AI advice even when wrong makes that oversight requirement more critical, not less.

GPT-5.4 Just Beat Humans at Real Office Work — 75% vs 72.4% on Desktop Tasks

What OSWorld-Verified Tests

What 75% Actually Means

The 1-Million-Token Context Window

The Jump From GPT-5.2

What This Means for Work

Enjoyed this story?

An AI Music Artist Just Released a Video Telling Critics to Shut Up — Hollywood Is Panicking

The EU AI Act Takes Effect in 4 Months — 90% of Companies Aren’t Ready

The New York Times Says ‘Agentic’ AI Is Just an Excuse to Fire People

Before you go…