BLOG

GPT-5.4 Just Beat Humans at Real Office Work — 75% vs 72.4% on Desktop Tasks

N Nikhil B Apr 5, 2026 2 min read
Engine Score 7/10 — Important
Editorial illustration for: GPT-5.4 Just Beat Humans at Real Office Work — 75% vs 72.4% on Desktop Tasks

OpenAI’s GPT-5.4 scored 75.0% on the OSWorld-Verified benchmark — a test of real desktop productivity tasks — surpassing the human baseline of 72.4%. That’s a 27.7 percentage point jump from GPT-5.2’s score. GPT-5.4 also features a 1-million-token context window and can autonomously execute multi-step workflows across files, browsers, and terminals.

What OSWorld-Verified Tests

OSWorld isn’t a language benchmark — it measures the ability to complete actual computer tasks. Test scenarios include:

  • Opening a spreadsheet, finding specific data, creating a chart, and exporting as PDF
  • Composing emails with attachments pulled from specific file paths
  • Navigating web interfaces to fill out forms with data from local documents
  • Multi-step file management: renaming, organizing, compressing, and uploading
  • Terminal operations: writing scripts, running commands, debugging errors

The “Verified” designation means tasks are checked by human evaluators, not just automated test harnesses. The 72.4% human baseline was established by giving the same tasks to office workers of varying skill levels.

What 75% Actually Means

Beating the human baseline doesn’t mean GPT-5.4 can do every office worker’s job. The benchmark measures task completion on isolated, well-defined tasks. Real office work involves:

  • Ambiguous requirements: “Make this look better” requires judgment the benchmark doesn’t test
  • Social context: Understanding office politics, team dynamics, and unstated preferences
  • Multi-day continuity: Tracking project state across weeks, not just completing one-off tasks
  • Error recovery: Handling unexpected situations that don’t fit the task description

The benchmark is a necessary condition for job replacement, not a sufficient one.

The 1-Million-Token Context Window

GPT-5.4’s context window holds approximately 750,000 words — enough to fit an entire novel, a year’s worth of email, or a complete codebase. This enables the model to:

  • Reference an entire project’s documentation while completing tasks
  • Process full datasets without chunking or summarization
  • Maintain context across long, complex workflows

For comparison: GPT-4o had 128K tokens, Arcee’s Trinity has 262K tokens, and Claude Opus 4.6 has 200K tokens. At 1M tokens, GPT-5.4 has the largest production context window of any frontier model.

The Jump From GPT-5.2

The 27.7 percentage point improvement from GPT-5.2 (47.3%) to GPT-5.4 (75.0%) is the largest single-generation improvement OpenAI has published on a real-world capability benchmark. For context, GPT-4 to GPT-4o was a 5-8 point improvement on most benchmarks. The GPT-5.2 to 5.4 jump suggests either a architectural change or a significant training methodology shift — OpenAI hasn’t disclosed specifics.

What This Means for Work

The practical question isn’t “can AI do my job?” — it’s “which parts of my job can AI do reliably?” At 75% on desktop tasks, GPT-5.4 can handle routine, well-defined office work: data entry, document processing, email management, file organization, and simple analysis. The 25% failure rate means it still needs oversight. The cognitive surrender research showing 79.8% of users follow AI advice even when wrong makes that oversight requirement more critical, not less.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime

NB
Nikhil B

Founder of MegaOne AI. Covers AI industry developments, tool launches, funding rounds, and regulation changes. Every story is sourced from primary documents, fact-checked, and rated using the six-factor Engine Score methodology.

About Us Editorial Policy