Alibaba's Qwen3.7-Plus: A New Multimodal AI Agent

Q: What happened?

Alibaba‘s Qwen team has released Qwen3.7-Plus, a multimodal model built on the text-only Qwen3.7 that combines visual perception with agent capabilities like coding and tool use, according to The Decoder. Alibaba bills it as a “multimodal interactive hybrid agent.” The model is designed to read screen content, operate graphical interfaces, write code from visual templates, and navigate mobile apps end to end, with UI clicks and command-line instructions running in the same agen

Alibaba released Qwen3.7-Plus, a multimodal model that combines visual perception, GUI operation, and coding in a single agent loop.
In a demo, an agent built on the model autonomously developed a vocabulary app — over 10,000 lines of code across more than 1,000 agent calls in 11 hours.
It leads on interface-operation benchmarks like AndroidWorld and ScreenSpot Pro but falls short on pure logic.
The model is available as a comparatively inexpensive proprietary option through Alibaba Cloud.

What Happened

Alibaba‘s Qwen team has released Qwen3.7-Plus, a multimodal model built on the text-only Qwen3.7 that combines visual perception with agent capabilities like coding and tool use, according to The Decoder. Alibaba bills it as a “multimodal interactive hybrid agent.”

The model is designed to read screen content, operate graphical interfaces, write code from visual templates, and navigate mobile apps end to end, with UI clicks and command-line instructions running in the same agent loop.

Why It Matters

Qwen3.7-Plus pushes the industry’s shift from chatbots to agents that act on a computer directly — the same frontier being contested by OpenAI’s GPT-5.x line and Claude Opus 4.8. As a comparatively cheap, China-origin option, it also extends the pricing pressure created by capable open and low-cost models like NVIDIA’s Nemotron 3 Ultra.

Technical Details

In one demo, a hybrid agent system built an English vocabulary learning app, running for over eleven hours and producing more than 10,000 lines of code across more than 1,000 agent calls — covering requirements, code generation, testing, and version management. In a second, the agent recreated the macOS Stocks app by parsing its UI and generating SwiftUI code, then connected a live data API and ran ten functional tests.

On the AndroidWorld and ScreenSpot Pro benchmarks, Qwen3.7-Plus sits well ahead of GPT-5.4 (xhigh) and Claude Opus on interface operation, but its published numbers trail on pure logic tasks.

Who’s Affected

Developers building GUI-automation and computer-use agents gain a low-cost, capable option via Alibaba Cloud. The release also affects Western labs whose computer-use features command premium pricing, since Qwen3.7-Plus undercuts them on cost while leading specific interface benchmarks.

What’s Next

The benchmarks are Alibaba’s own, and the logic-task gap is a stated limitation rather than an independent finding. Independent evaluation on AndroidWorld and ScreenSpot Pro will determine whether the interface-operation lead holds outside the vendor’s testing.

Alibaba’s Qwen3.7-Plus Turns a Multimodal Model Into an Autonomous Agent

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Related Reading

Enjoyed this story?

Alibaba’s Qwen3.7-Plus Turns a Multimodal Model Into an Autonomous Agent

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Related Reading

Enjoyed this story?

OpenAI Launches a Partner Network With $150 Million to Drive Enterprise Adoption

Luma AI Launches Open Physical-AI Lab to Train Robots

Nvidia at GTC Taipei: RTX Spark Laptops, Cosmos 3, Nemotron 3