Qwen3.5-Omni learned to write code from spoken instructions

Alibaba released Qwen3.5-Omni on March 31, 2026, an omnimodal AI model that processes text, images, audio, and video natively. Among its reported behaviors is an ability to write code from spoken instructions and video demonstrations — a capability the Qwen team says emerged without being an explicit training objective.

Qwen3.5-Omni-Plus is claimed by the Qwen team to achieve state of the art on 215 audio and audiovisual subtasks, scoring 82.2 on the MMAU audio comprehension benchmark versus 81.1 for Google’s Gemini 3.1 Pro.
Speech recognition support expanded from 11 languages in the previous generation to 74 languages and 39 Chinese dialects — 113 languages and dialects total.
On the seed-hard speech generation benchmark, the model achieved a word error rate of 6.24, compared to GPT-Audio at 8.19, Minimax at 8.62, and ElevenLabs at 27.70.
Alibaba has not released model weights; Qwen3.5-Omni is accessible only as an API service, departing from previous open-weight Qwen releases.

What Happened

Alibaba released Qwen3.5-Omni on or before March 31, 2026, the date Jonathan Kemper published his coverage for The Decoder. The model is available in three Instruct variants — Plus, Flash, and Light — and is accessible only through Alibaba’s API, with no publicly released weights. The Qwen team highlighted as a notable finding that the model developed an ability to write code from spoken instructions and video input without being explicitly trained for this behavior.

Why It Matters

Qwen3.5-Omni follows Qwen3-Omni, which supported eleven languages for speech recognition and eight Chinese dialects. The new release expands that to 74 languages and 39 Chinese dialects, reaching 113 total — a substantially broader footprint for multilingual audio processing.

The emergent coding behavior is notable because it suggests that large-scale pretraining on diverse audiovisual data — in this case, more than 100 million hours — can surface capabilities that were not objectives of the training process. The mechanism behind this behavior has not been analyzed in publicly available documentation from Alibaba.

Technical Details

Qwen3.5-Omni was natively pre-trained as an omnimodal model on over 100 million hours of audiovisual material. It supports context windows up to 256,000 tokens and, according to the Qwen team, can process more than ten hours of audio or over 400 seconds of 720p video at one frame per second in a single pass. The model generates speech output alongside text.

On audio benchmarks, the Plus variant scores 82.2 on MMAU (audio comprehension) against Gemini 3.1 Pro’s 81.1. The margin widens on music comprehension via RUL-MuchoMusic: 72.4 for Qwen3.5-Omni-Plus versus 59.6 for Gemini. On the VoiceBench dialog benchmark, Qwen3.5-Omni-Plus scored 93.1 compared to Gemini’s 88.9. The Qwen team claims the Plus variant leads or matches Gemini 3.1 Pro on 215 audio and audiovisual subtasks spanning three audiovisual benchmarks, five audio benchmarks, eight speech recognition benchmarks, 156 language-specific translation tasks, and 43 language-specific recognition tasks. These figures are vendor-reported and have not been independently verified at time of publication.

For speech generation, the model was benchmarked against ElevenLabs, Gemini 2.5 Pro, GPT-Audio, and Minimax. On the seed-hard test set, Qwen3.5-Omni-Plus achieved a word error rate of 6.24. GPT-Audio came in at 8.19, Minimax at 8.62, and ElevenLabs at 27.70. Voice cloning across 20 languages reached a word error rate of 1.87 with a cosine similarity of 0.79. Voice output supports 36 languages and dialects across 55 available voices, including user-defined, dialectal, scenario-specific, and multilingual options.

Who’s Affected

Developers building voice-first or multilingual applications are the most direct audience. The expansion from 11 to 74 speech recognition languages, combined with the 55-voice output system, allows a broader range of audio applications to be built without requiring separate transcription or synthesis pipelines.

The shift to API-only access directly affects organizations and developers that previously used open-weight Qwen models in self-hosted or fine-tuned deployments. Those workflows are not supported in this release.

Enterprises exploring voice-to-code pipelines may have interest in the emergent coding capability, but no benchmark data for that specific behavior exists in the available source material, leaving its production reliability uncharacterized.

What’s Next

Alibaba has not specified whether model weights will be released at a later date or whether API-only access is a permanent policy for this model line. No technical paper or model card for Qwen3.5-Omni was referenced in available coverage at time of publication.

The emergent code-generation capability — writing code from spoken instructions and video — has not been evaluated on any published benchmark. The Qwen team identified it as an observed outcome rather than a measured one. Determining its reliability, accuracy, and scope across programming languages would require systematic independent evaluation.

No direct quotes from Alibaba researchers or the Qwen team were available in the source material at time of publication.

Alibaba’s Qwen3.5-Omni Writes Code from Voice and Video Without Training

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Alibaba’s Qwen3.5-Omni Writes Code from Voice and Video Without Training

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Anthropic’s Revenue Chart Just Went Vertical — $9B to $30B in 4 Months Proves Enterprise AI Is Real [Charts]

Anthropic CEO Apologized for a Leaked Memo — Pentagon Feud Is Hurting Both Sides

IBM Just Tripled Entry-Level Hiring While Every Other Tech Company Cut Jobs