Alibaba’s Qwen Team Introduces FIPO, Doubling AI Reasoning Chain Length

Alibaba’s Qwen team developed FIPO (Future-KL Influenced Policy Optimization), a reinforcement learning algorithm that weights each token’s reward based on its downstream influence over subsequent tokens, rather than distributing end-result rewards uniformly.
On Qwen2.5-32B-Base trained with FIPO, average chain-of-thought length grew from roughly 4,000 tokens under the DAPO baseline to over 10,000 tokens.
FIPO achieved 56–58% accuracy on AIME 2024, surpassing both DeepSeek-R1-Zero-Math-32B (approximately 47%) and OpenAI’s o1-mini (approximately 56%).
The algorithm has been validated only on mathematical tasks so far; the Qwen team plans to release the training system as open source.

What Happened

Alibaba’s Qwen research team published a paper describing FIPO (Future-KL Influenced Policy Optimization), a reinforcement learning training algorithm that changes how large language models receive credit during chain-of-thought reasoning. The work was reported by The Decoder on April 6, 2026. Unlike standard RL methods that spread a binary pass/fail reward evenly across every token in a generated sequence, FIPO assigns each token a reward proportional to how much its generation shifts the probability distribution of all tokens that follow it.

Why It Matters

Popular RL training methods for reasoning models — including GRPO (Group Relative Policy Optimization) and its open-source variant DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) — hit a ceiling: reasoning chains grow to a fixed length and stop extending. Previous attempts to address this relied on PPO-based methods that require a separate value model pre-trained on synthetic long chain-of-thought data. The Qwen researchers argue this auxiliary model contaminates performance comparisons by importing outside knowledge, making it difficult to isolate algorithmic gains. FIPO eliminates the auxiliary model entirely.

Technical Details

FIPO calculates the cumulative KL divergence across all tokens downstream of a given generation step, using that forward-looking signal to allocate rewards more precisely. Tokens that initiate productive reasoning branches receive a larger share; tokens that lead to dead ends receive less. To maintain training stability, the algorithm applies a discount factor to distant tokens — whose downstream influence is harder to estimate reliably — and filters out tokens where model behavior has drifted too far between training iterations. The researchers report that without this drift filter, training diverged and response lengths collapsed.

The team evaluated FIPO on Qwen2.5-32B-Base, a model with no prior exposure to synthetic long chain-of-thought data, trained exclusively on the public DAPO dataset. The DAPO baseline produced average chain-of-thought lengths of approximately 4,000 tokens; FIPO pushed that figure past 10,000 tokens. On AIME 2024, accuracy rose from 50% to 56%, peaking at 58%, placing FIPO ahead of DeepSeek-R1-Zero-Math-32B (approximately 47%) and OpenAI’s o1-mini (approximately 56%). On the harder AIME 2025 benchmark, scores climbed from 38% to 43%.

The paper identifies four behavioral phases that emerge during FIPO training. In phase one, the model produces shallow planning outlines with no real computation. Phase two — where DAPO-trained models remain throughout training — involves a single linear reasoning chain terminating at the first answer found. Phase three introduces spontaneous self-verification: after reaching an answer, the model switches methods, moving, for example, from algebraic manipulation to geometric interpretation to cross-check its result. Phase four adds systematic multi-pass verification, recalculating intermediate steps multiple times. The researchers state this behavior “looks a lot like the inference-time scaling strategies in OpenAI’s o-series and DeepSeek-R1, but FIPO pulls it off through reinforcement learning alone, with no long-CoT synthetic data.”

Who’s Affected

The most direct beneficiaries are AI research teams building or fine-tuning reasoning models under compute constraints that make PPO-based value-model pipelines expensive. Teams working with open-weight models in the Qwen2.5 family or DeepSeek variants, which rely on GRPO-style training, would be the primary users of FIPO once the training code is released. The tradeoff is cost at inference: chains exceeding 10,000 tokens increase compute requirements for any organization serving these models in production.

What’s Next

The Qwen team has stated plans to open-source the full FIPO training system along with all configurations, but has not specified a release date. As of publication, the algorithm has been tested only on mathematical benchmarks using a single training dataset and base models without long chain-of-thought pre-training; the researchers acknowledge that whether gains transfer to code generation, symbolic logic, or other domains is an open question. The team also notes a persistent performance gap relative to distillation from larger teacher models, which pure reinforcement learning approaches have not yet closed.

Alibaba’s Qwen Team Introduces FIPO, Doubling AI Reasoning Chain Length

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Related Reading

Enjoyed this story?

Alibaba’s Qwen Team Introduces FIPO, Doubling AI Reasoning Chain Length

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Related Reading

Enjoyed this story?

Opal Achieves 29x Memory Throughput for Private AI Using ORAM Enclaves

AgentHazard Benchmark Finds Computer-Use Agents Fail Safety Tests at High Rates

GrandCode AI Places First in Three Live Codeforces Rounds, Beating All Human Competitors