BENCHMARKS

Function Calling Harness Pushes Qwen From 6.75 Percent to 100 Percent Success on Complex Schemas

M megaone_admin Mar 27, 2026 2 min read
Engine Score 8/10 — Important

The story presents a substantial improvement in Qwen's function calling reliability, offering high actionability for developers using this prominent open-source model. Despite originating from a secondary source, the technical achievement holds significant relevance for the AI community.

Editorial illustration for: Function Calling Harness Pushes Qwen From 6.75 Percent to 100 Percent Success on Complex Schemas

A developer presented at the Qwen Meetup Korea on March 26 demonstrated how a self-healing harness architecture can push function calling success rates from 6.75 percent to effectively 100 percent, even on deeply recursive union types that the AI industry has generally written off as unsolvable. The talk centered on AutoBe, an open-source AI agent developed by Wrtn Technologies that generates production-grade backends from natural language.

The baseline numbers are stark. When asked to generate API data types for a shopping mall backend, qwen3-coder-next achieved a first-try success rate of just 6.75 percent — 93 out of 100 attempts produced invalid structured output. The entire Qwen 3.5 model family hit 0 percent on union types due to a consistent double-stringify bug. These failures align with published benchmarks: NESTFUL (EMNLP 2025) measured GPT-4o at 28 percent accuracy on nested tool call sequences, and JSONSchemaBench (ICLR 2025) found 3 to 41 percent coverage on the hardest real-world schemas.

AutoBe’s solution does not improve the model itself. Instead, it wraps the model in a deterministic loop: type schemas constrain outputs, compilers verify results, and structured feedback pinpoints exactly where and why something went wrong so the LLM can correct itself. The system uses a five-phase waterfall pipeline running through four AST types and four-tier compiler validation, with self-healing loops at each stage.

The infrastructure powering this approach is Typia, a TypeScript compiler plugin that analyzes a single type from source code and automatically generates schema, parser, validator, and feedback generator. When Qwen 3.5 models produced zero percent success on union types, Typia’s lenient JSON parsing combined with schema-based type coercion and precise validation feedback flipped the rate to 100 percent compilation success.

The broader implication is that function calling reliability is an engineering problem, not a model capability problem. The presenter argued the pattern applies to any domain where deterministic validators exist — semiconductors, chemical processes, control systems — and that smaller models are actually preferable as QA engineers for harness development because they expose system vulnerabilities more readily than capable frontier models that mask infrastructure weaknesses.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime

M
MegaOne AI Editorial Team

MegaOne AI monitors 200+ sources daily to identify and score the most important AI developments. Our editorial team reviews 200+ sources with rigorous oversight to deliver accurate, scored coverage of the AI industry. Every story is fact-checked, linked to primary sources, and rated using our six-factor Engine Score methodology.

About Us Editorial Policy