Function Calling Harness Pushes Qwen From 6.75 Percent to 100 Percent Success on Complex Schemas

A developer presented at the Qwen Meetup Korea on March 26 demonstrated how a self-healing harness architecture can push function calling success rates from 6.75 percent to effectively 100 percent, even on deeply recursive union types that the AI industry has generally written off as unsolvable. The talk centered on AutoBe, an open-source AI agent developed by Wrtn Technologies that generates production-grade backends from natural language.

The baseline numbers are stark. When asked to generate API data types for a shopping mall backend, qwen3-coder-next achieved a first-try success rate of just 6.75 percent — 93 out of 100 attempts produced invalid structured output. The entire Qwen 3.5 model family hit 0 percent on union types due to a consistent double-stringify bug. These failures align with published benchmarks: NESTFUL (EMNLP 2025) measured GPT-4o at 28 percent accuracy on nested tool call sequences, and JSONSchemaBench (ICLR 2025) found 3 to 41 percent coverage on the hardest real-world schemas.

AutoBe’s solution does not improve the model itself. Instead, it wraps the model in a deterministic loop: type schemas constrain outputs, compilers verify results, and structured feedback pinpoints exactly where and why something went wrong so the LLM can correct itself. The system uses a five-phase waterfall pipeline running through four AST types and four-tier compiler validation, with self-healing loops at each stage.

The infrastructure powering this approach is Typia, a TypeScript compiler plugin that analyzes a single type from source code and automatically generates schema, parser, validator, and feedback generator. When Qwen 3.5 models produced zero percent success on union types, Typia’s lenient JSON parsing combined with schema-based type coercion and precise validation feedback flipped the rate to 100 percent compilation success.

The broader implication is that function calling reliability is an engineering problem, not a model capability problem. The presenter argued the pattern applies to any domain where deterministic validators exist — semiconductors, chemical processes, control systems — and that smaller models are actually preferable as QA engineers for harness development because they expose system vulnerabilities more readily than capable frontier models that mask infrastructure weaknesses.

Function Calling Harness Pushes Qwen From 6.75 Percent to 100 Percent Success on Complex Schemas

Enjoyed this story?

BullshitBench Results Show Anthropic Claude Models Dominate Top Seven Spots in Nonsense Detection Rankings

Liquid AI Runs 24-Billion-Parameter Model at 50 Tokens Per Second in a Web Browser

Open-Source ATLAS System on a $500 GPU Outperforms Claude Sonnet on Coding Benchmarks

Before you go…