BENCHMARKS

Function Calling Harness Pushes Qwen From 6.75 Percent to 100 Percent Success on Complex Schemas

J James Whitfield Mar 27, 2026 Updated Apr 7, 2026 4 min read
Engine Score 8/10 — Important

The story presents a substantial improvement in Qwen's function calling reliability, offering high actionability for developers using this prominent open-source model. Despite originating from a secondary source, the technical achievement holds significant relevance for the AI community.

Editorial illustration for: Function Calling Harness Pushes Qwen From 6.75 Percent to 100 Percent Success on Complex Schemas
  • A self-healing harness architecture pushed Qwen model function calling success from 6.75% first-try accuracy to 99.8%+ across five models on complex recursive schemas.
  • The approach uses deterministic compilers and structured error feedback rather than improving the model itself, making it model-agnostic.
  • Published benchmarks from NESTFUL and JSONSchemaBench confirm that even frontier models like GPT-4o struggle with nested function calling, scoring 28% or lower on hard schemas.
  • The open-source tool Typia recovers seven categories of malformed LLM output through lenient parsing and type coercion.

What Happened

Developer Jeongho Nam presented at the Qwen Meetup Korea on March 26, 2026, demonstrating how a deterministic harness can transform unreliable function calling into near-perfect structured output. The work centers on AutoBe, an open-source AI agent built by Wrtn Technologies that generates production-grade backends from natural language.

The baseline results were stark. When asked to generate API data types for a shopping mall backend, Qwen‘s qwen3-coder-next achieved a first-try success rate of just 6.75 percent, meaning 93 out of 100 attempts produced invalid structured output. The entire Qwen 3.5 model family scored zero percent on union types due to a consistent double-stringify bug where nested JSON fields were serialized twice.

After applying the harness system, all five tested Qwen models reached between 99.8 and 100 percent compilation success. The largest model, qwen3.5-397b-a17b, hit 100 percent. Even the smallest tested configuration, qwen3.5-35b-a3b with just 3 billion active parameters in its mixture-of-experts architecture, reached 99.8 percent.

Why It Matters

Function calling reliability has been a persistent bottleneck for AI agents that need to produce valid structured output. Industry benchmarks confirm the problem extends well beyond Qwen. NESTFUL, published at EMNLP 2025, measured GPT-4o at just 28 percent accuracy on nested tool call sequences. JSONSchemaBench at ICLR 2025 found between 3 and 41 percent coverage on the hardest real-world schemas.

Nam’s central argument is that function calling reliability is an engineering problem, not a model capability problem. Rather than waiting for better models, developers can wrap existing ones in deterministic validation loops that converge on correct output through iterative correction. The implication is significant: teams building AI agents do not need to upgrade to the latest frontier model every time structured output fails. A well-designed harness can absorb model variance and deliver consistent results regardless of the underlying LLM.

Technical Details

The harness uses a five-phase waterfall pipeline running through four AST types with four-tier compiler validation. At each stage, type schemas constrain outputs, compilers verify results, and structured feedback pinpoints exactly where and why a failure occurred so the model can self-correct.

The infrastructure layer is powered by Typia, a TypeScript compiler plugin that generates schema definitions, runtime validators, lenient JSON parsers, and structured error feedback from a single type definition. Typia recovers seven categories of malformed LLM output, including markdown code blocks, unclosed brackets, unquoted JSON keys, trailing commas, incomplete keywords, type coercion errors, and double-stringified union fields.

Nam highlighted what he called the “pink elephant problem” in prompt engineering: telling a model not to do something actually increases the tendency toward that behavior. Schemas solve this through structural absence. If a field type is not defined in the schema, it is impossible to generate.

Who’s Affected

The pattern is relevant to any developer building AI agents that depend on structured output from language models. Nam argued the approach extends beyond software engineering to domains where deterministic validators exist, including semiconductor design verification, chemical process mass-balance checks, BIM collision detection in construction, and stability analysis in control systems.

The results also carry implications for model selection. Smaller models with as few as 3 billion active parameters successfully ran through the harness, which Nam said makes them superior “QA engineers” for identifying system vulnerabilities. Larger frontier models can mask infrastructure weaknesses by compensating with raw capability.

What’s Next

AutoBe and Typia are both open-source and available for integration. From a single natural language prompt like “Build me a shopping mall backend,” AutoBe generates requirements analysis, database schemas, OpenAPI specifications, end-to-end test code, complete implementation code, and type-safe SDKs.

The approach remains limited to domains where deterministic validators can be constructed. Tasks requiring subjective judgment or creative output, where no compiler or formal verifier exists, fall outside the scope of this technique. Whether the pattern sees broader adoption depends on whether the engineering community prioritizes harness-level solutions over waiting for model-level improvements in structured output generation.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime