FC-Eval: Benchmarking LLMs with a New CLI Tool

UC Berkeley’s Gorilla project has released BFCL v4, the fourth iteration of the Berkeley Function Calling Leaderboard, expanding its evaluation framework to cover holistic agentic LLM behavior. Created by Shishir Patil and collaborators at UC Berkeley, BFCL is the first comprehensive and executable benchmark for measuring how accurately language models can invoke external functions, APIs, and developer-defined tools — a capability central to the emerging class of agentic AI applications.

Function calling — also referred to as tool use — enables LLMs to interact with external systems by generating structured API calls rather than plain text. As AI agents increasingly need to execute code, query databases, and chain multiple tool invocations together, standardized evaluation has become essential. Prior to BFCL, developers relied on ad-hoc testing or model provider self-reported benchmarks, making cross-model comparisons unreliable. The Gorilla project, which also produced the Gorilla LLM fine-tuned specifically for API usage, has evolved BFCL through four major versions since its initial release.

BFCL v4, released in July 2025, introduces agentic evaluation scenarios including web search with multi-hop reasoning and error recovery, agent memory management, and format sensitivity evaluation. Previous versions focused on simpler tasks: v1 used Abstract Syntax Tree (AST) matching to verify function call syntax, v2 added enterprise and open-source-contributed function schemas, and v3 introduced multi-turn conversation evaluation. The v4 benchmark tests models across simple, parallel, and multiple function calls, multi-turn interactions, and scenarios where the model must decide whether a function call is appropriate at all. Results are verified through both AST evaluation and executable verification.

The leaderboard currently ranks models from major providers including Anthropic’s Claude, OpenAI’s GPT-4o, Google’s Gemini, and open-source models like Llama and Mistral. Developers can run evaluations locally using the bfcl-eval Python package (pip install bfcl-eval), which supports benchmarking against both local and cloud-hosted models. This is relevant for teams building agentic applications in finance, healthcare, and enterprise software, where incorrect function calls can have material consequences — a model that hallucinates an API parameter name could trigger wrong transactions or return corrupted data.

The BFCL leaderboard and evaluation code are open-source under the Gorilla project’s GitHub repository. Patil’s team has indicated that future versions will incorporate more complex agentic workflows, including tool-use planning and cross-session memory evaluation.

UC Berkeley Expands BFCL Benchmark to v4, Adding Agentic Evaluation for LLM Function Calling

Related Reading

Enjoyed this story?

UC Berkeley Expands BFCL Benchmark to v4, Adding Agentic Evaluation for LLM Function Calling

Related Reading

Enjoyed this story?

Kimi K2.6 and Xiaomi MiMo Beat Claude, GPT-5.5, Gemini in Word Gem Puzzle Coding Tournament

UK AISI Tests Show GPT-5.5 Matching Claude Mythos in Multi-Stage Cyber Attacks

Poolside Releases Laguna M.1 and XS.2: Open-Weight Coding Models Hitting 72.5% on SWE-bench Verified