-
Notifications
You must be signed in to change notification settings - Fork 1.3k
BFCL V4 Release #1019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BFCL V4 Release #1019
Conversation
berkeley-function-call-leaderboard/bfcl_eval/constants/category_mapping.py
Show resolved
Hide resolved
berkeley-function-call-leaderboard/bfcl_eval/_llm_response_generation.py
Outdated
Show resolved
Hide resolved
...on-call-leaderboard/bfcl_eval/eval_checker/multi_turn_eval/func_source_code/memory_vector.py
Show resolved
Hide resolved
berkeley-function-call-leaderboard/bfcl_eval/_llm_response_generation.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments. Address if see fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments here. Address if see fit.
berkeley-function-call-leaderboard/bfcl_eval/data/BFCL_v4_irrelevance.json
Show resolved
Hide resolved
berkeley-function-call-leaderboard/bfcl_eval/data/BFCL_v4_simple_java.json
Show resolved
Hide resolved
berkeley-function-call-leaderboard/bfcl_eval/eval_checker/agentic_eval/agentic_checker.py
Outdated
Show resolved
Hide resolved
berkeley-function-call-leaderboard/bfcl_eval/eval_checker/agentic_eval/agentic_checker.py
Show resolved
Hide resolved
...ction-call-leaderboard/bfcl_eval/eval_checker/multi_turn_eval/func_source_code/web_search.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, nothing blocking, nits in general. looks great
…oc in format sensitivity
Hi @HuanzhiMao , thanks for adding the v4 content. Do you mind also updating the data README here: https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/bfcl_eval/data/README.md ? |
BFCL V4
💥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic, a new benchmark focused on tool-calling in real-world agentic settings — including:
🔍 Web search with multi-hop reasoning and error recovery
⚠️ Evaluating Format Sensitivity
🧠 Evaluating Tool-Calling for Memory
Change Log
New agentic domain
Revised overall-accuracy formula
Leaderboard / model cleanup
Address [BFCL] How should we calculate the overall accuracy for V2-Live dataset? #602
Non-Live Acc
andLive Acc
score calculation now excludes the Irrelevance/Relevance category scores.Resolve [BFCL] Verification Needed for Live-relevance Data and Ground Truth #1094.
Codebase refactor
enums.py
Test category rename
The following categories have been renamed to avoid confusion. This applies to both dataset file names and leaderboard website columns.
simple
-->simple_python
java
-->simple_java
javascript
-->simple_javascript
Directory layout overhaul
Results and scores now use a two-level hierarchy:
general_category
∈ { non_live, live, multi_turn, agentic, format_sensitivity }• For agentic-memory tasks, an extra level distinguishes the memory backend:
Migrate existing outputs to this structure before upgrading, otherwise the evaluation pipeline will fail to locate files.
New model support
Adds support for the following models:
claude-opus-4-1-20250805
gpt-5-2025-08-07
gpt-5-mini-2025-08-07
gpt-5-nano-2025-08-07
Qwen/Qwen3-30B-A3B-Instruct-2507
Qwen/Qwen3-235B-A22B-Instruct-2507
Qwen/Qwen3-4B-Instruct-2507