KEMBAR78
BFCL V4 Release by HuanzhiMao · Pull Request #1019 · ShishirPatil/gorilla · GitHub
Skip to content

Conversation

HuanzhiMao
Copy link
Collaborator

@HuanzhiMao HuanzhiMao commented May 11, 2025

❗️Important: This PR introduces breaking changes and is NOT backward-compatible.

BFCL V4

💥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic, a new benchmark focused on tool-calling in real-world agentic settings — including:

🔍 Web search with multi-hop reasoning and error recovery
🧠 Evaluating Tool-Calling for Memory
⚠️ Evaluating Format Sensitivity

Change Log

  1. New agentic domain

    • Introduces the agentic domain with two categories: Web Search and Memory Management.
    • For more information, please see our accompanying blog posts.
  2. Revised overall-accuracy formula

    • As single-turn tasks approach saturation, weighting now favors complex, multi-step agentic tasks.
    Segment Old % New %
    Live 33 10
    Non-Live 33 10
    Irrelevance 0 10
    Multi-Turn 33 30
    Agentic 0 40
  3. Leaderboard / model cleanup

    • Retires several deprecated models from the leaderboard.
    • Removes unused model handlers to improve maintainability.
  4. Address [BFCL] How should we calculate the overall accuracy for V2-Live dataset? #602

    • Non-Live Acc and Live Acc score calculation now excludes the Irrelevance/Relevance category scores.
  5. Resolve [BFCL] Verification Needed for Live-relevance Data and Ground Truth #1094.

  6. Codebase refactor

    • Reorganizes the response-generation pipeline and related modules for easier maintenance.
    • Simplify the response-generation pipeline logic for locally-hosted models.
    • Introduce enums.py
  7. Test category rename
    The following categories have been renamed to avoid confusion. This applies to both dataset file names and leaderboard website columns.

    • simple --> simple_python
    • java --> simple_java
    • javascript --> simple_javascript
  8. Directory layout overhaul
    Results and scores now use a two-level hierarchy:

    result/<model>/<general_category>/<category>.json
    score/<model>/<general_category>/<category>.json
    

    general_category ∈ { non_live, live, multi_turn, agentic, format_sensitivity }

    • For agentic-memory tasks, an extra level distinguishes the memory backend:

    result/<model>/agentic/<memory_backend>/<category>.json
    

    Migrate existing outputs to this structure before upgrading, otherwise the evaluation pipeline will fail to locate files.

  9. New model support
    Adds support for the following models:

    • claude-opus-4-1-20250805
    • gpt-5-2025-08-07
    • gpt-5-mini-2025-08-07
    • gpt-5-nano-2025-08-07
    • Qwen/Qwen3-30B-A3B-Instruct-2507
    • Qwen/Qwen3-235B-A22B-Instruct-2507
    • Qwen/Qwen3-4B-Instruct-2507

@HuanzhiMao HuanzhiMao added BFCL-General General BFCL Issue BFCL-Dataset BFCL Dataset-Related Issue labels May 11, 2025
@HuanzhiMao HuanzhiMao marked this pull request as ready for review July 17, 2025 22:08
Copy link
Collaborator

@Fanjia-Yan Fanjia-Yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments. Address if see fit

Copy link
Collaborator

@Fanjia-Yan Fanjia-Yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments here. Address if see fit.

Copy link
Collaborator

@CharlieJCJ CharlieJCJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, nothing blocking, nits in general. looks great

@D-X-Y
Copy link

D-X-Y commented Sep 2, 2025

Hi @HuanzhiMao , thanks for adding the v4 content. Do you mind also updating the data README here: https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/bfcl_eval/data/README.md ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BFCL-Dataset BFCL Dataset-Related Issue BFCL-General General BFCL Issue BFCL-New Model Add New Model to BFCL

Projects

None yet

4 participants