This paper presents a novel framework for transforming existing NL2SQL datasets into large-scale, invocable API
collections, enabling the evaluation of large language models (LLMs) on realistic tool-calling tasks. The authors
introduce three API transformation paradigms: SLOT, SEL, and REST; each representing a different level of tool-
calling complexity. The generated API sets are derived from the BIRD-SQL dataset and support executable
endpoints or Python functions that can be invoked by LLMs or agents in both static and live testing
environments.
The authors benchmark 10 LLMs and 3 ReACT-style agents on their proposed datasets and report very low task
completion rates (ranging from 7% to 47% in direct calls, and up to 50% for ReACT agents), highlighting current
LLM limitations in sequencing, intent detection, and slot filling for tool invocation. Extensive ablations are
conducted to assess the impact of API count and function name obfuscation. While all datasets and code are
publicly available, the paper's explanation of the methodology, particularly the processing flow of each
transformation method, lacks clarity and visual illustration (figures/flowcharts would be beneficial).
Furthermore, writing issues need attention; for instance, the abbreviation "TAO" on page 8 requires definition
upon its first appearance.
1. Novelty and Practicality: The paper introduces an original and scalable method for repurposing NL2SQL data
into tool-calling datasets, a direction for enterprise-grade LLM deployments.
2. Comprehensive Benchmarking: It evaluates a wide range of LLMs under different usage modes (direct tool-
calling and agent-based), providing valuable insights into model capabilities and bottlenecks.
3. Realistic Evaluation Setup: By converting SQL into invocable APIs backed by real databases, the dataset
better simulates real-world applications compared to many synthetic tool-calling benchmarks.
4. Clear Structure and Contributions: The paper is well-structured with clearly defined sections, contributions,
and results that are easy to follow and interpret.
1. Limited SQL Coverage: Only SELECT queries are supported in the current setup; INSERT, DELETE, nested
queries, and temporal logic are not addressed, reducing task diversity.
2. Constrained Deployment Details: Although the authors state that all APIs are invocable, detailed guidance on
how to deploy and interact with these APIs in a live environment is lacking.
3. Evaluation Focus: The evaluation primarily focuses on offline or single-query tasks. Incorporating multi-turn
interactions or user feedback loops would improve the realism of tool usage.
4. Obfuscation Results Lack Depth: While the obfuscation experiments are insightful, a deeper analysis (e.g.,
how models compensate using metadata or descriptions) could provide more actionable insights.
1. What are the primary barriers to supporting nested SELECTs or non-SELECT queries like INSERT/DELETE?
2. How sensitive are the models to tool name semantics, and could richer descriptions significantly improve
performance under name obfuscation?