Enhancing Function-Calling Capabilities in LLMs
Enhancing Function-Calling Capabilities in LLMs
MediaTek Research
{yi-chang.chen, pochun.hsu, chan.hsu, ds.shiu}@mtkresearch.com
Figure 1: An illustration of prompt templates used for function calling and instruction following in LLMs. Dur-
ing training, LLMs are given conditional prompts (shown on the left) and tasked with generating corresponding
text completions (shown on the right). When a function call is required, the model generates structured func-
tion calls in the form of a list of functions, where each function is specified with its arguments in the format
func_name(arg1=value1, ...). Special tokens, including <s>, <|im_start|>, <|im_end|>, <|answer|>, and
<|use_tool|>, are each represented by a single token after tokenization. For more details, refer to Section 3.1.
In the experiments, we investigated the perfor- to make a decision based on the user query and
mance comparison of different conditional prompts provided functions before delving into the details,
and the use of training data across various metrics thereby enhancing the stability of its output.
for instruction-following and function-calling ca-
The Decision Token also facilitates the creation
pabilities, as discussed in Section 4.2.
of non-function-call data from function-called data.
3.2 Decision Token To generate non-function-call data, consider an ex-
ample where the original data involves three func-
Achieving high performance in relevance detection
tions: func_A, func_B, and func_C. Based on the
is challenging, often hindered by the scarcity of
user query, func_A is helpful and thus called in the
negative samples in most synthetic datasets (Liu
original data point. By assuming that func_B and
et al., 2024a,b).
func_C are not helpful, we can create non-function-
To address this, we propose the novel Deci-
call data by removing func_A as input. With only
sion Token mechanism. LLMs generate responses
func_B and func_C as the remaining functions,
through next-token prediction, where each step in-
function calling should not be triggered from user
volves a classification task to select the next token.
query and a direct answer should be provided. This
The Decision Token concept leverages the fact that
allows us to easily obtain non-function-call data.
each token prediction is essentially a classifica-
Previously, generating non-function-call data for
tion. By introducing a pair of special tokens, the
training was challenging because it required spe-
model can predict a binary classification that de-
cific LLM responses for non-function-call cases.
termines whether to answer the query directly or
However, with the Decision Token, we can train the
invoke function calls before generating a detailed
model to output only <|answer|> in non-function-
response or function calls, respectively. Specifi-
call cases. During inference, this is not an issue
cally, this process introduces a pair of special to-
because the model will continue to provide an ap-
kens, <|answer|> and <|use_tool|>, as shown in
propriate response after <|answer|>.
Figure 1(e) and (g). If the model chooses to provide
a direct answer, it outputs <|answer|> first; if it The experiments involving the Decision Token
chooses function calling, it outputs <|use_tool|> and training on synthetic non-function-call data are
first. This classification task forces the model discussed in Section 4.3.
3.3 Chain-of-thought Reasoning 4 Experiments and Results
CoT reasoning has been demonstrated to signifi- 4.1 Experimental Setup
cantly enhance performance across various tasks In this section, we describe the experimental setup
by incorporating intermediate reasoning steps (Wei used to evaluate our proposed methods, including
et al., 2022). Inspired by this, we explore whether details on datasets, model configurations, training
CoT reasoning can similarly improve function- parameters, and evaluation metrics.
calling capabilities. To achieve this, we propose a We created a diverse dataset for fine-tuning,
synthetic data generation pipeline that constructs which includes both instruction-following and
reasoning descriptions derived from sequences of function-calling examples. The instruction-
conversations and function calls. This pipeline following data, marked as IF-110k, consists of
leverages single-turn queries with commercial- 110k instances sampled from Open ORCA (Long-
grade LLMs. In our prompt design, we initially pro- pre et al., 2023), a synthetic dataset generated from
vide the history of the conversation and the avail- GPT-4 completions. The function-calling data,
able functions, requiring the identification of the marked as FC-110k, also includes 110k instances,
reasoning needed to determine how to use the avail- sourced from a combination of APIGen (Liu et al.,
able functions to achieve the target function calls. 2024b) and the glaive-function-calling-v2 dataset1 .
Additionally, we provide multiple examples to en- We used Breeze-7B2 as the base model for our
hance stability (few-shot learning). More details experiments. Breeze-7B (Hsu et al., 2024) is an
are provided in Appendix A. Using this pipeline, open-source language model based on Mistral-7B,
we generate data that captures the thinking process, designed to improve language comprehension and
which is then used to fine-tune base LLMs. The chatbot capabilities in Traditional Chinese. Using
fine-tuning process employs a structured prompt Breeze-7B, we can test the model’s effectiveness
template, as illustrated in Figure 1(h). The exper- in both English and Traditional Chinese.
iments on incorporating CoT reasoning are pre- The models were fine-tuned using the prompt
sented in Section 4.4. templates, described in Section 3.1. For fine-tuning,
we applied the low-rank adaptation (LoRA) tech-
3.4 Multilingual Translation nique on linear layers. The fine-tuning process used
the following hyperparameters: a learning rate of
To enhance the multilingual capabilities of 1e-4, a batch size of 48, 3 epochs, a cosine learning
function-calling tuning, translating existing English rate scheduler, the AdamW optimizer, 100 warmup
function-calling datasets into target languages is a steps, a LoRA rank (r) of 16, and a LoRA α of 32.
common approach. However, this process presents We evaluated the performance of our models
significant challenges, as elements such as function using the following metrics:
names, enumeration items, and structured func- AST Summary (%): This metric, used in the
tion calls cannot be directly translated without risk- Berkeley Function Calling Leaderboard (BFCL)
ing inconsistencies or errors. To address these is- (Yan et al., 2024), assesses the structural correct-
sues and maintain the semantic and syntactic in- ness of language model outputs for function-calling
tegrity of translated datasets, we propose a novel tasks by comparing the Abstract Syntax Tree (AST)
translation pipeline specifically designed to over- representations of generated and target function
come the limitations of direct translation methods. calls. It includes four problem types—Simple Func-
This pipeline leverages a single-turn query with tion, Multiple Function, Parallel Function, and Par-
commercial-grade LLMs. In our prompt design, allel Multiple Function—categorized based on the
we initially specify the JSON format of the pro- combination of the number of provided functions
vided conversation trajectory with function calls. and function calls. The dataset consists of 400
We then instruct the LLMs to translate the data into Simple Function tasks and 200 tasks for each of
the target language, adhering to the rules of not the other three types. The AST Summary is the
translating function names and descriptions, and average accuracy across these four types.
translating arguments only when reasonable. More
1
details are provided in Appendix B. The experi- https://huggingface.co/datasets/glaiveai/glaive-function-
calling-v2
ments on verifying the effectiveness of the pipeline 2
https://huggingface.co/MediaTek-Research/Breeze-7B-
is presented in Section 4.5. Base-v1_0
Use of Data? MT AST Relevance
Conditional Prompt IF-110k FC-110k Bench Summary Detection
(a) No provided functions ⃝ × 5.46 - -
(b) Provide functions in a dedicated role ⃝ ⃝ 5.57 85.25 49.58
(c) Provide functions in the system role ⃝ ⃝ 5.29 85.94 39.58
(d) Provide functions in a dedicated role × ⃝ - 74.62 38.33
(e) Provide functions in the system role × ⃝ - 74.50 27.08
Table 1: Performance comparison of different prompts and the use of data on various metrics for instruction-
following and function-calling capabilities. The "Use of Data?" columns indicate whether the respective datasets
(IF-110k and FC-110k) are included in the training process. Detailed experiments are discussed in Section 4.2.
Table 2: Impact of incrementally adding the Decision Token and synthetic non-function-call data. The table shows
different prompt configurations for providing functions. The last three rows represent the configurations: baseline,
Decision Token added, and both Decision Token and synthetic data added. See Section 4.3 for details.
Relevance Detection (%): This metric, also 3.1. The use of training data, training setup, and
used in the BFCL, measures the success rate of no metrics is described in Section 4.1.
function call when none of the provided functions Compared to Table 1(b) and (c), the functions
are relevant. This scenario helps determine whether provided in a dedicated role and the system role
a model will hallucinate its functions and parame- exhibit similar capabilities in terms of instruction-
ters when the provided functions are irrelevant to following (MT Bench) and function-calling accu-
the user’s query. racy (AST Summary). But, Relevance Detection
MT-Bench (score): Unlike previous works, we is superior when functions are provided in the ded-
also explore the impact of instruction-following icated role. We hypothesize that providing func-
capabilities when enabling function-calling func- tions in the dedicated role makes the template with
tionalities. MT-Bench (Zheng et al., 2023) is a functions significantly different from the template
benchmark for evaluating these capabilities. We without functions, making it easier for the model
use GPT-4o as a judge to give the score out of 10. to learn when to use function calling or respond
We also evaluated the performance on Tradi- directly.
tional Chinese function calling using the Func- Compared to the results shown in Table 1(a), (b),
tion Calling Leaderboard for ZHTW (Lee et al., and (c) on the MT Bench, we find that enabling
2024), which is constructed by translating the the function-calling capability does not reduce the
BFCL. Therefore, the calculation of metrics AST performance of the instruction-following capability,
Summary and Relevance Detection is similar. regardless of the conditional prompt given.
Compared to the results shown in Table 1(b), (c),
4.2 Effects of Prompt Templates and Use of
(d), and (e) on the AST Summary and Relevance
Training Data
Detection metrics, we find that the performance of
We investigated the performance comparison of dif- the function-calling capability decreases when we
ferent conditional prompts and the use of training exclude the instruction-following data (IF-110k).
data on various metrics for instruction-following This observation is noteworthy. We hypothesize
and function-calling capabilities, as shown in Table that the increase in function-calling capability is
1. Conditional prompts are described in Section due to the additional instruction-following data,
How to provide functions in a prompt? In a dedicated role In the system role
Metrics on Function Calling Leaderboard AST Relevance AST Relevance
for ZHTW (Lee et al., 2024): Summary Detection Summary Detection
Baseline 52.37 36.67 50.81 47.08
+ Traditional Chinese Data (TC-19k) 61.56 41.25 58.56 45.83
Table 3: The impact of adding Traditional Chinese data, generated through a tailored translation pipeline (Section
3.4), is analyzed. Notably, the metrics AST Summary and Relevance Detection are evaluated on the benchmark for
Tradition Chinese. Detailed experiments are discussed in Section 4.5.
which helps the model better understand the se- with those trained on IF-110k + FC-110k, we
mantic structure of the prompts. Consequently, this found no significant improvement in function call-
improved understanding enhances the model’s abil- ing accuracy (AST Summary), which was 84.44%
ity to accurately perform function calling. More- compared to the baseline of 85.25%. We hypothe-
over, instruction-following data provided more non- size that BFCL problems may not require reasoning
function-call examples, further improving Rele- for function calling.
vance Detection.
In conclusion, our experiments demonstrate that 4.5 Effects of Translation Pipeline
the inclusion of function-calling capabilities does To evaluate the effectiveness of the translation
not compromise instruction-following performance. pipeline described in Section 3.4, we generated
Additionally, the use of instruction-following data 18k function-calling instances in Traditional Chi-
significantly enhances function-calling accuracy nese using synthetic methods from the FC-110k
and relevance detection. dataset. Additionally, we applied an non-function-
call case generation pipeline, as detailed in Section
4.3 Effects of the Decision Token
3.2, to this dataset, producing 200 instances of
To verify the effectiveness of the Decision Token, non-function-call data in Traditional Chinese. The
as described in Section 3.2, we examined the effects combined dataset is referred to as TC-19k.
of incrementally adding the Decision Token and In our baseline experiment, we used the Decision
the synthetic non-function-call data. Token approach along with the IF-110k, FC-110k,
In the baseline experiment, we used IF-110k and NF-1k datasets as training data. We then incor-
and FC-110k as the training data to finetune the porated the TC-19k Traditional Chinese data into
base model. Then, we added the Decision Token to the training set. The results, presented in Table 3,
the prompt templates and finetuned the base model demonstrate that even a small amount of translated
on the same training data. In the final experiment, data can significantly enhance function-calling per-
we used synthetic methods described in Section formance.
3.2 to generate 1k instances of non-function-call
data, marked as NF-1k. The models were then fine- 5 Conclusion
tuned with a combination of NF-1k, IF-110k, and
FC-110k. The results of this investigation are pre- Our research demonstrates that integrating
sented in Table 2. In conclusion, our analysis shows instruction-following data with function-calling
that the adoption of the Decision Token, along with tasks significantly enhances function-calling
the accompanying synthetic non-function-call data, capabilities. The Decision Token mechanism,
can benefit Relevance Detection. However, it also combined with synthetic non-function-call data,
results in a slight decrease in function-calling accu- further improves relevance detection. Additionally,
racy (AST Summary). a tailored translation pipeline effectively mitigates
multilingual challenges. These findings underscore
4.4 Effects of Chain-of-Thought Reasoning the potential for improving function-calling
To evaluate CoT reasoning (Section 3.3), we gener- capabilities and expanding multilingual proficiency
ated reasoning descriptions for each function call in LLMs, paving the way for more practical
in FC-110k, creating FC-110k-Reason. Compar- real-world applications.
ing models trained on IF-110k + FC-110k-Reason
References Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar
Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew
Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Oldham, Mathieu Rita, Maya Pavlova, Melanie Kam-
Sadhana Kumaravel, Matthew Stallone, Rameswar badur, Mike Lewis, Min Si, Mitesh Kumar Singh,
Panda, Yara Rizk, GP Bhargav, Maxwell Crouse, Mona Hassan, Naman Goyal, Narjes Torabi, Niko-
Chulaka Gunasekara, Shajith Ikbal, Sachin Joshi, lay Bashlykov, Nikolay Bogoychev, Niladri Chatterji,
Hima Karanam, Vineet Kumar, Asim Munawar, Ning Zhang, Olivier Duchenne, Onur Çelebi, Patrick
Sumit Neelam, Dinesh Raghu, Udit Sharma, Alrassy, Pengchuan Zhang, Pengwei Li, Petar Va-
Adriana Meza Soria, Dheeraj Sreedhar, Praveen sic, Peter Weng, Prajjwal Bhargava, Pratik Dubal,
Venkateswaran, Merve Unuvar, David Cox, Salim Praveen Krishnan, Punit Singh Koura, Puxin Xu,
Roukos, Luis Lastras, and Pavan Kapanipathi. 2024. Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj
Granite-function calling model: Introducing function Ganapathy, Ramon Calderer, Ricardo Silveira Cabral,
calling abilities via multi-task learning of granular Robert Stojnic, Roberta Raileanu, Rohan Maheswari,
tasks. Preprint, arXiv:2407.00121. Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ron-
Maxwell Crouse, Ibrahim Abdelaziz, Kinjal Basu, So- nie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan
ham Dan, Sadhana Kumaravel, Achille Fokoue, Pa- Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sa-
van Kapanipathi, and Luis A. Lastras. 2024. For- hana Chennabasappa, Sanjay Singh, Sean Bell, Seo-
mally specifying the high-level behavior of LLM- hyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sha-
based agents. ran Narang, Sharath Raparthy, Sheng Shen, Shengye
Wan, Shruti Bhosale, Shun Zhang, Simon Van-
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, denhende, Soumya Batra, Spencer Whitman, Sten
Pengfei Liu, Yiming Yang, Jamie Callan, and Gra- Sootla, Stephane Collot, Suchin Gururangan, Syd-
ham Neubig. 2022. Pal: Program-aided language ney Borodinsky, Tamar Herman, Tara Fowler, Tarek
models. arXiv preprint arXiv:2211.10435. Sheasha, Thomas Georgiou, Thomas Scialom, Tobias
Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh
Abhinav Pandey, Abhishek Kadian, Ahmad Al- Ramanathan, Viktor Kerkez, Vincent Gonguet, Vir-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ginie Do, Vish Vogeti, Vítor Albiero, Vladan Petro-
ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh vic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whit-
Goyal, Anthony Hartshorn, Aobo Yang, Archi Mi- ney Meers, Xavier Martinet, Xiaodong Wang, Xi-
tra, Archie Sravankumar, Artem Korenev, Arthur aofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xin-
Hinsvark, Arun Rao, Aston Zhang, Aurelien Ro- feng Xie, Xuchao Jia, Xuewei Wang, Yaelle Gold-
driguez, Austen Gregerson, Ava Spataru, Baptiste schlag, Yashesh Gaur, Yasmine Babaei, Yi Wen,
Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao,
Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing
Chris Marra, Chris McConnell, Christian Keller, Chen, Zoe Papakipos, Aaditya Singh, Aayushi Sri-
Christophe Touret, Chunyang Wu, Corinne Wong, vastava, Abha Jain, Adam Kelsey, Adam Shajnfeld,
Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al- Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand,
lonsius, Daniel Song, Danielle Pintz, Danny Livshits, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei
Danny Wyatt, David Esiobu, Dhruv Choudhary, Baevski, Allie Feinstein, Amanda Kallet, Amit San-
Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, gani, Amos Teo, Anam Yunus, Andrei Lupu, An-
Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, dres Alvarado, Andrew Caples, Andrew Gu, Andrew
Elina Lobanova, Emily Dinan, Eric Michael Smith, Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchan-
Filip Radenovic, Francisco Guzmán, Frank Zhang, dani, Annie Dong, Annie Franco, Anuj Goyal, Apara-
Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis An- jita Saraf, Arkabandhu Chowdhury, Ashley Gabriel,
derson, Govind Thattai, Graeme Nail, Gregoire Mi- Ashwin Bharambe, Assaf Eisenman, Azadeh Yaz-
alon, Guan Pang, Guillem Cucurell, Hailey Nguyen, dan, Beau James, Ben Maurer, Benjamin Leonhardi,
Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi
Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Is- Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Han-
han Misra, Ivan Evtimov, Jack Zhang, Jade Copet, cock, Bram Wasti, Brandon Spence, Brani Stojkovic,
Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Brian Gamido, Britt Montalvo, Carl Parker, Carly
Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Burton, Catalina Mejia, Ce Liu, Changhan Wang,
Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Changkyu Kim, Chao Zhou, Chester Hu, Ching-
Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Hsiang Chu, Chris Cai, Chris Tindal, Christoph Fe-
Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, ichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty,
Joseph Rocca, Joshua Johnstun, Joshua Saxe, Jun- Daniel Kreymer, Daniel Li, David Adkins, David
teng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Xu, Davide Testuggine, Delia David, Devi Parikh,
Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Diana Liskovich, Didem Foss, Dingkang Wang, Duc
Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Le, Dustin Holland, Edward Dowling, Eissa Jamil,
Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Elaine Montgomery, Eleonora Presani, Emily Hahn,
Lakhotia, Lauren Rantala-Yeary, Laurens van der Emily Wood, Eric-Tuan Le, Erik Brinkman, Este-
Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, ban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun,
Louis Martin, Lovish Madaan, Lubo Malo, Lukas Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat
Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Ozgenel, Francesco Caggioni, Frank Kanayet, Frank
Seide, Gabriela Medina Florez, Gabriella Schwarz, Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia,
Gada Badeer, Georgia Swee, Gil Halpern, Grant Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi,
Herman, Grigory Sizov, Guangyi, Zhang, Guna Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao,
Lakshminarayanan, Hakan Inan, Hamid Shojanaz- Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary
eri, Han Zou, Hannah Wang, Hanwen Zha, Haroun DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang,
Habeeb, Harrison Rudolph, Helen Suk, Henry As- Zhiwei Zhao, and Zhiyu Ma. 2024. The llama 3 herd
pegren, Hunter Goldman, Hongyuan Zhan, Ibrahim of models. Preprint, arXiv:2407.21783.
Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis,
Irina-Elena Veliche, Itai Gat, Jake Weissman, James Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa
Geboski, James Kohli, Janice Lam, Japhet Asher, Safdari, Yutaka Matsuo, Douglas Eck, and Aleksan-
Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jen- dra Faust. 2024. A real-world webagent with plan-
nifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy ning, long context understanding, and program syn-
Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe thesis. In The Twelfth International Conference on
Cummings, Jon Carvill, Jon Shepard, Jonathan Mc- Learning Representations.
Phie, Jonathan Torres, Josh Ginsburg, Junjie Wang,
Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khan- Yilun Hao, Yongchao Chen, Yang Zhang, and Chuchu
delwal, Katayoun Zand, Kathy Matosich, Kaushik Fan. 2024. Large language models can plan your trav-
Veeraraghavan, Kelly Michelena, Keqian Li, Ki- els rigorously with formal verification tools. arXiv
ran Jagadeesh, Kun Huang, Kunal Chawla, Kyle preprint arXiv:2404.11891.
Huang, Lailin Chen, Lakshya Garg, Lavender A,
Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Joy He-Yueya, Gabriel Poesia, Rose E. Wang, and
Guo, Licheng Yu, Liron Moshkovich, Luca Wehrst- Noah D. Goodman. 2023. Solving math word prob-
edt, Madian Khabsa, Manav Avalani, Manish Bhatt, lems by combining language models with symbolic
Martynas Mankus, Matan Hasson, Matthew Lennie, solvers. Preprint, arXiv:2304.09102.
Matthias Reso, Maxim Groshev, Maxim Naumov,
Maya Lathi, Meghan Keneally, Miao Liu, Michael L. Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-
Seltzer, Michal Valko, Michelle Restrepo, Mihir Pa- Chun Hsu, Yi-Chang Chen, and Da-Shan Shiu.
tel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, 2024. Breeze-7b technical report. arXiv preprint
Mike Macey, Mike Wang, Miquel Jubert Hermoso, arXiv:2403.02712.
Mo Metanat, Mohammad Rastegari, Munish Bansal,
Shijue Huang, Wanjun Zhong, Jianqiao Lu, Qi Zhu, Ji-
Nandhini Santhanam, Natascha Parks, Natasha
ahui Gao, Weiwen Liu, Yutai Hou, Xingshan Zeng,
White, Navyata Bawa, Nayan Singhal, Nick Egebo,
Yasheng Wang, Lifeng Shang, Xin Jiang, Ruifeng
Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich
Xu, and Qun Liu. 2024. Planning, creation, usage:
Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz,
Benchmarking LLMs for comprehensive tool utiliza-
Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin
tion in real-world complex scenarios. In Findings of
Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pe-
the Association for Computational Linguistics: ACL
dro Rittner, Philip Bontrager, Pierre Roux, Piotr
2024, pages 4363–4400, Bangkok, Thailand. Associ-
Dollar, Polina Zvyagina, Prashant Ratanchandani,
ation for Computational Linguistics.
Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel
Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2021.
Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Internet-augmented dialogue generation. In Annual
Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Meeting of the Association for Computational Lin-
Wang, Russ Howes, Ruty Rinott, Sachin Mehta, guistics.
Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara
Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Liang-Chieh Lee, Cheng-Wei Lin, Pei-Chen Ho, Chien-
Satadru Pan, Saurabh Mahajan, Saurabh Verma, Yu Yu, Yi-Chang Chen, and Da-Shan Shiu. 2024.
Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lind- Function calling leaderboard for zhtw.
say, Shaun Lindsay, Sheng Feng, Shenghao Lin,
Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao,
Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan,
Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yux-
Stephanie Max, Stephen Chen, Steve Kehoe, Steve ian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan
Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu
Summer Deng, Sungmin Cho, Sunny Virk, Suraj Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming
Subramanian, Sy Choudhury, Sydney Goldman, Tal Tang, Defu Lian, Qun Liu, and Enhong Chen. 2024a.
Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Toolace: Winning the points of llm function calling.
Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Preprint, arXiv:2409.00920.
Matthews, Timothy Chou, Tzook Shaked, Varun
Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu,
Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao,
Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Zhiwei Liu, Yihao Feng, et al. 2024b. Apigen:
Vladimir Ivanov, Wei Li, Wenchen Wang, Wen- Automated pipeline for generating verifiable and
wen Jiang, Wes Bouaziz, Will Constable, Xiaocheng diverse function-calling datasets. arXiv preprint
Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo arXiv:2406.18518.
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
Le, Barret Zoph, Jason Wei, et al. 2023. The flan et al. 2022. Chain-of-thought prompting elicits rea-
collection: Designing data and methods for effective soning in large language models. Advances in neural
instruction tuning. In International Conference on information processing systems, 35:24824–24837.
Machine Learning, pages 22631–22648. PMLR.
Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata
Nexusflow.ai. 2023. Nexusraven-v2: Surpassing gpt-4 Mukherjee, Yuchen Liu, and Dongkuan Xu. 2023.
for zero-shot function calling. Rewoo: Decoupling reasoning from observations
for efficient augmented language models. Preprint,
Aaron Parisi, Yao Zhao, and Noah Fiedel. 2022. arXiv:2305.18323.
Talm: Tool augmented language models. Preprint,
arXiv:2205.12255. Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji,
Tianjun Zhang, Shishir G. Patil, Ion Stoica, and
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Berkeley function calling
Joseph E. Gonzalez. 2023. Gorilla: Large language leaderboard. https://gorilla.cs.berkeley.
model connected with massive apis. arXiv preprint edu/blogs/8_berkeley_function_calling_
arXiv:2305.15334. leaderboard.html.
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge,
Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Xiu Li, and Ying Shan. 2023a. GPT4tools: Teaching
Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, large language model to use tools via self-instruction.
Ruobing Xie, Jie Zhou, Mark Gerstein, dahai li, In Thirty-seventh Conference on Neural Information
Zhiyuan Liu, and Maosong Sun. 2024. ToolLLM: Processing Systems.
Facilitating large language models to master 16000+
real-world APIs. In The Twelfth International Con- Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin
ference on Learning Representations. Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu,
Ce Liu, Michael Zeng, and Lijuan Wang. 2023b.
Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Mm-react: Prompting chatgpt for multimodal rea-
Shuaiqiang Wang, Dawei Yin, Jun Xua, and Ji-Rong soning and action.
Wen. 2024. Tool learning with large language mod-
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
els: A survey. arXiv preprint arXiv:2405.17935.
Shafran, Karthik Narasimhan, and Yuan Cao. 2022.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta React: Synergizing reasoning and acting in language
Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- models. arXiv preprint arXiv:2210.03629.
moyer, Nicola Cancedda, and Thomas Scialom. 2023. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Toolformer: Language models can teach themselves Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
to use tools. In Thirty-seventh Conference on Neural Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023.
Information Processing Systems. Judging llm-as-a-judge with mt-bench and chatbot
arena. Advances in Neural Information Processing
Noah Shinn, Federico Cassano, Edward Berman, Ash-
Systems, 36:46595–46623.
win Gopinath, Karthik Narasimhan, and Shunyu Yao.
2023. Reflexion: Language agents with verbal rein- Ruizhe Zhong, Xingbo Du, Shixiong Kai, Zhentao
forcement learning. Preprint, arXiv:2303.11366. Tang, Siyuan Xu, Hui-Ling Zhen, Jianye Hao,
Qiang Xu, Mingxuan Yuan, and Junchi Yan. 2023.
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Llm4eda: Emerging progress in large language
Qiao Liang, and Le Sun. 2023. Toolalpaca: Gener- models for electronic design automation. Preprint,
alized tool learning for language models with 3000 arXiv:2401.12224.
simulated cases. Preprint, arXiv:2306.05301.
## Example 1:
Please output JSON with the key `reason` for identifying the reason
to figure out how to use the available functions and finally expect to get the
,→ answer shown below.
```json
[
{
"name": "weather_api.get_current_weather",
"arguments": {
"location": "Palo Alto",
"units": "Celsius"
}
}
]
```
```json
{
"reason": "The user wants to know the current weather conditions in Palo Alto.
,→ The available tool 'weather_api.get_current_weather' can be used to retrieve
,→ this information by specifying the location as 'Palo Alto'."
}
```
## Example 2:
Please output JSON with the key `reason` for identifying the reason
to figure out how to use the available functions and finally expect to get the
,→ answer shown below.
```json
[
{
"name": "math_toolkit.sum_of_multiples",
"arguments": {
"lower_limit": 1,
"upper_limit": 1000,
"multiples": [3, 5]
}
},
{
"name": "math_toolkit.product_of_primes",
"arguments": {
"count": 5
}
}
]
```
```json
{
"reason": "The user wants to find the sum of all multiples of 3 and 5 between 1
,→ and 1000, and also find the product of the first five prime numbers. The
,→ available tools 'math_toolkit.sum_of_multiples' and
,→ 'math_toolkit.product_of_primes' can be used to retrieve this information.
,→ The 'math_toolkit.sum_of_multiples' tool can be used by specifying the lower
,→ limit as 1, the upper limit as 1000, and the multiples as [3, 5]. The
,→ 'math_toolkit.product_of_primes' tool can be used by specifying the count as
,→ 5."
}
```
## Start
Please output JSON with the key `reason` for identifying the reason
to figure out how to use the available functions and finally expect to get the
,→ answer shown below.
```json
{FUNC_CALL}
```
This JSON object outlines a conversation between a user and an assistant, including
,→ the available functions the assistant can utilize to meet the user's requests.
```json
{DATA}
```
AND NOW,
I want to translate this JSON into {TARGET_LANG}.
Note that:
- Do not translate any content in `functions`
- Translate the content in `arguments` if using {TARGET_LANG} is reasonable