DB GPT Hub 2024
DB GPT Hub 2024
1
ing results, potentially slowing down advancement 2.2 Prompt-based and Fine-tuning Settings
in this field. As a first step towards addressing Based on how LLMs are used for text-to-SQL gen-
these challenges, in this work, we present a holistic erations, the problem settings can be categorized
framework, namely DB-GPT-Hub,. Apart from into two scenarios: zero-shot/few-shot prompting
existing works that mostly focus on few-shot and fine-tuning.
prompting strategies or tuning relatively smaller
LLMs, our work focuses on tuning larger LLMs. Zero-shot / Few-shot Prompting. In zero-shot
In all, DB-GPT-Hub consolidates essential research scenarios, no exemplar is provided while in few-
assets (e.g., data, model services, evaluation meth- shot a few input–output exemplars are provided
ods, documentation) with following distinct merits: to prompt LLMs. Formally, given a pretrained
LLM parameterized by θ, the question qi , and k
• Standardization. We establish a standardized exemplars (k ≥ 0), the objective is maximize the
pipeline in an open-source codebase, with uni- probability of generating the correct SQL si from
fied experimental settings and containerized en- the LLM:
vironments, to enable transparent and consistent
comparisons of LLM models after text-to-SQL max PLLMθ (si ∣σ(qi , M)), ∣M∣ = k (1)
si
tasks tuning.
1
• Comprehensiveness. We conduct extensive where Θ and σ(qi , M) denotes a representation
benchmarking that covers a range of medium space of the target question qi by incorporating
to large-sized, fine-tuned LLMs across various relevant information from exemplars.
experimental scenarios and explore their relative Fine-tuning. The fine-tuing process involves
performance compared to prompting methods. adapting the pretrained LLMθ to generate SQL
Our work comprises one of the most pragmatic from the input sequences by tuning the model with
and expansive sets of benchmark suites available. text-to-SQL datasets, which contain a collection of
• Extensibility. As a rapidly evolving field, novel serialized inputs qi and corresponding SQL outputs
LLM-based methods constantly emerge, and the si pairs. The object of fine-tuning is minimize the
best practice continuously evolves. Following empirical loss:
our documentation and protocols, one could ef-
min L(̂si (LLMθ ), si ∣σ(qi )), (2)
fortlessly incorporate novel ingredients into our θ
codebase: new datasets, new modules, new mod-
where L is the loss function to measure the
els (or model services), and new evaluation pro-
difference between the generated SQL and the
grams. Moreover, our framework offers easy
groundtruth.
compatibility with various prompting techniques.
Despite the significant advances achieved with
The high extensibility will eventually benefit the
few-shot prompting of LLMs, it remains a
research area of text-to-SQL.
formidable challenge for a pretrained LLM to rely
2 Background and Problem Formulation solely on its parametric knowledge and prompting
to accurately process highly complex SQL queries.
2.1 A Generalized Setup
The input of text-to-SQL task is a natural language Parameter-Efficient Fine-tuning. Medium to
question q and the database information D. The large-sized models with billions of parameters,
output is the SQL s corresponding to the question. are prohibitively expensive to fine-tune in order
The database D = {S, Kp , Kf } includes database to adapt them to particular tasks or domains.
schema S, primary keys Kp and foreign keys Kf , Parameter-Efficient Fine-Tuning (PEFT) methods
where S usually contains multiple tables Tk ∶ S = enable efficient adaptation of large pretrained mod-
{T1 , T2 , ...Ts ...}. Each table Tk has table name Nk , els to various downstream applications by only
column names cj and column data types tj . There- fine-tuning a small number of (extra) model param-
fore, Tk = {Nk , (ck1 , tk1 ), (ck2 , tk2 )...}. Consider eters instead of all the model’s parameters. Two
the queries may come from various database do- mostly commonly used techniques are LoRA (Hu
mains, we formulate the data into a set of triples et al., 2021), which proposes to freeze pretrained
M = {(qi , si , Di )}, with i denoting the index of 1
σ(qi , M) technically denotes the information set gener-
the query, the output and source database. ated by qi and M.
2
model weights and inject trainable layers (rank- to guide LLMs. See listing 2 and listing 3 in Ap-
decomposition matrices) in each transformer block, pendix A.4 for full examples.
and its quantized version QLoRA (Dettmers et al.,
2023). Throughout the benchmark, we use these I want you to act as a SQL terminal in
two strategies consistently to tune the LLMs. See front of a database and below is an
Section 3 and Section 4 for details of tuning bench- description of the database schema. Write a
response that appropriately completes the
mark design and experimental results. request.
3
Dataset Construction Trainer Predictor Evaluation
Open Source Fine
Instruction <schema> LLMs Tuning PEFT strategies Adaptive
Metrics Suite
Builder LoRA / QLoRA Parallelism
EX / EM / VES
Open Datasets /* example 1 */
Example Deepspeed Optimization
Selector /* example 2 */ Downloadable
… Close Source Report
LLMs Predictor
<query> In-Context
<response> Learning Prompting
optional
Qwen (Bai et al., 2023), Baichuan (Baichuan, 2023) to-extend codebase for standardization of imple-
and Llama (Touvron et al., 2023). mentation, evaluation, and ablation of text-to-SQL
methods.
• ChatGLM3-6B, the up-to-date open version of
ChatGLM, an open bilingual language model Software Architecture. Figure 1 presents the
based on GLM framework. pipeline and architecture of our codebase. Pipelines
are decomposed into following parts:
• Qwen-7B/14B/72B-Chat, a series of aligned
models of Qwen.
• Dataset Construction. Raw text-to-SQL data is
• Baichuan2-7B/13B-Chat, the up-to-date collec- processed into a suitable format (e.g., TRF shown
tion of aligned models of Baichuan. in listing 1 ) to tune LLMs. This includes inte-
grating the schema and database description into
3
• LLaMA2-7B/13B/70B-Chat , the up-to-date a prompt as an instruction, along with various
aligned version of LLaMA. question representations to boost performance
during training and evaluation. Additionally, we
• CodeLLaMA-7B/13B/70B-Instruct, an aligned
will select different few-shot strategies, such as
version of LLaMA-2-13B, tuned with code data.
example selection and organization, to construct
To ensure a fair comparison, we use the same the evaluation dataset (Gao et al., 2023).
maximal context length 2048 for all the LLMs.
During the evaluation, we leave 512 tokens for • Training. Our codebase supports the fine-tuning
response generation. We set the argument tempera- of open-source LLMs with PEFT strategies. We
ture as 0 to eliminate the influence of randomness. support most of the public architecture with
small to large-sized model scales, such as Qwen,
Tuning Methods. As the scale of the dataset is Llama, Baichuan, and ChatGLM.
notably smaller than that of LLMs, we apply the
PEFT strategies –LoRA and QLoRA – to tune the • Prediction. Our codebase supports SQL query
LLMs, respectively. For medium-sized models inference for open-source LLMs with its fine-
(7B/13B), we adopt 1 Nvidia A100 Tensor Core tuned version and closed-source LLMs as well.
GPU to run training and inference. For large-sized We support the few-shot and zero-shot method to
models (70B), we adopt 8 A100s. generate SQLs for specific scenarios.
Benchmark Pipeline. Figure 1 presents the open
benchmarking pipeline implemented in DB-GPT- • Evaluation. Our repository holds different met-
Hub. This pipeline will facilitate future research in rics(EX, EM, valid efficiency score(VES)) to
this area and help promote reproducible work. evaluate the performance of generated SQL from
different perspectives.
3.2 Codebase
To facilitate the innovation of the community, our Implementations. The codebase is built with the
DB-GPT-Hub contains a well-modularized, easy- PyTorch framework (Paszke et al., 2017), upon the
open source project DB-GPT (Xue et al., 2023a,
3
Due to the page limitation, we have omitted the suffix 2024a). We release the code with Apache License
“-Chat” from the names of LLMs in the tables throughout the
following sections. For instance, “Qwen-7B” should be read 2.0 and we are committed to actively maintain the
as “Qwen-7B-Chat” model. repository.
4
4 Experiments 1.0
0.9 easy
Improvement on EX
medium
In this section, with the utility of DB-GPT-Hub, we 0.8 hard
formally evaluate the text-to-SQL process to deter- 0.7 extra
0.6 overall
mine the performance differences among various
0.5
LLMs and explore the effect of training paradigms 0.4
that influence tuning performance of LLMs. 0.3
0.2
4.1 Main Results Llama-7B CodeLlama-7B Qwen-7B
Table 1 and Table 2 show the evaluation results, Figure 2: The improvement on tuning with LoRA strat-
measured by EX, on Spider and BIRD datasets, egy across subgroups of different complexities.
4
respectively . The results in EM on both datasets
can be found in Table 6 and Table 7 in Appendix B. LoRA vs QLoRA We summarize the EX, EM,
Time Cost, and GPU memory metrics in Table 3.
Best Models. Unsurprisingly, tuned CodeLlama Firstly, not surprisingly, we see limited differ-
families, whose base models haven been optimized ences in generation performance, measured by EX
for code generation and infilling, show consistently and EM, between models tuned with LoRA and
better performance over other competitors on both QLoRA. Secondly, consistent with the quantization
datasets. Specifically, we have achieved the follow- mechanism, QLoRA takes more time to converge
ing key insights: with less GPU memory. For example, compared
• As shown in the right-most columns in Table 1 to Qwen-14B-LoRA, its QLoRA counterpart takes
and Table 2, The fine-tuned, small-sized CodeL- 2× of time with only 50% GPU memory.
5
lama (e.g., CodeLlama-7B-LoRA ) exhibits com- To conclude, in circumstances with restricted
parable, and in some cases even superior, perfor- computational resources, QLoRA is an efficient
mance to other tuned medium to large-sized open tuning alternative that can save memory without
LLMs, such as Qwen-14B/72B-LoRA. sacrificing performance.
5
M ODEL E ASY M EDIUM H ARD E XTRA OVERALL
BASE L/QL BASE L/QL BASE L/QL BASE L/QL BASE L/QL
L LAMA 2-7B 0.000 0.887/0.847 0.000 0.641/0.623 0.000 0.489/0.466 0.000 0.331/0.361 0.000 0.626/0.608
L LAMA 2-13B 0.000 0.907/0.911 0.000 0.729/0.700 0.000 0.552/0.552 0.000 0.343/0.319 0.000 0.680/0.664
L LAMA 2-70B 0.411 0.915/− 0.229 0.732/− 0.190 0.560/− 0.072 0.392/− 0.241 0.687/−
C ODE L LAMA -7B 0.214 0.923/0.911 0.177 0.756/0.751 0.092 0.586/0.598 0.036 0.349/0.331 0.149 0.702/0.696
C ODE L LAMA -13B 0.698 0.940/0.940 0.600 0.789/0.744 0.408 0.684/0.626 0.271 0.404/0.392 0.529 0.746/0.727
C ODE L LAMA -70B 0.722 0.962/− 0.625 0.812/− 0.443 0.716/− 0.302 0.432/− 0.567 0.771/−
BAICHUAN 2-7B 0.577 0.871/0.891 0.352 0.630/0.637 0.201 0.448/0.489 0.066 0.295/0.331 0.335 0.603/0.624
BAICHUAN 2-13B 0.581 0.903/0.895 0.413 0.702/0.675 0.264 0.569/0.580 0.187 0.392/0.343 0.392 0.678/0.659
Q WEN -7B 0.395 0.855/0.911 0.256 0.688/0.675 0.138 0.575/0.575 0.042 0.331/0.343 0.235 0.652/0.662
Q WEN -14B 0.871 0.895/0.919 0.632 0.702/0.744 0.368 0.552/0.598 0.181 0.367/0.458 0.573 0.663/0.701
Q WEN -72B 0.831 0.927/− 0.635 0.756/− 0.489 0.621/− 0.277 0.367/− 0.600 0.712/−
C HAT GLM3-6B 0.000 0.855/0.843 0.000 0.605/0.603 0.000 0.477/0.506 0.000 0.271/0.211 0.000 0.590/0.581
Table 1: Evaluations on Spider: EX of base models vs fine-tuned models on each split of complexity and overall
dataset. “L” and “QL” denote “LORA” and “QLoRA” tuing methods, respectively.
Table 2: Evaluations on BIRD: EX of base models vs fine-tuned models on each split of complexity and overall
dataset. “L” and “QL” denote “LORA” and “QLoRA” tuing methods, respectively.
ios. For example, the EX of Qwen-72B-LoRA vs • For a given few-shot scenario, the performance
Qwen-72B on 3-shot: 68.5 vs 64.8 and on 5-shot: margin of tuning method over prompting method
68.4 vs 65. This is more clearly observed from a comes closer when the size of LLMs grows. For
different perspective in Figure 3, where the curves example, for 1-shot scenario, the performance
for Qwen-13B/72B is flat at low levels. improvement on EX of Qwen-LoRA over Qwen
This fact is possibly because these Qwen-72B al- is 31.0, 24.1 and 3.5 for 7B, 14B and 72B, re-
ready has strong SQL reasoning capabilities, which spectively.
has barely been discussed in other text-to-SQL Recall that the exact figure of few-shot evalua-
benchmarking works. tions can be found at Table 8 in Appendix B. Over-
In all, fine-tuned models exhibit superior SQL all, tuning methods continue to outperform prompt-
reasoning abilities compared to non-tuned models ing methods while the performance gap narrows as
in few-shot generation scenarios; however, the mar- the size of the LLMs increases.
gin of improvement is relatively small for robust
models like Qwen-72B. 4.3 Analysis II: Fine-tuning with More
Exemplars
Effect of Model Size. From Figure 4, we inter-
In this subsection, we explore the possibility of
pret the few-shot performance w.r.t. the model size
enhancing the performance of LLMs by adding
for four models (two base models and two tuned
more contextual examples during fine-tuning.
models) and observe that:
Setup. We use Qwen-7B as the base model and
• Larger models consistently achieve better results construct additional three few-shot (1/3/5-shot)
in few-shot scenarios compared to their smaller- training sets to fine-tune the model. Specifi-
sized counterparts. cally, the 1/3/5-shot training sets consist of query-
6
M ODEL EX EM T IME C OST ( HOUR ) GPU M EMORY (GB)
L O RA QL O RA L O RA QL ORA L O RA QL O RA L O RA QL O RA
L LAMA 2-7B 0.626 0.608 0.581 0.564 4.12 5.74 23.5 16.9
L LAMA 2-13B 0.680 0.664 0.640 0.632 7.26 8.82 34.8 29.6
C ODE L LAMA -7B 0.702 0.696 0.668 0.665 4.33 6.74 23.8 16.7
C ODE L LAMA -13B 0.746 0.727 0.701 0.682 7.26 8.82 34.8 29.6
BAICHUAN 2-7B 0.603 0.624 0.588 0.602 3.33 7.52 20.9 11.5
BAICHUAN 2-13B 0.678 0.659 0.607 0.606 8.12 15.3 34.4 17.5
Q WEN -7B 0.652 0.662 0.610 0.621 2.57 6.45 28.9 17.1
Q WEN -14B 0.663 0.701 0.658 0.665 4.23 11.32 38.4 18.1
Table 3: The comparison between LoRA and QLoRA on Spider across different perspectives: EX and EM are the
performance metrics; the training time and max GPU memory cost are the resource metrics.
Improvement on EX
Improvement on EX
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0.0 0 1 2 3 4 5 0.0 0 1 2 3 4 5 0.0 0 1 2 3 4 5
k-shot k-shot k-shot
(a) 7B (b) 13B (c) 70B / 72B
Figure 3: Few-shot evaluations on Spider: EX improvement on few-shot scenarios over zero-shot. EX(k-shot)
represents the EX of the target (untuned/tuned) model under k-shot scenario minus EX of the base model in zero-shot
scenario, i.e., in (a), Improvement on EX(Qwen-LoRA, 3-shot) = EX(Qwen-LoRA, 3-shot) - EX(Qwen, 0-shot).
response pairs with an additional 1/3/5 exemplars. et al., 2023): diverse training corpus benefits the
For a given model, we also evaluate its few-shot fine-tuning of LLMs.
performances, same as in section 4.2.
5 Related Work
Core Insights. Shown in Table 4, we primarily
conclude with two insights: 5.1 LLM-empowered Text-to-SQL Methods
Driven by the considerable success of LLMs, the
• In a zero-shot evaluation scenario, tuning with field of LLM-empowered text-to-SQL has captured
additional exemplars does not yield a significant the interest of a large amount of researchers both
improvement in performance. See the “0-shot” in nature language process and database commu-
column. This is possible because the training cor- nity recently. The models on LLM-based text-to-
pus (more examples) mismatches the evaluation SQL can be categorized into supervised fine-tuning
setting (no examples). based and prompting based methods. Popular fine-
• In 1/3/5-shot evaluation scenarios, adding more tuned text-to-sql models are SQL-PaLM (Sun et al.,
contextual examples contributes to the notable 2023), PICARD (Scholak et al., 2021) and RESD-
improvement over the counterpart tuned with 0- SQL (Li et al., 2023a). In contrast to supervised
shot training corpus. It means that the perfor- fine-tuned models, prompting-based models do not
mance loss on few-shot evaluation for zero-shot require additional fine-tuning on task-specific train-
training is caused by the prompt mismatch of ing data. Instead, they solely rely on the zero-shot
training and evaluation dataset. and few-shot (Rajkumar et al., 2022; Liu et al.,
2023) capabilities inherent in LLMs. Within the
• The random-shot strategy, which refers to ran- prompting paradigm, the pivotal factor for query
domly adding 0/1/3/5 examples into the training representation lies in the design of the prompt (Wei
corpus, achieves the highest EM scores. This et al., 2022; Zhou et al., 2022; Wang et al., 2022a).
finding is consistent with that proposed by (Sun In particular, DIN-SQL (Pourreza and Rafiei, 2023)
7
0.8 0.8 0.8 0.8
0.7 0-shot 3-shot 0.7 0.7 0.7
0.6 1-shot 5-shot 0.6 0.6 0.6
0.5 0.5 0.5 0.5
EX
EX
EX
EX
0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
7B 13B 70B 7B 13B 70B 7B 14B 72B 7B 14B 72B
(a) Llama2 (b) Llama2-LoRA (c) Qwen (d) Qwen-LoRA
Figure 4: Few-shot evaluations on Spider: the EX performance of Llama2 / Qwen and their tuned counterparts with
varying model size.
M ODEL 0- SHOT 1- SHOT 3- SHOT 5- SHOT
EM EX EM EX EM EX EM EX
Q WEN -7B 16.1 22.9 27.4 34.0 27.6 33.9 25.9 33.8
Q WEN -7B-L O RA (0- SHOT ) 61.0 65.3 58.4 61.8 57.8 62.0 57.7 61.4
Q WEN -7B-L O RA (1- SHOT ) 61.2 64.0 61.7 64.8 60.8 63.8 61.8 64.8
Q WEN -7B-L O RA (3- SHOT ) 61.0 62.8 62.0 62.8 60.7 62.1 60.7 62.9
Q WEN -7B-L O RA (5- SHOT ) 60.4 62.7 62.0 64.0 61.5 63.2 60.9 63.5
Q WEN -7B-L O RA ( RANDOM - SHOT ) 61.5 63.0 62.1 64.0 62.2 63.6 61.9 63.6
Table 4: Few-shot Evaluations on Spider: EM and EX of fine-tuned models with the different number of examples
in the training corpus.
introduces adaptive prompt strategies via task de- works (Gao et al., 2023; Zhang et al., 2024) which
composition to effectively address challenges as- perform systematical benchmarking on text-to-
sociated with schema linking. DAIL-SQL (Gao SQL methods. Important distinctions of their work
et al., 2023) proposes a refined prompt selection from ours include: 1. comprehensiveness of bench-
and organization strategy to improve the perfor- mark settings: we evaluate both ICL and medium
mance. In DB-GPT-Hub, we offer scripts to sup- to large-sized fine-tuning methods in an end-to-
port researchers in fine-tuning LLMs in accordance end manner while Gao et al. (2023) focus on ICL
with the methodologies established in SQL-PaLM. methods and Zhang et al. (2024) assess various sub-
In addition, we also integrate the popular prompt tasks of the text-to-SQL process; 2. open source of
techniques used in DAIL-SQL. the codebase: we released a well-maintained open
repository on Github containing all code and data
5.2 Text-to-SQL Benchmarks assets, which, to the best of knowledge, is one of
A pivotal factor in the progression of text-to-SQL the most popular text-to-SQL benchmark reposito-
is the establishment of high-quality benchmarks. ries (over 1k stars so far), while neither of them has
Early benchmarks focus on single databases, in- achieved this.
cluding ATIS (Dahl et al., 1994), GeoQuery (Zelle
6 Conclusion
and Mooney, 1996), Academic (Li and Jagadish,
2014), Advising (Finegan-Dollak et al., 2018), and In this study, we conduct a systematic benchmark-
more recent additions such as SEDE (Hazoom ing of the various LLMs within the text-to-SQL
et al., 2021) and MIMICSQL (Wang et al., 2019). pipeline. Our benchmarking provides a meticulous
These benchmarks and datasets are often adapted perspective on the pipeline, equipping the research
from real-life applications, with many containing community with strategies to improve the semantic
domain-specific knowledge that may not generalize understanding of LLMs.
effectively to unseen SQL domains. Hence, large-
scale cross-domain datasets featuring professional 7 Limitations
SQL queries, such as Squall (Shi et al., 2020), Spi- The large computational resources required for
der (Yu et al., 2018a), Spider-Syn (Gan et al., 2021), LLM training might not be accessible to all re-
WikiSQL (Zhong et al., 2017), and SparC (Yu et al., searchers and practitioners, which may limit the
2020), have been introduced to facilitate compre- reproducibility of our findings.
hensive method analyses.
In retrospect, we realize two concurrent
8
References Jiaqi Guo, Ziliang Si, Yu Wang, Qian Liu, Ming Fan,
Jian-Guang Lou, Zijiang Yang, and Ting Liu. 2021.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Chase: A large-scale and pragmatic chinese dataset
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei for cross-database context-dependent text-to-sql. In
Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Proceedings of the 59th Annual Meeting of the Asso-
Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, ciation for Computational Linguistics and the 11th
Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, International Joint Conference on Natural Language
Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Processing, pages 2316–2331, Online. Association
Tu, Peng Wang, Shijie Wang, Wei Wang, Sheng- for Computational Linguistics.
guang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang,
Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Moshe Hazoom, Vibhor Malik, and Ben Bogin. 2021.
Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingx- Text-to-sql in the wild: A naturally-occurring
uan Zhang, Yichang Zhang, Zhenru Zhang, Chang dataset based on stack exchange data. arXiv,
Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang abs/2106.05006.
Zhu. 2023. Qwen technical report. arXiv preprint
arXiv:2309.16609. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
Baichuan. 2023. Baichuan 2: Open large-scale lan- and Weizhu Chen. 2021. Lora: Low-rank adap-
guage models. arXiv preprint arXiv:2309.10305. tation of large language models. arXiv preprint
arXiv:2106.09685.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Gangwei Jiang, Caigao Jiang, Siqiao Xue, James Y.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Zhang, Jun Zhou, Defu Lian, and Ying Wei. 2023.
Askell, et al. 2020. Language models are few-shot Towards anytime fine-tuning: Continually pre-trained
learners. Advances in Neural Information Processing language models with hypernetwork prompt. In Pro-
Systems (NeurIPS). ceedings of the 2023 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP).
Deborah A. Dahl, Madeleine Bates, Michael Brown,
William M. Fisher, Kate Hunicke-Smith, David S. Ming Jin, Qingsong Wen, Yuxuan Liang, Chaoli
Pallett, Christine Pao, Alexander I. Rudnicky, and Zhang, Siqiao Xue, Xue Wang, James Zhang,
Elizabeth Shriberg. 1994. Expanding the scope of Yi Wang, Haifeng Chen, Xiaoli Li, Shirui Pan, Vin-
the atis task: The atis-3 corpus. In Human Language cent S. Tseng, Yu Zheng, Lei Chen, and Hui Xiong.
Technology - The Baltic Perspectiv. 2023. Large models for time series and spatio-
temporal data: A survey and outlook. Preprint,
Naihao Deng, Yulong Chen, and Yue Zhang. 2022. Re- arXiv:2310.10196.
cent advances in text-to-sql: A survey of what we
have and what we expect. arXiv, abs/2208.10099. Fei Li and H. V. Jagadish. 2014. Constructing an
interactive natural language interface for relational
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and databases. Proc. VLDB Endow., 8:73–84.
Luke Zettlemoyer. 2023. Qlora: Efficient finetuning
Haoyang Li, Jing Zhang, Cuiping Li, and Hong Chen.
of quantized llms. arXiv preprint arXiv:2305.14314.
2023a. Resdsql: Decoupling schema linking and
skeleton parsing for text-to-sql. In Proceedings of
Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao,
the AAAI Conference on Artificial Intelligence.
Yunjun Gao, Lu Chen, Jinshu Lin, and Dongfang Lou.
2023. C3: zero-shot text-to-sql with chatgpt. CoRR, Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang,
abs/2307.07306. Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao,
Ruiying Geng, Nan Huo, Chenhao Ma, Kevin C. C.
Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Chang, Fei Huang, Reynold Cheng, and Yongbin Li.
Li Zhang, Karthik Ramanathan, Sesh Sadasivam, 2023b. Can llm already serve as a database interface?
Rui Zhang, and Dragomir R. Radev. 2018. Im- a big bench for large-scale database grounded text-
proving text-to-sql evaluation methodology. arXiv, to-sqls. Preprint, arXiv:2305.03111.
abs/1806.09029.
Aiwei Liu, Xuming Hu, Lijie Wen, and Philip S. Yu.
Yujian Gan, Xinyun Chen, Qiuping Huang, Matthew 2023. A comprehensive evaluation of chatgpt’s zero-
Purver, John Robert Woodward, Jinxia Xie, and Peng- shot text-to-sql capability. arXiv, abs/2303.13547.
sheng Huang. 2021. Towards robustness of text-to-
sql models against synonym substitution. In Annual Linyong Nan, Yilun Zhao, Weijin Zou, Narutatsu
Meeting of the Association for Computational Lin- Ri, Jaesung Tae, Ellen Zhang, Arman Cohan, and
guistics. Dragomir Radev. 2023. Enhancing few-shot text-to-
sql capabilities of large language models: A study on
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, prompt design strategies. CoRR, abs/2305.12586.
Yichen Qian, Bolin Ding, and Jingren Zhou. 2023.
Text-to-sql empowered by large language models: A OpenAI. 2023a. GPT-4 technical report. arXiv preprint
benchmark evaluation. arXiv:2303.08774.
9
OpenAI. 2023b. Sql translate. https://platform. Ruan Silva, Eric Michael Smith, Ranjan Subrama-
openai.com/examples/default-sql-translate. nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
Last accessed on 2023-07-24. lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Chen Pan, Fan Zhou, Xuanwei Hu, Xinxin Zhu, Wenxin Melanie Kambadur, Sharan Narang, Aurelien Ro-
Ning, Zi Zhuang, Siqiao Xue, James Zhang, and driguez, Robert Stojnic, Sergey Edunov, and Thomas
Yunhua Hu. 2023. Deep optimal timing strategies for Scialom. 2023. Llama 2: Open foundation and fine-
time series. In ICDM. tuned chat models. Preprint, arXiv:2307.09288.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Immanuel Trummer. 2022. CodexDB: Synthesizing
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, code for query processing from natural language in-
Alban Desmaison, Luca Antiga, and Adam Lerer. structions using GPT-3 Codex. PVLDB, 15(11):2921
2017. Automatic differentiation in PyTorch. – 2928.
Mohammad Reza Pourreza and Davood Rafiei. 2023. Ping Wang, Tian Shi, and Chandan K. Reddy. 2019.
Din-sql: Decomposed in-context learning of text- Text-to-sql generation for question answering on elec-
to-sql with self-correction. In Advances in Neural tronic medical records. Proceedings of The Web Con-
Information Processing Systems (NeurIPS). ference 2020.
Jiexing Qi, Jingyao Tang, Ziwei He, Xiangpeng Wan, Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Chenghu Zhou, Xinbing Wang, Quanshi Zhang, and Ed Huai hsin Chi, and Denny Zhou. 2022a. Self-
Zhouhan Lin. 2022. Rasat: Integrating relational consistency improves chain of thought reasoning in
structures into pretrained seq2seq model for text-to- language models. arXiv, abs/2203.11171.
sql. arXiv, abs/2205.06983.
Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang,
Nitarshan Rajkumar, Raymond Li, and Dzmitry Bah- Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot,
danau. 2022. Evaluating the text-to-sql capabilities Jennifer Dy, and Tomas Pfister. 2022b. Learning to
of large language models. arXiv, abs/2204.00498. prompt for continual learning. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Torsten Scholak, Nathan Schucher, and Dzmitry Bah- Pattern Recognition, pages 139–149.
danau. 2021. Picard: Parsing incrementally for
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
constrained auto-regressive decoding from language
Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and
models. arXiv, abs/2109.05093.
Denny Zhou. 2022. Chain of thought prompting
Tianze Shi, Chen Zhao, Jordan L. Boyd-Graber, Hal elicits reasoning in large language models. arXiv,
Daum’e, and Lillian Lee. 2020. On the potential of abs/2201.11903.
lexico-logical alignments for semantic parsing to sql Siqiao Xue, Caigao Jiang, Wenhui Shi, Fangyin Cheng,
queries. arXiv, abs/2010.11246. Keting Chen, Hongjun Yang, Zhiping Zhang, Jian-
Ruoxi Sun, Sercan Ö. Arik, Hootan Nakhost, Hanjun shan He, Hongyang Zhang, Ganglin Wei, Wang
Dai, Rajarishi Sinha, Pengcheng Yin, and Tomas Pfis- Zhao, Fan Zhou, Danrui Qi, Hong Yi, Shaodong
ter. 2023. Sql-palm: Improved large language model Liu, and Faqiang Chen. 2023a. Db-gpt: Empower-
adaptation for text-to-sql. arXiv, abs/2306.00739. ing database interactions with private large language
models. arXiv preprint arXiv:2312.17449.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Siqiao Xue, Danrui Qi, Caigao Jiang, Wenhui Shi,
Sequence to sequence learning with neural networks.
Fangyin Cheng, Keting Chen, Hongjun Yang, Zhip-
In Proceedings of the 27th International Conference
ing Zhang, Jianshan He, Hongyang Zhang, Ganglin
on Neural Information Processing Systems - Volume
Wei, Wang Zhao, Fan Zhou, Hong Yi, Shaodong Liu,
2, NIPS’14, page 3104–3112, Cambridge, MA, USA.
Hongjun Yang, and Faqiang Chen. 2024a. Demon-
MIT Press.
stration of db-gpt: Next generation data interac-
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- tion system empowered by large language models.
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Preprint, arXiv:2404.10209.
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Siqiao Xue, Xiaoming Shi, Zhixuan Chu, Yan Wang,
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Hongyan Hao, Fan Zhou, Caigao Jiang, Chen Pan,
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, James Y. Zhang, Qingsong Wen, Jun Zhou, and
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Hongyuan Mei. 2024b. Easytpp: Towards open
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- benchmarking temporal point processes. In Inter-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan national Conference on Learning Representations
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, (ICLR).
Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- Siqiao Xue, Fan Zhou, Yi Xu, Hongyu Zhao, Shuo Xie,
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- Caigao Jiang, James Zhang, Jun Zhou, Dacheng Xiu,
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- and Hongyuan Mei. 2023b. Weaverbird: Empow-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- ering financial decision-making with large language
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, model, knowledge base, and search engine.
10
Yale. 2018. Spider 1.0, yale semantic parsing and text- Olivier Bousquet, Quoc Le, and Ed Huai hsin
to-sql challenge. https://yale-lily.github.io/ Chi. 2022. Least-to-most prompting enables com-
spider. plex reasoning in large language models. arXiv,
abs/2205.10625.
Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and
Dragomir R. Radev. 2018a. Typesql: Knowledge- Shiyi Zhu, Jing Ye, Wei Jiang, Siqiao Xue, Qi Zhang,
based type-aware neural text-to-sql generation. In Yifan Wu, and Jianguo Li. 2024. Coca: Fusing posi-
North American Chapter of the Association for Com- tion embedding with collinear constrained attention
putational Linguistics. in transformers for long context window extending.
In Proceedings of the 62nd Annual Meeting of the As-
Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue, sociation for Computational Linguistics. Association
Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze for Computational Linguistics.
Shi, Zihan Li, Youxuan Jiang, Michihiro Yasunaga,
Sungrok Shim, Tao Chen, Alexander Fabbri, Zifan
Li, Luyao Chen, Yuwen Zhang, Shreya Dixit, Vin-
cent Zhang, Caiming Xiong, Richard Socher, Walter
Lasecki, and Dragomir Radev. 2019. CoSQL: A
conversational text-to-SQL challenge towards cross-
domain natural language interfaces to databases. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 1962–
1979, Hong Kong, China. Association for Computa-
tional Linguistics.
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir
Radev. 2018b. Spider: A large-scale human-labeled
dataset for complex and cross-domain semantic pars-
ing and text-to-sql task. In Proceedings of the 2018
Conference on Empirical Methods in Natural Lan-
guage Processing, Brussels, Belgium. Association
for Computational Linguistics.
Xiaojing Yu, Tianlong Chen, Zhengjie Yu, Huiyu Li,
Yang Yang, Xiaoqian Jiang, and Anxiao Jiang. 2020.
Dataset and enhanced model for eligibility criteria-
to-sql semantic parsing. In International Conference
on Language Resources and Evaluation.
John M. Zelle and Raymond J. Mooney. 1996. Learn-
ing to parse database queries using inductive logic
programming. In AAAI/IAAI, Vol. 2.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b:
An open bilingual pre-trained model. arXiv preprint
arXiv:2210.02414.
Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu,
Zhishuai Li, Sun Yang, Chi Harold Liu, Rui Zhao,
Ziyue Li, and Hangyu Mao. 2024. Benchmark-
ing the text-to-sql capability of large language
models: A comprehensive evaluation. Preprint,
arXiv:2403.02951.
Victor Zhong, Caiming Xiong, and Richard Socher.
2017. Seq2sql: Generating structured queries
from natural language using reinforcement learning.
CoRR, abs/1709.00103.
Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei,
Nathan Scales, Xuezhi Wang, Dale Schuurmans,
11
Appendices uses SQL to retrieve answers, clarify ambiguous
questions, or otherwise inform.
Chase (Guo et al., 2021). This data is to date
A Experimental Details
the largest Chinese dataset for the cross-database
A.1 Dataset Details context-dependent Text-to-SQL problem. It con-
sists of 5,459 question sequences (17,940 ques-
Spider (Yu et al., 2018b). It consists of 10,181 tions) over 280 databases. Each question in Chase
questions and 5,693 unique complex SQL queries has rich semantic annotations, including its SQL
across 200 databases, covering 138 domains, each query, contextual dependency, and schema linking.
containing multiple tables. The standard protocol
for this dataset divides it into 8,659 training exam- A.2 Metrics Details
ples across 146 databases, 1,034 development ex-
We clarify the properties of the two metrics in de-
amples across 20 databases, and a holdout of 2,147
tails.
test examples across 34 databases. The databases
Exact-set match accuracy (EM). EM treats
used in each of these sets are nonoverlapping. SQL
each clause as a set and compares the prediction
queries are categorized into four difficulty levels,
for each clause to its corresponding clause in the
based on the number of SQL keywords used, the
reference query. A predicted SQL query is con-
presence of nested subqueries, and the usage of
sidered correct only if all of its components match
column selections and aggregations.
the ground truth. EM does not take values into
BIRD (Li et al., 2023b). This dataset represents
account.
a pioneering, cross-domain dataset that examines
Execution accuracy (EX). EX compares the ex-
the impact of extensive database contents on text-to-
ecution output of the predicted SQL query with that
SQL parsing. BIRD contains over 12,751 unique
of the ground truth SQL query on some database
question-SQL pairs, 95 big databases with a total
instances. Execution accuracy provides a more
size of 33.4 GB. It also covers more than 37 profes-
precise estimate of the performance of the method
sional domains, such as blockchain, hockey, health-
as there may be multiple valid SQL queries for a
care and education, etc. BIRD also introduces ex-
given question while EM only evaluates the pre-
ternal knowledge as an additional resource to assist
dicted SQL against one of them.
models in generating accurate SQL queries. Specif-
ically four sources of external knowledge were in- A.3 Implementation Details
troduced: numeric reasoning knowledge, domain
knowledge, synonym knowledge, and value illus- All models are implemented using the PyTorch
tration. Notably, the SQL queries in the BIRD framework (Paszke et al., 2017). For parameter
dataset tend to be more intricate than those in the scale with 7B and 13B models, we adopt 1 Nvidia
Spider dataset. A100 Tensor Core GPU to run training. For the
parameter scale of 70B model, we adopt 8*A100
WikiSQL (Zhong et al., 2017). This dataset
to run training and inference.
consists of a corpus of 80,654 natural statement
expressions and sql annotations of 24,241 tables. Fine-tuning hyperparameters setting The hy-
Each query in WikiSQL is limited to the same table perparameters of the training are shown in Table 5.
and does not contain complex operations such as
sorting, grouping. The queries in WikiSQL are lim-
ited to the same table and do not include complex A.4 Few Shot Prompting
operations such as sorting, grouping, subqueries,
etc. Given the following database schema :
CoSQL (Yu et al., 2019). This dataset is a con-
versational version of the Spider task. CoSQL con- Table advisor, columns = [∗,s_ID,i_ID]
sists of 30,000 rounds and 10,000 annotated SQL Table classroom, columns = [∗,building,
room_number,capacity]
queries from Wizard-of-Oz’s collection of 3k con- Table course, columns = [∗,course_id,title,
versations querying 200 complex databases across dept_name,credits]
138 domains. Each conversation simulates a real- Table department, columns = [∗,dept_name,
building,budget] Table instructor, columns
istic DB query scenario in which a staff member = [∗,ID,name,dept_name,salary] Table prereq,
explores the database as a user and a SQL expert columns = [∗,course_id,prereq_id]
12
Parameter 7B 13B 70B Response: SELECT COUNT (DISTINCT ID) FROM
GPUs 1*A100 1*A100 8*A100 teaches WHERE semester = 'Spring' AND YEAR
= 2010.
max source length 2048 2048 2048
max target length 512 512 512 Q: Find the name of the students and their
fine-tuning type lora lora lora department names sorted by their total
credits in ascending order.
lora rank 64 64 64 Response: SELECT name, dept_name FROM
lora alpha 32 32 32 student ORDER BY tot_cred.
lr 0.0002 0.0002 0.0002
epoch 8 8 8 Q: Find the year which offers the largest
number of courses.
Response: SELECT YEAR FROM SECTION GROUP BY
Table 5: Parameter setting of fine tuning for different YEAR ORDER BY count(∗) DESC LIMIT 1.
model scale
Q: What are the names and average salaries
for departments with average salary higher
Table section, columns = [∗,course_id, than 42000?
sec_id,semester,year,building,room_number, Response: SELECT dept_name , AVG (salary)
time_slot_id] FROM instructor GROUP BY dept_name HAVING
Table student, columns = [∗,ID,name, AVG (salary) > 42000.
dept_name,tot_cred] Table takes, columns =
[∗,ID,course_id,sec_id,semester,year,grade] Q: Find the minimum salary for the
Table teaches, columns = [∗,ID,course_id, departments whose average salary is above
sec_id,semester,year] the average payment of all instructors.
Table time_slot, columns = [∗,time_slot_id, Response: SELECT min(salary) , dept_name
day,start_hr,start_min,end_hr,end_min] FROM instructor GROUP BY dept_name HAVING
avg(salary) > (SELECT avg(salary) FROM
Please write queries to answer the instructor).
following questions:
Q: What is the course title of the
Q: Find the title of courses that have two prerequisite of course Mobile Computing?
prerequisites. Response: SELECT title FROM course WHERE
Response: SELECT T1.title FROM course AS T1 course_id IN (SELECT T1.prereq_id FROM
JOIN prereq AS T2 ON T1.course_id = T2. prereq AS T1 JOIN course AS T2 ON T1.
course_id GROUP BY T2.course_id HAVING course_id = T2.course_id WHERE T2.title = '
count(∗) = 2. Mobile Computing').
13
list_id FROM lists WHERE movie_id = m.
movie_id)) WHERE r.rating_score > 4 AND lu.
Listing 2: Full Examples of Text Representation Prompt user_has_payment_method = 1 GROUP BY m.
on Spider Dataset. movie_id, m.movie_release_year HAVING COUNT
(DISTINCT lu.list_id) >= 10.
Given the following database schema :
Q: Find the title of the most popular movie
Table movies, columns = [∗,movie_id, among those that have never received any
movie_title,movie_release_year,movie_url, critic comments.
movie_title_language,movie_popularity, Response: SELECT movie_title FROM movies
movie_image_url,director_id,director_name, JOIN ratings ON movies.movie_id = ratings.
director_url] movie_id WHERE critic_comments = 0 ORDER BY
Table ratings, columns = [∗,movie_id, movie_popularity DESC LIMIT 1;
rating_id,rating_url,rating_score,
rating_timestamp_utc,critic,critic_likes, Q: Find the names of movies from the year
critic_comments,user_id,user_trialist] 2000 which have been added to at least 5
Table lists, columns = [∗,user_id,list_id, different lists and have an image URL
list_title,list_movie_number, available.
list_update_timestamp_utc, Response: SELECT DISTINCT m.movie_title
list_creation_timestamp_utc,list_followers, FROM movies m JOIN lists l ON m.movie_id IN
list_url,list_comments,list_description] (SELECT movie_id FROM lists WHERE list_id
Table lists_users, columns = [∗,user_id, = l.list_id) WHERE m.movie_release_year =
list_id,list_update_date_utc, 2000 AND m.movie_image_url IS NOT NULL
list_creation_date_utc,user_trialist, GROUP BY m.movie_id HAVING COUNT(DISTINCT l.
user_subscriber,user_avatar_image_url, list_id) >= 5.
user_cover_image_url,
user_eligible_for_trial, Q: Which user created the most number of
user_has_payment_method] lists while being a subscriber and having a
profile cover image?
Please write queries to answer the Response: SELECT user_id, COUNT(list_id) as
following questions: num_lists FROM lists_users WHERE
Q: How many movies in the database were user_subscriber = 1 AND
directed by Christopher Nolan? user_cover_image_url IS NOT NULL GROUP BY
Response: SELECT COUNT(∗) FROM movies WHERE user_id ORDER BY num_lists DESC LIMIT 1.
director_name = 'Christopher Nolan'.
Q: Provide the critic made by users who
Q: List all movies that have a popularity rated a movie more than 3 but less than 5
greater than 5000. and got at least 10 likes on their review.
Response: SELECT movie_title FROM movies Response: SELECT critic FROM ratings WHERE
WHERE movie_popularity > 5000. rating_score BETWEEN 3 AND 5 AND
critic_likes >= 10.
Q: Retrieve the URL of the most popular
movie. Q: How many lists were created by users who
Response: SELECT movie_url FROM movies were subscribers and not trialists on
ORDER BY movie_popularity DESC LIMIT 1; January 1st, 2020?
Response: SELECT COUNT(DISTINCT list_id)
Q: Which user IDs have rated a movie on the FROM lists_users WHERE user_subscriber = 1
1st of January 2021. AND user_trialist = 0 AND
Response: SELECT DISTINCT user_id FROM list_creation_date_utc = '2020−01−01'.
ratings WHERE rating_timestamp_utc BETWEEN
'2021−01−01 00:00:00' AND '2021−01−01 Q: What are the titles of the lists which
23:59:59';. were created on '2022−05−15' and have more
than 50 comments?
Q: What are the names of the directors for Response: SELECT list_title FROM lists
movies that have an image URL containing ' WHERE list_creation_timestamp_utc =
poster'? '2022−05−15' AND list_comments > 50.
Response: SELECT DISTINCT director_name
FROM movies WHERE movie_image_url LIKE '%
poster%'. Q: What is the name and URL of the movie
that has the latest rating timestamp?
Q: Give me the IDs and release years of Response: SELECT movie_title, movie_url
movies that have both a rating score higher FROM movies WHERE movie_id = (SELECT
than 4 and have been included in at least movie_id FROM ratings ORDER BY
10 lists created by users who had a payment rating_timestamp_utc DESC LIMIT 1).
method when they created the list.
Response: SELECT m.movie_id, m. Q: Which movie has the highest number of
movie_release_year FROM movies m JOIN critic likes.
ratings r ON m.movie_id = r.movie_id JOIN Response: SELECT movie_id FROM ratings
lists_users lu ON lu.user_id = ANY(SELECT ORDER BY critic_likes DESC LIMIT 1;
user_id FROM lists WHERE list_id IN (SELECT
14
Q: Retrieve the list description and URL • Integration of more model training techniques.
for lists created by trialists that have In addition to pre-training, the community is
been updated since 2021 and contain movies
directed by Christopher Nolan. also interested in continual learning techniques
Response: SELECT l.list_description, l. for language models, such as continual pre-
list_url FROM lists l JOIN lists_users lu training (Jiang et al., 2023), prompt learn-
ON l.list_id = lu.list_id JOIN movies m ON
m.movie_id IN (SELECT movie_id FROM lists ing (Wang et al., 2022b) or positional encoding
WHERE list_id = l.list_id) WHERE lu. techniques (Zhu et al., 2024). The integration of
user_trialist = 1 AND l.
list_update_timestamp_utc > '2021−01−01'
these methods will greatly facilitate the research
AND m.director_name = 'Christopher Nolan'. community in these areas.
Q: List all the directors along with the
average rating score for movies they
directed that have over 1000 followers on
Mubi lists.
Response: SELECT director_name, AVG(
rating_score) AS avg_rating FROM movies
JOIN ratings ON movies.movie_id = ratings.
movie_id LEFT JOIN lists ON movies.movie_id
= lists.list_movie_number GROUP BY
director_name HAVING SUM(list_followers) >
1000.
15
M ODEL E ASY M EDIUM H ARD E XTRA OVERALL
BASE L/QL BASE L/QL BASE L/QL BASE L/QL BASE L/QL
L LAMA 2-7B 0.000 0.827/0.810 0.000 0.614/0.574 0.000 0.408/0.443 0.000 0.307/0.295 0.000 0.581/0.564
L LAMA 2-13B 0.000 0.867/0.835 0.000 0.670/0.670 0.000 0.483/0.517 0.000 0.386/0.349 0.000 0.640/0.632
L LAMA 2-70B 0.327 0.847/− 0.112 0.679/− 0.075 0.454/− 0.018 0.382/− 0.142 0.635/−
C ODE L LAMA -7B 0.174 0.883/0.871 0.127 0.736/0.721 0.063 0.523/0.553 0.012 0.309/0.291 0.121 0.643/0.628
C ODE L LAMA -13B 0.617 0.910/0.910 0.545 0.727/0.688 0.377 0.624/0.556 0.224 0.365/0.382 0.487 0.706/0.682
C ODE L LAMA -70B 0.688 0.928/− 0.582 0.723/− 0.400 0.655/− 0.278 0.366/− 0.527 0.713/−
BAICHUAN 2-7B 0.326 0.832/0.815 0.104 0.588/0.621 0.025 0.402/0.454 0.000 0.225/0.286 0.119 0.579/0.602
BAICHUAN 2-13B 0.363 0.839/0.827 0.141 0.632/0.650 0.040 0.483/0.460 0.000 0.325/0.313 0.155 0.607/0.606
Q WEN -7B 0.365 0.802/0.778 0.101 0.643/0.608 0.063 0.517/0.471 0.024 0.331/0.313 0.161 0.610/0.578
Q WEN -14B 0.758 0.867/0.851 0.318 0.713/0.735 0.172 0.529/0.506 0.066 0.398/0.367 0.359 0.623/0.668
Q WEN -72B 0.754 0.903/− 0.316 0.726/− 0.241 0.523/− 0.102 0.386/− 0.374 0.680/−
C HAT GLM3-6B 0.000 0.776/0.763 0.000 0.564/0.533 0.000 0.457/0.477 0.000 0.261/0.224 0.000 0.521/0.542
Table 6: Evaluations on Spider: EM of base models vs fine-tuned models on each split of complexity and overall
dataset. “L” and “QL” denote “LORA” and “QLoRA” tuing methods, respectively.
Table 7: Evaluations on BIRD: EM of base models vs fine-tuned models on each split of complexity and overall
dataset. “L” and “QL” denote “LORA” and “QLoRA” tuing methods, respectively.
16
M ODEL S PIDER BIRD
L O RA QL O RA L O RA QL ORA
L LAMA 2-7B ↑0.626 ↑0.608 ↑0.169 ↑0.168
L LAMA 2-13B ↑0.680 ↑0.664 ↑0.167 ↑0.163
L LAMA 2-70B ↑0.687 - ↑0.186 −
C ODE L LAMA -7B ↑0.453 ↑0.447 ↑0.228 ↑0.214
C ODE L LAMA -13B ↑0.217 ↑0.198 ↑0.204 ↑0.204
C ODE L LAMA -70B ↑0.204 − ↑0.179 −
BAICHUAN 2-7B ↑0.268 ↑0.289 ↑0.133 ↑0.123
BAICHUAN 2-13B ↑0.286 ↑0.267 ↑0.141 ↑0.101
Q WEN -7B ↑0.417 ↑0.427 ↑0.148 ↑0.133
Q WEN -14B ↑0.090 ↑0.128 ↑0.075 ↑0.068
Q WEN -72B ↑0.112 − ↑0.019 −
C HAT GLM3-6B ↑0.590 ↑0.581 ↑0.156 ↑0.128
17