0% found this document useful (0 votes)

73 views17 pages

DB GPT Hub 2024

The document introduces DB-GPT-Hub, an open benchmark suite designed for evaluating text-to-SQL tasks using large language models (LLMs). It emphasizes the importance of fine-tuning LLMs for improved performance in generating SQL queries, while also providing a standardized framework for comparison and extensibility. The benchmark includes comprehensive evaluations, modular code, and support for various datasets, aiming to facilitate further research and applications in the text-to-SQL domain.

Uploaded by

urkanni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views17 pages

DB GPT Hub 2024

Uploaded by

urkanni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

DB-GPT-Hub: Towards Open Benchmarking Text-to-SQL Empowered by

Large Language Models

♢ ♢ ♣ ♢ ♢ ♦
Fan Zhou , Siqiao Xue , Danrui Qi , Wenhui Shi , Wang Zhao, Ganglin Wei , Hongyang Zhang
♢ ♢ ♥ ♢,∗
Caigao Jiang , Gangwei Jiang , Zhixuan Chu , Faqiang Chen
♢ ♣ ♥
Ant Group, Simon Fraser University, Canada, Zhejiang University,
♦
Southwestern University of Finance and Economics, China
{hanlian.zf, siqiao.xsq, faqiang.cfq}@antgroup.com

Abstract niques specifically for the task, leveraging SQL syn-

tax, semantics, and the complex interplay between
Large language models (LLMs) becomes
the dominant paradigm for the challenging questions and databases (Scholak et al., 2021; Qi
arXiv:2406.11434v1 [cs.DB] 17 Jun 2024

task of text-to-SQL. LLM-empowered text-to- et al., 2022).

SQL methods are typically categorized into While recently large language models (LLMs)
prompting-based and tuning approaches. Com-
pared to prompting-based methods, benchmark- such as ChatGPT (Brown et al., 2020) and GPT-
ing fine-tuned LLMs for text-to-SQL is impor- 4 (OpenAI, 2023a) have showcased their remark-
tant yet under-explored, partially attributed to able capabilities in engaging in human-like com-
the prohibitively high computational cost. In
this paper, we present DB-GPT-Hub, an open munication and understanding complex queries,
benchmark suite for LLM-empowered text-to- LLMs have emerged as a new paradigm for text-to-
SQL, which primarily focuses on tuning LLMs
at large scales. The proposed benchmark con- SQL (Liu et al., 2023; Trummer, 2022). Notably,
sists of: 1. a standardized and comprehensive since 2023, the majority of top-performing solu-
evaluation of text-to-SQL tasks by fine-tuning tions on the Spider leaderboard (Yale, 2018) have
medium to large-sized open LLMs; 2. a modu-
larized and easy-to-extend codebase with main- been methods based on LLMs.
stream LLMs and experimental scenarios sup- The most recent advancement in this area in-
ported, which prioritizes fine-tuning methods volves employing LLMs for generating accurate
but can be easily extended to prompt-based set-
ting. Our work investigates the potential gains SQL queries through in-context learning (ICL)
and the performance boundaries of tuning ap- techniques, notably zero-shot and few-shot prompt-
proaches, compared to prompting approaches ing (OpenAI, 2023b; Dong et al., 2023; Pourreza
and explores optimal solutions tailored to spe-
cific scenarios. We hope DB-GPT-Hub, along and Rafiei, 2023). Beyond the inherent challenge of
with these findings, enables further research ambiguity and complexity, the laborious efforts for
and broad applications that would otherwise be annotating SQL query-response exemplars by do-
difficult owing to the absence of a dedicated
open benchmark. The project code has been re- main experts hinder the process of scaling-up data
leased at https://github.com/eosphoros-ai/ hungry LLMs for text-to-SQL applications. Mean-
DB-GPT-Hub.
while, another prominent approach is fine-tuning
LLMs using additional task-specific training data
1 Introduction
to enhance their efficacy for text-to-SQL tasks (Li
The task of text-to-SQL, which converts natural et al., 2023a; Sun et al., 2023). The remarkable per-
utterances into SQL queries, has emerged as a pop- formances achieved in these works indicate the im-
ular topic in both natural language processing and mense potential of fine-tuning. However, compared
database (Yu et al., 2018b; Deng et al., 2022). It ef- to prompting approaches, fine-tuning approaches
fectively narrows the gap between non-expert users have been relatively under-explored, partially at-
and database systems, significantly enhancing data tributed to the prohibitively high computational
processing efficiency. Essentially, text-to-SQL can cost. Recent systematic studies (Gao et al., 2023;
be characterized as a sequence-to-sequence model- Zhang et al., 2024) still mainly highlight the ICL
ing task (Sutskever et al., 2014), where the database abilities of LLMs and their accuracy in generating
schema and the natural language question are trans- SQL queries in relevant tasks.
formed into a sequential input, while the desired Up until now, there still has not been a univer-
SQL query serves as the sequential output target. sally acknowledged open benchmark for tuning
Early works focus on fine-tuning domain-specific approaches, which impedes researchers and prac-
Transformer models and developing decoding tech- titioners from comparing methods and reproduc-

1
ing results, potentially slowing down advancement 2.2 Prompt-based and Fine-tuning Settings
in this field. As a first step towards addressing Based on how LLMs are used for text-to-SQL gen-
these challenges, in this work, we present a holistic erations, the problem settings can be categorized
framework, namely DB-GPT-Hub,. Apart from into two scenarios: zero-shot/few-shot prompting
existing works that mostly focus on few-shot and fine-tuning.
prompting strategies or tuning relatively smaller
LLMs, our work focuses on tuning larger LLMs. Zero-shot / Few-shot Prompting. In zero-shot
In all, DB-GPT-Hub consolidates essential research scenarios, no exemplar is provided while in few-
assets (e.g., data, model services, evaluation meth- shot a few input–output exemplars are provided
ods, documentation) with following distinct merits: to prompt LLMs. Formally, given a pretrained
LLM parameterized by θ, the question qi , and k
• Standardization. We establish a standardized exemplars (k ≥ 0), the objective is maximize the
pipeline in an open-source codebase, with uni- probability of generating the correct SQL si from
fied experimental settings and containerized en- the LLM:
vironments, to enable transparent and consistent
comparisons of LLM models after text-to-SQL max PLLMθ (si ∣σ(qi , M)), ∣M∣ = k (1)
si
tasks tuning.
1
• Comprehensiveness. We conduct extensive where Θ and σ(qi , M) denotes a representation
benchmarking that covers a range of medium space of the target question qi by incorporating
to large-sized, fine-tuned LLMs across various relevant information from exemplars.
experimental scenarios and explore their relative Fine-tuning. The fine-tuing process involves
performance compared to prompting methods. adapting the pretrained LLMθ to generate SQL
Our work comprises one of the most pragmatic from the input sequences by tuning the model with
and expansive sets of benchmark suites available. text-to-SQL datasets, which contain a collection of
• Extensibility. As a rapidly evolving field, novel serialized inputs qi and corresponding SQL outputs
LLM-based methods constantly emerge, and the si pairs. The object of fine-tuning is minimize the
best practice continuously evolves. Following empirical loss:
our documentation and protocols, one could ef-
min L(̂si (LLMθ ), si ∣σ(qi )), (2)
fortlessly incorporate novel ingredients into our θ
codebase: new datasets, new modules, new mod-
where L is the loss function to measure the
els (or model services), and new evaluation pro-
difference between the generated SQL and the
grams. Moreover, our framework offers easy
groundtruth.
compatibility with various prompting techniques.
Despite the significant advances achieved with
The high extensibility will eventually benefit the
few-shot prompting of LLMs, it remains a
research area of text-to-SQL.
formidable challenge for a pretrained LLM to rely
2 Background and Problem Formulation solely on its parametric knowledge and prompting
to accurately process highly complex SQL queries.
2.1 A Generalized Setup
The input of text-to-SQL task is a natural language Parameter-Efficient Fine-tuning. Medium to
question q and the database information D. The large-sized models with billions of parameters,
output is the SQL s corresponding to the question. are prohibitively expensive to fine-tune in order
The database D = {S, Kp , Kf } includes database to adapt them to particular tasks or domains.
schema S, primary keys Kp and foreign keys Kf , Parameter-Efficient Fine-Tuning (PEFT) methods
where S usually contains multiple tables Tk ∶ S = enable efficient adaptation of large pretrained mod-
{T1 , T2 , ...Ts ...}. Each table Tk has table name Nk , els to various downstream applications by only
column names cj and column data types tj . There- fine-tuning a small number of (extra) model param-
fore, Tk = {Nk , (ck1 , tk1 ), (ck2 , tk2 )...}. Consider eters instead of all the model’s parameters. Two
the queries may come from various database do- mostly commonly used techniques are LoRA (Hu
mains, we formulate the data into a set of triples et al., 2021), which proposes to freeze pretrained
M = {(qi , si , Di )}, with i denoting the index of 1
σ(qi , M) technically denotes the information set gener-
the query, the output and source database. ated by qi and M.

2
model weights and inject trainable layers (rank- to guide LLMs. See listing 2 and listing 3 in Ap-
decomposition matrices) in each transformer block, pendix A.4 for full examples.
and its quantized version QLoRA (Dettmers et al.,
2023). Throughout the benchmark, we use these I want you to act as a SQL terminal in
two strategies consistently to tune the LLMs. See front of a database and below is an
Section 3 and Section 4 for details of tuning bench- description of the database schema. Write a
response that appropriately completes the
mark design and experimental results. request.

3 Benchmark Design and Resources /∗ Instruction ∗/

Database concert_singer contains tables
3.1 Setup such as stadium, singer, concert,
singer_in_concert.
Datasets. We conduct experiments mainly on the Table stadium has columns such as
following 2 well recognized public datasets: Stadium_ID, Location, Name, Capacity,
Highest, Lowest, Average. Stadium_ID is the
primary key.
• Spider (Yu et al., 2018b). Spider is a large-scale Table singer has columns such as Singer_ID,
cross-domain dataset consisting of 10,181 natu- Name, Country, Song_Name,
Song_release_year, Age, Is_male. Singer_ID
ral language queries, 5,693 unique complex SQL is the primary key.
queries across 200 databases, covering 138 do- Table concert has columns such as
mains. The standard protocol for this dataset concert_ID, concert_Name, Theme, Stadium_ID,
Year. concert_ID is the primary key.
divides it into 8,659 training examples and a hold- Table singer_in_concert has columns such as
out of 2,147 test examples across 34 databases. concert_ID, Singer_ID. concert_ID is the
SQL queries are categorized into four difficulty primary key.
The Stadium_ID of concert is the foreign
levels, i.e., easy, medium, hard and extra hard. key of Stadium_ID of stadium.
The Singer_ID of singer_in_concert is the
• BIRD (Li et al., 2023b). It comprises an exten- foreign key of Singer_ID of singer.
sive dataset with 12,751 unique question-SQL The concert_ID of singer_in_concert is the
foreign key of concert_ID of concert.
pairs, encompassing 95 large databases. SQL
queries are categorized into three difficulty lev- Please give SQL statement to answer the
following question:
els, i.e., simple, moderate and challenge. Notably,
the SQL queries in the BIRD dataset tend to be Q: How many singers do we have?
more intricate than those in the Spider dataset. Response: SELECT DISTINCT country FROM
singer WHERE age > 20.
Moreover, our codebase universally supports tun-
ing a wide range of popular dataset, such as Wik- Listing 1: Query-response Pairs in TRP Format on
iSQL (Zhong et al., 2017), CoSQL (Yu et al., 2019), Spider Dataset.
Chase (Guo et al., 2021) (see Appendix A.1 for the
Metrics. We use two commonly used metrics,
detailed statistics of each dataset.) and due to the
exact-set-match accuracy (EM), execution accu-
page limit, we continually post updated experimen-
2 racy (EX) to evaluate the performance of all mod-
tal results on the project site .
els. EM measures the matched SQL keywords
Query-response Construction. We construct between the predicted SQL query and its corre-
query-response pairs from the datasets so that sponding ground truth while EX compares the exe-
LLMs can be tuned with (Gao et al., 2023; Xue cution output of the predicted SQL query with that
et al., 2023b). Following Gao et al. (2023), we of the ground truth SQL query on some database in-
formulate the pairs using the widely-used Text Rep- stances. EX provides a more precise estimate of the
resentation Prompt (Nan et al., 2023) (TRP) for- model’s performance since there may be multiple
mat for train, development and test split for all the valid SQL queries for a given question. For both
datasets throughout the experiments. metrics, the higher is considered the better. We
Shown in listing 1, TRP represents both schema mainly use EX to evaluate the accuracy of SQLs in
and query in natural language. In addition, it adds the paper. See Appendix A.2 for details.
instructions at the very beginning of the prompt Base LLMs. We benchmark a range of medium
2
https://github.com/eosphoros-ai/DB-GPT-Hub/ to large-sized LLM variants from 4 promi-
blob/main/docs/eval_llm_result.md nent LLM families: GLM (Zeng et al., 2022),

3
Dataset Construction Trainer Predictor Evaluation
Open Source Fine
Instruction <schema> LLMs Tuning PEFT strategies Adaptive
Metrics Suite
Builder LoRA / QLoRA Parallelism
EX / EM / VES
Open Datasets /* example 1 */
Example Deepspeed Optimization
Selector /* example 2 */ Downloadable
… Close Source Report
LLMs Predictor
<query> In-Context
<response> Learning Prompting

optional

Figure 1: An open benchmarking pipeline using DB-GPT-Hub.

Qwen (Bai et al., 2023), Baichuan (Baichuan, 2023) to-extend codebase for standardization of imple-
and Llama (Touvron et al., 2023). mentation, evaluation, and ablation of text-to-SQL
methods.
• ChatGLM3-6B, the up-to-date open version of
ChatGLM, an open bilingual language model Software Architecture. Figure 1 presents the
based on GLM framework. pipeline and architecture of our codebase. Pipelines
are decomposed into following parts:
• Qwen-7B/14B/72B-Chat, a series of aligned
models of Qwen.
• Dataset Construction. Raw text-to-SQL data is
• Baichuan2-7B/13B-Chat, the up-to-date collec- processed into a suitable format (e.g., TRF shown
tion of aligned models of Baichuan. in listing 1 ) to tune LLMs. This includes inte-
grating the schema and database description into
3
• LLaMA2-7B/13B/70B-Chat , the up-to-date a prompt as an instruction, along with various
aligned version of LLaMA. question representations to boost performance
during training and evaluation. Additionally, we
• CodeLLaMA-7B/13B/70B-Instruct, an aligned
will select different few-shot strategies, such as
version of LLaMA-2-13B, tuned with code data.
example selection and organization, to construct
To ensure a fair comparison, we use the same the evaluation dataset (Gao et al., 2023).
maximal context length 2048 for all the LLMs.
During the evaluation, we leave 512 tokens for • Training. Our codebase supports the fine-tuning
response generation. We set the argument tempera- of open-source LLMs with PEFT strategies. We
ture as 0 to eliminate the influence of randomness. support most of the public architecture with
small to large-sized model scales, such as Qwen,
Tuning Methods. As the scale of the dataset is Llama, Baichuan, and ChatGLM.
notably smaller than that of LLMs, we apply the
PEFT strategies –LoRA and QLoRA – to tune the • Prediction. Our codebase supports SQL query
LLMs, respectively. For medium-sized models inference for open-source LLMs with its fine-
(7B/13B), we adopt 1 Nvidia A100 Tensor Core tuned version and closed-source LLMs as well.
GPU to run training and inference. For large-sized We support the few-shot and zero-shot method to
models (70B), we adopt 8 A100s. generate SQLs for specific scenarios.
Benchmark Pipeline. Figure 1 presents the open
benchmarking pipeline implemented in DB-GPT- • Evaluation. Our repository holds different met-
Hub. This pipeline will facilitate future research in rics(EX, EM, valid efficiency score(VES)) to
this area and help promote reproducible work. evaluate the performance of generated SQL from
different perspectives.
3.2 Codebase
To facilitate the innovation of the community, our Implementations. The codebase is built with the
DB-GPT-Hub contains a well-modularized, easy- PyTorch framework (Paszke et al., 2017), upon the
open source project DB-GPT (Xue et al., 2023a,
3
Due to the page limitation, we have omitted the suffix 2024a). We release the code with Apache License
“-Chat” from the names of LLMs in the tables throughout the
following sections. For instance, “Qwen-7B” should be read 2.0 and we are committed to actively maintain the
as “Qwen-7B-Chat” model. repository.

4
4 Experiments 1.0
0.9 easy

Improvement on EX
medium
In this section, with the utility of DB-GPT-Hub, we 0.8 hard
formally evaluate the text-to-SQL process to deter- 0.7 extra
0.6 overall
mine the performance differences among various
0.5
LLMs and explore the effect of training paradigms 0.4
that influence tuning performance of LLMs. 0.3
0.2
4.1 Main Results Llama-7B CodeLlama-7B Qwen-7B
Table 1 and Table 2 show the evaluation results, Figure 2: The improvement on tuning with LoRA strat-
measured by EX, on Spider and BIRD datasets, egy across subgroups of different complexities.
4
respectively . The results in EM on both datasets
can be found in Table 6 and Table 7 in Appendix B. LoRA vs QLoRA We summarize the EX, EM,
Time Cost, and GPU memory metrics in Table 3.
Best Models. Unsurprisingly, tuned CodeLlama Firstly, not surprisingly, we see limited differ-
families, whose base models haven been optimized ences in generation performance, measured by EX
for code generation and infilling, show consistently and EM, between models tuned with LoRA and
better performance over other competitors on both QLoRA. Secondly, consistent with the quantization
datasets. Specifically, we have achieved the follow- mechanism, QLoRA takes more time to converge
ing key insights: with less GPU memory. For example, compared
• As shown in the right-most columns in Table 1 to Qwen-14B-LoRA, its QLoRA counterpart takes
and Table 2, The fine-tuned, small-sized CodeL- 2× of time with only 50% GPU memory.
5
lama (e.g., CodeLlama-7B-LoRA ) exhibits com- To conclude, in circumstances with restricted
parable, and in some cases even superior, perfor- computational resources, QLoRA is an efficient
mance to other tuned medium to large-sized open tuning alternative that can save memory without
LLMs, such as Qwen-14B/72B-LoRA. sacrificing performance.

• CodeLlama-70B-LoRA is universally optimal. 4.2 Analysis I: Fine-tuning vs Few-shot

Prompting
Performance Improvement on Tuning. Table 1
and Table 2 (also shown in Table 9 in Appendix B) In this subsection, we explore the improvements
illustrate the improvement of PEFT strategies of with tuning compared to few-shot prompting.
LLMs on both datasets, highlighting the LLMs Setup. We take two model families –Llama2 and
proficiency to adapt to high-quality text-to-SQL Qwen– and conduct our investigations primarily
training data. Notably, tuning yields a larger im- on the Spider dataset. We use the method DAIL
provement on Spider compared to BIRD, measured Selection (Gao et al., 2023), which currently ranks
by EX. This suggests that the benefits of tuning be- as the second-best open-source model on the Spider
come increasingly important in less complex tasks. leaderboard, to construct the few-shot prompt. It
Performance for Different SQL Difficulty Levels. selects exemplars those have good similarity with
In Figure 2, using three 7B models for instance, we both queries and responses, better preserving the
present the efficacy of tuned LLMs against a spec- mapping in between the query-response pairs.
trum of SQL generation difficulty levels. For all
Core Insights. Due to the page limitation, we
three tuned models, the results highlight that the
put the full results in Table 8 in Appendix B. In
size of improvement is negatively correlated with
both zero-shot and few-shot (1/3/5-shot) evalua-
the complexities and tuning brings more significant
tion scenarios, tuned LLMs demonstrate superior
improvement on easy tasks, which reveals the im-
results, highlighting the LLMs proficiency to adapt
portance of tuning over simpler tasks than difficult
to high-quality text-to-SQL training data.
ones.
4
For large-sized (70B) models, we found that DeepSpeed Effect of Number of Exemplars in Prompting.
optimization is incompatible with QLoRA, so we have left In addition to the superior performances of tuned
this data blank for the time being.
5 LLMs, Figure 4 reveals that, for strong (large-
We use the suffix ’-LoRA/QLoRA’ to denote the Lo-
RA/QLoRA PEFT strategies applied to tune LLMs, i.e., ’- sized) models, the EX margin of tuned against base
LoRA’ means the LLM is tuned with LoRA. model becomes less prominent on few-shot scenar-

5
M ODEL E ASY M EDIUM H ARD E XTRA OVERALL
BASE L/QL BASE L/QL BASE L/QL BASE L/QL BASE L/QL
L LAMA 2-7B 0.000 0.887/0.847 0.000 0.641/0.623 0.000 0.489/0.466 0.000 0.331/0.361 0.000 0.626/0.608
L LAMA 2-13B 0.000 0.907/0.911 0.000 0.729/0.700 0.000 0.552/0.552 0.000 0.343/0.319 0.000 0.680/0.664
L LAMA 2-70B 0.411 0.915/− 0.229 0.732/− 0.190 0.560/− 0.072 0.392/− 0.241 0.687/−
C ODE L LAMA -7B 0.214 0.923/0.911 0.177 0.756/0.751 0.092 0.586/0.598 0.036 0.349/0.331 0.149 0.702/0.696
C ODE L LAMA -13B 0.698 0.940/0.940 0.600 0.789/0.744 0.408 0.684/0.626 0.271 0.404/0.392 0.529 0.746/0.727
C ODE L LAMA -70B 0.722 0.962/− 0.625 0.812/− 0.443 0.716/− 0.302 0.432/− 0.567 0.771/−
BAICHUAN 2-7B 0.577 0.871/0.891 0.352 0.630/0.637 0.201 0.448/0.489 0.066 0.295/0.331 0.335 0.603/0.624
BAICHUAN 2-13B 0.581 0.903/0.895 0.413 0.702/0.675 0.264 0.569/0.580 0.187 0.392/0.343 0.392 0.678/0.659
Q WEN -7B 0.395 0.855/0.911 0.256 0.688/0.675 0.138 0.575/0.575 0.042 0.331/0.343 0.235 0.652/0.662
Q WEN -14B 0.871 0.895/0.919 0.632 0.702/0.744 0.368 0.552/0.598 0.181 0.367/0.458 0.573 0.663/0.701
Q WEN -72B 0.831 0.927/− 0.635 0.756/− 0.489 0.621/− 0.277 0.367/− 0.600 0.712/−
C HAT GLM3-6B 0.000 0.855/0.843 0.000 0.605/0.603 0.000 0.477/0.506 0.000 0.271/0.211 0.000 0.590/0.581

Table 1: Evaluations on Spider: EX of base models vs fine-tuned models on each split of complexity and overall
dataset. “L” and “QL” denote “LORA” and “QLoRA” tuing methods, respectively.

M ODEL S IMPLE M ODERATE C HALLENGE OVERALL

BASE L/QL BASE L/QL BASE L/QL BASE L/QL
L LAMA 2-7B 0.000 0.214/0.211 0.000 0.108/0.112 0.000 0.076/0.069 0.000 0.169/0.168
L LAMA 2-13B 0.000 0.226/0.217 0.000 0.073/0.086 0.000 0.097/0.069 0.000 0.167/0.163
L LAMA 2-70B 0.082 0.210/− 0.013 0.138/− 0.014 0.126/− 0.055 0.241/−
C ODE L LAMA -7B 0.010 0.299/0.076 0.065 0.149/0.146 0.000 0.112/0.128 0.085 0.237/0.223
C ODE L LAMA -13B 0.120 0.375/0.373 0.042 0.176/0.179 0.042 0.141/0.140 0.089 0.294/0.293
C ODE L LAMA -70B 0.191 0.423/− 0.091 0.191/− 0.063 0.159/− 0.149 0.328/−
BAICHUAN 2-7B 0.051 0.231/0.208 0.024 0.082/0.084 0.000 0.069/0.105 0.038 0.171/0.161
BAICHUAN 2-13B 0.048 0.0230/0.182 0.013 0.088/0.067 0.021 0.111/0.069 0.035 0.176/0.136
Q WEN -7B 0.035 0.235/0.225 0.012 0.073/0.095 0.014 0.083/0.082 0.023 0.171/0.172
Q WEN -14B 0.188 0.288/0.252 0.049 0.136/0.120 0.028 0.111/0.110 0.131 0.226/0.198
Q WEN -72B 0.253 0.289/− 0.112 0.093/− 0.048 0.083/− 0.190 0.209/−
C HAT GLM3-6B 0.000 0.204/0.185 0.000 0.089/0.074 0.000 0.056/0.042 0.000 0.156/0.129

Table 2: Evaluations on BIRD: EX of base models vs fine-tuned models on each split of complexity and overall
dataset. “L” and “QL” denote “LORA” and “QLoRA” tuing methods, respectively.

ios. For example, the EX of Qwen-72B-LoRA vs • For a given few-shot scenario, the performance
Qwen-72B on 3-shot: 68.5 vs 64.8 and on 5-shot: margin of tuning method over prompting method
68.4 vs 65. This is more clearly observed from a comes closer when the size of LLMs grows. For
different perspective in Figure 3, where the curves example, for 1-shot scenario, the performance
for Qwen-13B/72B is flat at low levels. improvement on EX of Qwen-LoRA over Qwen
This fact is possibly because these Qwen-72B al- is 31.0, 24.1 and 3.5 for 7B, 14B and 72B, re-
ready has strong SQL reasoning capabilities, which spectively.
has barely been discussed in other text-to-SQL Recall that the exact figure of few-shot evalua-
benchmarking works. tions can be found at Table 8 in Appendix B. Over-
In all, fine-tuned models exhibit superior SQL all, tuning methods continue to outperform prompt-
reasoning abilities compared to non-tuned models ing methods while the performance gap narrows as
in few-shot generation scenarios; however, the mar- the size of the LLMs increases.
gin of improvement is relatively small for robust
models like Qwen-72B. 4.3 Analysis II: Fine-tuning with More
Exemplars
Effect of Model Size. From Figure 4, we inter-
In this subsection, we explore the possibility of
pret the few-shot performance w.r.t. the model size
enhancing the performance of LLMs by adding
for four models (two base models and two tuned
more contextual examples during fine-tuning.
models) and observe that:
Setup. We use Qwen-7B as the base model and
• Larger models consistently achieve better results construct additional three few-shot (1/3/5-shot)
in few-shot scenarios compared to their smaller- training sets to fine-tune the model. Specifi-
sized counterparts. cally, the 1/3/5-shot training sets consist of query-

6
M ODEL EX EM T IME C OST ( HOUR ) GPU M EMORY (GB)
L O RA QL O RA L O RA QL ORA L O RA QL O RA L O RA QL O RA
L LAMA 2-7B 0.626 0.608 0.581 0.564 4.12 5.74 23.5 16.9
L LAMA 2-13B 0.680 0.664 0.640 0.632 7.26 8.82 34.8 29.6
C ODE L LAMA -7B 0.702 0.696 0.668 0.665 4.33 6.74 23.8 16.7
C ODE L LAMA -13B 0.746 0.727 0.701 0.682 7.26 8.82 34.8 29.6
BAICHUAN 2-7B 0.603 0.624 0.588 0.602 3.33 7.52 20.9 11.5
BAICHUAN 2-13B 0.678 0.659 0.607 0.606 8.12 15.3 34.4 17.5
Q WEN -7B 0.652 0.662 0.610 0.621 2.57 6.45 28.9 17.1
Q WEN -14B 0.663 0.701 0.658 0.665 4.23 11.32 38.4 18.1

Table 3: The comparison between LoRA and QLoRA on Spider across different perspectives: EX and EM are the
performance metrics; the training time and max GPU memory cost are the resource metrics.

0.7 0.7 0.7

Llama2 Qwen Llama2 Qwen Llama2 Qwen
0.6 Llama2-LoRA Qwen-LoRA 0.6 Llama2-LoRA Qwen-LoRA 0.6 Llama2-LoRA Qwen-LoRA
Improvement on EX

Improvement on EX

Improvement on EX
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0.0 0 1 2 3 4 5 0.0 0 1 2 3 4 5 0.0 0 1 2 3 4 5
k-shot k-shot k-shot
(a) 7B (b) 13B (c) 70B / 72B

Figure 3: Few-shot evaluations on Spider: EX improvement on few-shot scenarios over zero-shot. EX(k-shot)
represents the EX of the target (untuned/tuned) model under k-shot scenario minus EX of the base model in zero-shot
scenario, i.e., in (a), Improvement on EX(Qwen-LoRA, 3-shot) = EX(Qwen-LoRA, 3-shot) - EX(Qwen, 0-shot).

response pairs with an additional 1/3/5 exemplars. et al., 2023): diverse training corpus benefits the
For a given model, we also evaluate its few-shot fine-tuning of LLMs.
performances, same as in section 4.2.
5 Related Work
Core Insights. Shown in Table 4, we primarily
conclude with two insights: 5.1 LLM-empowered Text-to-SQL Methods
Driven by the considerable success of LLMs, the
• In a zero-shot evaluation scenario, tuning with field of LLM-empowered text-to-SQL has captured
additional exemplars does not yield a significant the interest of a large amount of researchers both
improvement in performance. See the “0-shot” in nature language process and database commu-
column. This is possible because the training cor- nity recently. The models on LLM-based text-to-
pus (more examples) mismatches the evaluation SQL can be categorized into supervised fine-tuning
setting (no examples). based and prompting based methods. Popular fine-
• In 1/3/5-shot evaluation scenarios, adding more tuned text-to-sql models are SQL-PaLM (Sun et al.,
contextual examples contributes to the notable 2023), PICARD (Scholak et al., 2021) and RESD-
improvement over the counterpart tuned with 0- SQL (Li et al., 2023a). In contrast to supervised
shot training corpus. It means that the perfor- fine-tuned models, prompting-based models do not
mance loss on few-shot evaluation for zero-shot require additional fine-tuning on task-specific train-
training is caused by the prompt mismatch of ing data. Instead, they solely rely on the zero-shot
training and evaluation dataset. and few-shot (Rajkumar et al., 2022; Liu et al.,
2023) capabilities inherent in LLMs. Within the
• The random-shot strategy, which refers to ran- prompting paradigm, the pivotal factor for query
domly adding 0/1/3/5 examples into the training representation lies in the design of the prompt (Wei
corpus, achieves the highest EM scores. This et al., 2022; Zhou et al., 2022; Wang et al., 2022a).
finding is consistent with that proposed by (Sun In particular, DIN-SQL (Pourreza and Rafiei, 2023)

7
0.8 0.8 0.8 0.8
0.7 0-shot 3-shot 0.7 0.7 0.7
0.6 1-shot 5-shot 0.6 0.6 0.6
0.5 0.5 0.5 0.5
EX

EX
0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
7B 13B 70B 7B 13B 70B 7B 14B 72B 7B 14B 72B
(a) Llama2 (b) Llama2-LoRA (c) Qwen (d) Qwen-LoRA

Figure 4: Few-shot evaluations on Spider: the EX performance of Llama2 / Qwen and their tuned counterparts with
varying model size.
M ODEL 0- SHOT 1- SHOT 3- SHOT 5- SHOT
EM EX EM EX EM EX EM EX
Q WEN -7B 16.1 22.9 27.4 34.0 27.6 33.9 25.9 33.8
Q WEN -7B-L O RA (0- SHOT ) 61.0 65.3 58.4 61.8 57.8 62.0 57.7 61.4
Q WEN -7B-L O RA (1- SHOT ) 61.2 64.0 61.7 64.8 60.8 63.8 61.8 64.8
Q WEN -7B-L O RA (3- SHOT ) 61.0 62.8 62.0 62.8 60.7 62.1 60.7 62.9
Q WEN -7B-L O RA (5- SHOT ) 60.4 62.7 62.0 64.0 61.5 63.2 60.9 63.5
Q WEN -7B-L O RA ( RANDOM - SHOT ) 61.5 63.0 62.1 64.0 62.2 63.6 61.9 63.6

Table 4: Few-shot Evaluations on Spider: EM and EX of fine-tuned models with the different number of examples
in the training corpus.

introduces adaptive prompt strategies via task de- works (Gao et al., 2023; Zhang et al., 2024) which
composition to effectively address challenges as- perform systematical benchmarking on text-to-
sociated with schema linking. DAIL-SQL (Gao SQL methods. Important distinctions of their work
et al., 2023) proposes a refined prompt selection from ours include: 1. comprehensiveness of bench-
and organization strategy to improve the perfor- mark settings: we evaluate both ICL and medium
mance. In DB-GPT-Hub, we offer scripts to sup- to large-sized fine-tuning methods in an end-to-
port researchers in fine-tuning LLMs in accordance end manner while Gao et al. (2023) focus on ICL
with the methodologies established in SQL-PaLM. methods and Zhang et al. (2024) assess various sub-
In addition, we also integrate the popular prompt tasks of the text-to-SQL process; 2. open source of
techniques used in DAIL-SQL. the codebase: we released a well-maintained open
repository on Github containing all code and data
5.2 Text-to-SQL Benchmarks assets, which, to the best of knowledge, is one of
A pivotal factor in the progression of text-to-SQL the most popular text-to-SQL benchmark reposito-
is the establishment of high-quality benchmarks. ries (over 1k stars so far), while neither of them has
Early benchmarks focus on single databases, in- achieved this.
cluding ATIS (Dahl et al., 1994), GeoQuery (Zelle
6 Conclusion
and Mooney, 1996), Academic (Li and Jagadish,
2014), Advising (Finegan-Dollak et al., 2018), and In this study, we conduct a systematic benchmark-
more recent additions such as SEDE (Hazoom ing of the various LLMs within the text-to-SQL
et al., 2021) and MIMICSQL (Wang et al., 2019). pipeline. Our benchmarking provides a meticulous
These benchmarks and datasets are often adapted perspective on the pipeline, equipping the research
from real-life applications, with many containing community with strategies to improve the semantic
domain-specific knowledge that may not generalize understanding of LLMs.
effectively to unseen SQL domains. Hence, large-
scale cross-domain datasets featuring professional 7 Limitations
SQL queries, such as Squall (Shi et al., 2020), Spi- The large computational resources required for
der (Yu et al., 2018a), Spider-Syn (Gan et al., 2021), LLM training might not be accessible to all re-
WikiSQL (Zhong et al., 2017), and SparC (Yu et al., searchers and practitioners, which may limit the
2020), have been introduced to facilitate compre- reproducibility of our findings.
hensive method analyses.
In retrospect, we realize two concurrent

8
References Jiaqi Guo, Ziliang Si, Yu Wang, Qian Liu, Ming Fan,
Jian-Guang Lou, Zijiang Yang, and Ting Liu. 2021.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Chase: A large-scale and pragmatic chinese dataset
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei for cross-database context-dependent text-to-sql. In
Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Proceedings of the 59th Annual Meeting of the Asso-
Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, ciation for Computational Linguistics and the 11th
Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, International Joint Conference on Natural Language
Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Processing, pages 2316–2331, Online. Association
Tu, Peng Wang, Shijie Wang, Wei Wang, Sheng- for Computational Linguistics.
guang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang,
Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Moshe Hazoom, Vibhor Malik, and Ben Bogin. 2021.
Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingx- Text-to-sql in the wild: A naturally-occurring
uan Zhang, Yichang Zhang, Zhenru Zhang, Chang dataset based on stack exchange data. arXiv,
Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang abs/2106.05006.
Zhu. 2023. Qwen technical report. arXiv preprint
arXiv:2309.16609. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
Baichuan. 2023. Baichuan 2: Open large-scale lan- and Weizhu Chen. 2021. Lora: Low-rank adap-
guage models. arXiv preprint arXiv:2309.10305. tation of large language models. arXiv preprint
arXiv:2106.09685.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Gangwei Jiang, Caigao Jiang, Siqiao Xue, James Y.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Zhang, Jun Zhou, Defu Lian, and Ying Wei. 2023.
Askell, et al. 2020. Language models are few-shot Towards anytime fine-tuning: Continually pre-trained
learners. Advances in Neural Information Processing language models with hypernetwork prompt. In Pro-
Systems (NeurIPS). ceedings of the 2023 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP).
Deborah A. Dahl, Madeleine Bates, Michael Brown,
William M. Fisher, Kate Hunicke-Smith, David S. Ming Jin, Qingsong Wen, Yuxuan Liang, Chaoli
Pallett, Christine Pao, Alexander I. Rudnicky, and Zhang, Siqiao Xue, Xue Wang, James Zhang,
Elizabeth Shriberg. 1994. Expanding the scope of Yi Wang, Haifeng Chen, Xiaoli Li, Shirui Pan, Vin-
the atis task: The atis-3 corpus. In Human Language cent S. Tseng, Yu Zheng, Lei Chen, and Hui Xiong.
Technology - The Baltic Perspectiv. 2023. Large models for time series and spatio-
temporal data: A survey and outlook. Preprint,
Naihao Deng, Yulong Chen, and Yue Zhang. 2022. Re- arXiv:2310.10196.
cent advances in text-to-sql: A survey of what we
have and what we expect. arXiv, abs/2208.10099. Fei Li and H. V. Jagadish. 2014. Constructing an
interactive natural language interface for relational
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and databases. Proc. VLDB Endow., 8:73–84.
Luke Zettlemoyer. 2023. Qlora: Efficient finetuning
Haoyang Li, Jing Zhang, Cuiping Li, and Hong Chen.
of quantized llms. arXiv preprint arXiv:2305.14314.
2023a. Resdsql: Decoupling schema linking and
skeleton parsing for text-to-sql. In Proceedings of
Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao,
the AAAI Conference on Artificial Intelligence.
Yunjun Gao, Lu Chen, Jinshu Lin, and Dongfang Lou.
2023. C3: zero-shot text-to-sql with chatgpt. CoRR, Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang,
abs/2307.07306. Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao,
Ruiying Geng, Nan Huo, Chenhao Ma, Kevin C. C.
Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Chang, Fei Huang, Reynold Cheng, and Yongbin Li.
Li Zhang, Karthik Ramanathan, Sesh Sadasivam, 2023b. Can llm already serve as a database interface?
Rui Zhang, and Dragomir R. Radev. 2018. Im- a big bench for large-scale database grounded text-
proving text-to-sql evaluation methodology. arXiv, to-sqls. Preprint, arXiv:2305.03111.
abs/1806.09029.
Aiwei Liu, Xuming Hu, Lijie Wen, and Philip S. Yu.
Yujian Gan, Xinyun Chen, Qiuping Huang, Matthew 2023. A comprehensive evaluation of chatgpt’s zero-
Purver, John Robert Woodward, Jinxia Xie, and Peng- shot text-to-sql capability. arXiv, abs/2303.13547.
sheng Huang. 2021. Towards robustness of text-to-
sql models against synonym substitution. In Annual Linyong Nan, Yilun Zhao, Weijin Zou, Narutatsu
Meeting of the Association for Computational Lin- Ri, Jaesung Tae, Ellen Zhang, Arman Cohan, and
guistics. Dragomir Radev. 2023. Enhancing few-shot text-to-
sql capabilities of large language models: A study on
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, prompt design strategies. CoRR, abs/2305.12586.
Yichen Qian, Bolin Ding, and Jingren Zhou. 2023.
Text-to-sql empowered by large language models: A OpenAI. 2023a. GPT-4 technical report. arXiv preprint
benchmark evaluation. arXiv:2303.08774.

9
OpenAI. 2023b. Sql translate. https://platform. Ruan Silva, Eric Michael Smith, Ranjan Subrama-
openai.com/examples/default-sql-translate. nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
Last accessed on 2023-07-24. lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Chen Pan, Fan Zhou, Xuanwei Hu, Xinxin Zhu, Wenxin Melanie Kambadur, Sharan Narang, Aurelien Ro-
Ning, Zi Zhuang, Siqiao Xue, James Zhang, and driguez, Robert Stojnic, Sergey Edunov, and Thomas
Yunhua Hu. 2023. Deep optimal timing strategies for Scialom. 2023. Llama 2: Open foundation and fine-
time series. In ICDM. tuned chat models. Preprint, arXiv:2307.09288.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Immanuel Trummer. 2022. CodexDB: Synthesizing
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, code for query processing from natural language in-
Alban Desmaison, Luca Antiga, and Adam Lerer. structions using GPT-3 Codex. PVLDB, 15(11):2921
2017. Automatic differentiation in PyTorch. – 2928.
Mohammad Reza Pourreza and Davood Rafiei. 2023. Ping Wang, Tian Shi, and Chandan K. Reddy. 2019.
Din-sql: Decomposed in-context learning of text- Text-to-sql generation for question answering on elec-
to-sql with self-correction. In Advances in Neural tronic medical records. Proceedings of The Web Con-
Information Processing Systems (NeurIPS). ference 2020.

Jiexing Qi, Jingyao Tang, Ziwei He, Xiangpeng Wan, Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Chenghu Zhou, Xinbing Wang, Quanshi Zhang, and Ed Huai hsin Chi, and Denny Zhou. 2022a. Self-
Zhouhan Lin. 2022. Rasat: Integrating relational consistency improves chain of thought reasoning in
structures into pretrained seq2seq model for text-to- language models. arXiv, abs/2203.11171.
sql. arXiv, abs/2205.06983.
Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang,
Nitarshan Rajkumar, Raymond Li, and Dzmitry Bah- Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot,
danau. 2022. Evaluating the text-to-sql capabilities Jennifer Dy, and Tomas Pfister. 2022b. Learning to
of large language models. arXiv, abs/2204.00498. prompt for continual learning. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Torsten Scholak, Nathan Schucher, and Dzmitry Bah- Pattern Recognition, pages 139–149.
danau. 2021. Picard: Parsing incrementally for
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
constrained auto-regressive decoding from language
Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and
models. arXiv, abs/2109.05093.
Denny Zhou. 2022. Chain of thought prompting
Tianze Shi, Chen Zhao, Jordan L. Boyd-Graber, Hal elicits reasoning in large language models. arXiv,
Daum’e, and Lillian Lee. 2020. On the potential of abs/2201.11903.
lexico-logical alignments for semantic parsing to sql Siqiao Xue, Caigao Jiang, Wenhui Shi, Fangyin Cheng,
queries. arXiv, abs/2010.11246. Keting Chen, Hongjun Yang, Zhiping Zhang, Jian-
Ruoxi Sun, Sercan Ö. Arik, Hootan Nakhost, Hanjun shan He, Hongyang Zhang, Ganglin Wei, Wang
Dai, Rajarishi Sinha, Pengcheng Yin, and Tomas Pfis- Zhao, Fan Zhou, Danrui Qi, Hong Yi, Shaodong
ter. 2023. Sql-palm: Improved large language model Liu, and Faqiang Chen. 2023a. Db-gpt: Empower-
adaptation for text-to-sql. arXiv, abs/2306.00739. ing database interactions with private large language
models. arXiv preprint arXiv:2312.17449.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Siqiao Xue, Danrui Qi, Caigao Jiang, Wenhui Shi,
Sequence to sequence learning with neural networks.
Fangyin Cheng, Keting Chen, Hongjun Yang, Zhip-
In Proceedings of the 27th International Conference
ing Zhang, Jianshan He, Hongyang Zhang, Ganglin
on Neural Information Processing Systems - Volume
Wei, Wang Zhao, Fan Zhou, Hong Yi, Shaodong Liu,
2, NIPS’14, page 3104–3112, Cambridge, MA, USA.
Hongjun Yang, and Faqiang Chen. 2024a. Demon-
MIT Press.
stration of db-gpt: Next generation data interac-
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- tion system empowered by large language models.
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Preprint, arXiv:2404.10209.
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Siqiao Xue, Xiaoming Shi, Zhixuan Chu, Yan Wang,
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Hongyan Hao, Fan Zhou, Caigao Jiang, Chen Pan,
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, James Y. Zhang, Qingsong Wen, Jun Zhou, and
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Hongyuan Mei. 2024b. Easytpp: Towards open
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- benchmarking temporal point processes. In Inter-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan national Conference on Learning Representations
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, (ICLR).
Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- Siqiao Xue, Fan Zhou, Yi Xu, Hongyu Zhao, Shuo Xie,
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- Caigao Jiang, James Zhang, Jun Zhou, Dacheng Xiu,
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- and Hongyuan Mei. 2023b. Weaverbird: Empow-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- ering financial decision-making with large language
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, model, knowledge base, and search engine.

10
Yale. 2018. Spider 1.0, yale semantic parsing and text- Olivier Bousquet, Quoc Le, and Ed Huai hsin
to-sql challenge. https://yale-lily.github.io/ Chi. 2022. Least-to-most prompting enables com-
spider. plex reasoning in large language models. arXiv,
abs/2205.10625.
Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and
Dragomir R. Radev. 2018a. Typesql: Knowledge- Shiyi Zhu, Jing Ye, Wei Jiang, Siqiao Xue, Qi Zhang,
based type-aware neural text-to-sql generation. In Yifan Wu, and Jianguo Li. 2024. Coca: Fusing posi-
North American Chapter of the Association for Com- tion embedding with collinear constrained attention
putational Linguistics. in transformers for long context window extending.
In Proceedings of the 62nd Annual Meeting of the As-
Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue, sociation for Computational Linguistics. Association
Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze for Computational Linguistics.
Shi, Zihan Li, Youxuan Jiang, Michihiro Yasunaga,
Sungrok Shim, Tao Chen, Alexander Fabbri, Zifan
Li, Luyao Chen, Yuwen Zhang, Shreya Dixit, Vin-
cent Zhang, Caiming Xiong, Richard Socher, Walter
Lasecki, and Dragomir Radev. 2019. CoSQL: A
conversational text-to-SQL challenge towards cross-
domain natural language interfaces to databases. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 1962–
1979, Hong Kong, China. Association for Computa-
tional Linguistics.
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir
Radev. 2018b. Spider: A large-scale human-labeled
dataset for complex and cross-domain semantic pars-
ing and text-to-sql task. In Proceedings of the 2018
Conference on Empirical Methods in Natural Lan-
guage Processing, Brussels, Belgium. Association
for Computational Linguistics.
Xiaojing Yu, Tianlong Chen, Zhengjie Yu, Huiyu Li,
Yang Yang, Xiaoqian Jiang, and Anxiao Jiang. 2020.
Dataset and enhanced model for eligibility criteria-
to-sql semantic parsing. In International Conference
on Language Resources and Evaluation.
John M. Zelle and Raymond J. Mooney. 1996. Learn-
ing to parse database queries using inductive logic
programming. In AAAI/IAAI, Vol. 2.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b:
An open bilingual pre-trained model. arXiv preprint
arXiv:2210.02414.
Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu,
Zhishuai Li, Sun Yang, Chi Harold Liu, Rui Zhao,
Ziyue Li, and Hangyu Mao. 2024. Benchmark-
ing the text-to-sql capability of large language
models: A comprehensive evaluation. Preprint,
arXiv:2403.02951.
Victor Zhong, Caiming Xiong, and Richard Socher.
2017. Seq2sql: Generating structured queries
from natural language using reinforcement learning.
CoRR, abs/1709.00103.
Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei,
Nathan Scales, Xuezhi Wang, Dale Schuurmans,

11
Appendices uses SQL to retrieve answers, clarify ambiguous
questions, or otherwise inform.
Chase (Guo et al., 2021). This data is to date
A Experimental Details
the largest Chinese dataset for the cross-database
A.1 Dataset Details context-dependent Text-to-SQL problem. It con-
sists of 5,459 question sequences (17,940 ques-
Spider (Yu et al., 2018b). It consists of 10,181 tions) over 280 databases. Each question in Chase
questions and 5,693 unique complex SQL queries has rich semantic annotations, including its SQL
across 200 databases, covering 138 domains, each query, contextual dependency, and schema linking.
containing multiple tables. The standard protocol
for this dataset divides it into 8,659 training exam- A.2 Metrics Details
ples across 146 databases, 1,034 development ex-
We clarify the properties of the two metrics in de-
amples across 20 databases, and a holdout of 2,147
tails.
test examples across 34 databases. The databases
Exact-set match accuracy (EM). EM treats
used in each of these sets are nonoverlapping. SQL
each clause as a set and compares the prediction
queries are categorized into four difficulty levels,
for each clause to its corresponding clause in the
based on the number of SQL keywords used, the
reference query. A predicted SQL query is con-
presence of nested subqueries, and the usage of
sidered correct only if all of its components match
column selections and aggregations.
the ground truth. EM does not take values into
BIRD (Li et al., 2023b). This dataset represents
account.
a pioneering, cross-domain dataset that examines
Execution accuracy (EX). EX compares the ex-
the impact of extensive database contents on text-to-
ecution output of the predicted SQL query with that
SQL parsing. BIRD contains over 12,751 unique
of the ground truth SQL query on some database
question-SQL pairs, 95 big databases with a total
instances. Execution accuracy provides a more
size of 33.4 GB. It also covers more than 37 profes-
precise estimate of the performance of the method
sional domains, such as blockchain, hockey, health-
as there may be multiple valid SQL queries for a
care and education, etc. BIRD also introduces ex-
given question while EM only evaluates the pre-
ternal knowledge as an additional resource to assist
dicted SQL against one of them.
models in generating accurate SQL queries. Specif-
ically four sources of external knowledge were in- A.3 Implementation Details
troduced: numeric reasoning knowledge, domain
knowledge, synonym knowledge, and value illus- All models are implemented using the PyTorch
tration. Notably, the SQL queries in the BIRD framework (Paszke et al., 2017). For parameter
dataset tend to be more intricate than those in the scale with 7B and 13B models, we adopt 1 Nvidia
Spider dataset. A100 Tensor Core GPU to run training. For the
parameter scale of 70B model, we adopt 8*A100
WikiSQL (Zhong et al., 2017). This dataset
to run training and inference.
consists of a corpus of 80,654 natural statement
expressions and sql annotations of 24,241 tables. Fine-tuning hyperparameters setting The hy-
Each query in WikiSQL is limited to the same table perparameters of the training are shown in Table 5.
and does not contain complex operations such as
sorting, grouping. The queries in WikiSQL are lim-
ited to the same table and do not include complex A.4 Few Shot Prompting
operations such as sorting, grouping, subqueries,
etc. Given the following database schema :
CoSQL (Yu et al., 2019). This dataset is a con-
versational version of the Spider task. CoSQL con- Table advisor, columns = [∗,s_ID,i_ID]
sists of 30,000 rounds and 10,000 annotated SQL Table classroom, columns = [∗,building,
room_number,capacity]
queries from Wizard-of-Oz’s collection of 3k con- Table course, columns = [∗,course_id,title,
versations querying 200 complex databases across dept_name,credits]
138 domains. Each conversation simulates a real- Table department, columns = [∗,dept_name,
building,budget] Table instructor, columns
istic DB query scenario in which a staff member = [∗,ID,name,dept_name,salary] Table prereq,
explores the database as a user and a SQL expert columns = [∗,course_id,prereq_id]

12
Parameter 7B 13B 70B Response: SELECT COUNT (DISTINCT ID) FROM
GPUs 1*A100 1*A100 8*A100 teaches WHERE semester = 'Spring' AND YEAR
= 2010.
max source length 2048 2048 2048
max target length 512 512 512 Q: Find the name of the students and their
fine-tuning type lora lora lora department names sorted by their total
credits in ascending order.
lora rank 64 64 64 Response: SELECT name, dept_name FROM
lora alpha 32 32 32 student ORDER BY tot_cred.
lr 0.0002 0.0002 0.0002
epoch 8 8 8 Q: Find the year which offers the largest
number of courses.
Response: SELECT YEAR FROM SECTION GROUP BY
Table 5: Parameter setting of fine tuning for different YEAR ORDER BY count(∗) DESC LIMIT 1.
model scale
Q: What are the names and average salaries
for departments with average salary higher
Table section, columns = [∗,course_id, than 42000?
sec_id,semester,year,building,room_number, Response: SELECT dept_name , AVG (salary)
time_slot_id] FROM instructor GROUP BY dept_name HAVING
Table student, columns = [∗,ID,name, AVG (salary) > 42000.
dept_name,tot_cred] Table takes, columns =
[∗,ID,course_id,sec_id,semester,year,grade] Q: Find the minimum salary for the
Table teaches, columns = [∗,ID,course_id, departments whose average salary is above
sec_id,semester,year] the average payment of all instructors.
Table time_slot, columns = [∗,time_slot_id, Response: SELECT min(salary) , dept_name
day,start_hr,start_min,end_hr,end_min] FROM instructor GROUP BY dept_name HAVING
avg(salary) > (SELECT avg(salary) FROM
Please write queries to answer the instructor).
following questions:
Q: What is the course title of the
Q: Find the title of courses that have two prerequisite of course Mobile Computing?
prerequisites. Response: SELECT title FROM course WHERE
Response: SELECT T1.title FROM course AS T1 course_id IN (SELECT T1.prereq_id FROM
JOIN prereq AS T2 ON T1.course_id = T2. prereq AS T1 JOIN course AS T2 ON T1.
course_id GROUP BY T2.course_id HAVING course_id = T2.course_id WHERE T2.title = '
count(∗) = 2. Mobile Computing').

Q: Find the room number of the rooms which

can sit 50 to 100 students and their Q: Give the title and credits for the
buildings. course that is taught in the classroom with
Response: SELECT building , room_number the greatest capacity.
FROM classroom WHERE capacity BETWEEN 50 Response: SELECT T3.title , T3.credits FROM
AND 100. classroom AS T1 JOIN SECTION AS T2 ON T1.
building = T2.building AND T1.room_number =
Q: Give the name of the student in the T2.room_number JOIN course AS T3 ON T2.
History department with the most credits. course_id = T3.course_id WHERE T1.capacity
Response: SELECT name FROM student WHERE = (SELECT max(capacity) FROM classroom).
dept_name = 'History' ORDER BY tot_cred
DESC LIMIT 1. Q: Find the name of students who took any
class in the years of 2009 and 2010.
Q: Find the total budgets of the Marketing Response: SELECT DISTINCT T1.name FROM
or Finance department. student AS T1 JOIN takes AS T2 ON T1.id =
Response: SELECT sum(budget) FROM T2.id WHERE T2.YEAR = 2009 OR T2.YEAR =
department WHERE dept_name = 'Marketing' OR 2010.
dept_name = 16 'Finance'.
Q: Find the total number of students and
Q: Find the department name of the total number of instructors for each
instructor whose name contains 'Soisalon'. department.
Response: SELECT dept_name FROM instructor Response: SELECT count(DISTINCT T2.id) ,
WHERE name LIKE '%Soisalon%'. count(DISTINCT T3.id) , T3.dept_name FROM
department AS T1 JOIN student AS T2 ON T1.
Q: What is the name of the department with dept_name = T2.dept_name JOIN instructor AS
the most credits? T3 ON T1.dept_name = T3.dept_name GROUP BY
Response: SELECT dept_name FROM course T3.dept_name.
GROUP BY dept_name ORDER BY sum(credits)
DESC LIMIT 1. Q: Find the buildings which have rooms with
capacity more than 50.
Q: How many instructors teach a course in Response: SELECT DISTINCT building FROM
the Spring of 2010? classroom WHERE capacity > 50

13
list_id FROM lists WHERE movie_id = m.
movie_id)) WHERE r.rating_score > 4 AND lu.
Listing 2: Full Examples of Text Representation Prompt user_has_payment_method = 1 GROUP BY m.
on Spider Dataset. movie_id, m.movie_release_year HAVING COUNT
(DISTINCT lu.list_id) >= 10.
Given the following database schema :
Q: Find the title of the most popular movie
Table movies, columns = [∗,movie_id, among those that have never received any
movie_title,movie_release_year,movie_url, critic comments.
movie_title_language,movie_popularity, Response: SELECT movie_title FROM movies
movie_image_url,director_id,director_name, JOIN ratings ON movies.movie_id = ratings.
director_url] movie_id WHERE critic_comments = 0 ORDER BY
Table ratings, columns = [∗,movie_id, movie_popularity DESC LIMIT 1;
rating_id,rating_url,rating_score,
rating_timestamp_utc,critic,critic_likes, Q: Find the names of movies from the year
critic_comments,user_id,user_trialist] 2000 which have been added to at least 5
Table lists, columns = [∗,user_id,list_id, different lists and have an image URL
list_title,list_movie_number, available.
list_update_timestamp_utc, Response: SELECT DISTINCT m.movie_title
list_creation_timestamp_utc,list_followers, FROM movies m JOIN lists l ON m.movie_id IN
list_url,list_comments,list_description] (SELECT movie_id FROM lists WHERE list_id
Table lists_users, columns = [∗,user_id, = l.list_id) WHERE m.movie_release_year =
list_id,list_update_date_utc, 2000 AND m.movie_image_url IS NOT NULL
list_creation_date_utc,user_trialist, GROUP BY m.movie_id HAVING COUNT(DISTINCT l.
user_subscriber,user_avatar_image_url, list_id) >= 5.
user_cover_image_url,
user_eligible_for_trial, Q: Which user created the most number of
user_has_payment_method] lists while being a subscriber and having a
profile cover image?
Please write queries to answer the Response: SELECT user_id, COUNT(list_id) as
following questions: num_lists FROM lists_users WHERE
Q: How many movies in the database were user_subscriber = 1 AND
directed by Christopher Nolan? user_cover_image_url IS NOT NULL GROUP BY
Response: SELECT COUNT(∗) FROM movies WHERE user_id ORDER BY num_lists DESC LIMIT 1.
director_name = 'Christopher Nolan'.
Q: Provide the critic made by users who
Q: List all movies that have a popularity rated a movie more than 3 but less than 5
greater than 5000. and got at least 10 likes on their review.
Response: SELECT movie_title FROM movies Response: SELECT critic FROM ratings WHERE
WHERE movie_popularity > 5000. rating_score BETWEEN 3 AND 5 AND
critic_likes >= 10.
Q: Retrieve the URL of the most popular
movie. Q: How many lists were created by users who
Response: SELECT movie_url FROM movies were subscribers and not trialists on
ORDER BY movie_popularity DESC LIMIT 1; January 1st, 2020?
Response: SELECT COUNT(DISTINCT list_id)
Q: Which user IDs have rated a movie on the FROM lists_users WHERE user_subscriber = 1
1st of January 2021. AND user_trialist = 0 AND
Response: SELECT DISTINCT user_id FROM list_creation_date_utc = '2020−01−01'.
ratings WHERE rating_timestamp_utc BETWEEN
'2021−01−01 00:00:00' AND '2021−01−01 Q: What are the titles of the lists which
23:59:59';. were created on '2022−05−15' and have more
than 50 comments?
Q: What are the names of the directors for Response: SELECT list_title FROM lists
movies that have an image URL containing ' WHERE list_creation_timestamp_utc =
poster'? '2022−05−15' AND list_comments > 50.
Response: SELECT DISTINCT director_name
FROM movies WHERE movie_image_url LIKE '%
poster%'. Q: What is the name and URL of the movie
that has the latest rating timestamp?
Q: Give me the IDs and release years of Response: SELECT movie_title, movie_url
movies that have both a rating score higher FROM movies WHERE movie_id = (SELECT
than 4 and have been included in at least movie_id FROM ratings ORDER BY
10 lists created by users who had a payment rating_timestamp_utc DESC LIMIT 1).
method when they created the list.
Response: SELECT m.movie_id, m. Q: Which movie has the highest number of
movie_release_year FROM movies m JOIN critic likes.
ratings r ON m.movie_id = r.movie_id JOIN Response: SELECT movie_id FROM ratings
lists_users lu ON lu.user_id = ANY(SELECT ORDER BY critic_likes DESC LIMIT 1;
user_id FROM lists WHERE list_id IN (SELECT

14
Q: Retrieve the list description and URL • Integration of more model training techniques.
for lists created by trialists that have In addition to pre-training, the community is
been updated since 2021 and contain movies
directed by Christopher Nolan. also interested in continual learning techniques
Response: SELECT l.list_description, l. for language models, such as continual pre-
list_url FROM lists l JOIN lists_users lu training (Jiang et al., 2023), prompt learn-
ON l.list_id = lu.list_id JOIN movies m ON
m.movie_id IN (SELECT movie_id FROM lists ing (Wang et al., 2022b) or positional encoding
WHERE list_id = l.list_id) WHERE lu. techniques (Zhu et al., 2024). The integration of
user_trialist = 1 AND l.
list_update_timestamp_utc > '2021−01−01'
these methods will greatly facilitate the research
AND m.director_name = 'Christopher Nolan'. community in these areas.
Q: List all the directors along with the
average rating score for movies they
directed that have over 1000 followers on
Mubi lists.
Response: SELECT director_name, AVG(
rating_score) AS avg_rating FROM movies
JOIN ratings ON movies.movie_id = ratings.
movie_id LEFT JOIN lists ON movies.movie_id
= lists.list_movie_number GROUP BY
director_name HAVING SUM(list_followers) >
1000.

Listing 3: Full Examples of Text Representation Prompt

on BIRD Dataset.

B More Experiment Result

B.1 EM metrics of Spider Dataset
The EM metric of BIRD dataset are show in Table 6.

B.2 EM metric of BIRD Dataset

The EM metric of BIRD dataset are show in Table 7.

B.3 More Results on Few-Shot Evaluation

The execution accuracy of k-shots prompt on differ-
ent models with it’s fine-tuned version are shown
in Table 8

B.4 LoRA and QLoRA

The performance improvement of LoRA and
QLoRA on Spider and BIRD are shown in Table 9

C Ongoing and Future Work

We are currently exploring several extensions to
deal with more complex dialogue and analytics
cases in our system. We are particularly interested
in handling

• More powerful agents. Users may want our sys-

tem not only to perform the analysis but also
provide more powerful abilities on text-to-SQL,
such as sequential predictions (Jin et al., 2023;
Xue et al., 2024b) based on historical data and
predictive decision abilities (Pan et al., 2023).

15
M ODEL E ASY M EDIUM H ARD E XTRA OVERALL
BASE L/QL BASE L/QL BASE L/QL BASE L/QL BASE L/QL
L LAMA 2-7B 0.000 0.827/0.810 0.000 0.614/0.574 0.000 0.408/0.443 0.000 0.307/0.295 0.000 0.581/0.564
L LAMA 2-13B 0.000 0.867/0.835 0.000 0.670/0.670 0.000 0.483/0.517 0.000 0.386/0.349 0.000 0.640/0.632
L LAMA 2-70B 0.327 0.847/− 0.112 0.679/− 0.075 0.454/− 0.018 0.382/− 0.142 0.635/−
C ODE L LAMA -7B 0.174 0.883/0.871 0.127 0.736/0.721 0.063 0.523/0.553 0.012 0.309/0.291 0.121 0.643/0.628
C ODE L LAMA -13B 0.617 0.910/0.910 0.545 0.727/0.688 0.377 0.624/0.556 0.224 0.365/0.382 0.487 0.706/0.682
C ODE L LAMA -70B 0.688 0.928/− 0.582 0.723/− 0.400 0.655/− 0.278 0.366/− 0.527 0.713/−
BAICHUAN 2-7B 0.326 0.832/0.815 0.104 0.588/0.621 0.025 0.402/0.454 0.000 0.225/0.286 0.119 0.579/0.602
BAICHUAN 2-13B 0.363 0.839/0.827 0.141 0.632/0.650 0.040 0.483/0.460 0.000 0.325/0.313 0.155 0.607/0.606
Q WEN -7B 0.365 0.802/0.778 0.101 0.643/0.608 0.063 0.517/0.471 0.024 0.331/0.313 0.161 0.610/0.578
Q WEN -14B 0.758 0.867/0.851 0.318 0.713/0.735 0.172 0.529/0.506 0.066 0.398/0.367 0.359 0.623/0.668
Q WEN -72B 0.754 0.903/− 0.316 0.726/− 0.241 0.523/− 0.102 0.386/− 0.374 0.680/−
C HAT GLM3-6B 0.000 0.776/0.763 0.000 0.564/0.533 0.000 0.457/0.477 0.000 0.261/0.224 0.000 0.521/0.542

Table 6: Evaluations on Spider: EM of base models vs fine-tuned models on each split of complexity and overall
dataset. “L” and “QL” denote “LORA” and “QLoRA” tuing methods, respectively.

M ODEL S IMPLE M ODERATE C HALLENGE OVERALL

BASE L/QL BASE L/QL BASE L/QL BASE L/QL
L LAMA 2-7B 0.000 0.068/0.062 0.000 0.015/0.017 0.000 0.000/0.000 0.000 0.046/0.043
L LAMA 2-13B 0.000 0.115/0.087 0.000 0.013/0.017 0.000 0.069/0.000 0.000 0.074/0.058
L LAMA 2-70B 0.000 0.107/− 0.000 0.028/− 0.000 0.000/− 0.000 0.072/−
C ODE L LAMA -7B 0.000 0.228/0.059 0.000 0.089/0.086 0.000 0.058/0.062 0.000 0.128/0.119
C ODE L LAMA -13B 0.088 0.293/0.346 0.000 0.129/0.136 0.000 0.112/0.124 0.029 0.256/0.243
C ODE L LAMA -70B 0.102 0.348/− 0.059 0.124/− 0.032 0.087/− 0.082 0.255/−
BAICHUAN 2-7B 0.000 0.078/0.068 0.000 0.022/0.017 0.000 0.000/0.000 0.000 0.054/0.046
BAICHUAN 2-13B 0.010 0.073/0.056 0.000 0.004/0.018 0.000 0.014/0.000 0.035 0.045/0.037
Q WEN -7B 0.000 0.067/0.082 0.000 0.010/0.015 0.000 0.007/0.013 0.000 0.043/0.055
Q WEN -14B 0.000 0.089/0.084 0.000 0.028/0.021 0.000 0.014/0.021 0.000 0.064/0.059
Q WEN -72B 0.154 0.243/− 0.023 0.048/− 0.012 0.038/− 0.042 0.089/−
C HAT GLM3-6B 0.000 0.124/0.112 0.000 0.045/0.048 0.000 0.026/0.028 0.000 0.068/0.051

Table 7: Evaluations on BIRD: EM of base models vs fine-tuned models on each split of complexity and overall
dataset. “L” and “QL” denote “LORA” and “QLoRA” tuing methods, respectively.

M ODEL 0- SHOT 1- SHOT 3- SHOT 5- SHOT

EM EX EM EX EM EX EM EX
L LAMA 2-7B 3.1 13.0 18.5 25.4 22.1 28.1 22.6 29.3
L LAMA 2-7B-L O RA 63.9 66.7 58.5 61.9 59.8 61.7 58.9 60.9
L LAMA 2-13B 2.4 20.3 13.2 30.0 15.5 32.3 16.2 32.4
L LAMA 2-13B-L O RA 62.7 67.0 62.5 66.5 60.6 66.0 61.3 66.4
L LAMA 2-70B 14.2 24.1 24.8 35.7 25.4 35.2 27.7 36.6
L LAMA 2-70B-L O RA 66.3 68.7 62.8 67.1 61.6 66.6 61.5 66.6
Q WEN -7B 16.1 23.5 27.4 34.0 27.6 33.9 25.9 33.8
Q WEN -7B-L O RA 61.0 65.2 58.4 61.8 57.8 62.0 57.5 61.4
Q WEN -14B 32.3 52.4 40.4 55.4 43.4 56.4 44.8 57.9
Q WEN -14B-L O RA 67.8 69.8 64.5 66.4 64.3 65.9 64.3 66.6
Q WEN -72B 37.4 60.0 51.5 65.4 51.3 64.8 51.3 65.0
Q WEN -72B-L O RA 68.0 71.2 65.1 68.9 65.5 68.5 64.2 68.4

Table 8: Few shot evaluations on Spider: base models vs fine-tune models.

16
M ODEL S PIDER BIRD
L O RA QL O RA L O RA QL ORA
L LAMA 2-7B ↑0.626 ↑0.608 ↑0.169 ↑0.168
L LAMA 2-13B ↑0.680 ↑0.664 ↑0.167 ↑0.163
L LAMA 2-70B ↑0.687 - ↑0.186 −
C ODE L LAMA -7B ↑0.453 ↑0.447 ↑0.228 ↑0.214
C ODE L LAMA -13B ↑0.217 ↑0.198 ↑0.204 ↑0.204
C ODE L LAMA -70B ↑0.204 − ↑0.179 −
BAICHUAN 2-7B ↑0.268 ↑0.289 ↑0.133 ↑0.123
BAICHUAN 2-13B ↑0.286 ↑0.267 ↑0.141 ↑0.101
Q WEN -7B ↑0.417 ↑0.427 ↑0.148 ↑0.133
Q WEN -14B ↑0.090 ↑0.128 ↑0.075 ↑0.068
Q WEN -72B ↑0.112 − ↑0.019 −
C HAT GLM3-6B ↑0.590 ↑0.581 ↑0.156 ↑0.128

Table 9: Evaluations on Spider and BIRD: EX improve-

ment on tuning with LoRA / QLoRA over base model.

Preprints202402 0693 v1
No ratings yet
Preprints202402 0693 v1
9 pages
Large Language Model Enhanced Text-to-SQL Generation - A Survey
No ratings yet
Large Language Model Enhanced Text-to-SQL Generation - A Survey
18 pages
DTS SQL
No ratings yet
DTS SQL
9 pages
OmniSQL Synthesizing High-Quality Text-To-SQL Data at Scale
No ratings yet
OmniSQL Synthesizing High-Quality Text-To-SQL Data at Scale
15 pages
LLM-driven Text-To-SQL and Database Querying
No ratings yet
LLM-driven Text-To-SQL and Database Querying
3 pages
STaR SQL Self Taught Reasoner For Text To SQL
No ratings yet
STaR SQL Self Taught Reasoner For Text To SQL
11 pages
SQLPa LM
No ratings yet
SQLPa LM
61 pages
Final Paper
No ratings yet
Final Paper
16 pages
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
No ratings yet
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
17 pages
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
No ratings yet
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
22 pages
Din SQL Research Paper
No ratings yet
Din SQL Research Paper
34 pages
Solid-SQL Enhanced Schema-Linking Based In-Context Learning For
No ratings yet
Solid-SQL Enhanced Schema-Linking Based In-Context Learning For
11 pages
670e4e23bdd7d170839060aa2023 Findings-Emnlp 227
No ratings yet
670e4e23bdd7d170839060aa2023 Findings-Emnlp 227
32 pages
In Context Reinforcement Learning Based Retrieval Augmented Generation For Text To SQL
No ratings yet
In Context Reinforcement Learning Based Retrieval Augmented Generation For Text To SQL
8 pages
LLM Based Survey Text 1741015993
No ratings yet
LLM Based Survey Text 1741015993
20 pages
LLM-Based Text-to-SQL Survey
No ratings yet
LLM-Based Text-to-SQL Survey
18 pages
Pet-Sql:: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL With Cross-Consistency
No ratings yet
Pet-Sql:: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL With Cross-Consistency
15 pages
2024 Lrec-Main 539
No ratings yet
2024 Lrec-Main 539
19 pages
LLM Based TXT To SQL
No ratings yet
LLM Based TXT To SQL
18 pages
LLM Model Transform For Short Term Trading On Commodity
No ratings yet
LLM Model Transform For Short Term Trading On Commodity
7 pages
A Survey of Large Language Model-Based Generative AI For Text-to-SQL: Benchmarks, Applications, Use Cases, and Challenges
No ratings yet
A Survey of Large Language Model-Based Generative AI For Text-to-SQL: Benchmarks, Applications, Use Cases, and Challenges
7 pages
Lucy: Think and Reason To Solve Text-to-SQL: Nina Narodytska Shay Vargaftik
No ratings yet
Lucy: Think and Reason To Solve Text-to-SQL: Nina Narodytska Shay Vargaftik
33 pages
Structure-Guided Large Language Models For
No ratings yet
Structure-Guided Large Language Models For
24 pages
Few-Shot Text-to-SQL Translation Using Structure
No ratings yet
Few-Shot Text-to-SQL Translation Using Structure
28 pages
Research Paper
No ratings yet
Research Paper
32 pages
Semantic Parsing For Complex Data Retrieval: Targeting Query Plans vs. SQL For No-Code Access To Relational Databases
No ratings yet
Semantic Parsing For Complex Data Retrieval: Targeting Query Plans vs. SQL For No-Code Access To Relational Databases
17 pages
Schema-R1: A Reasoning Training Approach For Schema Linking in Text-To-Sql Task
No ratings yet
Schema-R1: A Reasoning Training Approach For Schema Linking in Text-To-Sql Task
11 pages
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation
No ratings yet
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation
17 pages
Enhancing Text-To-SQL Capabilities of Large Language Models
No ratings yet
Enhancing Text-To-SQL Capabilities of Large Language Models
22 pages
From Natural Language To SQL Review of
No ratings yet
From Natural Language To SQL Review of
15 pages
Chase SQL
No ratings yet
Chase SQL
30 pages
E-SQL Direct Schema Linking Via Question Enrichment in
No ratings yet
E-SQL Direct Schema Linking Via Question Enrichment in
20 pages
Text-to-SQL Meets The Real-World: 1 4 A 1 B 1 C 1 4 D 1 e 4 F 1 G 1 4 H 2 I 3 J 1 4 K
No ratings yet
Text-to-SQL Meets The Real-World: 1 4 A 1 B 1 C 1 4 D 1 e 4 F 1 G 1 4 H 2 I 3 J 1 4 K
12 pages
Reasoning-SQL: Reinforcement Learning With SQL Tailored Partial Rewards For Reasoning-Enhanced Text-to-SQL
No ratings yet
Reasoning-SQL: Reinforcement Learning With SQL Tailored Partial Rewards For Reasoning-Enhanced Text-to-SQL
29 pages
E-SQL: Direct Schema Linking Via Question Enrichment in Text-to-SQL
No ratings yet
E-SQL: Direct Schema Linking Via Question Enrichment in Text-to-SQL
18 pages
Think2SQL: Reinforce LLM Reasoning Capabilities For Text2SQL
No ratings yet
Think2SQL: Reinforce LLM Reasoning Capabilities For Text2SQL
17 pages
Enhancing Text-To-SQL Translation For Financial System Design
No ratings yet
Enhancing Text-To-SQL Translation For Financial System Design
11 pages
A Survey On Text-to-SQL Parsing: Concepts, Methods, and Future Directions
No ratings yet
A Survey On Text-to-SQL Parsing: Concepts, Methods, and Future Directions
19 pages
HCteam IT Proposal
No ratings yet
HCteam IT Proposal
15 pages
Analyzing The Effectiveness of Large Language Models On Text-to-SQL Synthesis
No ratings yet
Analyzing The Effectiveness of Large Language Models On Text-to-SQL Synthesis
5 pages
Spider 2 0 Can Language
No ratings yet
Spider 2 0 Can Language
45 pages
Ai SQL Accuracy 2023 08 17
No ratings yet
Ai SQL Accuracy 2023 08 17
12 pages
10 LLM Based SQL Generation Wi
No ratings yet
10 LLM Based SQL Generation Wi
6 pages
Datrics Text 2 SQL
No ratings yet
Datrics Text 2 SQL
29 pages
Q D F C: Uerying Atabases With Unction Alling
No ratings yet
Q D F C: Uerying Atabases With Unction Alling
23 pages
Sorted LLaMA: Dynamic Inference in NLP
No ratings yet
Sorted LLaMA: Dynamic Inference in NLP
17 pages
AI Professional Workshop
No ratings yet
AI Professional Workshop
32 pages
Data Science AI Query Solution
No ratings yet
Data Science AI Query Solution
6 pages
LLMs as Database Interfaces: BIRD Benchmark
No ratings yet
LLMs as Database Interfaces: BIRD Benchmark
28 pages
Ijctt V72i12p103
No ratings yet
Ijctt V72i12p103
8 pages
Project Report - 4
No ratings yet
Project Report - 4
5 pages
Text-To-SQL Spider Paper
No ratings yet
Text-To-SQL Spider Paper
11 pages
Project Report - 3
No ratings yet
Project Report - 3
3 pages
Dbms Lab El Report
No ratings yet
Dbms Lab El Report
20 pages
LLM For Data Management
No ratings yet
LLM For Data Management
98 pages
Automated Data Visualization From Natural Language Via Large Language Models: An Exploratory Study
No ratings yet
Automated Data Visualization From Natural Language Via Large Language Models: An Exploratory Study
28 pages
Chalkboard School Class Syllabus Education Presentation
No ratings yet
Chalkboard School Class Syllabus Education Presentation
8 pages
Corruption Analysis in Nigeria
No ratings yet
Corruption Analysis in Nigeria
12 pages
PGBC Summary
100% (2)
PGBC Summary
36 pages
Gcse Chemistry Revision Booklet: Dr. E. Kruiswijk, 2005
No ratings yet
Gcse Chemistry Revision Booklet: Dr. E. Kruiswijk, 2005
20 pages
BCH 201 General - Biochemistry 1 - Farid2 PDF
100% (1)
BCH 201 General - Biochemistry 1 - Farid2 PDF
103 pages
CHA Hyderabad (AutoRecovered) Jan2023
No ratings yet
CHA Hyderabad (AutoRecovered) Jan2023
18 pages
Theology For Beginners PDF
No ratings yet
Theology For Beginners PDF
287 pages
Engaging Young Learners with Freeze Framing
No ratings yet
Engaging Young Learners with Freeze Framing
14 pages
Four or Dead 1 PDF
17% (71)
Four or Dead 1 PDF
9 pages
Result Declared - MJ - 2025 - 05.07.2025
No ratings yet
Result Declared - MJ - 2025 - 05.07.2025
47 pages
Health Law
100% (1)
Health Law
30 pages
Dutch Flower Industry Analysis
No ratings yet
Dutch Flower Industry Analysis
13 pages
Numerology Report Vedvastushastra
100% (2)
Numerology Report Vedvastushastra
11 pages
Navagraha Table
No ratings yet
Navagraha Table
2 pages
ANICAS, Jerimi V. - Project - in - IE203
No ratings yet
ANICAS, Jerimi V. - Project - in - IE203
12 pages
Instant Access To The Problem Centred Interview Principles and Practice Andreas Witzel Ebook Full Chapters
100% (10)
Instant Access To The Problem Centred Interview Principles and Practice Andreas Witzel Ebook Full Chapters
77 pages
Prestressed Slab Design Calculation
No ratings yet
Prestressed Slab Design Calculation
35 pages
DLL English 10 Q1 - Module 1 - Lesson 3 - Myth, Implicit and Explicit Signals, Let It Go, Orpheus, Life of Pi
No ratings yet
DLL English 10 Q1 - Module 1 - Lesson 3 - Myth, Implicit and Explicit Signals, Let It Go, Orpheus, Life of Pi
8 pages
Understanding Solar Plant Design ParametersSolar Irradiance, Tilt Angle, Azimuth, Efficiency Factors and Shading Analysis
No ratings yet
Understanding Solar Plant Design ParametersSolar Irradiance, Tilt Angle, Azimuth, Efficiency Factors and Shading Analysis
46 pages
Progard H3
No ratings yet
Progard H3
15 pages
Topic 8
No ratings yet
Topic 8
58 pages
Students Attendance Management System
No ratings yet
Students Attendance Management System
5 pages
Careers Project - 13
No ratings yet
Careers Project - 13
22 pages
Pastor Anita Williams Bio
No ratings yet
Pastor Anita Williams Bio
1 page
Mobile Tech Resume 3 31 2017
No ratings yet
Mobile Tech Resume 3 31 2017
2 pages
05 - m106 - Partie4-7e
No ratings yet
05 - m106 - Partie4-7e
34 pages
LS 6 Lesson 1 - Parts of A Desktop Computer
No ratings yet
LS 6 Lesson 1 - Parts of A Desktop Computer
64 pages
Essay On Criticism
67% (3)
Essay On Criticism
5 pages
Nora w10 Datasheet Ubx 21036702
No ratings yet
Nora w10 Datasheet Ubx 21036702
36 pages
Curriculum Map Contemporary Arts 1st and 2nd Quarter
No ratings yet
Curriculum Map Contemporary Arts 1st and 2nd Quarter
12 pages
Session 4
No ratings yet
Session 4
13 pages

DB GPT Hub 2024

Uploaded by

DB GPT Hub 2024

Uploaded by

DB-GPT-Hub: Towards Open Benchmarking Text-to-SQL Empowered by

Large Language Models

Abstract niques specifically for the task, leveraging SQL syn-

task of text-to-SQL. LLM-empowered text-to- et al., 2022).

3 Benchmark Design and Resources /∗ Instruction ∗/

Figure 1: An open benchmarking pipeline using DB-GPT-Hub.

• CodeLlama-70B-LoRA is universally optimal. 4.2 Analysis I: Fine-tuning vs Few-shot

M ODEL S IMPLE M ODERATE C HALLENGE OVERALL

0.7 0.7 0.7

Q: Find the room number of the rooms which

Listing 3: Full Examples of Text Representation Prompt

B More Experiment Result

B.2 EM metric of BIRD Dataset

B.3 More Results on Few-Shot Evaluation

B.4 LoRA and QLoRA

C Ongoing and Future Work

• More powerful agents. Users may want our sys-

M ODEL S IMPLE M ODERATE C HALLENGE OVERALL

M ODEL 0- SHOT 1- SHOT 3- SHOT 5- SHOT

Table 8: Few shot evaluations on Spider: base models vs fine-tune models.

Table 9: Evaluations on Spider and BIRD: EX improve-

You might also like