KEMBAR78
NL2SQL Handbook | PDF | Databases | Sql
0% found this document useful (0 votes)
383 views181 pages

NL2SQL Handbook

The document presents a comprehensive survey on Natural Language to SQL (NL2SQL) utilizing Large Language Models (LLMs), detailing the evolution of NL2SQL solutions, challenges, and practical guidance for development. It highlights the importance of high-quality training data, evaluation methods, and error analysis in enhancing NL2SQL systems. The survey also invites contributions and updates from the community to address overlooked significant works in this emerging field.

Uploaded by

thomeomomo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
383 views181 pages

NL2SQL Handbook

The document presents a comprehensive survey on Natural Language to SQL (NL2SQL) utilizing Large Language Models (LLMs), detailing the evolution of NL2SQL solutions, challenges, and practical guidance for development. It highlights the importance of high-quality training data, evaluation methods, and error analysis in enhancing NL2SQL systems. The survey also invites contributions and updates from the community to address overlooked significant works in this emerging field.

Uploaded by

thomeomomo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 181

A Survey of NL2SQL with Large Language Models:

Where are we, and where are we going?


Acknowledgement
• The slides accompany our recent NL2SQL survey[1] and handbook[2]. I would like to
extend my gratitude to my students and collaborators for their valuable contributions.
• All materials in the slides, including the figures, are available for distribution and
use for educational and research purposes. If you find any resources in our survey,
handbook, or slides helpful, please feel free to cite it in your work[3].
• As NL2SQL is an emerging area, we appreciate your informing us of any significant
work we may have overlooked in our survey, so we can address it in future updates.
[1] Liu, Xinyu, et al. "A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?." arXiv preprint arXiv:2408.05109 (2024).
[2] NL2SQL Handbook: https://github.com/HKUSTDial/NL2SQL_Handbook
[3] @misc{liu2024surveynl2sqllargelanguage,
title={A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?},
author={Xinyu Liu and Shuyu Shen and Boyan Li and Peixian Ma and Runzhi Jiang and Yuyu Luo and Yuxin Zhang and Ju Fan and Guoliang Li and Nan Tang},
year={2024},
eprint={2408.05109},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2408.05109},
} 1
An Overview of the Survey: the entire lifecycle of NL2SQL

(a) NL2SQL Model Evolution Stream Graph


- Recent advances in NL2SQL solutions, driven
by pre-trained language models (PLMs) and
large language models (LLMs)

(b) Benchmarks and Training Data Synthesis


- How to collect high-quality training data
- How to synthesize high-quality training data

(c) NL2SQL Evaluation


- Multi-angle evaluation
- Scenario-based evaluation

(d) NL2SQL Error Analysis


- provide a taxonomy to summarize typical
errors produced by NL2SQL methods

2
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems

3
Outline
• NL2SQL Problem and Background
• Problem Formulation
• NL2SQL Human Workflow
• NL2SQL Task Challenges
• Challenges Solving with Large Language Models
• Language Model-Powered NL2SQL Solutions
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems 4
What is Natural Language to SQL (NL2SQL)?
Who are the three youngest
winners across all matches?
And their ranks?
NL Query DB
Users
Information

NL2SQL Model

SQL SELECT winner_name, winner_rank FROM


matches ORDER BY winner_age LIMIT 3

5
NL2SQL Model Evolution Stream Graph

6
Outline
• NL2SQL Problem and Background
• Problem Formulation
• NL2SQL Human Workflow
• NL2SQL Task Challenges
• Challenges Solving with Large Language Models
• Language Model-Powered NL2SQL Solutions
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems 7
NL2SQL Human Workflow
Step-1 NL Understanding

Find the number of dog pets that are raised by female student

Step-2 Schema Linking and Database Content Retrieval

Pets Student
PetID PetType PetAge ... StuID Sex Age ...
Dog F

Has_Pet
PetID StuID ...

Step-3 Translating the NL Intent into the SQL


Select count(*) FROM student AS T1 JOIN has_pet AS T2 ON T1.stuid=T2.stuid
JOIN pets AS T3 ON T2.petid=T3.petid WHERE T1.sex=‘F’ AND T3.pettype=‘Dog’
8
Outline
• NL2SQL Problem and Background
• Problem Formulation
• NL2SQL Human Workflow
• NL2SQL Task Challenges
• Challenges Solving with Large Language Models
• Language Model-Powered NL2SQL Solutions
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems 9
C1. Ambiguous NL Query
NL2SQL Task Challenges C2. Requiring Domain Knowledge

C3. Complex Schema

C1
C3

C2

10
C1. Ambiguous NL Query
NL2SQL Task Challenges C2. Requiring Domain Knowledge

C3. Complex Schema

C4. Multiple Possible SQL Queries


C1
C3

LiteraryGenre
C1

C2

11
LiteraryGenre 11
C1. Ambiguous NL Query
NL2SQL Task Challenges C2. Requiring Domain Knowledge

C3. Complex Schema

C4. Multiple Possible SQL Queries

C5. Database Schema Dependency

C6. Database Domain Adaption

12
12
Lexical Ambiguity

Challenges Uncertain NL Query


Syntactic Ambiguity
Under-specification

(inherent challenges)
User Mistakes

Task Challenges
Complex Relationships Among Tables

Complex Database Ambiguity in Attributes and Values


and Dirty Content Domain-Specific Schema Designs
Large and Dirty Database Values
NL2SQL Challenges
Free-form NL vs. Constrained and Formal SQL
NL2SQL Translation Multiple Possible SQL Queries
Database Schema Dependency
SQL Efficiency
Model Efficiency
Technical
Challenges in Insufficient and Noisy Training Data
Developing NL2SQL Cost-effective Solution
Solutions
Data Privacy
13
Trustworthiness and Reliability
Uncertain NL Query
Type Explanation Example

A single word with multiple • The word “bat” can refer to: 1) an animal, 2) a
Lexical Ambiguity
meanings baseball bat, or 3) the action of swinging

A sentence can be parsed in • Mary [saw [the man [with the telescope]]]
Syntactic Ambiguity
multiple ways • Mary [saw [the man]] [with the telescope]

Linguistic expressions lack


• Labor Day in 2023 refers to September 4th in
Under-specification sufficient detail to convey specific
the US but May 1st in China
intentions or meanings clearly

Spelling mistakes and grammatical • calendar à calender


User Mistakes
errors

14
Complex Database and Dirty Content
Type Explanation Example

Complex Relationships Databases often contain hundreds of tables • The FIBEN Benchmark[1] has 152 tables
Among Tables with complex interrelationships. per database.

• An attribute or value named “status” could


Attributes and values within a database can be
Ambiguity in refer to an order’s shipping status in the
ambiguous, making it difficult for NL2SQL
Attributes and Values shipping table, while in the payment table,
systems to determine the correct context.
it might represent the payment status.
• In BI scenario, the database may contain
In large databases, efficiently handling vast
Large and Dirty thousands of tables.
data volumes is critical, and data values
Database Values • Missing values, duplicates, or
may sometimes be inaccurate or “dirty”.
inconsistencies.
• Healthcare databases may include complex
Different domains often have unique database
relational structures including patient
Domain-Specific designs and schema patterns. The variations in
records, medical history, and treatment
Schema Designs schema design across domains make it
plans, requiring specialized joins and query
difficult to develop a one-size-fits-all solution.
patterns.
[1] Jaydeep Sen, Chuan Lei, Abdul Quamar, Fatma Özcan, Vasilis Efthymiou, Ayushi Dalmia, Greg Stager, Ashish R. Mittal, Diptikalyan Saha, Karthik Sankaranarayanan: 15
ATHENA++: Natural Language Querying for Complex Nested SQL Queries. Proc. VLDB Endow. 13(11): 2747-2759 (2020) 15
NL2SQL Translation
Challenge 1: Free-form NL vs. Constrained and Formal SQL
Natural language is flexible, while SQL queries must follow strict syntax rules.

Challenge 2: Multiple Possible SQL Queries


A single NL query can correspond to multiple SQL queries that fulfill the query intent,
leading to ambiguity in determining appropriate SQL translation.

Example

16
16
NL2SQL Translation
Challenge 3: Database Schema Dependency
The NL2SQL translation process is highly dependent on the database schema it interacts with.

• Example: The same user intent can result in different SQL queries for different database schemas

17
NL2SQL Translation
Challenge 3: Database Schema Dependency
The NL2SQL translation process is highly dependent on the database schema it interacts with.

• Example: The same user intent can result in different SQL queries for different database schemas

The update in the DB schema

18
Technical Challenges in Developing Solutions

Type Explanation

The SQL generated by NL2SQL models must be both correct and optimized for
SQL Efficiency
the SQL execution efficiency.

There is often a trade-off between model performance and efficiency, as larger


Model Efficiency models generally deliver better results due to scaling laws but suffer from high
latency.

Acquiring high-quality NL2SQL training data is challenging. The limited


Insufficient and Noisy
publicly available datasets are often insufficient for training robust models, and
Training Data
noisy annotations further compromise their quality.

19
19
Technical Challenges in Developing Solutions

Type Explanation

Deploying NL2SQL models, especially those powered by large language


Cost-effective Solution
models, requires substantial resources, including hardware and API costs.

Data privacy is crucial for NL2SQL systems, especially when the databases
Data Privacy
contain sensitive data.

For NL2SQL models to be widely used, they must be trustworthy and reliable,
Trustworthiness and
consistently delivering accurate results across various datasets and use cases.
Reliability
Trustworthiness also requires that the model’s decisions are transparent.

20
20
Outline
• NL2SQL Problem and Background
• Problem Formulation
• NL2SQL Human Workflow
• NL2SQL Task Challenges
• Challenges Solving with Language Models
• Language Model-Powered NL2SQL Solutions
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems 21
Challenges Solving with Language Models

Figure: The Evolution of NL2SQL Solutions from the Perspective of Language Models. 22
Recap: LLM vs. PLM

PLMs (Pre-trained Language Models): T5 GShard Publicly Available

• Trained on large text corpora 2019 2020


2021
1-4
mT5 PanGu- Ernie 3.0
Jurassic-1
PLUG
• General language understanding GPT-3
Codex 5-8
CPM-2
• Examples: BERT, T5 T0 9-10
FLAN
Yuan 1.0
LaMDA
AlphaCode Pythia
Anthropic
HyperCLOVA
WebGPT 11-12 Chinchilla InternLM Baichuan2
Vicuna
Ernie 3.0 Titan InstructGPT 2022 RWKV PanGu-Σ MPT QWEN
UL2
LLMs (Large Language Models): Gopher CodeGen 1-3
PaLM
Sparrow Bard Baichuan FLM
GLaM MT-NLG Flan-T5 PaLM2
• A subset of PLMs with larger size CodeGeeX
OPT
YaLM Flan-PaLM
LLaMA
CodeGen2
Aquila2
GPT-NeoX-20B Skywork
and more training data BLOOM
GLM Tk-Instruct
4-6
Luminous StarCoder XVERSE
7-10
• Superior language understanding and mT0
AlexaTM
Cohere
11-12
NLLB
Falcon
Grok-1

emergent capabilities
BLOOMZ 2023
WeLM 1-4
Galatica 5-8
9-11
• Examples: GPT-4, LLama2 OPT-IML ChatGPT GPT-4 LLaMA2

Xiaowei Huang, Wenjie Ruan, Wei Huang, Gaojie Jin, Yi Dong, Changshun Wu, Saddek Bensalem, Ronghui Mu, Yi Qi, Xingyu
Zhao, Kaiwen Cai, Yanghao Zhang, Sihao Wu, Peipei Xu, Dengyu Wu, André Freitas, Mustafa A. Mustafa: A survey of safety and
trustworthiness of large language models through the lens of verification and validation. Artif. Intell. Rev. 57(7): 175 (2024) 23
How to adapt LLM/PLM to NL2SQL?
• The LLMs are a type of PLMs characterized by superior language understanding and
emergent capabilities

Steps for applying PLMs to downstream tasks:


1. Pre-training: Training a model on large-scale data to learn general knowledge.
2. Fine-tuning: Adapting a pre-trained model to a specific task using specialized data.
Steps for applying LLMs to downstream tasks:
1. Pre-training (optional)
2. Fine-tuning (optional)
3. Prompt the LLMs with different strategies (e.g, In-context learning) 24
NL2SQL Solution in the Era of LLMs
Pre-train and Fine-tune LLMs for NL2SQL:

• Diverse datasets (Dp) include a broad range of linguistic patterns and domain-general knowledge, enabling
the model to develop robust understanding capabilities.
• Specialized datasets (Df ) are closely aligned with the NL2SQL task.

Finetuning NL2SQL

You can get more idea from Section Practical Guidance for Developing NL2SQL Solutions (click it)
🔘 https://dbgroup.cs.tsinghua.edu.cn/ligl/papers/tutorial-llm-meets-db.pdf
25
NL2SQL Solution in the Era of LLMs

In-Context Learning for NL2SQL:


The goal is to optimize the prompt function P
to guide the LLMs.

• K denotes additional information or domain-specific


knowledge related to NL or DB.
• P is a prompt function that transforms the input (NL, DB, K)
into a suitable textual prompt for the LLMs.

Example: In-context learning for NL2SQL

26
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• Pre-Processing
• NL2SQL Translation Methods
• Post-Processing
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems

27
An Overview of Language Model-Powered NL2SQL
• Pre-Processing
• Schema Linking
• Database Content Retrieval
• Additional Information Acquisition
• NL2SQL Translation Methods
• Encoding Strategy
• Decoding Strategy
• Task-specific Prompt Strategies
• Intermediate Representation
• Post-Processing
• Correction
• Consistency
• Execution- Guided
• N-best Rerankers
Figure: An Overview of NL2SQL Methods in the LM Era 28
F LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020
TABLE I: Comparisons of Existing NL2SQL Solutions.
Pre-Processing NL2SQL Translation Methods Post-Processing
Methods Years Finetuning
Schema DB Content Additional Information Backbone Encoding Intermediate Task-specific Decoding Execution N-best
Correction Consistency
Linking Retrieval Acquisition Model Strategy Representation Prompt Strategy Strategy Guided Rerankers
CHESS [51] 2024 - 3 3 3 Decoder-Only Sequential Encoding - COT Greedy Search 3 3 3 -
CodeS [46] 2024 - 3 3 - Decoder-Only Sequential Encoding - - Greedy Search - - 3 -
SFT CodeS [46] 2024 3 3 3 3 Decoder-Only Sequential Encoding - - Greedy Search - - 3 -
FinSQL [47] 2024 3 3 - 3 Decoder-Only Sequential Encoding - - Greedy Search 3 3 - -
DTS-SQL [52] 2024 3 3 - - Decoder-Only Sequential Encoding - - Greedy Search - - - -
TA-SQL [53] 2024 - 3 - - Decoder-Only Sequential Encoding Sketch Structure COT Greedy Search - - - -
SuperSQL [45] 2024 - 3 3 3 Decoder-Only Sequential Encoding - - Greedy Search - 3 - -
ZeroNL2SQL [44] 2024 3 - - - Encoder-Decoder Sequential Encoding Sketch Structure Decomposition Beam Search 3 - 3 -
PET-SQL [54] 2024 3 3 - 3 Decoder-Only Sequential Encoding - - Greedy Search - 3 - -
CoE-SQL [55] 2024 - - - 3 Decoder-Only Sequential Encoding - CoT Greedy Search 3 - - -
PURPLE [56] 2024 - 3 - 3 Decoder-Only Sequential Encoding - - Greedy Search 3 3 3 -
MetaSQL [57] 2024 - 3 - 3 Decoder-Only Sequential Encoding - Decomposition Greedy Search - - - 3
DEA-SQL [58] 2024 - 3 - 3 Decoder-Only Sequential Encoding - Decomposition Greedy Search 3 - - -
DIN-SQL [5] 2023 - 3 - 3 Decoder-Only Sequential Encoding Syntax Language Decomposition Greedy Search 3 - - -
DAIL-SQL [6] 2023 - - - 3 Decoder-Only Sequential Encoding - - Greedy Search - 3 - -
C3-SQL [59] 2023 - 3 - - Decoder-Only Sequential Encoding - COT Greedy Search - 3 - -
RESDSQL [7] 2023 3 3 3 - Encoder-Decoder Sequential Encoding Syntax Language Decomposition Beam Search - - - -
T5-3B+NatSQL+Token Preprocessing [60] 2023 3 3 3 - Encoder-Decoder Sequential Encoding Syntax Language - Greedy Search - - - -
ACT-SQL [61] 2023 - 3 - 3 Decoder-Only Sequential Encoding - CoT Greedy Search - - - -
ODIS [62] 2023 - - - 3 Decoder-Only Sequential Encoding - - Greedy Search - - - -
MAC-SQL [63] 2023 - 3 - - Decoder-Only Sequential Encoding - Decomposition Greedy Search 3 - 3 -
SC-Prompt [64] 2023 3 - - - Encoder-Decoder Separate Encoding Sketch Structure - Beam Search 3 - - -
CatSQL [65] 2023 3 - - - Encoder-Only Sequential Encoding Sketch Structure - Beam Search 3 - - -
SQLFormer [66] 2023 3 3 - - Encoder-Decoder Graph-based Encoding - - Beam Search - - - -
G³R [67] 2023 3 3 3 - Encoder-Only Graph-based Encoding - COT Beam Search - - - 3
Graphix-T5 [43] 2022 3 3 3 - Encoder-Decoder Graph-based Encoding - - Constraint-aware Incremental - - - -
SHiP [68] 2022 3 - 3 - Encoder-Decoder Graph-based Encoding - - Constraint-aware Incremental - - - -
N-best List Rerankers [69] 2022 3 3 3 - Encoder-Decoder Sequential Encoding - - Constraint-aware Incremental - - - 3
RASAT [70] 2022 3 - 3 - Encoder-Decoder Graph-based Encoding - - Constraint-aware Incremental - - - -
PICARD [71] 2022 3 - 3 - Encoder-Decoder Sequential Encoding - - Constraint-aware Incremental - - - -
TKK [72] 2022 3 - 3 - Encoder-Decoder Separate Encoding Sketch Structure Decomposition Constraint-aware Incremental - - - -
S²SQL [73] 2022 3 3 3 - Encoder-Only Graph-based Encoding - - Greedy Search - - - -
RAT-SQL [74] 2021 3 3 3 - Encoder-Only Graph-based Encoding Syntax Language - Beam Search - - - -
SmBoP [75] 2021 3 - 3 - Encoder-Only Graph-based Encoding - - Beam Search - - - -
RaSaP [76] 2021 3 3 3 - Encoder-Only Graph-based Encoding - - Beam Search - - - -
BRIDGE [77] 2020 3 - 3 - Encoder-Only Sequential Encoding - - Others - - - -

Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, and Nan Tang. 2024.
29
A Survey of NL2SQL with Large Language Models: Where are we, and where are we going? arXiv:2408.05109 [cs.DB] https://arxiv.org/abs/2408.05109
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• Pre-Processing
• Schema Linking
• Database Content Retrieval
• Additional Information Acquisition
• NL2SQL Translation Methods
• Post-Processing
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems 30
Pre-Processing
Motivation:
• The pre-processing ensures the accurate mapping and processing of key
information within the limited input, by identifying the tables and columns related
to the given NL query.
• The pre-processing serves as an enhancement to the model’s inputs in the
NL2SQL translation.

Schema Linking Database Content Retrieval Additional Information


Acquisition
31
Schema Linking
Motivation:
• Schema complexity and presence of irrelevant tables with NL in database.
• The model has input limitations, e.g., the input limit of ChatGPT is 4096 tokens.

Goal:
• Identify the tables and columns related to the given NL.

32
Workflow of Schema Linking
Schema Students NL Query Students
id Display the names of all students who Grades id
Grades
Name have received an 'A' grade and the names Name
id of the courses they are enrolled in. id
Grade_level Student_id Grade_level
Student_id
Class Course_id
Course_id
Age Grade
Grade
Schema Linking
Course Course
id
Module id

Teacher Course_name Teacher Course_name

id Teacher_id Teacher_id
Describe
Name

Subject
Age
33
Method Classification of Schema Linking

Method classification:
• String Matching-based Schema Linking
• Neural Network-based Schema Linking
• In-Context Learning for Schema Linking Schema Linking

34
String Matching-based Schema Linking
IRNet designs string-based schema linking and an intermediate representation called SemQL.
• It uses n-grams (1-6 length) extracted from user queries as query candidates
• It evaluates the string-level similarity between the query candidates and the database schema,
then classifies them as Exact or Partial matches.

Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, Dongmei Zhang:
35
Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. ACL (1) 2019: 4524-4535
String Matching-based Schema Linking

Limitations:
• False positives when candidates share common words.
• Lack the ability to handle synonyms.

36
Neural Network-based Schema Linking
RESDSQL proposes a ranking-enhanced encoding and skeleton-aware decoding framework.
• A cross-encoder is trained to classify tables and columns based on the input query.

Haoyang Li, Jing Zhang, Cuiping Li, Hong Chen:


37
RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. AAAI 2023: 13067-13075
Neural Network-based Schema Linking
FinSQL, a model-agnostic LLMs-based NL2SQL framework for financial analysis.
• Organizing the tables into a batch
• Devising a Parallel Cross-Encoder to retrieve relevant tables and columns from
hundreds of schema items

Original Cross-Encoder (RESDSQL)

Parallel Cross-Encoder (FinSQL)

Chao Zhang, Yuren Mao, Yijiang Fan, Yu Mi, Yunjun Gao, Lu Chen, Dongfang Lou, Jinshu Lin:
38
FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis. SIGMOD Conference Companion 2024: 93-105
Neural Network-based Schema Linking

Limitations:
• Highly rely on the quantity and quality of training data
• Struggle to generalize effectively across databases with new schemas
or domains

39
In-Context Learning for Schema Linking
CHESS breaks down the NL2SQL task into a 3-staged pipeline, including entity and context
retrieval, schema selection, and query generation.
• Utilizing GPT-4 to extract keywords from both NL and Hint.
• It implements an efficient three-stage schema selection with different prompts.

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, Amin Saberi:
40
CHESS: Contextual Harnessing for Efficient SQL Synthesis. CoRR abs/2405.16755 (2024) Shayan Talaei, Mohammadreza
In-Context Learning for Schema Linking
MAC-SQL is a multi-agent collaborative framework for NL2SQL.
• The Selector agent performs the schema linking task.
• The Selector is activated only when the length of the database schema prompt exceeds a
specified threshold.

Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Qian-Wen Zhang, Zhao Yan, Zhoujun Li:
41
MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL. CoRR abs/2312.11242 (2023)
In-Context Learning for Schema Linking

Limitations:
• Large schema may exceed context length limitation
• Highly rely on the quality of the prompt
• High API cost

42
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• Pre-Processing
• Schema Linking
• Database Content Retrieval
• Additional Information Acquisition
• NL2SQL Translation Methods
• Post-Processing
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems 43
Database Content Retrieval

Motivation:
• Due to the large databases and input limitations of the model,
retrieving cell values is resource-intensive.
• Effectively handle some SQL clauses such as WHERE and JOIN .

Goal:
• Efficiently retrieve cell values related to the given NL.

44
Workflow of Database Content Retrieval
Schema Students NL Query Schema Students
id Display the names of all students who id
Name have received an 'A' grade and the names Name
Grades Grades
Grade_level of the courses they are enrolled in.
id … … Grade_level
id … …
Class Class
Student_id … … Student_id … …
Age Age
Course_id … … Course_id … …
Grade A B Database Content Retrieval Grade A B
Course Module Course
id id
Teacher Course_name Teacher Course_name
id Teacher_id id Teacher_id
Name Describe Name Describe
Subject Subject
Age Age
45
Method Classification of DB Content Retrieval

Method classification:
• String Matching-based Database Content Retrieval
• Neural Network-based Database Content Retrieval
• Index Strategy for Database Content Retrieval
Database Content Retrieval

46
String Matching-based Database Content Retrieval
BRIDGE designs an anchor text matching to automatically extract cell values mentioned in
the NL.
• It uses a heuristic method to calculate the maximum sequence match between the
problem and the cell values to determine the matching boundary.

Xi Victoria Lin, Richard Socher, Caiming Xiong:


47
Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing. EMNLP (Findings) 2020: 4870-4888
String Matching-based Database Content Retrieval

Limitations:
• Struggles with synonyms
• High computational cost for large databases

48
Neural Network-based Database Content Retrieval
TABERT uses a method called database content snapshots.
• It uses an attention mechanism to manage information between cell values across
different rows.

Pengcheng Yin, Graham Neubig, Wen-tau Yih, Sebastian Riedel:


49
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. ACL 2020: 8413-8426
Neural Network-based Database Content Retrieval
Graphix-T5 improves structural reasoning capabilities by modeling the relationship
between cell values and the NL query.
• it uses a value-match relation that defines the relationship between cell values and
questions.

Jinyang Li, Binyuan Hui, Reynold Cheng, Bowen Qin, Chenhao Ma, Nan Huo, Fei Huang, Wenyu Du, Luo Si, Yongbin Li:
50
Graphix-T5: Mixing Pre-Trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing. CoRR abs/2301.07507 (2023)
Neural Network-based Database Content Retrieval

Limitations:
• Lead to the misinterpretation of ambiguous or context-sensitive
queries.
• Require significant computing resources.

51
Index Strategy for Database Content Retrieval
CodeS introduces a coarse-to-fine cell value matching approach.
• It builds the index for all values using BM25.
• The index identifies candidate values that are relevant to NL.

Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen:
52
CodeS: Towards Building Open-source Language Models for Text-to-SQL. CoRR abs/2402.16347 (2024)
Index Strategy for Database Content Retrieval
CHESS uses a Locality-sensitive Hashing algorithm for approximate nearest
neighbor searches.
• It indexes unique cell values to quickly identify the top similar values related to
the NL query.

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, Amin Saberi:
53
CHESS: Contextual Harnessing for Efficient SQL Synthesis. CoRR abs/2405.16755 (2024)
Index Strategy for Database Content Retrieval

Limitations:
• Time-consuming to build.
• Requires frequent updates for dynamic databases, which increases
inference overhead.

54
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• Pre-Processing
• Schema Linking
• Database Content Retrieval
• Additional Information Acquisition
• NL2SQL Translation Methods
• Post-Processing
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems 55
Additional Information Acquisition
Motivation:
Ø Incorporate additional information (e.g., domain
knowledge) to improve the comprehension capabilities
of NL2SQL models for understanding the NL query,
performing the schema linking, and benefiting the
Additional Information
NL2SQL translation. Acquisition

Goal:
Ø To retrieve useful information such as Demonstration examples,
Specific domain knowledge, Formulaic Evidence, and Format
Information for the NL2SQL backbone model or specific modules.
56
Method Classification?

Method classification:

• Sample-based Methods
• Retrieval-based Methods Additional Information
Acquisition

57
Sample-based Methods
DIN-SQL inserts additional information through few-shot learning across multiple stages of
the workflow.
• These stages comprise schema linking, query classification, task decomposition, and self-
correction.
• It allows DIN-SQL to effectively tackle various challenges, including the complexity of
schema links, identification of multiple table joins, and handling of nested queries.

58
Mohammadreza Pourreza, Davood Rafiei: DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. NeurIPS 2023
Sample-based Methods
CodeS utilizes metadata examples of cross-domain databases as the main additional
information, including data types and annotation text, which help the model resolve
potential ambiguity issues and understand entity relationships
• This extracted information is transformed into coherent text and concatenated with the
question query to form the final input context.

Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen:
CodeS: Towards Building Open-source Language Models for Text-to-SQL. Proc. ACM Manag. Data 2(3): 127 (2024) 59
Sample-based Methods

Limitations:

• Increasing computational cost


• Potential hallucination caused by low-quality samples or wrong knowledge

60
Retrieval-based Methods
ReBoost engages with the LLMs model using the Explain-Squeeze Schema Linking
mechanism.
• This mechanism is a two-phase strategy. Initially, it presents a generalized schema to the
LLMs to establish a foundational understanding.
• Subsequently, it employs targeted prompting to elicit detailed associations between query
phrases and specific database entities, thereby enhancing accuracy in mapping queries to
database structures without incurring excessive token cost.

Guanghu Sui, Zhishuai Li, Ziyue Li, Sun Yang, Jingqing Ruan, Hangyu Mao, Rui Zhao: Reboost Large Language Model-based Text-to-
61
SQL, Text-to-Python, and Text-to-Function - with Real Applications in Traffic Domain. CoRR abs/2310.18752 (2023)
Retrieval-based Methods
REGROUP constructs a formulaic knowledge base encompassing various domains, such
as finance, real estate, and transportation.
• It leverages a Dense Passage Retriever (DPR) to compute similarity scores for the
retrieval results from the formulaic knowledge base.
• Subsequently, an Erasing-Then-Awakening (ETA) model is used to integrate the entities
in these formulaic knowledge items with the entities in NL and schema.

Longxu Dou, Yan Gao, Xuqi Liu, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Dechen Zhan, Min-Yen Kan, Jian-Guang Lou:
62
Towards Knowledge-Intensive Text-to-SQL Semantic Parsing with Formulaic Knowledge. CoRR abs/2301.01067 (2023)
Retrieval-based Methods

Limitations:

• Insufficient exploration of retrieving formulated or structured domain knowledge


bases
• Computational cost of retrieving and embedding additional information

63
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• Pre-Processing
• NL2SQL Translation Methods
• Encoding Strategy
• Decoding Strategy
• Task-specific Prompt Strategies
• Intermediate Representation
• Post-Processing
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems 64
Encoding Strategy
Motivation:
• Converts unstructured and semi-structured data into a form that can be
processed for generating SQL queries.
• The encoding process captures the semantic meaning of the NL input and the
structural information of the database schema, enabling the model to
understand and map the user’s intent to the corresponding SQL query.

Goal:
• Transform NL and database schema into a structured format that can
be effectively utilized by a language model.
65
Method classification of Encoding Strategy

Method classification:
• Sequential Encoding Strategy
• Graph-based Encoding Strategy
• Separate Encoding Strategy

An Overview of the Encoding Strategies

66
Sequential Encoding Strategy
BRIDGE enhances the alignment between text and the database schema.
• It represents the NL question and database schema as a tagged sequence.
• It inserts matched database cell values (anchor texts) after the corresponding fields.

Xi Victoria Lin, Richard Socher, Caiming Xiong:


Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing. EMNLP (Findings) 2020: 4870-4888 67
Sequential Encoding Strategy
RESDSQL uses a ranking-enhanced encoder to sort and filter schema items, thereby
reducing the complexity of schema linking during encoding.

• This method ensures that the most relevant schema items are prioritized, improving the
overall efficiency of the encoding process.

Haoyang Li, Jing Zhang, Cuiping Li, Hong Chen:


RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. AAAI 2023: 13067-13075 68
Sequential Encoding Strategy

Limitations:
• Its linear approach, which treats all inputs as sequences of tokens, may fail
to fully capture the complex relationships between the database schema
and NL query.
• This can affect the model’s ability to understand and generate complex
queries.

69
Graph-based Encoding Strategy
RAT-SQL introduces a relation-aware self-attention mechanism.
• It allows the model to explicitly consider and utilize predefined relational information when jointly
encoding the question and the database schema.
• These relationships are represented as a graph structure, and it can more effectively capture the
structural information in the schema and its alignment with the NL query.
An illustration of an example schema as a graph G One RAT layer in the schema encoder.

Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, Matthew Richardson:
70
RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. ACL 2020: 7567-7578
Graph-based Encoding Strategy
S²SQL uses ELECTRA as its pre-trained language model.

• It enhances encoding by injecting syntactic structure information at the encoding stage, improving
the semantic understanding and generation of models.

An overview
of S²SQL framework.

Binyuan Hui, Ruiying Geng, Lihan Wang, Bowen Qin, Yanyang Li, Bowen Li, Jian Sun, Yongbin Li:
71
S²SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers. ACL (Findings) 2022: 1254-1262
Graph-based Encoding Strategy
G³R introduces the LGESQL encoder.
• It captures and integrates multisource heterogeneous information by constructing and utilizing a
heterogeneous graph and a Graph Attention Network (GAT) , thereby enhancing the representation
capability for NL and database schema.

An overview of the
architecture of G³R
framework.

Yanzheng Xiang, Qian-Wen Zhang, Xu Zhang, Zejie Liu, Yunbo Cao, Deyu Zhou:
72
G³R: A Graph-Guided Generate-and-Rerank Framework for Complex and Cross-domain Text-to-SQL Generation. ACL (Findings) 2023: 338-352
Graph-based Encoding Strategy

Limitation:
• Require more sophisticated algorithms for constructing and processing
graph structures.
• Need a large amount of training data to fully leverage its structural
advantages.

73
Separate Encoding Strategy
TKK employs task decomposition and multi-task learning strategies in encoding.
• It breaks down the complex NL2SQL task into multiple subtasks.
• In this way, it can progressively acquire and combine knowledge.

Chang Gao, Bowen Li, Wenxuan Zhang, Wai Lam, Binhua Li, Fei Huang, Luo Si, Yongbin Li:
74
Towards Generalizable and Robust Text-to-SQL Parsing. EMNLP (Findings) 2022: 2113-2125
Separate Encoding Strategy
SC-Prompt divides text encoding into two stages: the structure stage and the content
stage, with each stage being encoded separately.

SC-prompt framework with


two-stage prompt learning

Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, Xiaoyong Du:
75
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning. Proc. ACM Manag. Data 1(2): 147:1-147:28 (2023)
Separate Encoding Strategy

Limitations:
• Require multiple processing of input data
• May extend the training and inference time of the model

76
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• Pre-Processing
• NL2SQL Translation Methods
• Encoding Strategy
• Decoding Strategy
• Task-specific Prompt Strategies
• Intermediate Representation
• Post-Processing
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems 77
Decoding Strategy
Motivation:
• The choice of decoding strategy directly affects the quality and
performance of the generated SQL queries.
• An excellent decoding strategy not only produces syntactically correct SQL
queries but also ensures that the semantics of the SQL queries align with
the NL and can even optimize the execution efficiency of the queries.

Goal:
• Convert the representations generated by the encoder into the target
SQL queries.

78
Method classification of Decoding Strategy

Method classification:
• Greedy search-based decoding strategy
• Beam search-based decoding strategy
• Constraint-aware incremental decoding
strategy

An Overview of the Decoding Strategies

79
Greedy Search-based Decoding Strategy
The greedy search-based decoding strategy is a
simple and fast approach for decoding.
• At each decoding step, greedy search selects the token with
the highest current probability as the output.
• This strategy builds the final output sequence by continuously
choosing the locally optimal solution.
• Since the default decoding strategy of GPT series models
(e.g., GPT-4) is greedy search-based decoding, NL2SQL
solutions based on GPT fall into this category.

The Transformer-Model Architecture


Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: 80
Attention is All you Need. NIPS 2017: 5998-6008
Greedy Search-based Decoding Strategy
Limitations:
• Greedy search only considers the optimal solution at the current step,
ignoring long-term dependencies and overall optimal solutions.
• May result in the generation of SQL queries that are not globally optimal.
• Errors at each step could lead to biases in decision-making in subsequent
steps, potentially accumulating errors throughout the decoding process,
especially when handling complex queries.

81
Beam Search-based Decoding Strategy
RAT-SQL combines relation-aware graph structure encoding and generation techniques.
• During the decoding process, RAT-SQL uses beam search to generate multiple
candidate SQL queries
• These queries are then reranked, and the optimal query is selected based on graph
structure information.

Choosing a column in a tree decoder.

Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, Matthew Richardson:
RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. ACL 2020: 7567-7578 82
Beam Search-based Decoding Strategy
SmBoP uses a semi-autoregressive bottom-up decoding approach, improving decoding
efficiency by parallelizing the construction and scoring of sub-trees, achieving
logarithmic time complexity.

An overview of the decoding


procedure of SMBOP.

Ohad Rubin, Jonathan Berant:


SmBoP: Semi-autoregressive Bottom-up Semantic Parsing. SPNLP@ACL-IJCNLP 2021: 12-21 83
Beam Search-based Decoding Strategy

Limitations:
• More data needs to be processed at each decoding step.
• This significantly increases the demand for memory and computational
resources and results in slower decoding speeds than greedy search.

84
Constraint-aware Incremental Decoding Strategy
PICARD (Parsing Incrementally for Constrained Auto-Regressive Decoding) introduces the
constraint-aware incremental decoding strategy.
• The constraint-aware incremental decoding strategy is specifically designed for NL2SQL
tasks.
• This strategy aims to ensure the generation of syntactically correct SQL queries by
incorporating constraints during the decoding process.

Illustration of constrained beam search with beam


size 2 and PICARD.

Torsten Scholak, Nathan Schucher, Dzmitry Bahdanau:


PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. EMNLP (1) 2021: 9895-9901 85
Constraint-aware Incremental Decoding Strategy
Due to its ability to improve the accuracy of SQL generation, PICARD has been adopted as a
decoding strategy by several models, such as N-best List Rerankers.

• Compared to versions without PICARD, these models have shown improved accuracy.

PICARD explained by an example. The prediction


pattern is “〈Database name〉 | 〈pred SQL〉”.

Lu Zeng, Sree Hari Krishnan Parthasarathi, Dilek Hakkani-Tur:


N-Best Hypotheses Reranking for Text-to-SQL Systems. SLT 2022: 663-670 86
Constraint-aware Incremental Decoding Strategy

Limitations:

• Incremental decoding and progressively adding constraints may require


more computational resources and processing time.

87
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• Pre-Processing
• NL2SQL Translation Methods
• Encoding Strategy
• Decoding Strategy
• Task-specific Prompt Strategies
• Intermediate Representation
• Post-Processing
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems 88
Task-Specific Prompt Strategy

Motivation:
• Instruct the LLMs to optimize the SQL query generation process
according to task-specific rules.

Goal:
• Improve the accuracy of translating complex semantic NL query into
the corresponding SQL query.

89
Task-Specific Prompt Strategy

Method classification:

• Chain-of-Thought
• Task Decomposition

90
Chain-of-Thought
CHESS transforms NL into SQL statements using a streamlined pipeline that relies on
LLMs and CoT.
• This CoT process comprises entity and context retrieval, schema selection, SQL
generation, and revision.

Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, Dongmei Zhang:
Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. ACL (1) 2019: 4524-4535 91
Chain-of-Thought

Limitations:
• Hallucination caused by unknown domain knowledge
• Hard to control the output of key components

92
Decomposition
TKK divides the initial NL2SQL parsing tasks into various small individual subtasks, with
each corresponding to the mapping of the NL query to one or more clauses of the SQL query.
• This decomposition approach allows the model to concentrate on learning the generation of
each clause, thereby compelling the model to understand the problem, the database schema,
and the alignment between each SQL clause.

Chang Gao, Bowen Li, Wenxuan Zhang, Wai Lam, Binhua Li, Fei Huang, Luo Si, Yongbin Li:
93
Towards Generalizable and Robust Text-to-SQL Parsing. EMNLP (Findings) 2022: 2113-2125
Decomposition
MAC-SQL incorporates a Decomposer agent designed to break down the user's original
problem into several subproblems.
• This decomposition process aims to lessen the complexity of the origin question,
enabling the generation of simpler SQL queries to solve each individual subproblem.

Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Qian-Wen Zhang, Zhao Yan, Zhoujun Li:
MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL. CoRR abs/2312.11242 (2023) 94
Decomposition
DIN-SQL employs a sophisticated categorization module for decomposition.
• It classifies queries into distinct complexity groups: Easy, Non-Nested Complex, and Nested Complex,
with the reference of NL and database schema.
• By strategically identifying and separating schema linking, join conditions, and nested structures, the
module facilitates a structured generation of SQL queries and amplifies the accuracy of translating
complex the NL query into executable SQL.

Mohammadreza Pourreza, Davood Rafiei: DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. NeurIPS 2023
95
Decomposition

Limitations:
• Increasing computational cost
• Lowing efficiency because the model need to finish more sub-tasks

96
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• Pre-Processing
• NL2SQL Translation Methods
• Encoding Strategy
• Decoding Strategy
• Task-specific Prompt Strategies
• Intermediate Representation
• Post-Processing
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems 97
Intermediate Representation
Motivation:
• Design a grammar-free intermediate
representation compared to SQL as the bridge
between the ''free-form'' NL query and the
''constrained and formal'' SQL query.

Goal:
• Capture the essential components and
relationships of an NL query without the
strict syntax rules of SQL.

98
Method Classification of Intermediate Representation

Method classification:

• SQL-Like Syntax Language

• SQL-Like Sketch Structure

99
Intermediate Representation
NatSQL is a widely recognized SQL-like
syntax language that eliminates SQL
statement operators, keywords, set
operators, and other elements seldom
found in user problem descriptions.
• It enhances schema linking by
minimizing the necessary number of
schema items.

Yujian Gan, Xinyun Chen, Jinxia Xie, Matthew Purver, John R. Woodward, John H. Drake, Qiaofu Zhang:
Natural SQL: Making SQL Easier to Infer from Natural Language Specifications. EMNLP (Findings) 2021: 2030-2042 100
SQL-Like Syntax Language
QPL leverages the problem decomposition
strategy to improve the parsing of intricate
SQL queries.
• By breaking down a SQL query into
modularized sub-queries, the complexity of
the original query is reduced.
• This approach mitigates parsing difficulties
associated with complex problems and
cross-domain complex queries.

Ben Eyal, Moran Mahabi, Ophir Haroche, Amir Bachar, Michael Elhadad:
Semantic Decomposition of Question and SQL for Text-to-SQL Parsing. EMNLP (Findings) 2023: 13629-13645 101
SQL-Like Syntax Language
QDMR decomposes the original question into a number of atomic questions
• Each atomic question serves as an intermediate representation of the original question and
can be translated into a set of small-scale formal operations involving tasks such as selecting
entities, retrieving attributes, or aggregating information.

Tomer Wolfson, Daniel Deutch, Jonathan Berant:


Weakly Supervised Text-to-SQL Parsing through Question Decomposition. NAACL-HLT (Findings) 2022: 2528-2542 102
SQL-Like Syntax Language

Limitations:

• High complexity and inadequate coverage of database structures


• Cost of manual design and deployment

103
SQL-Like Sketch Structure
CatSQL constructs a general template sketch with slots serving as initial placeholders.
• Its base model focus on the parsing of user queries to fill these placeholders, consequently
decreasing the computational resource cost.
• It implements a novel semantic correction algorithm to assess the semantic accuracy of the
resulting SQL queries and rectify any semantic issues detected in the generated queries.

Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, Jianling Sun:
104
CatSQL: Towards Real World Natural Language to SQL Applications. Proc. VLDB Endow. 16(6): 1534-1547 (2023)
SQL-Like Sketch Structure
SC-prompt utilizes a two-stage divide and conquer method for NL2SQL parsing
• It instructs PLM to generate specific SQL structures.
• In the subsequent phase, it directs the PLM to generate SQL structures containing actual
values to fill the previously provided placeholders.

Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, Xiaoyong Du:
105
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning. Proc. ACM Manag. Data 1(2): 147:1-147:28 (2023)
SQL-Like Sketch Structure
ZeroNL2SQL integrates the schema alignment capabilities of PLM with the complex reasoning
capabilities of LLMs.
• Initially, it utilizes PLM to produce SQL sketches for achieving schema alignment and
subsequently employs LLMs to execute complex content reasoning for populating missing
information.
• Additionally, it also proposes a predicate calibration method for guiding the design of language
models for SQL sketches based on database instances and selecting the optimal SQL query.

Zihui Gu, Ju Fan, Nan Tang, Songyue Zhang, Yuxin Zhang, Zui Chen, Lei Cao, Guoliang Li, Sam Madden, Xiaoyong Du:
106
Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation. CoRR abs/2306.08891 (2023)
SQL-Like Sketch Structure

Limitations:
• High computational cost
• Hard to construct accurate sketch when the query is much more complex

107
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• Pre-Processing
• NL2SQL Translation Methods
• Post-Processing
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems

108
Post Processing
Motivation:
• Post-processing is a crucial step to refine the generated SQL queries,
ensuring they meet user expectations more accurately.
• This involves enhancing the initial SQL output using various
strategies.

109
SQL Correction Strategies
Motivation:
• The SQL generated by the model may sometimes contain syntax errors.
• SQL Correction Strategies aim to prompt LLMs to correct syntax errors
in SQL.

Goal:
• Convert the buggy SQL query into the correct SQL queries.

110
SQL Correction Strategies
DIN-SQL introduces a self-correction module for refining SQL queries generated by NL-
to-SQL models, addressing issues like redundant keywords or missing functions.
Two prompts are used: a general prompt for the CodeX model, which explicitly asks to
identify and fix errors in the “BUGGY SQL,” and a milder prompt for GPT-4, which simply
asks the model to point out potential issues without assuming the presence of errors.

Pourreza, Mohammadreza, and Davood Rafiei. "Din-sql: Decomposed in-context learning of text-to-sql with self correction." Advances in
Neural Information Processing Systems 36 (2024). 111
SQL Correction Strategies
Fin-SQL introduces an output calibration
algorithm to avoid executing time-
consuming SQL queries on large
financial databases.
Fin-SQL corrects spelling errors,
performs fuzzy matching for invalid
columns, and ensures that all tables
containing referenced columns are
included in the FROM clause.
This approach helps improve SQL query
accuracy without actual query execution.

Zhang, C., Mao, Y., Fan, Y., Mi, Y., Gao, Y., Chen, L., Lou, D. and Lin, J. 2024. FinSQL: Model-Agnostic LLMs-based Text-to-SQL
Framework for Financial Analysis. arXiv. 112
SQL Correction Strategies
Limitations:
• Primarily focus on specific syntax errors correction
• Lack of generalized error correction for other SQL errors

113
Output Consistency
Motivation:
• In complex reasoning problems, there are multiple reasoning paths
that can lead to a unique correct answer. It first samples multiple
different reasoning paths and then selects the most consistent answer
to significantly improve the quality of the output.
• In the NL2SQL task, being similar to reasoning tasks, can also have
multiple ways to write SQL queries that express the same meaning.

Goal:
• Vote to determine the final SQL to enhance the consistency of model
outputs.
114
Output Consistency
C3-SQL incorporates the Consistency
Output (CO) component to enhance
the reliability of SQL queries
generated by large language models,
addressing their inherent
randomness.
CO samples multiple reasoning
paths to generate different SQL
queries, executes them, and uses a
voting mechanism to select the most
consistent result after removing
errors.

Dong, X., Zhang, C., Ge, Y., Mao, Y., Gao, Y., Chen, lu, Lin, J. and Lou, D. 2023. C3: Zero-shot Text-to-SQL with ChatGPT. arXiv. 115
Output Consistency
Fin-SQL addresses the randomness of
LLM outputs by generating multiple
candidate SQL queries in parallel. It
evaluates the equivalence of these
queries based on keyword consistency
and clusters them accordingly.
The final query is selected from the
largest cluster, improving the consistency
and reliability of SQL query generation.

Zhang, C., Mao, Y., Fan, Y., Mi, Y., Gao, Y., Chen, L., Lou, D. and Lin, J. 2024. FinSQL: Model-Agnostic LLMs-based Text-to-SQL
Framework for Financial Analysis. arXiv. 116
Output Consistency
The self-consistency method for LLMs uses high temperatures to create diverse
outputs but may suffer from performance degradation due to increased
hallucinations.
PET-SQL addresses this by employing a cross-consistency strategy where multiple
LLMs generate SQL queries at lower temperatures and select results through
execution-based voting. This approach enhances query diversity while preserving
LLM performance at lower temperatures.

Li, Z., Wang, X., Zhao, J., Yang, S., Du, G., Hu, X., Zhang, B., Ye, Y., Li, Z., Zhao, R. and Mao, H. 2024. PET-SQL: A Prompt-Enhanced Two-
Round Refinement of Text-to-SQL with Cross-consistency. arXiv. 117
Output Consistency
Limitations:
Ø Increased inference cost and time
Ø Reliance on multiple queries may not address all types of errors
Ø Effectiveness may decrease as LLM capabilities improve

118
Execution-Guided Strategies
Motivation:
• The execution result of an SQL query is a crucial piece of information,
as it may indicate the quality of the query itself. For instance, if the
query execution results in errors or NULL values, it might indicate an
issue with the SQL query.

Goal:
• Incorporating the results of SQL queries to guide subsequent
processing of the SQL.

119
Execution-Guided Strategies
ZeroNL2SQL generates SQL queries iteratively
by checking the executability of multiple
candidate SQL sketches.
It provides error feedback to the LLMs, refining
the query until a feedback limit is reached, at
which point the current sketch is discarded.
This process continues until the LLMs generate
an optimal, executable SQL query as the final
output.

Gu, Z., Fan, J., Tang, N., Zhang, S., Zhang, Y., Chen, Z., Cao, L., Li, G., Madden, S. and Du, X. 2023. Interleaving Pre-Trained Language
120
Models and Large Language Models for Zero-Shot NL2SQL Generation. arXiv.
Execution-Guided Strategies
MAC-SQL introduces a Refiner agent that detects and corrects SQL errors in a multi-
agent collaboration framework.
The agent evaluates SQL queries for syntactic correctness, execution feasibility, and
whether they retrieve non-empty results from the database. If errors are found, it uses
feedback or guidance to correct the SQL, enhancing the model’s self-diagnosis and
overall accuracy.

Wang, B., Ren, C., Yang, J., Liang, X., Bai, J., Chai, L., Yan, Z., Zhang, Q.-W., Yin, D., Sun, X. and Li, Z. 2024. MAC-SQL: A Multi-Agent
Collaborative Framework for Text-to-SQL. arXiv. 121
Execution-Guided Strategies
Limitations:
Ø Increased SQL generation time, especially with large databases
Ø Dependency on execution results may not address logical errors
Ø Limited scalability for complex or extensive databases

122
N-best Rankers Strategies
Motivation:
• In cross-domain scenarios, the generated SQL queries may exhibit
subtle variances in both structure and semantics.
• We can reranking the generated SQL by larger model or using
additional knowledge.

Goal:
• Reordering the top n results from the original model using a larger
model or incorporating additional knowledge sources.

123
N-best Rankers Strategies
G3R addresses the issue of unstable and dependent on threshold settings by
introducing a feature-enhanced SQL Reranker based on pre-trained language models
with hybrid prompt tuning, allowing it to bridge domain gaps without adding extra
parameters.

Xiang, Y., Zhang, Q.-W., Zhang, X., Liu, Z., Cao, Y. and Zhou, D. G3R: A Graph-Guided Generate-and-Rerank Framework for Complex and
124
Cross-domain Text-to-SQL Generation.
N-best Rankers Strategies
Limitations:
Ø Limited applicability to LLMs with stronger inference capabilities
Ø Reduced effectiveness as LLMs increase in size and sophistication
Ø Reliance on additional models or knowledge for effective filtering

125
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• Pre-Processing
• NL2SQL Translation Methods
• Post-Processing
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems

126
Why We Need Benchmark?
Motivation:
• To quantify the performance of NL2SQL models
• To provide sufficient data for training NL2SQL models
• To simulate the challenges in real-world application scenarios

127
An Overview of NL2SQL Benchmarks

128
Single-domain Dataset
Motivation:
• Early NL2SQL datasets focused on specific domains, often featuring
simple SQL query structures
• Recent single-domain datasets feature more complex databases and SQL
queries specific to particular scenarios

An example in BULL dataset


Zhang C, Mao Y, Fan Y, et al.
129
FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis[J]. arXiv preprint arXiv:2401.10506, 2024.
Cross-domain Dataset
Motivation:
• Cross-domain datasets challenge
the generalization capabilities of
NL2SQL systems across
different domains, requiring
these systems to generalize not
only to new SQL queries but also
to new databases.

Database domain distribution in BIRD dataset

Li J, Hui B, Qu G, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls[J]. Advances
130
in Neural Information Processing Systems, 2024, 36.
Multi-turn Dataset

Motivation:
• In a real-world setting, users tend to
ask a sequence of thematically related
questions to learn about a particular
topic or to achieve a complex goal.

An example in SParC dataset


Yu T, Zhang R, Yasunaga M, et al. Sparc: Cross-domain semantic parsing in context[J]. arXiv preprint arXiv:1906.02285, 2019.
131
NL2SQL Datasets: Considering Robustness
Motivation:
• In real-world applications,
NL2SQL systems need to
handle diverse user groups
and various database
domains, making robustness a
growing focus within the
community.

An example in Dr.Spider dataset

Chang S, Wang J, Dong M, et al.


132
Dr. spider: A diagnostic evaluation benchmark towards text-to-sql robustness[J]. arXiv preprint arXiv:2301.08881, 2023.
NL2SQL Datasets: Considering SQL Efficiency
Motivation:
• Databases in real-world
scenarios often contain massive
amounts of data, and a user
question can be solved by
multiple SQL queries.
• These SQL queries can vary in
execution efficiency, which has
attracted attention from the
community.
Database domain distribution
with size in BIRD dataset
Li J, Hui B, Qu G, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls[J]. Advances
133
in Neural Information Processing Systems, 2024, 36.
Knowledge-Augmented NL2SQL Datasets

Motivation:
Ø NL2SQL systems often
require domain-specific
knowledge to effectively
perform the NL2SQL task
in real-world applications
within specific domains.

An example in KnowSQL

Dou L, Gao Y, Liu X, et al.


134
Towards knowledge-intensive text-to-SQL semantic parsing with formulaic knowledge[J]. arXiv preprint arXiv:2301.01067, 2023.
NL2SQL Datasets: Considering NL Ambiguity
Motivation:
Ø In real-world NL2SQL tasks, various ambiguities often arise, such as
semantic ambiguities in NL and overlapping database schema.

An example in AmbiQT
Bhaskar A, Tomar T, Sathe A, et al.
135
Benchmarking and improving text-to-sql generation under ambiguity[J]. arXiv preprint arXiv:2310.13659, 2023.
Synthetic NL2SQL Datasets
Motivation:
• Despite the fact that many datasets have been manually annotated so far,
the rapid development of LLMs in the NL2SQL field has led to a
growing demand for NL2SQL data.

End-to-end architecture
for automatic training
data generation in
ScienceBenchmark

Zhang Y, Deriu J, Katsogiannis-Meimarakis G, et al.


136
Sciencebenchmark: A complex real-world benchmark for evaluating natural language to sql systems. VLDB 2024.
The Statistics of NL2SQL Benchmarks

Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, and Nan Tang. 2024.
137
A Survey of NL2SQL with Large Language Models: Where are we, and where are we going? arXiv:2408.05109 [cs.DB] https://arxiv.org/abs/2408.05109
NL2SQL Benchmark Discussion & Insights
• From the Redundancy Measure perspective
• We observe a trend from early datasets to recent ones where datasets have grown in
size, including increases in the number of questions and unique queries
• From the Database Complexity perspective
• The number of databases (and tables) in datasets correlates with the tasks (e.g.,
Single-domain vs. Robustness) they serve
• From the Query Complexity perspective
• Recent datasets show a growing emphasis on Scalar Functions and Mathematical
Computations in SQL queries, which introduces challenges in SQL generation
structure not seen in earlier datasets

138
The Next For NL2SQL Benchmark
• Despite the growing number of datasets proposed by the NL2SQL
community, we find that current datasets still exhibit a gap in SQL
complexity compared to real-world scenarios
• Recent datasets generally feature lower counts of SELECT keyword,
indicating fewer nested SQL queries or complex set operations
• Challenges related to Scalar Functions (e.g., ROUND) and
Mathematical Computations (e.g., Addition) also need further attention

139
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Evaluation Metrics
• Evaluation Toolkit
• Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems

140
NL2SQL Evaluation Metrics
• Accuracy
• Execution Accuracy (EX)
• String-Match Accuracy (SM)
• Component-Match Accuracy (CM)
• Exact-Match Accuracy (EM)
• Efficiency
• Valid Efficiency Score (VES)
• Robustness
• Query Variance Testing (QVT)

141
NL2SQL Evaluation Metrics
• Accuracy
• Execution Accuracy (EX)

• N : the size of dataset


• Vi : the execution result set of the i-th ground-truth SQL query
• : the execution result set of the i-th predicted SQL query
• 1(•) : an indicator function that equals 1 if the condition inside is satisfied,
and 0 otherwise

Yu T, Zhang R, Yang K, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql
142
task[J]. arXiv preprint arXiv:1809.08887, 2018.
NL2SQL Evaluation Metrics
• Accuracy
• String-Match Accuracy (SM)

• N : the size of dataset


• Yi : the i-th ground-truth SQL query
• : the i-th predicted SQL query
• 1(•) : an indicator function that equals 1 if the condition inside is satisfied,
and 0 otherwise

Zhong V, Xiong C, Socher R. Seq2sql: Generating structured queries from natural language using reinforcement learning[J]. arXiv preprint
143
arXiv:1709.00103, 2017.
NL2SQL Evaluation Metrics
• Accuracy
• Component-Match Accuracy (CM) SQL Components:
• SELECT
• WHERE
• GROUP
• …...
• N : the size of dataset
• YiC : the component of the i-th ground-truth SQL query Yi
• : the component of the i-th predicted SQL query
• 1(•) : an indicator function that equals 1 if the condition inside is satisfied,
and 0 otherwise

Yu T, Zhang R, Yang K, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql
144
task[J]. arXiv preprint arXiv:1809.08887, 2018.
NL2SQL Evaluation Metrics
• Accuracy
• Exact-Match Accuracy (EM)

• N : the size of dataset


• : the k-th component of SQL query (e.g., WHERE)
• : the k-th component of the i-th ground-truth SQL query
• : the k-th component of the i-th predicted SQL query
• 1(•) : an indicator function that equals 1 if the condition inside is satisfied,
and 0 otherwise

• Yu T, Zhang R, Yang K, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql
• 145
task[J]. arXiv preprint arXiv:1809.08887, 2018.
NL2SQL Evaluation Metrics
• Efficiency
• Valid Efficiency Score (VES)

• N : the size of dataset


• Vi : the execution result set of the i-th ground-truth SQL query Yi
• : the execution result set of the i-th predicted SQL query
• E(•) : measures the efficiency of specific SQL query, which can be refer to
execution time

Li J, Hui B, Qu G, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls[J]. Advances
146
in Neural Information Processing Systems, 2024, 36.
NL2SQL Evaluation Metrics
• Robustness
• Query Variance Testing (QVT)

• N : the size of dataset


• : the predicted SQL query for the j-th NL variation of Yi
• mi : denotes the number of different NL variations for the SQL query Yi

Li B, Luo Y, Chai C, et al. The Dawn of Natural Language to SQL: Are We Fully Ready?[J]. arXiv preprint arXiv:2406.01265, 2024.
147
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Evaluation Metrics
• Evaluation Toolkit
• Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems

148
MT-TEQL
• MT-TEQL
• NL Utterance Variations
• Database Schema Variations

Categorizations of frequently-used prefixes in typical


NLIDB utterances

Illustrative examples of schema-oriented MRs

Ma P, Wang S. Mt-teql: evaluating and augmenting neural nlidb on real-world linguistic and schema variations[J]. Proceedings of the VLDB
149
Endowment, 2021, 15(3): 569-582.
NL2SQL360

• NL2SQL360
• SQL Complexity
• SQL Characterisitics (e.g., JOIN)
• Database Domain Adaption
• NL Query Variance Testing
• Economy
• Efficiency
• Features
• Multi-angle & Fine-grained Evaluation
• Scenario-based Evaluation

Li B, Luo Y, Chai C, et al. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024.
150
NL2SQL360: User Interface

Li B, Luo Y, Chai C, et al. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024. 151
NL2SQL360: User Interface

Li B, Luo Y, Chai C, et al. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024. 152
NL2SQL360 Case Study
• Dataset: Spider • Metric:
• Model: • Execution Accuracy (EX)
• LLM-based methods: • Exact-Match Accuracy (EM)
• DAILSQL, DAILSQL(SC) • Valid Efficiency Score (VES)
• DINSQL • Query Variance Testing (QVT)
• C3SQL
• SFT CodeS-1B/3B/7B/15B • Multi-Angle:
• PLM-based methods: • SQL Characteristics
• RESDSQL-3B, RESDSQL-3B + NatSQL • SQL Complexity
• Graphix + PICARD • NL Variance
• Database Domain Adaption
• Efficiency
• SQL Efficiency
• Economy (Cost)

Li B, Luo Y, Chai C, et al. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024. 153
NL2SQL360: Leaderboard

Leaderboard Link: https://nl2sql360.github.io/

Li B, Luo Y, Chai C, et al. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024. 154
NL2SQL360 Case Study – JOIN Finding
We find that:
DINSQL outperforms in LLMs with prompting.
RESDSQL + NatSQL outperforms in PLMs.

Commonality:
Both methods use NatSQL as the intermediate
representation (IR) of SQL.

Finding:
Taking NatSQL reduces the complexity of predicting
JOIN operations and potentially enhances the model
performance.

Li B, Luo Y, Chai C, et al. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024. 155
NL2SQL360 Case Study – Subquery Finding

(P): Prompt-based
(FT): Fine-tuning-based

In scenarios involving subqueries, LLM-based methods


outperform PLM-based methods overall, with methods
using GPT-4 showing particularly better performance.

The inherent reasoning ability of these models is likely


crucial for success in predicting the subqueries.

Li B, Luo Y, Chai C, et al. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024. 156
NL2SQL360 Case Study – QVT
• To further evaluate the robustness and flexibility of NL2SQL solutions in handling variations
in natural language queries, we propose a new metric called Query Variance Testing (QVT).
This metric assesses how well the models can adapt to different forms of NL queries.

Query Variance Testing (QVT): Given a SQL query Qi, there typically exist multiple corresponding NL
queries, denoted as pairs {(N1, Qi), (N2, Qi), . . . , (Nm, Qi)}. The formula for computing QVT accuracy is
defined as follows:

• M is the total number of SQL queries in the test set.


• mi is the number of natural language query variations corresponding to the SQL query Qi .
• F(Nij) represents the SQL query generated by the model for the j-th natural language query
variation of Qi.

Li B, Luo Y, Chai C, et al. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024. 157
NL2SQL360 Case Study – QVT

• Fine-tuned LLMs generally exhibit higher QVT than prompting LLMs.


• Notably, although the Graphix+PICARD method underperforms in overall EX compared to all prompt-based
LLMs, it surpasses them in QVT.
• Fine-tuning the model with task-specific datasets may help stabilize its performance against NL variations.

Li B, Luo Y, Chai C, et al. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024. 158
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Evaluation Metrics
• Evaluation Toolkit
• Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems

159
Error Analysis
Motivation:
• Error Analysis provides valuable insights into the limitations and
challenges faced by current models.

Goal:
• By systematically examining errors, researchers can identify specific
areas for improvement, enhance model robustness, and develop more
effective training strategies.

160
Existing Taxonomies for NL2SQL Errors

• Schema Linking
• Misunderstanding Database
• Misunderstanding Knowledge
• Reasoning
• Syntax-Related Errors
• Database related Errors

SQL-PaLM

Ruoxi Sun, Sercan Ö. Arik, Hootan Nakhost, Hanjun Dai, Rajarishi Sinha, Pengcheng Yin, Tomas
Pfister:SQLPaLM: Improved Large Language Model Adaptation for Text-to-SQL. CoRR abs/2306.00739 (2023) 161
Existing Taxonomies for NL2SQL Errors

• Schema Linking
• JOIN
• GROUP BY
• Queries with Nesting and
Set Operations
• Invalid SQL
• Miscellaneous
DIN-SQL
Mohammadreza Pourreza, Davood Rafiei:
162
DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. NeurIPS 2023
Existing Taxonomies for NL2SQL Errors
Limitations:
• Designed for specific datasets
• Limited scalability
• Due to inconsistent taxonomies, the annotated data cannot be shared
or compared between different studies

Therefore, it is imperative to develop a standardized and effective


taxonomy for analyzing NL2SQL Errors

163
Principles for NL2SQL Errors Taxonomy
• Comprehensiveness: The taxonomy should encompass all potential errors that
could occur during the NL2SQL conversion process.
• Mutual Exclusivity: Each error type should be clearly distinct with no overlap,
to avoid ambiguity in error classification.
• Extensibility: The taxonomy should be adaptable to incorporate new error
types as NL2SQL technologies and methodologies evolve.
• Practicality: The taxonomy should be practical and applicable in real-world
settings, aiding developers in diagnosing and correcting errors effectively

164
Our Two-level Taxonomy
• Error Localization: Focus on
identifying the specific parts of
the SQL where errors occur

• Cause of Error: Focus on


understanding why the model is
wrong when generating SQL.

165
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• A Data-Driven Roadmap of Optimizing LLMs for NL2SQL
• Decision Flow of Selecting NL2SQL Modules
• Open Problems

166
A Roadmap of Optimizing LLMs for NL2SQL

Condition 1: Data Privacy.

Condition 2: Training Data Volume.

Note that also need consider


hardware resources and API costs

167
A Roadmap of Optimizing LLMs for NL2SQL
A Case Study
CodeS designs open-source NL2SQL LLM to solve data privacy risks when using GPT-4

Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen:CodeS:
168
Towards Building Open-source Language Models for Text-to-SQL. Proc. ACM Manag. Data 2(3): 127 (2024)
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• A Data-Driven Roadmap of Optimizing LLMs for NL2SQL
• Decision Flow of Selecting NL2SQL Modules
• Open Problems

169
Decision Flow of Selecting Modules

170
Decision Flow of Selecting Modules
• Scenario 1:
• Complex Database Schema with
Numerous Tables and Columns.
• Recommendation:
• Schema Linking.
• Advantages:
• Reduced token costs.
• Decreased noise from the
redundant database schema.
• Disadvantages:
• Increased time costs.
171
Decision Flow of Selecting Modules
• Scenario 2:
• Mismatch between Database Content and NL.
• Recommendation:
• Database Content Retrieval.
• Advantages:
• Increased the accuracy of SQL value selection.
• Disadvantages:
• Increased time costs.
• Increased token costs.

172
Outline
• NL2SQL Problem and Background
• Language Model-Powered NL2SQL Solutions
• NL2SQL Benchmarks
• Evaluation and Error Analysis
• Practical Guidance for Developing NL2SQL Solutions
• Open Problems

173
Open Problem- Open NL2SQL Problem
• Problem & Goal:
• Open NL2SQL Problem: Querying Multiple Databases.
• Motivation:
• In many real-world scenarios, such as government open data
platforms, answering complex questions often requires querying
multiple databases and aggregating results.
• Key Idea:
• Database Retrieval
• Handling Heterogeneous Schemas
• Answer Aggregation
• Domain Adaptation
• Scalability and Efficiency
• Evaluating and Benchmarking 174
Open Problem- Cost-effective NL2SQL Methods

• Problem & Goal:


• Cost-effective NL2SQL Methods: Balancing Performance and Costs.
• Motivation:
• LLM-based NL2SQL methods show potential but face challenges due
to high token consumption, which impacts both costs and inference
times. Conversely, PLM-based methods excel at complex SQL query
processing and schema interpretation.
• Key Idea:
• Modular NL2SQL Solutions
• Multi-agent Framework

175
Open Problem- Trustworthy NL2SQL Solutions

• Problem & Goal:


• Trustworthy NL2SQL Solutions: Enhancing Accuracy and Reliability.
• Motivation:
• Ensuring the accuracy and reliability of generated SQL queries is
essential to reduce risks and minimize manual intervention.
• Key Idea:
• Interpreting NL2SQL Solutions
• Multi-agent Framework
• NL2SQL Debugging Tools

176
Open Problem- NL2SQL with Ambiguous and
Unspecified NL Queries
• Problem & Goal:
• NL2SQL with Ambiguous and Unspecified NL Queries: Addressing
Vague Inputs.
• Motivation:
• Ambiguous or unspecified queries are common in real-world
scenarios and pose challenges for NL2SQL systems.
• Key Idea:
• NL Query Rewriter.
• NL Query Auto-completion.
• Training with Ambiguous Queries.

177
Open Problem- Adaptive Training Data Synthesis

• Problem:
• Adaptive Training Data Synthesis: Improving Model Generalization.
• Motivation:
• Learning-based NL2SQL methods often struggle with new domains
and require high-quality, diverse training data for effective
performance.
• Key Idea:
• Adaptive Data Synthesis.
• Incorporating Feedback.

178
1. Make NL2SQL Methods Trustworthy
• NL2SQL Debugging
• NL Rewriter
2. Modularized NL2SQL Solutions

1. Adaptive Data Synthesis 1. Human and AI Teaming Evaluation


Thanks!
Yuyu LUO (骆昱宇)
Assistant Professor, HKUST (Guangzhou)/HKUST
yuyuluo@hkust-gz.edu.cn
https://arxiv.org/abs/2408.05109
https://github.com/HKUSTDial/NL2SQL_Handbook

You might also like