可以聽和說的
語音語言模型
Speech LLM
LLM
內容 語者 情緒 環境
Example
ChatGPT Gemini
voice mode Live
Example
• Moshi
• https://arxiv.org/abs/2410.00037
• GLM-4-Voice
• https://arxiv.org/abs/2412.02612
• Step-Audio
• https://arxiv.org/abs/2502.11946
• Qwen2.5-Omni
• https://arxiv.org/abs/2503.20215
• Kimi-Audio
• https://arxiv.org/abs/2504.18425
• SpeechGPT Sesame
• https://github.com/OpenMOSS/SpeechGP
T-2.0-preview https://www.sesame.com/research/crossi
ng_the_uncanny_valley_of_voice
• ……
We have talked about speech input; this lecture
will focus on speech generation.
https://youtu.be/Z6b5-77EfGk?si=st0d4IukGWAc__F2
“Text Token”
how are you
Text I am good
LLM
“Speech Token”
Speech
LLM
Tokenization Detokenization
How to Train Speech LLM
Unlabeled Pre-trained
speech data Next “Speech Speech LLM
Token” Prediction
Human Speech
annotated data LLM
SFT
Preference Speech
data RLHF LLM
語音生成的基本單位是什
麼?
(Speech
Token)
What is a “token” in the context of speech?
• Text • Speech
Waveform
I want to learn generative AI
Token Sequence Token Sequence
https://platform.openai.com/tokenizer ???
用 gpt-4o-mini-tts 合成
text text
Speech
LLM
ASR Tokenization Detokenization TTS
Text LLM
你實在是真的好棒喔 謝謝你的誇獎
你實在是真的好棒喔 怎麼了 ……
At least 8,000 tokens per second At least 8,000 tokens per second
Speech
LLM
Tokenization Detokenization
token
Various Types of Speech Tokenizers Haibin Wu
Source of image: https://www.linkedin.com/in/haibin-wu-479a39252/recent-activity/all/
Overview paper about Speech Tokenization
https://arxiv.org/abs/2402.13236
https://arxiv.org/abs/2502.06490
What is the best choice of tokens?
• Codec-SUPERB https://codecsuperb.github.io/ Quality, Various tasks
Tokenizer 3 77 23
De-Tokeniz
er
• DASB https://poonehmousavi.github.io/DASB-website/
Tokenizer 3 77 23 Various tasks
Learn more from Interspeech2024 Speech Processing Using Discrete Speech
Unit Challenge https://www.wavlab.org/activities/2024/Interspeech2024-Discrete-Speech-Unit-Challenge/
A possible pipeline of
speech tokenization
Speech SSL Model
0.02s
https://arxiv.org/abs/2205.10643
A possible pipeline of
speech tokenization
Speech SSL Model
https://www.youtube.com/watch?v=lMIN1iKYNmA
A possible pipeline of
speech tokenization
Speech SSL Model
Quantization: K-means or VQ-layer
3 2 2 2 77 3 3 2
Deduplicate
3 2 77 3 2
BPE (Byte Pair Encoding) 3 2 5
https://arxiv.org/abs/2310.14580
5 77 5
http://arxiv.org/abs/2205.01086
https://ieeexplore.ieee.org/abstract/document/10096788
5 77 5
Speech SSL Model
Detokenization
Tokenization
Model
3 2 2 2 77 3 3 2
?????
3 2 77 3 2
5 77 5
Another possible pipeline of speech tokenization
Neural Speech Codec
The tokenizer and detokenizer
are learned jointly. Tokenizer
3 77 23 4
Codec
Detokenizer
decompression
compression
Audio LM
https://arxiv.org/abs/2209.03143
Various Types of Speech Tokenizers
Two Types of
Tokens
Neural
SSL Tokenizer Tokenizer
Codec
“Sematic Token” “Acoustic Token”
• "Semantic" does not refer to its usual meaning in linguistics. Instead, "semantic
tokens" are closer to content information (usually containing phonetic
• information).
The distinction between the two types can be vague. 'Semantic tokens' also
include acoustic information, and vice versa.
RVQ (Residual
vector
Various Types of Speech Tokenizers Quantization)
https://arxiv.org/abs/2
210.13438
Two Types of
Tokens
Neural
SSL Tokenizer Tokenizer
Codec
“Sematic Token”
RVQ
SpeechTokenizer
https://arxiv.org/abs/2308.16692
Mimi (used in Moshi)
https://arxiv.org/abs/2410.00037
“Acoustic Token”
Various Types of Speech Tokenizers
Two Types of
Tokens
Neural
SSL Tokenizer Tokenizer
Codec
“Sematic Token”
選哪一個呢?
Choosing is for rookies, I want it all!
(小孩子才做選擇,我全都要!) “Acoustic Token”
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Coarse Token Fine-grain Token Finer Token
Content Acoustic
Assumption: All tokens are of equal length for simplicity.
LLM
https://arxiv.org/abs/2209.03143
e.g., AudioLM, VALLE https://arxiv.org/abs/2301.02111
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Coarse Token Fine-grain Token Finer Token
In VALLE, LLM 2 is a non-autoregressive language model.
LLM LLM 2
https://arxiv.org/abs/2209.03143
e.g., AudioLM, VALLE https://arxiv.org/abs/2301.02111
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Coarse Token Fine-grain Token Finer Token
This strategy is challenging for streaming. Detokenizer
LLM
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Coarse Token Fine-grain Token Finer Token
When generating different types of tokens
Streamable De … De … sequentially, the sequence can become very lengthy.
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
LLM
https://arxiv.org/abs/2402.05755
LLM
sequence length
= token per second x types of tokens x dialogue length
Take Moshi as example
12.5Hz 8 5 mins (300 seconds)
= 30k
30K
Source of image: https://towardsdatascience.com/towards-infinite-llm-context-windows-e099225abaaf
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Coarse Token Fine-grain Token Finer Token
Generate multiple types of tokens in one step
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
LLM
https://arxiv.org/abs/2402.05755
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Coarse Token Fine-grain Token Finer Token
https://arxiv.org/abs/2306.05284
Acoustic Delay https://arxiv.org/abs/2410.00037
1 2 1 3 2 1 4 3 2 5 4 3
LLM
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Coarse Token Fine-grain Token Finer Token
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Depth
LM LM LM LM LM Transformer
LM Temporal Transformer
https://arxiv.org/abs/2109.03264
https://arxiv.org/abs/2410.00037
Why discrete tokens?
Waveform Waveform
Tokenizer
3 77 23 4
Token Sequence
How about Continuous Representation?
Why discrete tokens?
The discrete tokens are crucial for generation.
Speech
LM
For understanding, there is no remarkable Given the same input, there can
difference between continuous representations be many possible outputs.
and discrete tokens.
Why discrete tokens?
• Let's say we train a speech LM to generate continuous
representations.
Average
Speech
LM
incorrect
Either is
correct.
Why discrete tokens?
We learn a
• How do discrete tokens solve the issue? probability
distribution
60% 1
40%
3 77 23 12 71
Speech
LM
2
1 2
Sampling from the distribution during inference.
Why discrete tokens?
• Let's say we train a speech LM to generate continuous
representations.
Average
Speech special design
LM
incorrect
Solutions from image generation.
https://arxiv.org/abs/2406.11838 Either is
https://arxiv.org/abs/2312.02116 correct.
https://arxiv.org/abs/2403.05196
MELLE
https://arxiv.org/pdf/2407.08551
Good performance in Text-to-Speech (TTS)
Breezy Voice
GitHub:
https://github.com/mtkresearch/BreezyVoice
Paper: https://arxiv.org/abs/2501.17790
大 家 好
TTS
hello~ how are you? 大 家 好
The source of the real audio is from the BIIC Podcast.
Pre-trained Speech LLM …… A large amount of
unlabeled speech data
Pre-trained
speech LLM
He assassinated the president
He assassinated the president and gave
mister johnson the last charge of
https://arxiv.org/abs/2306.02207 improvement in his writing possible three
point eight nine.
Does this sentence make sense?
… while the sentence has recognizable English words and phrases, as
it is currently constructed, it doesn't coherently communicate a
GPT4 clear, singular idea or sequence of connected ideas. …
以文字模型作為語音模型
的 Foundation Model
Why is training solely on unlabeled speech
data inefficient?
Pre-trained
Speech LLM
1M hours of speech data
100 tokens
per minute LLaMA 3
pre-trained on 15T
6B of text tokens text tokens … of speech
285k years
data
Text is a compressed version of speech.
Why is training solely on unlabeled speech
data inefficient? https://arxiv.org/abs/2404.00685
Text LLM
Speech
LLM
The linguistic performance of speech LLMs scales up three orders of magnitude more
slowly than that of text LLMs.
Besides content, speech LLMs also have to learn to understand other information
(such as speaker identity, emotion, etc.) that text LLMs do not have to.
Leveraging Text: Starting from Text LLM
• Initializing spoken QA models with text models GSQA
https://arxiv.org/abs/2312.09781
DUAL
https://arxiv.org/abs/2203.04911
Leveraging Text: Starting from Text LLM
How are you?
Text I am good.
LM
Initialization
Pre-trained
3 77 23 12 71 34 3 23
Spoken LLM
TWIST
https://arxiv.org/abs/2305.13009
Leveraging Text:
Speech-Text Hybrid Generation
how are you
Text Spoken
LM LLM
Initialization 3 77 23 12 71 34 3 23
This is similar to an inner monologue, allowing the model to consider
what it wants to say in text before actually expressing it in speech.
Leveraging Text:
Speech-Text Hybrid Generation
Text then speech: This is almost TTS
how are you 3 77 23 12 71 34 3 23
Spectron Drawback: cannot streaming
https://arxiv.org/abs/2305.15255
Text then speech (token-level):
how 3 77 23 are 12 71 34 you 3 23
We need alignment between text and speech during training.
Leveraging Text:
Speech-Text Hybrid Generation
Text and speech at the same time
how 3 are 77 you 23 ……
.
Spoken LLM
The text token and speech token do not have the
same scale (their lengths differ significantly).
3 77 23 12 71 34 3 23 Mini-Omni
https://arxiv.org/abs/
how are you 2408.16725
CTC loss
3 77 23 12 71 LLaMA-Omni
https://arxiv.org/abs/
how are 2409.06666
fixed number fixed number
3 77 23 12 71 34 3 23 Moshi
https://arxiv.org/abs/
how are you 2410.00037
This is similar to a duration model.
考慮文字的語音
Tokenization
https://arxiv.org/abs/2504.07053
Liang-Hsuan Tseng Yi-Chang Chen Kuan-Yi Lee
(NTU) (MediaTek) (NTU)
Can we have text-aligned speech
representations?
how are you
same number
Speech
LLM Complex alignment? Tokenizer
3 77 23 12 71 34 3 23
how are you
No need to include content information;
focus on information beyond the content.
different layers
Each corresponds to a
fixed period of time. Pre-trained Speech Encoder
e.g., Whisper Encoder
Detokenizer
Use the network architecture
how are you
of the TTS model (CozyVoice)
Aggregator (several attention layers)
query how are you
key
value
ASR
Pre-trained Speech Encoder
e.g., Whisper Encoder
Detokenizer
how are you
Aggregator (several attention layers)
query how are you
key
TASTE (Text-Aligned Speech value
Tokenization and Embedding)
Pre-trained Speech Encoder
Minimize reconstruction error
Detokenizer
How to pronounce a text token how are you
Aggregator (several attention layers)
query how are you
key
Tokenization
value
ASR
Pre-trained Speech Encoder
e.g., Whisper Encoder
Detokenizer Detokenizer
… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
… …
… …
Tokenizer Tokenizer
… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
Detokenizer Detokenizer
… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
… …
… …
Tokenizer Tokenizer
… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
https://arxiv.org/abs/2504.07053
Training Speech LLM
at 56 my 33 eyes 162
llama-3.2-1B as the
Speech LLM
initialization text LLM
look 101 at 56 my 33
Tokenizer 101 56 33 162
Emilia as our training datasets (English subset is about 40,000 hours)
look at my eyes
Speech Continuation Demonstration
I’ll take the
Pre-trained in the corner and just sit there. It’ll
speech LLM feel better. I’ll try to not sleep.
armchair
for a long time. I am very happy
I reserve your services Pre-trained
with the result. I will definitely
speech LLM recommend you to my friends.
It’s hot and loud Pre-trained
and so many I don’t know what to do ……
speech LLM
people (Raj in BTTB)
Pre-trained I am not the one who is wrong. I am
look at my eyes
speech LLM not the one who is wrong. (noise) I
am not the one who is wrong. I am
not the one who is wrong.
Source of video: https://www.youtube.com/watch?v=Dc7gc7BECk0
露比醬~ Pre-trained
嗨! speech LLM
歩夢醬~
Pre-trained
嗨! speech LLM
Pre-trained
四季醬~
嗨! speech LLM
我們已經討論了很多生成語音
的方式,然後怎麼訓練呢?
How to Train Speech LLM
Unlabeled Pre-trained
speech data Next “Speech Speech LLM
Token” Prediction
Human Speech
annotated data LLM
SFT
Text Text speech
conversation TTS
LLM conversation
How to Train Speech LLM
Unlabeled Pre-trained
speech data Next “Speech Speech LLM
Token” Prediction
Human Speech
annotated data LLM
SFT
Preference Speech
data RLHF LLM
Alignment with Feedback
Speech
LLM
Some related work improves
audio quality Some related work improves
https://arxiv.org/abs/2404.05600 audio understanding
https://arxiv.org/abs/2406.00654
https://arxiv.org/abs/2407.02243 https://arxiv.org/abs/2503.11197
https://arxiv.org/abs/2404.09956 https://arxiv.org/abs/2504.15900
https://arxiv.org/abs/2402.00744 https://arxiv.org/abs/2505.09439
Guan-Ting Lin
Alignment with Feedback (with researchers from
the Amazon GAI team)
https://arxiv.org/abs/2411.01834
Beyond the Turn-based Game
Text Conversation
User 1: text text
Turn-based
User 2: text text
Speech Conversation
Speaker 1:
Full-duplex overlap
Speaker 2:
How can we enable spoken LLMs to interact with interlocutors in a
full-duplex way?
Dialogue GSLM
Beyond the Turn-based Game https://arxiv.org/abs/2203.16502
https://arxiv.org/abs/2407.01911
Moshi
https://arxiv.org/abs/2410.00037
Evaluation
Chih-Kai Yang
https://arxiv.org/abs/2505.15957
To Learn More
For paper list:
https://github.com/ga642381/
speech-trident
https://arxiv.org/abs/2410.03751
https://arxiv.org/abs/2410.18908
https://arxiv.org/abs/2411.13577
https://arxiv.org/abs/2504.08528