KEMBAR78
Speech LLM | PDF | Speech Synthesis | Cognitive Science
0% found this document useful (0 votes)
11 views72 pages

Speech LLM

The document discusses the development and training of speech language models (LLMs) that can generate and understand speech. It covers various aspects of speech tokenization, decoding strategies, and the importance of using both text and speech data for effective model training. Additionally, it highlights the challenges and methodologies involved in creating efficient speech generation systems.

Uploaded by

pq4szdp6p8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views72 pages

Speech LLM

The document discusses the development and training of speech language models (LLMs) that can generate and understand speech. It covers various aspects of speech tokenization, decoding strategies, and the importance of using both text and speech data for effective model training. Additionally, it highlights the challenges and methodologies involved in creating efficient speech generation systems.

Uploaded by

pq4szdp6p8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

可以聽和說的

語音語言模型
Speech LLM

LLM

內容 語者 情緒 環境
Example

ChatGPT Gemini
voice mode Live
Example
• Moshi
• https://arxiv.org/abs/2410.00037
• GLM-4-Voice
• https://arxiv.org/abs/2412.02612
• Step-Audio
• https://arxiv.org/abs/2502.11946
• Qwen2.5-Omni
• https://arxiv.org/abs/2503.20215
• Kimi-Audio
• https://arxiv.org/abs/2504.18425
• SpeechGPT Sesame
• https://github.com/OpenMOSS/SpeechGP
T-2.0-preview https://www.sesame.com/research/crossi
ng_the_uncanny_valley_of_voice
• ……
We have talked about speech input; this lecture
will focus on speech generation.

https://youtu.be/Z6b5-77EfGk?si=st0d4IukGWAc__F2
“Text Token”
how are you
Text I am good
LLM

“Speech Token”
Speech
LLM

Tokenization Detokenization
How to Train Speech LLM
Unlabeled Pre-trained
speech data Next “Speech Speech LLM
Token” Prediction

Human Speech
annotated data LLM
SFT

Preference Speech
data RLHF LLM
語音生成的基本單位是什
麼?
(Speech
Token)
What is a “token” in the context of speech?
• Text • Speech
Waveform
I want to learn generative AI

Token Sequence Token Sequence


https://platform.openai.com/tokenizer ???
用 gpt-4o-mini-tts 合成

text text
Speech
LLM

ASR Tokenization Detokenization TTS


Text LLM

你實在是真的好棒喔 謝謝你的誇獎

你實在是真的好棒喔 怎麼了 ……
At least 8,000 tokens per second At least 8,000 tokens per second
Speech
LLM

Tokenization Detokenization

token
Various Types of Speech Tokenizers Haibin Wu

Source of image: https://www.linkedin.com/in/haibin-wu-479a39252/recent-activity/all/


Overview paper about Speech Tokenization

https://arxiv.org/abs/2402.13236

https://arxiv.org/abs/2502.06490
What is the best choice of tokens?
• Codec-SUPERB https://codecsuperb.github.io/ Quality, Various tasks

Tokenizer 3 77 23
De-Tokeniz
er

• DASB https://poonehmousavi.github.io/DASB-website/

Tokenizer 3 77 23 Various tasks

Learn more from Interspeech2024 Speech Processing Using Discrete Speech


Unit Challenge https://www.wavlab.org/activities/2024/Interspeech2024-Discrete-Speech-Unit-Challenge/
A possible pipeline of
speech tokenization
Speech SSL Model

0.02s
https://arxiv.org/abs/2205.10643
A possible pipeline of
speech tokenization
Speech SSL Model

https://www.youtube.com/watch?v=lMIN1iKYNmA
A possible pipeline of
speech tokenization
Speech SSL Model

Quantization: K-means or VQ-layer


3 2 2 2 77 3 3 2

Deduplicate
3 2 77 3 2

BPE (Byte Pair Encoding) 3 2 5


https://arxiv.org/abs/2310.14580
5 77 5
http://arxiv.org/abs/2205.01086
https://ieeexplore.ieee.org/abstract/document/10096788
5 77 5

Speech SSL Model

Detokenization
Tokenization

Model

3 2 2 2 77 3 3 2

?????
3 2 77 3 2

5 77 5
Another possible pipeline of speech tokenization

Neural Speech Codec


The tokenizer and detokenizer
are learned jointly. Tokenizer

3 77 23 4

Codec
Detokenizer
decompression

compression
Audio LM
https://arxiv.org/abs/2209.03143

Various Types of Speech Tokenizers


Two Types of
Tokens
Neural
SSL Tokenizer Tokenizer
Codec

“Sematic Token” “Acoustic Token”

• "Semantic" does not refer to its usual meaning in linguistics. Instead, "semantic
tokens" are closer to content information (usually containing phonetic
• information).
The distinction between the two types can be vague. 'Semantic tokens' also
include acoustic information, and vice versa.
RVQ (Residual
vector
Various Types of Speech Tokenizers Quantization)
https://arxiv.org/abs/2
210.13438
Two Types of
Tokens
Neural
SSL Tokenizer Tokenizer
Codec

“Sematic Token”
RVQ
SpeechTokenizer
https://arxiv.org/abs/2308.16692

Mimi (used in Moshi)


https://arxiv.org/abs/2410.00037
“Acoustic Token”
Various Types of Speech Tokenizers
Two Types of
Tokens
Neural
SSL Tokenizer Tokenizer
Codec

“Sematic Token”
選哪一個呢?
Choosing is for rookies, I want it all!
(小孩子才做選擇,我全都要!) “Acoustic Token”
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

Content Acoustic
Assumption: All tokens are of equal length for simplicity.

LLM

https://arxiv.org/abs/2209.03143
e.g., AudioLM, VALLE https://arxiv.org/abs/2301.02111
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

In VALLE, LLM 2 is a non-autoregressive language model.

LLM LLM 2

https://arxiv.org/abs/2209.03143
e.g., AudioLM, VALLE https://arxiv.org/abs/2301.02111
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

This strategy is challenging for streaming. Detokenizer

LLM
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

When generating different types of tokens


Streamable De … De … sequentially, the sequence can become very lengthy.
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

LLM

https://arxiv.org/abs/2402.05755
LLM

sequence length
= token per second x types of tokens x dialogue length

Take Moshi as example


12.5Hz 8 5 mins (300 seconds)

= 30k
30K

Source of image: https://towardsdatascience.com/towards-infinite-llm-context-windows-e099225abaaf


Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

Generate multiple types of tokens in one step

1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

LLM

https://arxiv.org/abs/2402.05755
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

https://arxiv.org/abs/2306.05284
Acoustic Delay https://arxiv.org/abs/2410.00037

1 2 1 3 2 1 4 3 2 5 4 3

LLM
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

Depth
LM LM LM LM LM Transformer

LM Temporal Transformer

https://arxiv.org/abs/2109.03264
https://arxiv.org/abs/2410.00037
Why discrete tokens?
Waveform Waveform

Tokenizer

3 77 23 4

Token Sequence

How about Continuous Representation?


Why discrete tokens?
The discrete tokens are crucial for generation.

Speech
LM

For understanding, there is no remarkable Given the same input, there can
difference between continuous representations be many possible outputs.
and discrete tokens.
Why discrete tokens?
• Let's say we train a speech LM to generate continuous
representations.

Average
Speech
LM
incorrect

Either is
correct.
Why discrete tokens?
We learn a
• How do discrete tokens solve the issue? probability
distribution
60% 1
40%
3 77 23 12 71
Speech
LM
2
1 2

Sampling from the distribution during inference.


Why discrete tokens?
• Let's say we train a speech LM to generate continuous
representations.

Average
Speech special design
LM
incorrect
Solutions from image generation.
https://arxiv.org/abs/2406.11838 Either is
https://arxiv.org/abs/2312.02116 correct.
https://arxiv.org/abs/2403.05196
MELLE
https://arxiv.org/pdf/2407.08551
Good performance in Text-to-Speech (TTS)
Breezy Voice
GitHub:
https://github.com/mtkresearch/BreezyVoice
Paper: https://arxiv.org/abs/2501.17790
大 家 好
TTS

hello~ how are you? 大 家 好

The source of the real audio is from the BIIC Podcast.


Pre-trained Speech LLM …… A large amount of
unlabeled speech data

Pre-trained
speech LLM
He assassinated the president
He assassinated the president and gave
mister johnson the last charge of
https://arxiv.org/abs/2306.02207 improvement in his writing possible three
point eight nine.

Does this sentence make sense?

… while the sentence has recognizable English words and phrases, as


it is currently constructed, it doesn't coherently communicate a
GPT4 clear, singular idea or sequence of connected ideas. …
以文字模型作為語音模型
的 Foundation Model
Why is training solely on unlabeled speech
data inefficient?
Pre-trained
Speech LLM
1M hours of speech data
100 tokens
per minute LLaMA 3
pre-trained on 15T
6B of text tokens text tokens … of speech
285k years
data

Text is a compressed version of speech.


Why is training solely on unlabeled speech
data inefficient? https://arxiv.org/abs/2404.00685

Text LLM

Speech
LLM

The linguistic performance of speech LLMs scales up three orders of magnitude more
slowly than that of text LLMs.
Besides content, speech LLMs also have to learn to understand other information
(such as speaker identity, emotion, etc.) that text LLMs do not have to.
Leveraging Text: Starting from Text LLM
• Initializing spoken QA models with text models GSQA
https://arxiv.org/abs/2312.09781

DUAL
https://arxiv.org/abs/2203.04911
Leveraging Text: Starting from Text LLM

How are you?


Text I am good.
LM

Initialization

Pre-trained
3 77 23 12 71 34 3 23
Spoken LLM

TWIST
https://arxiv.org/abs/2305.13009
Leveraging Text:
Speech-Text Hybrid Generation
how are you

Text Spoken
LM LLM

Initialization 3 77 23 12 71 34 3 23

This is similar to an inner monologue, allowing the model to consider


what it wants to say in text before actually expressing it in speech.
Leveraging Text:
Speech-Text Hybrid Generation
Text then speech: This is almost TTS
how are you 3 77 23 12 71 34 3 23

Spectron Drawback: cannot streaming


https://arxiv.org/abs/2305.15255

Text then speech (token-level):


how 3 77 23 are 12 71 34 you 3 23

We need alignment between text and speech during training.


Leveraging Text:
Speech-Text Hybrid Generation
Text and speech at the same time

how 3 are 77 you 23 ……


.

Spoken LLM

The text token and speech token do not have the


same scale (their lengths differ significantly).
3 77 23 12 71 34 3 23 Mini-Omni
https://arxiv.org/abs/
how are you 2408.16725

CTC loss
3 77 23 12 71 LLaMA-Omni
https://arxiv.org/abs/
how are 2409.06666

fixed number fixed number

3 77 23 12 71 34 3 23 Moshi
https://arxiv.org/abs/
how are you 2410.00037

This is similar to a duration model.


考慮文字的語音
Tokenization
https://arxiv.org/abs/2504.07053
Liang-Hsuan Tseng Yi-Chang Chen Kuan-Yi Lee
(NTU) (MediaTek) (NTU)
Can we have text-aligned speech
representations?

how are you

same number
Speech
LLM Complex alignment? Tokenizer

3 77 23 12 71 34 3 23

how are you


No need to include content information;
focus on information beyond the content.
different layers

Each corresponds to a
fixed period of time. Pre-trained Speech Encoder
e.g., Whisper Encoder
Detokenizer
Use the network architecture
how are you
of the TTS model (CozyVoice)

Aggregator (several attention layers)

query how are you


key

value

ASR
Pre-trained Speech Encoder
e.g., Whisper Encoder
Detokenizer

how are you

Aggregator (several attention layers)

query how are you


key

TASTE (Text-Aligned Speech value


Tokenization and Embedding)

Pre-trained Speech Encoder

Minimize reconstruction error


Detokenizer

How to pronounce a text token how are you

Aggregator (several attention layers)

query how are you


key
Tokenization

value

ASR
Pre-trained Speech Encoder
e.g., Whisper Encoder
Detokenizer Detokenizer

… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
… …
… …

Tokenizer Tokenizer

… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
Detokenizer Detokenizer

… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
… …
… …

Tokenizer Tokenizer

… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
https://arxiv.org/abs/2504.07053
Training Speech LLM
at 56 my 33 eyes 162

llama-3.2-1B as the
Speech LLM
initialization text LLM

look 101 at 56 my 33

Tokenizer 101 56 33 162

Emilia as our training datasets (English subset is about 40,000 hours)

look at my eyes
Speech Continuation Demonstration

I’ll take the


Pre-trained in the corner and just sit there. It’ll
speech LLM feel better. I’ll try to not sleep.
armchair

for a long time. I am very happy


I reserve your services Pre-trained
with the result. I will definitely
speech LLM recommend you to my friends.

It’s hot and loud Pre-trained


and so many I don’t know what to do ……
speech LLM
people (Raj in BTTB)

Pre-trained I am not the one who is wrong. I am


look at my eyes
speech LLM not the one who is wrong. (noise) I
am not the one who is wrong. I am
not the one who is wrong.
Source of video: https://www.youtube.com/watch?v=Dc7gc7BECk0
露比醬~ Pre-trained
嗨! speech LLM

歩夢醬~
Pre-trained
嗨! speech LLM

Pre-trained
四季醬~
嗨! speech LLM
我們已經討論了很多生成語音
的方式,然後怎麼訓練呢?
How to Train Speech LLM
Unlabeled Pre-trained
speech data Next “Speech Speech LLM
Token” Prediction

Human Speech
annotated data LLM
SFT

Text Text speech


conversation TTS
LLM conversation
How to Train Speech LLM
Unlabeled Pre-trained
speech data Next “Speech Speech LLM
Token” Prediction

Human Speech
annotated data LLM
SFT

Preference Speech
data RLHF LLM
Alignment with Feedback

Speech
LLM

Some related work improves


audio quality Some related work improves
https://arxiv.org/abs/2404.05600 audio understanding
https://arxiv.org/abs/2406.00654
https://arxiv.org/abs/2407.02243 https://arxiv.org/abs/2503.11197
https://arxiv.org/abs/2404.09956 https://arxiv.org/abs/2504.15900
https://arxiv.org/abs/2402.00744 https://arxiv.org/abs/2505.09439
Guan-Ting Lin

Alignment with Feedback (with researchers from


the Amazon GAI team)

https://arxiv.org/abs/2411.01834
Beyond the Turn-based Game
Text Conversation
User 1: text text
Turn-based
User 2: text text
Speech Conversation
Speaker 1:
Full-duplex overlap
Speaker 2:

How can we enable spoken LLMs to interact with interlocutors in a


full-duplex way?
Dialogue GSLM
Beyond the Turn-based Game https://arxiv.org/abs/2203.16502
https://arxiv.org/abs/2407.01911

Moshi
https://arxiv.org/abs/2410.00037
Evaluation

Chih-Kai Yang

https://arxiv.org/abs/2505.15957
To Learn More

For paper list:


https://github.com/ga642381/
speech-trident
https://arxiv.org/abs/2410.03751

https://arxiv.org/abs/2410.18908

https://arxiv.org/abs/2411.13577
https://arxiv.org/abs/2504.08528

You might also like