0% found this document useful (0 votes)

11 views72 pages

Speech LLM

The document discusses the development and training of speech language models (LLMs) that can generate and understand speech. It covers various aspects of speech tokenization, decoding strategies, and the importance of using both text and speech data for effective model training. Additionally, it highlights the challenges and methodologies involved in creating efficient speech generation systems.

Uploaded by

pq4szdp6p8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views72 pages

Speech LLM

Uploaded by

pq4szdp6p8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

可以聽和說的

語音語言模型
Speech LLM

LLM

內容語者情緒環境
Example

ChatGPT Gemini
voice mode Live
Example
• Moshi
• https://arxiv.org/abs/2410.00037
• GLM-4-Voice
• https://arxiv.org/abs/2412.02612
• Step-Audio
• https://arxiv.org/abs/2502.11946
• Qwen2.5-Omni
• https://arxiv.org/abs/2503.20215
• Kimi-Audio
• https://arxiv.org/abs/2504.18425
• SpeechGPT Sesame
• https://github.com/OpenMOSS/SpeechGP
T-2.0-preview https://www.sesame.com/research/crossi
ng_the_uncanny_valley_of_voice
• ……
We have talked about speech input; this lecture
will focus on speech generation.

https://youtu.be/Z6b5-77EfGk?si=st0d4IukGWAc__F2
“Text Token”
how are you
Text I am good
LLM

“Speech Token”
Speech
LLM

Tokenization Detokenization
How to Train Speech LLM
Unlabeled Pre-trained
speech data Next “Speech Speech LLM
Token” Prediction

Human Speech
annotated data LLM
SFT

Preference Speech
data RLHF LLM
語音生成的基本單位是什
麼？
(Speech
Token)
What is a “token” in the context of speech?
• Text • Speech
Waveform
I want to learn generative AI

Token Sequence Token Sequence

https://platform.openai.com/tokenizer ???
用 gpt-4o-mini-tts 合成

text text
Speech
LLM

ASR Tokenization Detokenization TTS

Text LLM

你實在是真的好棒喔謝謝你的誇獎

你實在是真的好棒喔怎麼了 ……
At least 8,000 tokens per second At least 8,000 tokens per second
Speech
LLM

Tokenization Detokenization

token
Various Types of Speech Tokenizers Haibin Wu

Source of image: https://www.linkedin.com/in/haibin-wu-479a39252/recent-activity/all/

Overview paper about Speech Tokenization

https://arxiv.org/abs/2402.13236

https://arxiv.org/abs/2502.06490
What is the best choice of tokens?
• Codec-SUPERB https://codecsuperb.github.io/ Quality, Various tasks

Tokenizer 3 77 23
De-Tokeniz
er

• DASB https://poonehmousavi.github.io/DASB-website/

Tokenizer 3 77 23 Various tasks

Learn more from Interspeech2024 Speech Processing Using Discrete Speech

Unit Challenge https://www.wavlab.org/activities/2024/Interspeech2024-Discrete-Speech-Unit-Challenge/
A possible pipeline of
speech tokenization
Speech SSL Model

0.02s
https://arxiv.org/abs/2205.10643
A possible pipeline of
speech tokenization
Speech SSL Model

https://www.youtube.com/watch?v=lMIN1iKYNmA
A possible pipeline of
speech tokenization
Speech SSL Model

Quantization: K-means or VQ-layer

3 2 2 2 77 3 3 2

Deduplicate
3 2 77 3 2

BPE (Byte Pair Encoding) 3 2 5

https://arxiv.org/abs/2310.14580
5 77 5
http://arxiv.org/abs/2205.01086
https://ieeexplore.ieee.org/abstract/document/10096788
5 77 5

Speech SSL Model

Detokenization
Tokenization

Model

3 2 2 2 77 3 3 2

?????
3 2 77 3 2

5 77 5
Another possible pipeline of speech tokenization

Neural Speech Codec

The tokenizer and detokenizer
are learned jointly. Tokenizer

3 77 23 4

Codec
Detokenizer
decompression

compression
Audio LM
https://arxiv.org/abs/2209.03143

Various Types of Speech Tokenizers

Two Types of
Tokens
Neural
SSL Tokenizer Tokenizer
Codec

“Sematic Token” “Acoustic Token”

• "Semantic" does not refer to its usual meaning in linguistics. Instead, "semantic
tokens" are closer to content information (usually containing phonetic
• information).
The distinction between the two types can be vague. 'Semantic tokens' also
include acoustic information, and vice versa.
RVQ (Residual
vector
Various Types of Speech Tokenizers Quantization)
https://arxiv.org/abs/2
210.13438
Two Types of
Tokens
Neural
SSL Tokenizer Tokenizer
Codec

“Sematic Token”
RVQ
SpeechTokenizer
https://arxiv.org/abs/2308.16692

Mimi (used in Moshi)

https://arxiv.org/abs/2410.00037
“Acoustic Token”
Various Types of Speech Tokenizers
Two Types of
Tokens
Neural
SSL Tokenizer Tokenizer
Codec

“Sematic Token”
選哪一個呢？
Choosing is for rookies, I want it all!
(小孩子才做選擇，我全都要!) “Acoustic Token”
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

Content Acoustic
Assumption: All tokens are of equal length for simplicity.

LLM

https://arxiv.org/abs/2209.03143
e.g., AudioLM, VALLE https://arxiv.org/abs/2301.02111
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

In VALLE, LLM 2 is a non-autoregressive language model.

LLM LLM 2

https://arxiv.org/abs/2209.03143
e.g., AudioLM, VALLE https://arxiv.org/abs/2301.02111
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

This strategy is challenging for streaming. Detokenizer

LLM
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

When generating different types of tokens

Streamable De … De … sequentially, the sequence can become very lengthy.
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

LLM

https://arxiv.org/abs/2402.05755
LLM

sequence length
= token per second x types of tokens x dialogue length

Take Moshi as example

12.5Hz 8 5 mins (300 seconds)

= 30k
30K

Source of image: https://towardsdatascience.com/towards-infinite-llm-context-windows-e099225abaaf

Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

Generate multiple types of tokens in one step

1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

LLM

https://arxiv.org/abs/2402.05755
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

https://arxiv.org/abs/2306.05284
Acoustic Delay https://arxiv.org/abs/2410.00037

1 2 1 3 2 1 4 3 2 5 4 3

LLM
Choice of Decoding Strategies
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

Coarse Token Fine-grain Token Finer Token

1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

Depth
LM LM LM LM LM Transformer

LM Temporal Transformer

https://arxiv.org/abs/2109.03264
https://arxiv.org/abs/2410.00037
Why discrete tokens?
Waveform Waveform

Tokenizer

3 77 23 4

Token Sequence

How about Continuous Representation?

Why discrete tokens?
The discrete tokens are crucial for generation.

Speech
LM

For understanding, there is no remarkable Given the same input, there can
difference between continuous representations be many possible outputs.
and discrete tokens.
Why discrete tokens?
• Let's say we train a speech LM to generate continuous
representations.

Average
Speech
LM
incorrect

Either is
correct.
Why discrete tokens?
We learn a
• How do discrete tokens solve the issue? probability
distribution
60% 1
40%
3 77 23 12 71
Speech
LM
2
1 2

Sampling from the distribution during inference.

Why discrete tokens?
• Let's say we train a speech LM to generate continuous
representations.

Average
Speech special design
LM
incorrect
Solutions from image generation.
https://arxiv.org/abs/2406.11838 Either is
https://arxiv.org/abs/2312.02116 correct.
https://arxiv.org/abs/2403.05196
MELLE
https://arxiv.org/pdf/2407.08551
Good performance in Text-to-Speech (TTS)
Breezy Voice
GitHub:
https://github.com/mtkresearch/BreezyVoice
Paper: https://arxiv.org/abs/2501.17790
大家好
TTS

hello~ how are you? 大家好

The source of the real audio is from the BIIC Podcast.

Pre-trained Speech LLM …… A large amount of
unlabeled speech data

Pre-trained
speech LLM
He assassinated the president
He assassinated the president and gave
mister johnson the last charge of
https://arxiv.org/abs/2306.02207 improvement in his writing possible three
point eight nine.

Does this sentence make sense?

… while the sentence has recognizable English words and phrases, as

it is currently constructed, it doesn't coherently communicate a
GPT4 clear, singular idea or sequence of connected ideas. …
以文字模型作為語音模型
的 Foundation Model
Why is training solely on unlabeled speech
data inefficient?
Pre-trained
Speech LLM
1M hours of speech data
100 tokens
per minute LLaMA 3
pre-trained on 15T
6B of text tokens text tokens … of speech
285k years
data

Text is a compressed version of speech.

Why is training solely on unlabeled speech
data inefficient? https://arxiv.org/abs/2404.00685

Text LLM

Speech
LLM

The linguistic performance of speech LLMs scales up three orders of magnitude more
slowly than that of text LLMs.
Besides content, speech LLMs also have to learn to understand other information
(such as speaker identity, emotion, etc.) that text LLMs do not have to.
Leveraging Text: Starting from Text LLM
• Initializing spoken QA models with text models GSQA
https://arxiv.org/abs/2312.09781

DUAL
https://arxiv.org/abs/2203.04911
Leveraging Text: Starting from Text LLM

How are you?

Text I am good.
LM

Initialization

Pre-trained
3 77 23 12 71 34 3 23
Spoken LLM

TWIST
https://arxiv.org/abs/2305.13009
Leveraging Text:
Speech-Text Hybrid Generation
how are you

Text Spoken
LM LLM

Initialization 3 77 23 12 71 34 3 23

This is similar to an inner monologue, allowing the model to consider

what it wants to say in text before actually expressing it in speech.
Leveraging Text:
Speech-Text Hybrid Generation
Text then speech: This is almost TTS
how are you 3 77 23 12 71 34 3 23

Spectron Drawback: cannot streaming

https://arxiv.org/abs/2305.15255

Text then speech (token-level):

how 3 77 23 are 12 71 34 you 3 23

We need alignment between text and speech during training.

Leveraging Text:
Speech-Text Hybrid Generation
Text and speech at the same time

how 3 are 77 you 23 ……

Spoken LLM

The text token and speech token do not have the

same scale (their lengths differ significantly).
3 77 23 12 71 34 3 23 Mini-Omni
https://arxiv.org/abs/
how are you 2408.16725

CTC loss
3 77 23 12 71 LLaMA-Omni
https://arxiv.org/abs/
how are 2409.06666

fixed number fixed number

3 77 23 12 71 34 3 23 Moshi
https://arxiv.org/abs/
how are you 2410.00037

This is similar to a duration model.

考慮文字的語音
Tokenization
https://arxiv.org/abs/2504.07053
Liang-Hsuan Tseng Yi-Chang Chen Kuan-Yi Lee
(NTU) (MediaTek) (NTU)
Can we have text-aligned speech
representations?

how are you

same number
Speech
LLM Complex alignment? Tokenizer

3 77 23 12 71 34 3 23

how are you

No need to include content information;
focus on information beyond the content.
different layers

Each corresponds to a
fixed period of time. Pre-trained Speech Encoder
e.g., Whisper Encoder
Detokenizer
Use the network architecture
how are you
of the TTS model (CozyVoice)

Aggregator (several attention layers)

query how are you

key

value

ASR
Pre-trained Speech Encoder
e.g., Whisper Encoder
Detokenizer

how are you

Aggregator (several attention layers)

query how are you

key

TASTE (Text-Aligned Speech value

Tokenization and Embedding)

Pre-trained Speech Encoder

Minimize reconstruction error

Detokenizer

How to pronounce a text token how are you

Aggregator (several attention layers)

query how are you

key
Tokenization

value

ASR
Pre-trained Speech Encoder
e.g., Whisper Encoder
Detokenizer Detokenizer

… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
… …
… …

Tokenizer Tokenizer

… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
Detokenizer Detokenizer

… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
… …
… …

Tokenizer Tokenizer

… but now Murdoch came around to stare at the gang … Any news on the dancer‘s assault case?
https://arxiv.org/abs/2504.07053
Training Speech LLM
at 56 my 33 eyes 162

llama-3.2-1B as the
Speech LLM
initialization text LLM

look 101 at 56 my 33

Tokenizer 101 56 33 162

Emilia as our training datasets (English subset is about 40,000 hours)

look at my eyes
Speech Continuation Demonstration

I’ll take the

Pre-trained in the corner and just sit there. It’ll
speech LLM feel better. I’ll try to not sleep.
armchair

for a long time. I am very happy

I reserve your services Pre-trained
with the result. I will definitely
speech LLM recommend you to my friends.

It’s hot and loud Pre-trained

and so many I don’t know what to do ……
speech LLM
people (Raj in BTTB)

Pre-trained I am not the one who is wrong. I am

look at my eyes
speech LLM not the one who is wrong. (noise) I
am not the one who is wrong. I am
not the one who is wrong.
Source of video: https://www.youtube.com/watch?v=Dc7gc7BECk0
露比醬～ Pre-trained
嗨！ speech LLM

歩夢醬～
Pre-trained
嗨！ speech LLM

Pre-trained
四季醬～
嗨！ speech LLM
我們已經討論了很多生成語音
的方式，然後怎麼訓練呢？
How to Train Speech LLM
Unlabeled Pre-trained
speech data Next “Speech Speech LLM
Token” Prediction

Human Speech
annotated data LLM
SFT

Text Text speech

conversation TTS
LLM conversation
How to Train Speech LLM
Unlabeled Pre-trained
speech data Next “Speech Speech LLM
Token” Prediction

Human Speech
annotated data LLM
SFT

Preference Speech
data RLHF LLM
Alignment with Feedback

Speech
LLM

Some related work improves

audio quality Some related work improves
https://arxiv.org/abs/2404.05600 audio understanding
https://arxiv.org/abs/2406.00654
https://arxiv.org/abs/2407.02243 https://arxiv.org/abs/2503.11197
https://arxiv.org/abs/2404.09956 https://arxiv.org/abs/2504.15900
https://arxiv.org/abs/2402.00744 https://arxiv.org/abs/2505.09439
Guan-Ting Lin

Alignment with Feedback (with researchers from

the Amazon GAI team)

https://arxiv.org/abs/2411.01834
Beyond the Turn-based Game
Text Conversation
User 1: text text
Turn-based
User 2: text text
Speech Conversation
Speaker 1:
Full-duplex overlap
Speaker 2:

How can we enable spoken LLMs to interact with interlocutors in a

full-duplex way?
Dialogue GSLM
Beyond the Turn-based Game https://arxiv.org/abs/2203.16502
https://arxiv.org/abs/2407.01911

Moshi
https://arxiv.org/abs/2410.00037
Evaluation

Chih-Kai Yang

https://arxiv.org/abs/2505.15957
To Learn More

For paper list:

https://github.com/ga642381/
speech-trident
https://arxiv.org/abs/2410.03751

https://arxiv.org/abs/2410.18908

https://arxiv.org/abs/2411.13577
https://arxiv.org/abs/2504.08528

Spoken LLM Course Overview
No ratings yet
Spoken LLM Course Overview
65 pages
Unified Speech Tokenizer for Models
No ratings yet
Unified Speech Tokenizer for Models
21 pages
Speechtokenizer: Unified Speech Tokenizer For Speech Large Language Models
No ratings yet
Speechtokenizer: Unified Speech Tokenizer For Speech Large Language Models
18 pages
Recent Advances in Speech Language Models: A Survey
No ratings yet
Recent Advances in Speech Language Models: A Survey
20 pages
Speech-Text Pre-Training Breakthrough
No ratings yet
Speech-Text Pre-Training Breakthrough
23 pages
Recent Advances in Speech Language Models A Survey
No ratings yet
Recent Advances in Speech Language Models A Survey
21 pages
Vocalnet
No ratings yet
Vocalnet
15 pages
GLM 4 Voice
No ratings yet
GLM 4 Voice
14 pages
XY Tokenizer
No ratings yet
XY Tokenizer
16 pages
Llasa
No ratings yet
Llasa
25 pages
Spoken Language Models
No ratings yet
Spoken Language Models
40 pages
Llama Omni 2
No ratings yet
Llama Omni 2
13 pages
Path To The LLM & Generative AI
No ratings yet
Path To The LLM & Generative AI
12 pages
Hands-On Large Language Models
No ratings yet
Hands-On Large Language Models
59 pages
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model With Single-Stream Decoupled Speech Tokens
No ratings yet
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model With Single-Stream Decoupled Speech Tokens
22 pages
Large Language Models
No ratings yet
Large Language Models
32 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
CosyVoice: Multilingual TTS with Semantic Tokens
No ratings yet
CosyVoice: Multilingual TTS with Semantic Tokens
10 pages
Llmvox: Autoregressive Streaming Text-To-Speech Model For Any LLM
No ratings yet
Llmvox: Autoregressive Streaming Text-To-Speech Model For Any LLM
12 pages
AI Tools
No ratings yet
AI Tools
19 pages
Pheme
No ratings yet
Pheme
15 pages
Text To Speech
No ratings yet
Text To Speech
15 pages
Speech Augmented Language Model
No ratings yet
Speech Augmented Language Model
5 pages
Captura de Pantalla 2024-05-31 A La(s) 9.07.37 A. M.
No ratings yet
Captura de Pantalla 2024-05-31 A La(s) 9.07.37 A. M.
245 pages
Wavllm: Towards Robust and Adaptive Speech Large Language Model
No ratings yet
Wavllm: Towards Robust and Adaptive Speech Large Language Model
21 pages
Module 5
No ratings yet
Module 5
76 pages
LLM Book 43-102
No ratings yet
LLM Book 43-102
60 pages
AudioChatLlama: Towards General-Purpose Speech Abilities For LLMs
No ratings yet
AudioChatLlama: Towards General-Purpose Speech Abilities For LLMs
11 pages
SECodec: Structural Entropy-Based Compressive Speech Representation Codec For Speech Language Models
No ratings yet
SECodec: Structural Entropy-Based Compressive Speech Representation Codec For Speech Language Models
9 pages
D 02 Large Language Models
100% (1)
D 02 Large Language Models
58 pages
Audio Gen
No ratings yet
Audio Gen
16 pages
Lesson 1 Intro
No ratings yet
Lesson 1 Intro
51 pages
UU EktaVats AI Physics
No ratings yet
UU EktaVats AI Physics
102 pages
Make Your LLM Core-1
No ratings yet
Make Your LLM Core-1
104 pages
Understanding Large Language Models
No ratings yet
Understanding Large Language Models
1 page
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
MULTIMODAL LLMs
No ratings yet
MULTIMODAL LLMs
82 pages
2 Generative Models
No ratings yet
2 Generative Models
60 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
224s 22 Lec1
No ratings yet
224s 22 Lec1
31 pages
Lecture 03 - Introduction To LLMs
No ratings yet
Lecture 03 - Introduction To LLMs
32 pages
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
No ratings yet
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
10 pages
03 NLP Document
No ratings yet
03 NLP Document
38 pages
2209 03143v2AudioLM
No ratings yet
2209 03143v2AudioLM
11 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
Intro To LLMs
No ratings yet
Intro To LLMs
32 pages
Hardware Acceleration of LLMS: A Comprehensive Survey and Comparison
No ratings yet
Hardware Acceleration of LLMS: A Comprehensive Survey and Comparison
15 pages
SE Via Token
No ratings yet
SE Via Token
5 pages
Large Language Model
0% (1)
Large Language Model
38 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant: Alan Dao (Gia Tuan Dao), Dinh Bach Vu, Huy Hoang Ha
No ratings yet
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant: Alan Dao (Gia Tuan Dao), Dinh Bach Vu, Huy Hoang Ha
21 pages
AudioPaLM - A Large Language Model That Can Speak and Listen
No ratings yet
AudioPaLM - A Large Language Model That Can Speak and Listen
27 pages
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
No ratings yet
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
10 pages
Using Large Language Models
No ratings yet
Using Large Language Models
9 pages
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
No ratings yet
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
27 pages
Salmonn Omni
No ratings yet
Salmonn Omni
9 pages
ASR Accuracy Boost with LA-RAG
No ratings yet
ASR Accuracy Boost with LA-RAG
5 pages
Spoken Language Intelligence LLM For Language Learning
No ratings yet
Spoken Language Intelligence LLM For Language Learning
27 pages
LightGCN Movie Recommendation Guide
No ratings yet
LightGCN Movie Recommendation Guide
12 pages
The MagPi - The Official Raspberry Pi Magazine - Electronics Starter Guide Issue 64 December 2017 PDF
100% (2)
The MagPi - The Official Raspberry Pi Magazine - Electronics Starter Guide Issue 64 December 2017 PDF
100 pages
List of Interested Students For Contata Solutions
No ratings yet
List of Interested Students For Contata Solutions
12 pages
Introduction To The Theory of Neural Computation
No ratings yet
Introduction To The Theory of Neural Computation
18 pages
Capgemini Careers for IIM Indore
No ratings yet
Capgemini Careers for IIM Indore
10 pages
ChatGPT Money Machine How To Make Money With ChatGPT and The Best AI Tools To Grow Your Online Business (Updated 2024) (Reuben, Mike) (Z-Library)
No ratings yet
ChatGPT Money Machine How To Make Money With ChatGPT and The Best AI Tools To Grow Your Online Business (Updated 2024) (Reuben, Mike) (Z-Library)
144 pages
Artificial Intelligence Driven Competitiveness Enhancing SME Performance in The Global Market
No ratings yet
Artificial Intelligence Driven Competitiveness Enhancing SME Performance in The Global Market
10 pages
Annotate Less But Perform Better: Weakly Supervised Shadow Detection Via Label Augmentation
No ratings yet
Annotate Less But Perform Better: Weakly Supervised Shadow Detection Via Label Augmentation
15 pages
Chapter Introduction
No ratings yet
Chapter Introduction
7 pages
Poster - Template - PPTX (1) (2) A Fe
No ratings yet
Poster - Template - PPTX (1) (2) A Fe
1 page
Assignment 2 AI 2025
No ratings yet
Assignment 2 AI 2025
2 pages
Nokuthaba CV
No ratings yet
Nokuthaba CV
5 pages
DM - Ai22c07 - Unit 3
No ratings yet
DM - Ai22c07 - Unit 3
272 pages
A I in Creative Industries
No ratings yet
A I in Creative Industries
14 pages
Group 6
No ratings yet
Group 6
13 pages
Comic Panel Recognition Tech
No ratings yet
Comic Panel Recognition Tech
5 pages
CS607 CURRENT MIDTERM SOLVED SUBJECTIVE by JUNAID
No ratings yet
CS607 CURRENT MIDTERM SOLVED SUBJECTIVE by JUNAID
10 pages
Agrovision: Your Ai Companion For Smart Farming: Komal Jadhav Jaswant Singh
No ratings yet
Agrovision: Your Ai Companion For Smart Farming: Komal Jadhav Jaswant Singh
6 pages
HPVD Assignment - EMail
No ratings yet
HPVD Assignment - EMail
3 pages
MSC Artificial Intelligence
No ratings yet
MSC Artificial Intelligence
4 pages
Icaps LLM Tut Slides Posted
No ratings yet
Icaps LLM Tut Slides Posted
97 pages
Innovative Product Design
No ratings yet
Innovative Product Design
21 pages
Nvidia Learning Training Course Catalog
No ratings yet
Nvidia Learning Training Course Catalog
33 pages
XAUBOT Guideline2024
No ratings yet
XAUBOT Guideline2024
31 pages
AI Associate Merged
No ratings yet
AI Associate Merged
100 pages
ISO 5338 Highlights
No ratings yet
ISO 5338 Highlights
7 pages
AGV - Webinar Strategy Template
No ratings yet
AGV - Webinar Strategy Template
12 pages
Arxiv Impact of GENAI
No ratings yet
Arxiv Impact of GENAI
9 pages
Final Year Project PPT Template
No ratings yet
Final Year Project PPT Template
12 pages
Ad3311 Set4
No ratings yet
Ad3311 Set4
2 pages

Speech LLM

Uploaded by

Speech LLM

Uploaded by

可以聽和說的

Token Sequence Token Sequence

ASR Tokenization Detokenization TTS

Source of image: https://www.linkedin.com/in/haibin-wu-479a39252/recent-activity/all/

Tokenizer 3 77 23 Various tasks

Learn more from Interspeech2024 Speech Processing Using Discrete Speech

Quantization: K-means or VQ-layer

BPE (Byte Pair Encoding) 3 2 5

Speech SSL Model

Neural Speech Codec

Various Types of Speech Tokenizers

“Sematic Token” “Acoustic Token”

Mimi (used in Moshi)

Coarse Token Fine-grain Token Finer Token

Coarse Token Fine-grain Token Finer Token

In VALLE, LLM 2 is a non-autoregressive language model.

Coarse Token Fine-grain Token Finer Token

This strategy is challenging for streaming. Detokenizer

Coarse Token Fine-grain Token Finer Token

When generating different types of tokens

Take Moshi as example

Source of image: https://towardsdatascience.com/towards-infinite-llm-context-windows-e099225abaaf

Coarse Token Fine-grain Token Finer Token

Generate multiple types of tokens in one step

Coarse Token Fine-grain Token Finer Token

Coarse Token Fine-grain Token Finer Token

How about Continuous Representation?

Sampling from the distribution during inference.

hello~ how are you? 大 家 好

The source of the real audio is from the BIIC Podcast.

Does this sentence make sense?

… while the sentence has recognizable English words and phrases, as

Text is a compressed version of speech.

How are you?

This is similar to an inner monologue, allowing the model to consider

Spectron Drawback: cannot streaming

Text then speech (token-level):

We need alignment between text and speech during training.

how 3 are 77 you 23 ……

The text token and speech token do not have the

fixed number fixed number

This is similar to a duration model.

how are you

how are you

Aggregator (several attention layers)

query how are you

how are you

Aggregator (several attention layers)

query how are you

TASTE (Text-Aligned Speech value

Pre-trained Speech Encoder

Minimize reconstruction error

How to pronounce a text token how are you

Aggregator (several attention layers)

query how are you

Tokenizer 101 56 33 162

Emilia as our training datasets (English subset is about 40,000 hours)

I’ll take the

for a long time. I am very happy

It’s hot and loud Pre-trained

Pre-trained I am not the one who is wrong. I am

Text Text speech

Some related work improves

Alignment with Feedback (with researchers from

How can we enable spoken LLMs to interact with interlocutors in a

For paper list:

You might also like

hello~ how are you? 大家好