Speechlm2 SALM improvements #13829

pzelasko · 2025-06-04T15:50:55Z

What does this PR do ?

SpeechLM2 improvements:

LLM inputs are now left-padded rather than right-padded which removes an issue during generation using HF APIs
SALM ASR eval: choose English/Basic/None text normalizer, remove hardcoded user prompt and make customizable
Qwen prompt formatter definition + tests
Ability to filter NeMoMultimodalConversation data by duration
~~Very experimental support for token bucket bins estimation and OOMptimizer for speechlm2 SALM~~ will add OOMptimizer in a separate PR
Fault-tolerant audio loading for NeMoMultimodalConversation
- helpful when reading corrupted data, especially when you're unlucky and sample an entire mini-batch of corrupted unreadable audio, in which case FallbackDataset re-uses the previous working mini-batch.

Collection: SpeechLM2

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

examples/speechlm2/salm_train.py

 from lightning.pytorch import Trainer
 from omegaconf import OmegaConf

+from nemo.collections.common.data.fallback import FallbackDataset


To fix the problem, we will remove the unused import statement from nemo.collections.common.data.fallback import FallbackDataset on line 20. This will clean up the code and eliminate the unnecessary dependency, improving readability and maintainability.

…a/nemo into speechlm2-salm-eval-improv

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

scripts/speech_llm/estimate_token_bins.py

zhehuaichen

Many thanks!

Is there anything needs to be done to support oomoptimizer in duplex?

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko · 2025-06-17T13:54:17Z

Many thanks!

Is there anything needs to be done to support oomoptimizer in duplex?

@zhehuaichen I have a version of OOMptimizer that works for Duplex S2S, I'll commit in a separate PR together with tests for it.

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

github-actions · 2025-06-17T19:30:34Z

[🤖]: Hi @pzelasko 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

zhehuaichen · 2025-06-17T21:04:52Z

nemo/collections/speechlm2/data/salm_dataset.py

+    return pad_sequence(tensors, batch_first=True, padding_value=padding_value, padding_side="left")
+
+
+def drop_in_memory_data(conversations: CutSet) -> CutSet:


why do we need this?

I added conversations to the return dict of dataset's getitem so we can fetch relevant metadata for each example in a batch. This is useful in salm_generate script which reads cuts/conversations and writes the prediction results as conversations manifest.

But, because items in CutSet (cuts, conversations) may be holding audio data in memory when reading tarred formats, it is very inefficient to pass them between dataloading subprocess and the main training/inference process. This function drops the binary data from memory (no longer needed since we already created tensors) so we only pass metadata around.

I'll add a comment in a follow up PR (don't want to drag this out due to rerunning CI)

* Add customizable user prompt and text normalizer to salm_eval.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Qwen prompt format support Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support AutoTokenizer in estimate_token_bins.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add FallbackDataset for handling very corrupted data in SALM Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix estim token bins Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support duration filtering for multimodal conversations Signed-off-by: Piotr Żelasko <petezor@gmail.com> * import custom prompt format fns in estimate token bins script Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix in SALM generation, various fixes Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix WER Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add SALM generation without eval script and add it to tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix CI Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Amir Hussein <amhussein@nvidia.com>

* Add customizable user prompt and text normalizer to salm_eval.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Qwen prompt format support Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support AutoTokenizer in estimate_token_bins.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add FallbackDataset for handling very corrupted data in SALM Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix estim token bins Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support duration filtering for multimodal conversations Signed-off-by: Piotr Żelasko <petezor@gmail.com> * import custom prompt format fns in estimate token bins script Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix in SALM generation, various fixes Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix WER Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add SALM generation without eval script and add it to tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix CI Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com>

pzelasko added 9 commits May 27, 2025 11:35

Add customizable user prompt and text normalizer to salm_eval.py

430b807

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

Qwen prompt format support

0b7ff13

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

Support AutoTokenizer in estimate_token_bins.py

971aa95

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

Add FallbackDataset for handling very corrupted data in SALM

051b2e8

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

fix

b84fcbd

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

fix

9d3c71d

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

fix estim token bins

6984c50

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

Support duration filtering for multimodal conversations

9ac9098

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

import custom prompt format fns in estimate token bins script

84fe2a7

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko requested review from ankitapasad and zhehuaichen June 4, 2025 15:50

github-actions bot added the common label Jun 4, 2025

pzelasko added the Run CICD label Jun 4, 2025

github-advanced-security bot found potential problems Jun 4, 2025

View reviewed changes

pzelasko changed the title ~~Speechlm2 salm eval improv~~ Speechlm2 SALM improvements Jun 4, 2025

ko3n1g and others added 2 commits June 4, 2025 18:20

Merge branch 'main' into speechlm2-salm-eval-improv

51e65c9

Merge branch 'speechlm2-salm-eval-improv' of https://github.com/nvidi…

a8f804e

…a/nemo into speechlm2-salm-eval-improv

ko3n1g added Run CICD and removed Run CICD labels Jun 4, 2025

github-actions bot removed the Run CICD label Jun 4, 2025

ko3n1g added the Run CICD label Jun 4, 2025

github-actions bot removed the Run CICD label Jun 4, 2025

pzelasko added 2 commits June 4, 2025 12:39

fix ci

c74a082

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

fix ci

081ac94

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko added the Run CICD label Jun 4, 2025

pzelasko temporarily deployed to test June 4, 2025 16:47 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Jun 4, 2025

View reviewed changes

scripts/speech_llm/estimate_token_bins.py Dismissed Show dismissed Hide dismissed

github-actions bot removed the Run CICD label Jun 4, 2025

zhehuaichen previously approved these changes Jun 6, 2025

View reviewed changes

fix

9594e23

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko dismissed zhehuaichen’s stale review via 9594e23 June 9, 2025 18:24

pzelasko added 2 commits June 17, 2025 09:38

Fix in SALM generation, various fixes

97fd70a

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

Fix WER

28eff4f

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko added the Run CICD label Jun 17, 2025

pzelasko had a problem deploying to test June 17, 2025 13:55 — with GitHub Actions Error

Add SALM generation without eval script and add it to tests

536af0c

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko added Run CICD and removed Run CICD labels Jun 17, 2025

pzelasko temporarily deployed to test June 17, 2025 14:03 — with GitHub Actions Inactive

fix CI

f36c170

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko added Run CICD and removed Run CICD labels Jun 17, 2025

pzelasko temporarily deployed to test June 17, 2025 14:35 — with GitHub Actions Inactive

github-actions bot removed the Run CICD label Jun 17, 2025

zhehuaichen approved these changes Jun 17, 2025

View reviewed changes

pzelasko merged commit 081ef78 into main Jun 18, 2025
454 of 457 checks passed

pzelasko deleted the speechlm2-salm-eval-improv branch June 18, 2025 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speechlm2 SALM improvements #13829

Speechlm2 SALM improvements #13829

pzelasko commented Jun 4, 2025 •

edited

Loading

Uh oh!

Check notice

Copilot Autofix

Uh oh!

zhehuaichen left a comment

Uh oh!

pzelasko commented Jun 17, 2025

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

zhehuaichen Jun 17, 2025

Uh oh!

pzelasko Jun 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@@ -19,3 +19,3 @@
-            from nemo.collections.common.data.fallback import FallbackDataset
             from nemo.collections.speechlm2 import SALM, DataModule, SALMDataset

		return pad_sequence(tensors, batch_first=True, padding_value=padding_value, padding_side="left")


		def drop_in_memory_data(conversations: CutSet) -> CutSet:

Speechlm2 SALM improvements #13829

Speechlm2 SALM improvements #13829

Conversation

pzelasko commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

Check notice

Copilot Autofix

Uh oh!

zhehuaichen left a comment

Choose a reason for hiding this comment

Uh oh!

pzelasko commented Jun 17, 2025

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

zhehuaichen Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

pzelasko Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pzelasko commented Jun 4, 2025 •

edited

Loading