KEMBAR78
Speechlm2 SALM improvements by pzelasko · Pull Request #13829 · NVIDIA-NeMo/NeMo · GitHub
Skip to content

Conversation

@pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Jun 4, 2025

What does this PR do ?

SpeechLM2 improvements:

  • LLM inputs are now left-padded rather than right-padded which removes an issue during generation using HF APIs
  • SALM ASR eval: choose English/Basic/None text normalizer, remove hardcoded user prompt and make customizable
  • Qwen prompt formatter definition + tests
  • Ability to filter NeMoMultimodalConversation data by duration
  • Very experimental support for token bucket bins estimation and OOMptimizer for speechlm2 SALM will add OOMptimizer in a separate PR
  • Fault-tolerant audio loading for NeMoMultimodalConversation
    • helpful when reading corrupted data, especially when you're unlucky and sample an entire mini-batch of corrupted unreadable audio, in which case FallbackDataset re-uses the previous working mini-batch.

Collection: SpeechLM2

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

pzelasko added 9 commits May 27, 2025 11:35
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
from lightning.pytorch import Trainer
from omegaconf import OmegaConf

from nemo.collections.common.data.fallback import FallbackDataset

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'FallbackDataset' is not used.

Copilot Autofix

AI 5 months ago

To fix the problem, we will remove the unused import statement from nemo.collections.common.data.fallback import FallbackDataset on line 20. This will clean up the code and eliminate the unnecessary dependency, improving readability and maintainability.

Suggested changeset 1
examples/speechlm2/salm_train.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/examples/speechlm2/salm_train.py b/examples/speechlm2/salm_train.py
--- a/examples/speechlm2/salm_train.py
+++ b/examples/speechlm2/salm_train.py
@@ -19,3 +19,3 @@
 
-from nemo.collections.common.data.fallback import FallbackDataset
+
 from nemo.collections.speechlm2 import SALM, DataModule, SALMDataset
EOF
@@ -19,3 +19,3 @@

from nemo.collections.common.data.fallback import FallbackDataset

from nemo.collections.speechlm2 import SALM, DataModule, SALMDataset
Copilot is powered by AI and may make mistakes. Always verify output.
@pzelasko pzelasko changed the title Speechlm2 salm eval improv Speechlm2 SALM improvements Jun 4, 2025
pzelasko added 2 commits June 4, 2025 12:39
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
@github-actions github-actions bot removed the Run CICD label Jun 4, 2025
zhehuaichen
zhehuaichen previously approved these changes Jun 6, 2025
Copy link
Collaborator

@zhehuaichen zhehuaichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks!

Is there anything needs to be done to support oomoptimizer in duplex?

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
pzelasko added 2 commits June 17, 2025 09:38
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
@pzelasko
Copy link
Collaborator Author

Many thanks!

Is there anything needs to be done to support oomoptimizer in duplex?

@zhehuaichen I have a version of OOMptimizer that works for Duplex S2S, I'll commit in a separate PR together with tests for it.

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
@github-actions
Copy link
Contributor

[🤖]: Hi @pzelasko 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

return pad_sequence(tensors, batch_first=True, padding_value=padding_value, padding_side="left")


def drop_in_memory_data(conversations: CutSet) -> CutSet:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added conversations to the return dict of dataset's getitem so we can fetch relevant metadata for each example in a batch. This is useful in salm_generate script which reads cuts/conversations and writes the prediction results as conversations manifest.

But, because items in CutSet (cuts, conversations) may be holding audio data in memory when reading tarred formats, it is very inefficient to pass them between dataloading subprocess and the main training/inference process. This function drops the binary data from memory (no longer needed since we already created tensors) so we only pass metadata around.

I'll add a comment in a follow up PR (don't want to drag this out due to rerunning CI)

@pzelasko pzelasko merged commit 081ef78 into main Jun 18, 2025
454 of 457 checks passed
@pzelasko pzelasko deleted the speechlm2-salm-eval-improv branch June 18, 2025 12:56
AmirHussein96 pushed a commit to AmirHussein96/NeMo that referenced this pull request Jul 23, 2025
* Add customizable user prompt and text normalizer to salm_eval.py

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Qwen prompt format support

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Support AutoTokenizer in estimate_token_bins.py

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Add FallbackDataset for handling very corrupted data in SALM

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix estim token bins

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Support duration filtering for multimodal conversations

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* import custom prompt format fns in estimate token bins script

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix ci

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix ci

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Fix in SALM generation, various fixes

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Fix WER

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Add SALM generation without eval script and add it to tests

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix CI

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Amir Hussein <amhussein@nvidia.com>
AmirHussein96 pushed a commit to AmirHussein96/NeMo that referenced this pull request Aug 5, 2025
* Add customizable user prompt and text normalizer to salm_eval.py

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Qwen prompt format support

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Support AutoTokenizer in estimate_token_bins.py

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Add FallbackDataset for handling very corrupted data in SALM

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix estim token bins

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Support duration filtering for multimodal conversations

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* import custom prompt format fns in estimate token bins script

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix ci

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix ci

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Fix in SALM generation, various fixes

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Fix WER

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Add SALM generation without eval script and add it to tests

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix CI

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Amir Hussein <amhussein@nvidia.com>
AmirHussein96 pushed a commit to AmirHussein96/NeMo that referenced this pull request Aug 5, 2025
* Add customizable user prompt and text normalizer to salm_eval.py

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Qwen prompt format support

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Support AutoTokenizer in estimate_token_bins.py

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Add FallbackDataset for handling very corrupted data in SALM

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix estim token bins

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Support duration filtering for multimodal conversations

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* import custom prompt format fns in estimate token bins script

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix ci

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix ci

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Fix in SALM generation, various fixes

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Fix WER

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Add SALM generation without eval script and add it to tests

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix CI

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Amir Hussein <amhussein@nvidia.com>
nasretdinovr pushed a commit to nasretdinovr/NeMo that referenced this pull request Aug 8, 2025
* Add customizable user prompt and text normalizer to salm_eval.py

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Qwen prompt format support

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Support AutoTokenizer in estimate_token_bins.py

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Add FallbackDataset for handling very corrupted data in SALM

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix estim token bins

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Support duration filtering for multimodal conversations

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* import custom prompt format fns in estimate token bins script

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix ci

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix ci

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Fix in SALM generation, various fixes

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Fix WER

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Add SALM generation without eval script and add it to tests

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix CI

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants