-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Speechlm2 SALM improvements #13829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speechlm2 SALM improvements #13829
Conversation
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
| from lightning.pytorch import Trainer | ||
| from omegaconf import OmegaConf | ||
|
|
||
| from nemo.collections.common.data.fallback import FallbackDataset |
Check notice
Code scanning / CodeQL
Unused import Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 months ago
To fix the problem, we will remove the unused import statement from nemo.collections.common.data.fallback import FallbackDataset on line 20. This will clean up the code and eliminate the unnecessary dependency, improving readability and maintainability.
-
Copy modified line R20
| @@ -19,3 +19,3 @@ | ||
|
|
||
| from nemo.collections.common.data.fallback import FallbackDataset | ||
|
|
||
| from nemo.collections.speechlm2 import SALM, DataModule, SALMDataset |
…a/nemo into speechlm2-salm-eval-improv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks!
Is there anything needs to be done to support oomoptimizer in duplex?
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
@zhehuaichen I have a version of OOMptimizer that works for Duplex S2S, I'll commit in a separate PR together with tests for it. |
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
|
[🤖]: Hi @pzelasko 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
| return pad_sequence(tensors, batch_first=True, padding_value=padding_value, padding_side="left") | ||
|
|
||
|
|
||
| def drop_in_memory_data(conversations: CutSet) -> CutSet: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added conversations to the return dict of dataset's getitem so we can fetch relevant metadata for each example in a batch. This is useful in salm_generate script which reads cuts/conversations and writes the prediction results as conversations manifest.
But, because items in CutSet (cuts, conversations) may be holding audio data in memory when reading tarred formats, it is very inefficient to pass them between dataloading subprocess and the main training/inference process. This function drops the binary data from memory (no longer needed since we already created tensors) so we only pass metadata around.
I'll add a comment in a follow up PR (don't want to drag this out due to rerunning CI)
* Add customizable user prompt and text normalizer to salm_eval.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Qwen prompt format support Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support AutoTokenizer in estimate_token_bins.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add FallbackDataset for handling very corrupted data in SALM Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix estim token bins Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support duration filtering for multimodal conversations Signed-off-by: Piotr Żelasko <petezor@gmail.com> * import custom prompt format fns in estimate token bins script Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix in SALM generation, various fixes Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix WER Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add SALM generation without eval script and add it to tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix CI Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Amir Hussein <amhussein@nvidia.com>
* Add customizable user prompt and text normalizer to salm_eval.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Qwen prompt format support Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support AutoTokenizer in estimate_token_bins.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add FallbackDataset for handling very corrupted data in SALM Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix estim token bins Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support duration filtering for multimodal conversations Signed-off-by: Piotr Żelasko <petezor@gmail.com> * import custom prompt format fns in estimate token bins script Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix in SALM generation, various fixes Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix WER Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add SALM generation without eval script and add it to tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix CI Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Amir Hussein <amhussein@nvidia.com>
* Add customizable user prompt and text normalizer to salm_eval.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Qwen prompt format support Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support AutoTokenizer in estimate_token_bins.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add FallbackDataset for handling very corrupted data in SALM Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix estim token bins Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support duration filtering for multimodal conversations Signed-off-by: Piotr Żelasko <petezor@gmail.com> * import custom prompt format fns in estimate token bins script Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix in SALM generation, various fixes Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix WER Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add SALM generation without eval script and add it to tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix CI Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Amir Hussein <amhussein@nvidia.com>
* Add customizable user prompt and text normalizer to salm_eval.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Qwen prompt format support Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support AutoTokenizer in estimate_token_bins.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add FallbackDataset for handling very corrupted data in SALM Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix estim token bins Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Support duration filtering for multimodal conversations Signed-off-by: Piotr Żelasko <petezor@gmail.com> * import custom prompt format fns in estimate token bins script Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix in SALM generation, various fixes Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix WER Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add SALM generation without eval script and add it to tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix CI Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com>
What does this PR do ?
SpeechLM2 improvements:
Very experimental support for token bucket bins estimation and OOMptimizer for speechlm2 SALMwill add OOMptimizer in a separate PRCollection: SpeechLM2
Changelog
Usage
# Add a code snippet demonstrating how to use thisGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information