KEMBAR78
Sketch dist-ckpt content versioning by mikolajblaz · Pull Request #13839 · NVIDIA-NeMo/NeMo · GitHub
Skip to content

Conversation

@mikolajblaz
Copy link
Collaborator

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added core Changes to NeMo Core NLP Multi Modal labels Jun 5, 2025

To accelerate checkpoint saving, it is recommended to set ``dist_ckpt_assume_constant_structure=True``.

**9. Q: I get an error about an "invalid access pattern". What does it mean?**
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bonus unrelated to the MR :)

@mikolajblaz mikolajblaz force-pushed the mblaz/dist-ckpt-content-versioning branch from a35348c to df69593 Compare June 24, 2025 14:30
mikolajblaz and others added 14 commits June 24, 2025 16:35
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

# Conflicts:
#	nemo/lightning/pytorch/strategies/megatron_strategy.py
Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
@mikolajblaz mikolajblaz force-pushed the mblaz/dist-ckpt-content-versioning branch from df69593 to 05ceba1 Compare June 24, 2025 14:36
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
@mikolajblaz mikolajblaz force-pushed the mblaz/dist-ckpt-content-versioning branch from 518afd7 to 1cc6744 Compare July 2, 2025 09:39
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Jul 2, 2025
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

try:
scaler.update()
except AssertionError:

Check notice

Code scanning / CodeQL

Empty except Note test

'except' clause does nothing but pass and there is no explanatory comment.
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
@mikolajblaz mikolajblaz force-pushed the mblaz/dist-ckpt-content-versioning branch from bfb19ca to 77742bb Compare August 6, 2025 09:20
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Aug 6, 2025
Copy link
Collaborator

@dimapihtar dimapihtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@dimapihtar dimapihtar merged commit e295dbc into NVIDIA-NeMo:main Aug 13, 2025
296 of 302 checks passed
guyueh1 pushed a commit to guyueh1/NeMo that referenced this pull request Aug 25, 2025
* Sketch dist-ckpt content versioning

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

# Conflicts:
#	nemo/lightning/pytorch/strategies/megatron_strategy.py

* Apply isort and black reformatting

Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Change dist_opt_sharding_type name

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove MappingProxyType

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Expand docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix .rst formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix .rst formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix .rst formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix .rst formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Unindent code

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Match MLM metadata creation logic

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove one Nemo1 TODO

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add doc

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove collections/nlp TODOs (NeMo 1)

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix some TODOs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove f string

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix MegatronStrategy typo

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Handle TODOs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add missing import

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add content metadata flag to async

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add missing load_content_metadata

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Load content_metadata through unwrapped_ckpt_io

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Pass content_metadata through storage_options

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix linting problems

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add sharded_state_dict_metadata to FabricMegatronStrategy

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix indentation

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix last unwrapped_checkpoint_io

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix None type

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add local ckpt versioning note

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add safe_import fix

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Revert nemo_logger changes

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove chained_optim_avoid_prefix flag

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add unit tests

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>

* Fix OptimizerWrapper init signature

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

---------

Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: dimapihtar <dpihtar@gmail.com>
Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Changes to NeMo Core NLP Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants