feat: refit metadata optimization #686

ZhiyuLi-Nvidia · 2025-07-17T18:14:53Z

What does this PR do ?

fix: maintain fp32 mlp.router.expert_bias even with bf16 enabled
- caused by conflicts between fp32 router bias and bf16 module
track refitting time inside prepare_for_generation
refit metadata optimization: pass list of keys only during refitting
benchmark refit performance 8% gain and the code would be cleaner.
- baseline: 51s
- with the change: 47s

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

nemo_rl/algorithms/grpo.py

yfw · 2025-07-18T18:05:10Z

PR looks good to me. I noticed that the function serialization is also part of this PR. Out of the 8% perf gain, do we have a breakdown of how much can be attributed to the function serialization and how much to the metadata optimization? Just wondering how much we are actually gaining from that part.

yfw · 2025-07-18T18:07:59Z

Also a minor thing since we are timing the refit separately now. Do you mind removing this print which spams the logs for large models?

ZhiyuLi-Nvidia · 2025-07-18T23:06:40Z

PR looks good to me. I noticed that the function serialization is also part of this PR. Out of the 8% perf gain, do we have a breakdown of how much can be attributed to the function serialization and how much to the metadata optimization? Just wondering how much we are actually gaining from that part.

@yfw almost all 8% perf gain is from the function serialization.

pass list of keys only during refitting This change may not offer any speed optimization but will result in cleaner and more readable code. It might be reasonable to expect little improvement from changing a dictionary (key, offset pair) to a key list (requiring local offset reconstruction) during serialization.

yuki-97

Thanks @ZhiyuLi-Nvidia , LGTM!

guyueh1

One small comment otherwise LGTM

nemo_rl/algorithms/grpo.py

ZhiyuLi-Nvidia · 2025-07-21T21:09:03Z

Also a minor thing since we are timing the refit separately now. Do you mind removing this print which spams the logs for large models?

Done. Could you take another look @yfw?

Signed-off-by: Yuki Huang <yukih@nvidia.com>

nemo_rl/models/policy/megatron_policy_worker.py

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

github-actions · 2025-07-29T23:21:35Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 898903a (PR #686 from zhiyul/yukih/refit-optimization-minimal-serialization)

✅ Submodules that are properly updated:

NeMo: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

github-actions · 2025-07-29T23:26:35Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 19703f7 (PR #686 from zhiyul/yukih/refit-optimization-minimal-serialization)

✅ Submodules that are properly updated:

NeMo: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

github-actions · 2025-07-29T23:35:42Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 0d26765 (PR #686 from zhiyul/yukih/refit-optimization-minimal-serialization)

✅ Submodules that are properly updated:

NeMo: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

github-actions · 2025-07-29T23:40:03Z

✅ Submodule Fast-Forward Check Results

Check based on commit: dd74850 (PR #686 from zhiyul/yukih/refit-optimization-minimal-serialization)

✅ Submodules that are properly updated:

NeMo: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com> Signed-off-by: tpoisonooo <khj.application@aliyun.com>

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

ZhiyuLi-Nvidia changed the title ~~fit: refit metadata optimization~~ feat: refit metadata optimization Jul 17, 2025

ZhiyuLi-Nvidia force-pushed the zhiyul/yukih/refit-optimization-minimal-serialization branch 2 times, most recently from bd680b0 to bb491df Compare July 18, 2025 08:37

ZhiyuLi-Nvidia requested review from guyueh1, parthchadha, terrykong, yfw and yuki-97 July 18, 2025 08:52

yuki-97 reviewed Jul 18, 2025

View reviewed changes

nemo_rl/algorithms/grpo.py Outdated Show resolved Hide resolved

yuki-97 previously approved these changes Jul 21, 2025

View reviewed changes

guyueh1 reviewed Jul 21, 2025

View reviewed changes

nemo_rl/algorithms/grpo.py Outdated Show resolved Hide resolved

ZhiyuLi-Nvidia dismissed yuki-97’s stale review via 2025a57 July 21, 2025 21:06

ZhiyuLi-Nvidia force-pushed the zhiyul/yukih/refit-optimization-minimal-serialization branch from 2025a57 to e677b9c Compare July 21, 2025 21:08

github-actions bot added documentation Improvements or additions to documentation CI Relating to CI labels Jul 21, 2025

ZhiyuLi-Nvidia force-pushed the zhiyul/yukih/refit-optimization-minimal-serialization branch from e677b9c to 2931ae2 Compare July 21, 2025 21:16

github-actions bot removed documentation Improvements or additions to documentation CI Relating to CI labels Jul 21, 2025

yfw previously approved these changes Jul 21, 2025

View reviewed changes

yuki-97 previously approved these changes Jul 22, 2025

View reviewed changes

guyueh1 previously approved these changes Jul 22, 2025

View reviewed changes

yuki-97 added a commit that referenced this pull request Jul 22, 2025

squash #686

fffd732

Signed-off-by: Yuki Huang <yukih@nvidia.com>

ZhiyuLi-Nvidia mentioned this pull request Jul 23, 2025

fix: maintain fp32 mlp.router.expert_bias even with bf16 enabled #674

Closed

4 tasks

ZhiyuLi-Nvidia requested a review from SahilJain314 July 23, 2025 21:01

yuki-97 added a commit that referenced this pull request Jul 28, 2025

squash #686

949f849

Signed-off-by: Yuki Huang <yukih@nvidia.com>

terrykong reviewed Jul 29, 2025

View reviewed changes

nemo_rl/models/policy/megatron_policy_worker.py Show resolved Hide resolved

yuki-97 and others added 6 commits July 29, 2025 11:26

feat: cache refit_param_info_mcore (#698)

715f522

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

better context manager name

35cffa7

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

update .gitmodules

c3f015a

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

lint

3a21a4e

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

remove unused comment

4eb076f

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

fix tests failure in dtensor

898903a

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia dismissed stale reviews from guyueh1, yuki-97, and yfw via 898903a July 29, 2025 23:20

ZhiyuLi-Nvidia force-pushed the zhiyul/yukih/refit-optimization-minimal-serialization branch from 9f29928 to 898903a Compare July 29, 2025 23:20

lint

0d26765

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia force-pushed the zhiyul/yukih/refit-optimization-minimal-serialization branch from 19703f7 to 0d26765 Compare July 29, 2025 23:34

add nemo_rl/models/generation/vllm_backend.py to pyrefly whitelist

dd74850

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

terrykong enabled auto-merge July 29, 2025 23:43

terrykong approved these changes Jul 29, 2025

View reviewed changes

terrykong added this pull request to the merge queue Jul 29, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 30, 2025

terrykong added this pull request to the merge queue Jul 30, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 30, 2025

terrykong added this pull request to the merge queue Jul 30, 2025

Merged via the queue into main with commit 7b3fad8 Jul 30, 2025
15 checks passed

terrykong deleted the zhiyul/yukih/refit-optimization-minimal-serialization branch July 30, 2025 20:37

feat: refit metadata optimization #686

feat: refit metadata optimization #686

Uh oh!

Conversation

ZhiyuLi-Nvidia commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

Uh oh!

yfw commented Jul 18, 2025

Uh oh!

yfw commented Jul 18, 2025

Uh oh!

ZhiyuLi-Nvidia commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

guyueh1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented Jul 21, 2025

Uh oh!

Uh oh!

github-actions bot commented Jul 29, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Jul 29, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Jul 29, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Jul 29, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ZhiyuLi-Nvidia commented Jul 17, 2025 •

edited

Loading

ZhiyuLi-Nvidia commented Jul 18, 2025 •

edited

Loading