fix: [FSDP2] reshard_after_forward=False for root model #464

weifengpy · 2025-06-02T00:28:11Z

What does this PR do ?

Hi from pytorch fsdp2! set fully_shard(reshard_after_forward=False) to keep the memory behavior the same after pytorch side change: pytorch/pytorch#154704

for root model, reshard_after_forward=False means keep root parameters unsharded after forward, since they will be used in the backward immeidately. This is a AA change

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

uv run python examples/run_grpo_math.py cluster.gpus_per_node=8

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

weifengpy · 2025-06-02T00:34:32Z

would love to support more from pytorch fsdp2 side @gshennvm @yuki-666 @terrykong @parthchadha

gshennvm · 2025-06-02T05:18:03Z

would love to support more from pytorch fsdp2 side @gshennvm @yuki-666 @terrykong @parthchadha

thanks for the contribution! Just for my understanding -- doesn't pytorch set this to False by default already?

from the pytorch docs:

The root FSDP state has its value specially set to False as a heuristic since its parameters would typically be immediately all-gathered for backward.

cc @terrykong on how we can work together for better fsdp2 integration :)

weifengpy · 2025-06-02T18:10:44Z

doesn't pytorch set this to False by default already?

from the pytorch docs:
The root FSDP state has its value specially set to False as a heuristic since its parameters would typically be immediately all-gathered for backward.

I am about to remove the heuristic for future pytorch release: pytorch/pytorch#154704

for root model, if user set fully_shard(reshard_after_forward=True),

in existing pytorch release, we will override it to False. it's too implicit
in future pytorch release, we respect user's config on reshard_after_forward=True

I will change the doc as well to recommend reshard_after_forward=False for root model

gshennvm

ah that makes sense. Thanks for the info! Approved.

We can merge tests are resolved

terrykong · 2025-06-02T19:31:39Z

Hi @weifengpy . Thanks, could you rebase your commits to --signoff to pass our DCO check?

weifengpy · 2025-06-03T03:22:00Z

Hi @weifengpy . Thanks, could you rebase your commits to --signoff to pass our DCO check?

just commited with --signoff

Signed-off-by: Wei Feng <weif@meta.com>

) Signed-off-by: Wei Feng <weif@meta.com>

gshennvm previously approved these changes Jun 2, 2025

View reviewed changes

terrykong changed the title ~~[FSDP2] reshard_after_forward=False for root model~~ fix: [FSDP2] reshard_after_forward=False for root model Jun 2, 2025

weifengpy dismissed gshennvm’s stale review via 52e093b June 3, 2025 03:20

weifengpy added 2 commits June 2, 2025 20:32

[FSDP2] reshard_after_forward=False for root model

9f52b67

Signed-off-by: Wei Feng <weif@meta.com>

explain why reshard_after_forward=False

55e15a2

Signed-off-by: Wei Feng <weif@meta.com>

weifengpy force-pushed the main branch from 52e093b to 55e15a2 Compare June 3, 2025 03:33

terrykong approved these changes Jun 3, 2025

View reviewed changes

terrykong enabled auto-merge June 3, 2025 06:11

terrykong added this pull request to the merge queue Jun 3, 2025

Merged via the queue into NVIDIA-NeMo:main with commit a1bf952 Jun 3, 2025
13 of 14 checks passed

YzjiaoNvd pushed a commit to YzjiaoNvd/NeMo-RL that referenced this pull request Jun 10, 2025

fix: [FSDP2] reshard_after_forward=False for root model (NVIDIA-NeMo#464

791fc0d

) Signed-off-by: Wei Feng <weif@meta.com>

snowmanwwg added the external label Jul 19, 2025

terrykong mentioned this pull request Jul 21, 2025

feat: Migrate custom optimized plans and refactor parallelizer. NVIDIA-NeMo/Automodel#205

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: [FSDP2] reshard_after_forward=False for root model #464

fix: [FSDP2] reshard_after_forward=False for root model #464

weifengpy commented Jun 2, 2025

Uh oh!

weifengpy commented Jun 2, 2025

Uh oh!

gshennvm commented Jun 2, 2025

Uh oh!

weifengpy commented Jun 2, 2025 •

edited

Loading

Uh oh!

gshennvm left a comment •

edited

Loading

Uh oh!

terrykong commented Jun 2, 2025

Uh oh!

weifengpy commented Jun 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: [FSDP2] reshard_after_forward=False for root model #464

fix: [FSDP2] reshard_after_forward=False for root model #464

Conversation

weifengpy commented Jun 2, 2025

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

weifengpy commented Jun 2, 2025

Uh oh!

gshennvm commented Jun 2, 2025

Uh oh!

weifengpy commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gshennvm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terrykong commented Jun 2, 2025

Uh oh!

weifengpy commented Jun 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

weifengpy commented Jun 2, 2025 •

edited

Loading

gshennvm left a comment •

edited

Loading