[FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload #160481

mori360 · 2025-08-12T23:22:46Z

Fixes #160291
post_reduce_stream is all_reduce_stream during HSDP, but CPU-GPU sync is hard coded to reduce_scatter_stream
The hard-code could fail unit test on HSDP+CPU offload, add unit test here.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

pytorch-bot · 2025-08-12T23:22:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160481

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 5002451 with merge base 211c988 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
vision_maskrcnn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mori360 · 2025-08-13T05:31:03Z

@pytorchbot rebase

pytorchmergebot · 2025-08-13T05:32:28Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-08-13T05:32:30Z

Tried to rebase and push PR #160481, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

mori360 · 2025-08-15T01:08:47Z

test/distributed/_composable/fsdp/test_fully_shard_training.py

        )
        model = Transformer(model_args)
        ref_model = copy.deepcopy(model)
-        if device_type == device_type:


device_type == device_type here is always true

one of the device_type is referring to https://github.com/pytorch/pytorch/blob/82c7a1eb4b743408e907bdd09ea5645af964ae85/test/distributed/_composable/fsdp/test_fully_shard_training.py#L60C1-L60C12

rename device_type in this unit test?

mori360 · 2025-08-18T23:51:54Z

@pytorchbot merge

pytorchmergebot · 2025-08-18T23:53:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-19T00:14:59Z

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

mori360 · 2025-08-19T02:12:54Z

@pytorchbot merge

pytorchmergebot · 2025-08-19T02:14:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…rch#160481) Fixes pytorch#160291 `post_reduce_stream` is `all_reduce_stream` during HSDP, but CPU-GPU sync is hard coded to `reduce_scatter_stream` The hard-code could fail unit test on HSDP+CPU offload, add unit test here. Pull Request resolved: pytorch#160481 Approved by: https://github.com/weifengpy

fix bug on hsdp+cpuoffload

30917be

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Aug 12, 2025

mori360 added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 12, 2025

mori360 changed the title ~~[FSDP] Fix bug on hsdp+cpuoffload~~ [FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload Aug 13, 2025

mori360 requested a review from weifengpy August 13, 2025 17:19

mori360 marked this pull request as ready for review August 13, 2025 17:19

mori360 marked this pull request as draft August 15, 2025 00:02

mori360 added 4 commits August 14, 2025 17:34

move unit test

005fb83

Merge branch 'pytorch:main' into fix_reduce_stream

aeeefa5

move test

2466414

lint

cf515c0

mori360 commented Aug 15, 2025

View reviewed changes

mori360 marked this pull request as ready for review August 15, 2025 17:56

weifengpy approved these changes Aug 18, 2025

View reviewed changes

rename device type

5002451

mori360 marked this pull request as draft August 18, 2025 22:32

pytorchmergebot added the merging label Aug 18, 2025

pytorchmergebot removed the merging label Aug 19, 2025

pytorchmergebot added the merging label Aug 19, 2025

pytorchmergebot added the Merged label Aug 19, 2025

pytorchmergebot closed this in e6e45e6 Aug 19, 2025

pytorchmergebot removed the merging label Aug 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload #160481

[FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload #160481

mori360 commented Aug 12, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 12, 2025 •

edited

Loading

Uh oh!

mori360 commented Aug 13, 2025

Uh oh!

pytorchmergebot commented Aug 13, 2025

Uh oh!

pytorchmergebot commented Aug 13, 2025

Uh oh!

mori360 Aug 15, 2025

Uh oh!

weifengpy Aug 18, 2025

Uh oh!

mori360 commented Aug 18, 2025

Uh oh!

pytorchmergebot commented Aug 18, 2025

Uh oh!

pytorchmergebot commented Aug 19, 2025

Uh oh!

mori360 commented Aug 19, 2025

Uh oh!

pytorchmergebot commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload #160481

[FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload #160481

Conversation

mori360 commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160481

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

mori360 commented Aug 13, 2025

Uh oh!

pytorchmergebot commented Aug 13, 2025

Uh oh!

pytorchmergebot commented Aug 13, 2025

Uh oh!

mori360 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

mori360 commented Aug 18, 2025

Uh oh!

pytorchmergebot commented Aug 18, 2025

Merge started

Uh oh!

pytorchmergebot commented Aug 19, 2025

Merge failed

Uh oh!

mori360 commented Aug 19, 2025

Uh oh!

pytorchmergebot commented Aug 19, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mori360 commented Aug 12, 2025 •

edited

Loading

pytorch-bot bot commented Aug 12, 2025 •

edited

Loading