KEMBAR78
[FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload by mori360 · Pull Request #160481 · pytorch/pytorch · GitHub
Skip to content

Conversation

@mori360
Copy link
Contributor

@mori360 mori360 commented Aug 12, 2025

Fixes #160291
post_reduce_stream is all_reduce_stream during HSDP, but CPU-GPU sync is hard coded to reduce_scatter_stream
The hard-code could fail unit test on HSDP+CPU offload, add unit test here.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160481

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 5002451 with merge base 211c988 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Aug 12, 2025
@mori360 mori360 added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 12, 2025
@mori360 mori360 changed the title [FSDP] Fix bug on hsdp+cpuoffload [FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload Aug 13, 2025
@mori360
Copy link
Contributor Author

mori360 commented Aug 13, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #160481, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@mori360 mori360 requested a review from weifengpy August 13, 2025 17:19
@mori360 mori360 marked this pull request as ready for review August 13, 2025 17:19
@mori360 mori360 marked this pull request as draft August 15, 2025 00:02
)
model = Transformer(model_args)
ref_model = copy.deepcopy(model)
if device_type == device_type:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_type == device_type here is always true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mori360 mori360 marked this pull request as ready for review August 15, 2025 17:56
@mori360 mori360 marked this pull request as draft August 18, 2025 22:32
@mori360
Copy link
Contributor Author

mori360 commented Aug 18, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@mori360
Copy link
Contributor Author

mori360 commented Aug 19, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

can-gaa-hou pushed a commit to can-gaa-hou/pytorch that referenced this pull request Aug 22, 2025
…rch#160481)

Fixes pytorch#160291
`post_reduce_stream` is `all_reduce_stream` during HSDP, but CPU-GPU sync is hard coded to `reduce_scatter_stream`
The hard-code could fail unit test on HSDP+CPU offload, add unit test here.

Pull Request resolved: pytorch#160481
Approved by: https://github.com/weifengpy
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…rch#160481)

Fixes pytorch#160291
`post_reduce_stream` is `all_reduce_stream` during HSDP, but CPU-GPU sync is hard coded to `reduce_scatter_stream`
The hard-code could fail unit test on HSDP+CPU offload, add unit test here.

Pull Request resolved: pytorch#160481
Approved by: https://github.com/weifengpy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FSDP2] Bug: NaN gradient when both HSDP and CPU offload are enabled

3 participants