KEMBAR78
[FSDP2] Use reduceOpSum for world size 1 by mori360 · Pull Request #157529 · pytorch/pytorch · GitHub
Skip to content

Conversation

@mori360
Copy link
Contributor

@mori360 mori360 commented Jul 3, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157529

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 4f2fa3e with merge base f56bfb3 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jul 3, 2025
@mori360 mori360 added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 4, 2025
@mori360 mori360 changed the title Use reduceOpSum for world size 1 [FSDP2] Use reduceOpSum for world size 1 Jul 4, 2025
@mori360 mori360 marked this pull request as ready for review July 4, 2025 05:09
@mori360 mori360 requested review from lw and weifengpy July 4, 2025 05:09

from torch.distributed.distributed_c10d import ReduceOp

model = ModelOfModel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we invent ModelOfModel for world_size 1 unit test?

dist.reduce_scatter_tensor(
output=reduce_output_sum, input=reduce_scatter_input, op=ReduceOp.SUM
)
self.assertNotEqual(reduce_output_avg, reduce_scatter_input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to assert NotEqual for avg? it's a nccl bug

self.assertEqual(reduce_output_sum, reduce_scatter_input)

model = MLP(4)
fully_shard(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your previous unit test is right. just need to replca modelofmodel with this simple MLP

@mori360 mori360 marked this pull request as draft July 8, 2025 02:54
@mori360 mori360 marked this pull request as ready for review July 8, 2025 22:09
@mori360 mori360 requested review from lw and weifengpy July 8, 2025 22:09
@mori360
Copy link
Contributor Author

mori360 commented Jul 8, 2025

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 2, 2, linux.rocm.gpu.2)

Details for Dev Infra team Raised by workflow job

@mori360
Copy link
Contributor Author

mori360 commented Jul 9, 2025

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants