[FSDP2] Use reduceOpSum for world size 1 #157529

mori360 · 2025-07-03T06:22:17Z

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

pytorch-bot · 2025-07-03T06:22:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157529

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 4f2fa3e with merge base f56bfb3 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / unit-test / cuda12.8-py3.10-gcc9-sm86 / test (inductor_cpp_wrapper, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (disabled by #126867 but the issue was closed recently and a rebase is needed to make it pass)
inductor/test_max_autotune.py::TestMaxAutotune::test_non_contiguous_input_mm_plus_mm

This comment was automatically generated by Dr. CI and updates every 15 minutes.

weifengpy · 2025-07-04T15:05:07Z

test/distributed/_composable/fsdp/test_fully_shard_comm.py

+
+        from torch.distributed.distributed_c10d import ReduceOp
+
+        model = ModelOfModel()


why do we invent ModelOfModel for world_size 1 unit test?

weifengpy · 2025-07-07T21:46:36Z

test/distributed/_composable/fsdp/test_fully_shard_comm.py

+        dist.reduce_scatter_tensor(
+            output=reduce_output_sum, input=reduce_scatter_input, op=ReduceOp.SUM
+        )
+        self.assertNotEqual(reduce_output_avg, reduce_scatter_input)


why we need to assert NotEqual for avg? it's a nccl bug

weifengpy · 2025-07-07T21:47:09Z

test/distributed/_composable/fsdp/test_fully_shard_comm.py

+        self.assertEqual(reduce_output_sum, reduce_scatter_input)
+
+        model = MLP(4)
+        fully_shard(model)


I think your previous unit test is right. just need to replca modelofmodel with this simple MLP

mori360 · 2025-07-08T23:37:56Z

@pytorchmergebot merge

pytorchmergebot · 2025-07-08T23:40:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-07-09T00:27:53Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 2, 2, linux.rocm.gpu.2)

Details for Dev Infra team

Raised by workflow job

mori360 · 2025-07-09T18:01:23Z

@pytorchmergebot merge

pytorchmergebot · 2025-07-09T18:03:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

use reduceopSum for world size 1

b325f88

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jul 3, 2025

mori360 added 2 commits July 3, 2025 08:47

Update _fsdp_collectives.py

883e511

unit test

64df17e

mori360 added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 4, 2025

lint

b449a35

mori360 changed the title ~~Use reduceOpSum for world size 1~~ [FSDP2] Use reduceOpSum for world size 1 Jul 4, 2025

mori360 marked this pull request as ready for review July 4, 2025 05:09

mori360 requested review from lw and weifengpy July 4, 2025 05:09

Skylion007 approved these changes Jul 4, 2025

View reviewed changes

lw approved these changes Jul 4, 2025

View reviewed changes

weifengpy reviewed Jul 4, 2025

View reviewed changes

change the test

fa2ed9d

weifengpy reviewed Jul 7, 2025

View reviewed changes

mori360 marked this pull request as draft July 8, 2025 02:54

mori360 added 3 commits July 7, 2025 20:06

Merge branch 'main' of github.com:mori360/pytorch into size1_reduceop

4e175bb

change test

eed77ca

remove #

4f2fa3e

mori360 marked this pull request as ready for review July 8, 2025 22:09

mori360 requested review from lw and weifengpy July 8, 2025 22:09

weifengpy approved these changes Jul 8, 2025

View reviewed changes

pytorchmergebot added the merging label Jul 8, 2025

pytorchmergebot removed the merging label Jul 9, 2025

pytorchmergebot added the merging label Jul 9, 2025

pytorchmergebot added the Merged label Jul 9, 2025

pytorchmergebot closed this in 81c7445 Jul 9, 2025

pytorchmergebot removed the merging label Jul 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP2] Use reduceOpSum for world size 1 #157529

[FSDP2] Use reduceOpSum for world size 1 #157529

Uh oh!

mori360 commented Jul 3, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jul 3, 2025 •

edited

Loading

Uh oh!

weifengpy Jul 4, 2025

Uh oh!

weifengpy Jul 7, 2025

Uh oh!

weifengpy Jul 7, 2025

Uh oh!

mori360 commented Jul 8, 2025

Uh oh!

pytorchmergebot commented Jul 8, 2025

Uh oh!

pytorchmergebot commented Jul 9, 2025

Uh oh!

mori360 commented Jul 9, 2025

Uh oh!

pytorchmergebot commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		from torch.distributed.distributed_c10d import ReduceOp

		model = ModelOfModel()

[FSDP2] Use reduceOpSum for world size 1 #157529

[FSDP2] Use reduceOpSum for world size 1 #157529

Uh oh!

Conversation

mori360 commented Jul 3, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157529

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

weifengpy Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

mori360 commented Jul 8, 2025

Uh oh!

pytorchmergebot commented Jul 8, 2025

Merge started

Uh oh!

pytorchmergebot commented Jul 9, 2025

Merge failed

Uh oh!

mori360 commented Jul 9, 2025

Uh oh!

pytorchmergebot commented Jul 9, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mori360 commented Jul 3, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 3, 2025 •

edited

Loading