KEMBAR78
[C10D] support group_src/dst in broadcast/reduce ops by wconstab · Pull Request #140843 · pytorch/pytorch · GitHub
Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Nov 15, 2024

Also add mypy annotations

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140843

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 8307988 with merge base b379a28 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Also add mypy annotations

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o

[ghstack-poisoned]
Also add mypy annotations

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o

[ghstack-poisoned]
@wconstab wconstab mentioned this pull request Nov 16, 2024
Also add mypy annotations

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o

[ghstack-poisoned]
Also add mypy annotations

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o

[ghstack-poisoned]
Also add mypy annotations

[ghstack-poisoned]
Also add mypy annotations

[ghstack-poisoned]
Also add mypy annotations

Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460

[ghstack-poisoned]
Also add mypy annotations

Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460

[ghstack-poisoned]
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks a lot cleaner now! Great work!

if group_rank:
c10d.broadcast(x, group_src=1, group=subgroup)
else:
c10d.broadcast(x, src=self.rank + 1, group=subgroup)
Copy link
Contributor

@kwen2501 kwen2501 Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not introduced by this PR but src=self.rank + 1 can be a potential cause for collective mismatch if someone scaling the world size for this test suit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it? I saw a 'world_size=4' hardcode and a if self.rank >= world_size: return so i thought it was correct even though it insists on using exactly 4 gpus rather than being flexible to all available gpus

Comment on lines 3351 to 3352
global_src = _canonicalize_group_rank(group, src, group_src, return_global=True)
group_src = _canonicalize_group_rank(group, src, group_src, return_global=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me why we need both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the existing code uses both actually. I prioritized changing less of the existing code in case of making mistakes. It might be possible to avoid having both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, you're right i think i can delete group_src here. Seems totally unused. Maybe i refactored more later and forgot to clean up.

Also add mypy annotations

Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460

[ghstack-poisoned]
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good.

pytorchmergebot pushed a commit that referenced this pull request Nov 19, 2024
…0847)

Also add mypy annotations

Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460

Pull Request resolved: #140847
Approved by: https://github.com/H-Huang
ghstack dependencies: #140843
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
Also add mypy annotations

Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in pytorch#140460
Pull Request resolved: pytorch#140843
Approved by: https://github.com/kwen2501
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…orch#140847)

Also add mypy annotations

Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in pytorch#140460

Pull Request resolved: pytorch#140847
Approved by: https://github.com/H-Huang
ghstack dependencies: pytorch#140843
Esquains pushed a commit to Esquains/study1 that referenced this pull request Dec 15, 2024
Also add mypy annotations

ghstack-source-id: 1776a64
Pull Request resolved: pytorch/pytorch#140843
@github-actions github-actions bot deleted the gh/wconstab/365/head branch December 19, 2024 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants