Refactor commonalities between two approaches #62624

awgu · 2021-08-02T23:50:40Z

Overview:
This refactors some commonalities between the two approaches to overlapping DDP with ZeRO. This also partially addresses this comment: #62157 (comment)

Test Plan:

gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py

Stack from ghstack:

Refactor commonalities between two approaches #62624 Refactor commonalities between two approaches
Add invariant check (bucket indices: 0, 1, ..., k-1) #62623 Add invariant check (bucket indices: 0, 1, ..., k-1)

Differential Revision: D30058543

[ghstack-poisoned]

facebook-github-bot · 2021-08-02T23:50:46Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/62624
📄 Preview docs built from this PR

💊 CI failures summary and remediations

As of commit 30ebafb (more details on the Dr. CI page):

1/1 failures introduced in this PR

1 failure not recognized by patterns:

Job	Step	Action
^{Lint / flake8-py3}	^{Fail if there were any warnings}	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

[ghstack-poisoned]

ghstack-source-id: 3bf8cdf Pull Request resolved: #62624

awgu · 2021-08-03T00:01:34Z

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli

LGTM! Thanks for improving code readability.

mrshenli · 2021-08-03T02:15:56Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

+    assert bucket_index in overlap_info.offsets, \
+        f"Bucket index {bucket_index} was not assigned to rank {rank}"
+    offset = overlap_info.offsets[bucket_index]
+    bucket_gradients = bucket.gradients()


nit: bucket_gradients seems just used once, looks like we don't need to create a var for it?

mrshenli · 2021-08-03T02:21:07Z

torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py

+    bucket: dist.GradBucket,
+    zero: ZeroRedundancyOptimizer,
+    rank: int,
+    rank_to_update: int,


owner_rank?

And also, would I be correct if I assume owner_rank is always the same as the source_rank argument in _broadcast_bucket, as the owner of those params should both update and broadcast. If yes, let's consolidate these two args to use the same name.

**Overview:** This refactors some commonalities between the two approaches to overlapping DDP with ZeRO. This also partially addresses this comment: #62157 (comment) **Test Plan:** ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` Differential Revision: [D30058543](https://our.internmc.facebook.com/intern/diff/D30058543) [ghstack-poisoned]

ghstack-source-id: d6043f8 Pull Request resolved: #62624

awgu · 2021-08-03T03:28:39Z

@andwgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-08-03T15:44:36Z

@andwgu merged this pull request in 43327cc.

Refactor commonalities between two approaches

468bca7

[ghstack-poisoned]

awgu requested review from H-Huang, cbalioglu, mingzhe09088, mrshenli, pritamdamania87, rohan-varma, wayi1 and zhaojuanmao as code owners August 2, 2021 23:50

awgu mentioned this pull request Aug 2, 2021

Add invariant check (bucket indices: 0, 1, ..., k-1) #62623

Closed

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Aug 2, 2021

Update on "Refactor commonalities between two approaches"

f7aac58

[ghstack-poisoned]

awgu pushed a commit that referenced this pull request Aug 3, 2021

Refactor commonalities between two approaches

287efa8

ghstack-source-id: 3bf8cdf Pull Request resolved: #62624

mrshenli approved these changes Aug 3, 2021

View reviewed changes

awgu pushed a commit that referenced this pull request Aug 3, 2021

Refactor commonalities between two approaches

e56a7df

ghstack-source-id: d6043f8 Pull Request resolved: #62624

facebook-github-bot closed this in 43327cc Aug 3, 2021

facebook-github-bot added the Merged label Aug 3, 2021

facebook-github-bot deleted the gh/andwgu/14/head branch August 7, 2021 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor commonalities between two approaches #62624

Refactor commonalities between two approaches #62624

Uh oh!

awgu commented Aug 2, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 2, 2021 •

edited

Loading

Uh oh!

awgu commented Aug 3, 2021

Uh oh!

mrshenli left a comment

Uh oh!

mrshenli Aug 3, 2021

Uh oh!

mrshenli Aug 3, 2021

Uh oh!

awgu commented Aug 3, 2021

Uh oh!

facebook-github-bot commented Aug 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor commonalities between two approaches #62624

Refactor commonalities between two approaches #62624

Uh oh!

Conversation

awgu commented Aug 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Aug 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

1 failure not recognized by patterns:

Uh oh!

awgu commented Aug 3, 2021

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli Aug 3, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli Aug 3, 2021

Choose a reason for hiding this comment

Uh oh!

awgu commented Aug 3, 2021

Uh oh!

facebook-github-bot commented Aug 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

awgu commented Aug 2, 2021 •

edited

Loading

facebook-github-bot commented Aug 2, 2021 •

edited

Loading