[Reland2][DDP] Merge work and future_work in reducer #59574

wayi1 · 2021-06-07T19:19:45Z

Stack from ghstack:

[DDP] Rename the member divFactor_ as div_factor for naming consistency in reducer #59523 [DDP] Rename the member divFactor_ as div_factor for naming consistency in reducer
[DDP] Remove the duplicate parseHookResult in reducer #59510 [DDP] Remove the duplicate parseHookResult in reducer
[Reland][Gradient Compression] Apply division first to avoid overflow #59576 [Reland][Gradient Compression] Apply division first to avoid overflow
[Reland2][DDP] Merge work and future_work in reducer #59574 [Reland2][DDP] Merge work and future_work in reducer

Remove work attribute from Reducer class in favor of future_work.

Additionally, remove copy_grad_to_bucket method since now it's only one-line implementation, and created a new C++ comm hook called _AllReduceCommHookWithDivFactor to replace allreduce and also support handling uneven input.

Compared with the reverted [DDP] Merge work and future_work in reducer #58937, updated _AllReduceCommHookWithDivFactor in default_comm_hooks.cpp to apply division first and hence avoid FP16 overflow.
Compared with the reverted [Reland][DDP] Merge work and future_work in reducer #59520, disabled test_DistributedDataParallel_non_default_stream on AMD, because now applying division first hurts the gradient averaging accuracy on AMD.
See [07:48:26]:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console

#Original PR Issue: #41266

Differential Revision: D28940800

Remove `work` attribute from Reducer class in favor of `future_work`. Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input. 1) Compared with the reverted #58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow. 2) Compared with the reverted #59520, disabled `test_DistributedDataParallel_non_default_stream` on AMD, because now applying division first hurts the gradient averaging accuracy on AMD. See [07:48:26]: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console #Original PR Issue: #41266 Differential Revision: [D28940800](https://our.internmc.facebook.com/intern/diff/D28940800/) [ghstack-poisoned]

facebook-github-bot · 2021-06-07T19:19:48Z

💊 CI failures summary and remediations

As of commit 65c1c2a (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 1/2 non-scanned failure(s)

1 failure not recognized by patterns:

Job	Step	Action
^{Label PRs & Issues / auto-label-rocm}	^Unknown	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Remove `work` attribute from Reducer class in favor of `future_work`. Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input. 1) Compared with the reverted #58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow. 2) Compared with the reverted #59520, disabled `test_DistributedDataParallel_non_default_stream` on AMD, because now applying division first hurts the gradient averaging accuracy on AMD. See [07:48:26]: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console #Original PR Issue: #41266 Differential Revision: [D28940800](https://our.internmc.facebook.com/intern/diff/D28940800/) ghstack-source-id: 130752393 Pull Request resolved: #59574

rohan-varma

LGTM to unblock for now, would be great to file an issue to investigate why it fails on ROCm. Thanks!

agolynski · 2021-06-07T20:28:00Z

A couple of questions:
wondering why it is called _AllReduceCommHookWithDivFactor, are you planning to deprecate/remove it later?
are we changing computation logic here or this is just refactoring PR (e.g. why overflow problem just surfaced in this PR)?

wayi1 · 2021-06-07T21:52:56Z

@jithunnair-amd test_DistributedDataParallel_non_default_stream is disabled on AMD, because now when we compute average gradients, we first divide the local gradient by the group size, and then sum local gradients up, in order to prevent the overflow in the range of FP16. However, this has caused a non-trivial discrepancy on the output average. The same test can pass on NVIDIA GPUs.

See [07:48:26]:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console

facebook-github-bot · 2021-06-07T23:53:48Z

This pull request has been merged in 6575975.

Summary: Pull Request resolved: pytorch#59574 Remove `work` attribute from Reducer class in favor of `future_work`. Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input. 1) Compared with the reverted pytorch#58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow. 2) Compared with the reverted pytorch#59520, disabled `test_DistributedDataParallel_non_default_stream` on AMD, because now applying division first hurts the gradient averaging accuracy on AMD. See [07:48:26]: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console #Original PR Issue: pytorch#41266 ghstack-source-id: 130752393 Test Plan: buck test caffe2/test/distributed:distributed_gloo_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16 buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_non_default_stream Reviewed By: rohan-varma Differential Revision: D28940800 fbshipit-source-id: 1ba727ac951ebc1e7875dc1a1be8108a2c8d9462

…n when no comm hook is specified The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]

…n when no comm hook is specified The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) ghstack-source-id: 133169350 Pull Request resolved: #61379

… of fusing copy and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]

…and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]

…n when no comm hook is specified Pull Request resolved: #61379 The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. ghstack-source-id: 133174301 Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/)

… of fusing copy and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]

…and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]

…n when no comm hook is specified Pull Request resolved: #61379 The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. ghstack-source-id: 133277170 Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/)

… of fusing copy and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]

…and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]

…n when no comm hook is specified Pull Request resolved: #61379 The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. ghstack-source-id: 133282616 Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/)

… of fusing copy and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]

…and division when no comm hook is specified" The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/) [ghstack-poisoned]

…n when no comm hook is specified Pull Request resolved: #61379 The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. ghstack-source-id: 133288529 Differential Revision: [D29597614](https://our.internmc.facebook.com/intern/diff/D29597614/)

…n when no comm hook is specified (#61379) Summary: Pull Request resolved: #61379 The optimization was accidentally removed in #59574 This optimization can help save a scan over all the input parameters, by fusing copy and div operations. Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook. ghstack-source-id: 133288529 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16 buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_non_default_stream buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_sparse_gradient buck test mode/dev-nosan caffe2/test/distributed:c10 -- test_ddp_checkpointing_once buck test mode/dev-nosan caffe2/test/distributed:c10 -- test_ddp_checkpointing_twice Reviewed By: rohan-varma Differential Revision: D29597614 fbshipit-source-id: 2434e4fd4e6abad7871cfe47886fe97b6e4ba28f

taozhiwei · 2023-12-01T07:18:34Z

May I ask for your advice：
in redecur.cpp call bucket.future_work->wait(),not call work->wait,how to ensure that the calculation stream is waiting for NCCL stream

wayi1 requested review from H-Huang, cbalioglu, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners June 7, 2021 19:19

facebook-github-bot added the cla signed label Jun 7, 2021

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 7, 2021

rohan-varma approved these changes Jun 7, 2021

View reviewed changes

This was referenced Jun 7, 2021

[Reland][Gradient Compression] Apply division first to avoid overflow #59576

Closed

[DDP] Remove the duplicate parseHookResult in reducer #59510

Closed

[DDP] Rename the member divFactor_ as div_factor for naming consistency in reducer #59523

Closed

facebook-github-bot closed this in 6575975 Jun 7, 2021

facebook-github-bot added the Merged label Jun 7, 2021

facebook-github-bot deleted the gh/SciPioneer/144/head branch June 11, 2021 14:17

wayi1 mentioned this pull request Jul 7, 2021

[DDP Comm Hook] Re-enable the optimization of fusing copy and division when no comm hook is specified #61379

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Reland2][DDP] Merge work and future_work in reducer #59574

[Reland2][DDP] Merge work and future_work in reducer #59574

Uh oh!

wayi1 commented Jun 7, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jun 7, 2021 •

edited

Loading

Uh oh!

rohan-varma left a comment

Uh oh!

agolynski commented Jun 7, 2021

Uh oh!

wayi1 commented Jun 7, 2021

Uh oh!

facebook-github-bot commented Jun 7, 2021

Uh oh!

taozhiwei commented Dec 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Reland2][DDP] Merge work and future_work in reducer #59574

[Reland2][DDP] Merge work and future_work in reducer #59574

Uh oh!

Conversation

wayi1 commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

1 failure not recognized by patterns:

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

agolynski commented Jun 7, 2021

Uh oh!

wayi1 commented Jun 7, 2021

Uh oh!

facebook-github-bot commented Jun 7, 2021

Uh oh!

taozhiwei commented Dec 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wayi1 commented Jun 7, 2021 •

edited

Loading

facebook-github-bot commented Jun 7, 2021 •

edited

Loading