[dtensor] support convolution ops #113123

KingsleyLiu-NV · 2023-11-07T03:46:35Z

This PR creates a prototype of training convolutional neural networks based on DTensor.

Register required ops and implement operator dispatch
Add unit tests and example

Basically, we shard the activations and replicate the model weights in this prototype. We can scale out to multiple GPUs and reduce the per-GPU memory footprint with this approach, and achieve weak scaling in terms of training performance (i.e., time per iteration).

Reference log (on 2xA100 GPU):

Unit Test

root@luna-prod-78-80gb:/pytorch# python3 test/distributed/_tensor/test_convolution_ops.py
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (Triggered internally at /opt/conda/conda-bld/pytorch_1699257304556/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2170.)
  return F.conv2d(input, weight, bias, self.stride,
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (Triggered internally at /opt/conda/conda-bld/pytorch_1699257304556/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2170.)
  return F.conv2d(input, weight, bias, self.stride,
..
----------------------------------------------------------------------
Ran 2 tests in 30.354s

OK
root@luna-prod-78-80gb:/pytorch# python3 test/distributed/_tensor/test_other_ops.py
[rank0]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank0]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank1]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank1]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
...
----------------------------------------------------------------------
Ran 3 tests in 16.343s

OK

ConvNeXt Example

root@luna-prod-78-80gb:/pytorch# python3 torch/distributed/_tensor/examples/convnext_example.py
rank 3, 20 iterations, latency     584.80 ms, forward     102.84 ms, backward     297.80 ms, max reserved    16.34 GiB, max allocated    14.75 GiB
rank 1, 20 iterations, latency     584.64 ms, forward     104.85 ms, backward     297.60 ms, max reserved    16.40 GiB, max allocated    14.74 GiB
rank 0, 20 iterations, latency     584.48 ms, forward     104.64 ms, backward     297.90 ms, max reserved    16.39 GiB, max allocated    14.75 GiB
rank 2, 20 iterations, latency     584.96 ms, forward      93.21 ms, backward     297.95 ms, max reserved    16.40 GiB, max allocated    14.74 GiB

@wanchaol @fduwjj FYI

cc @wanchaol @XilunWu

pytorch-bot · 2023-11-07T03:46:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113123

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b6a6728 with merge base 7963aaa ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-11-07T03:46:40Z

The committers listed above are authorized under a signed CLA.

✅ login: KingsleyLiu-NV / name: Kingsley Liu (294fd88, 65b87fe, b5927a5, 9d57de8, 4e51233, b97cb7c, 3d5ee22, 83412f1, 7ba3295, b4e9c2a, 2e52775, b478235, 8d88959, 30ca1e1, d61e355, e22f812, 5bb2453, 6d55d05, 29d699b, ffe19be, b6a6728)

wanchaol

Thanks for contributing! Convolution op is quite important and also hard to implement in the distributed tensor context, so glad that you already get this working :)

I have a few comments inlined with regard to how the ops (for conv and other ops) are implemented currently. The main thing is that we should see if convolution op can be implemented like other ops, without special casing to do pre/post communication.

torch/distributed/_tensor/ops/other_ops.py

wanchaol · 2023-11-09T04:55:35Z

torch/distributed/_tensor/ops/other_ops.py

cool! I am thinking if you are willing to spend sometime to rewrite the ops to "strategy" based approach? We recently changed the way to implement ops and going forward we want all ops implemented in the strategy based approach, some examples

https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/ops/tensor_ops.py#L152

https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/ops/math_ops.py#L208

it's also fine to land some of these ops using the existing approach you have, but we'll need to refactor it later, we can chat more details about how the new approach works if you want

I prefer to keep it the current way and refactor it later.

wanchaol · 2023-11-09T06:24:55Z

torch/distributed/_tensor/tp_conv.py

I am trying to see if we should make these batch send/recv calls be a type of redistribute/resharding, so instead of having a custom convolution op implemented, and special case in torch_dispatch, we should try to see if:

for the case of input data exchange, we can make this input send/recvs be a data order permutation (i.e. sth like mesh [0, 1, 2, 3] to mesh [1, 2, 3, 0]).

I see duplicate comms happen in the backward op too, we should merge these two into a common function.

I have merged communications as a function for conv fwd/bwd.

Regarding replacing batch send/recv calls with redistribute/resharding, we can keep the discussion, and I think it will not happen in this PR.

wanchaol · 2023-11-09T06:25:50Z

torch/distributed/_tensor/tp_conv.py

It's a bit hard to follow what this dist comm is doing, could you add some comments to explain what this batch send/recv is trying to achieve for the conv input?

Sure, I have sent you the slides about implementation details in the slack channel.

wanchaol · 2023-11-09T06:29:57Z

torch/distributed/_tensor/ops/other_ops.py

Given that those "other ops" are not related to convolution specifically, it would be better if we can separate the other ops enablement with convolution itself with some tests

Can you elaborate more on this? Currently we need to register slice_backward, bernoulli_, nll_loss_forward and nll_loss_backward within other_ops.py to run the tp training example convnext_example.py.

Is it good to keep other_ops.py as it is, and add some unit tests by creating a new file test_other_ops.py within test/distributed/_tensor?

I have add unit tests for other ops and showcase the reference log in the description in this PR.

test/distributed/_tensor/test_convolution_ops.py

XilunWu · 2023-11-10T00:11:59Z

torch/distributed/_tensor/dispatch.py

@wanchaol shouldn't this resharding happen in _operator_dispatch?

I have moved it to convolution_backward_handler to avoid this pollution.

XilunWu · 2023-11-10T00:17:01Z

torch/distributed/_tensor/dispatch.py

@wanchaol what can we do to avoid this customized DTensor op?

I have moved it to convolution_backward_handler to avoid this pollution.

torch/distributed/_tensor/examples/convnext_example.py

torch/distributed/_tensor/tp_conv.py

XilunWu

Thanks for the hard work! I'm working on backward layer norm support, wish it's not blocking you right now. Left some questions.

torch/distributed/_tensor/examples/convnext_example.py

wanchaol

Thanks for fixing all the lint errors! Please see inlined comments.

I think the other changes looks good, my major comment now is that we shouldn't hijack into the dispatch logic like this in dispatch.py, given that we are doing custom op handing, let's just make convolution now a custom op handler, and handle the custom logic inside tp_conv.py

torch/distributed/_tensor/dispatch.py

torch/distributed/_tensor/ops/other_ops.py

torch/distributed/_tensor/examples/convnext_example.py

wanchaol

lgtm, have some small suggestions inlined, please address them before landing. thanks for contributing!

torch/distributed/_tensor/dispatch.py

…et better performance

wanchaol · 2023-11-20T18:30:34Z

@pytorchbot merge

pytorchmergebot · 2023-11-20T18:32:42Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchbot added the open source label Nov 7, 2023

Aidyn-A added topic: new features topic category module: dtensor distributed tensor tag release notes: distributed (dtensor) release notes category labels Nov 8, 2023

Aidyn-A requested a review from wanchaol November 9, 2023 03:57

wanchaol reviewed Nov 9, 2023

View reviewed changes

KingsleyLiu-NV force-pushed the fea_dtensor_conv branch from 21f098c to 42e1b89 Compare November 9, 2023 08:08

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 9, 2023

XilunWu reviewed Nov 9, 2023

View reviewed changes

test/distributed/_tensor/test_convolution_ops.py Outdated Show resolved Hide resolved

XilunWu reviewed Nov 10, 2023

View reviewed changes

torch/distributed/_tensor/examples/convnext_example.py Outdated Show resolved Hide resolved

XilunWu reviewed Nov 10, 2023

View reviewed changes

torch/distributed/_tensor/tp_conv.py Outdated Show resolved Hide resolved

XilunWu reviewed Nov 10, 2023

View reviewed changes

KingsleyLiu-NV force-pushed the fea_dtensor_conv branch from 3506413 to 7ebfc8a Compare November 13, 2023 08:49

XilunWu reviewed Nov 13, 2023

View reviewed changes

torch/distributed/_tensor/examples/convnext_example.py Outdated Show resolved Hide resolved

KingsleyLiu-NV force-pushed the fea_dtensor_conv branch from e49339f to c3a52cd Compare November 14, 2023 02:14

wanchaol reviewed Nov 14, 2023

View reviewed changes

KingsleyLiu-NV force-pushed the fea_dtensor_conv branch from 1d76745 to eecddbe Compare November 17, 2023 04:08

wanchaol approved these changes Nov 20, 2023

View reviewed changes

torch/distributed/_tensor/dispatch.py Outdated Show resolved Hide resolved

torch/distributed/_tensor/dispatch.py Outdated Show resolved Hide resolved

torch/distributed/_tensor/dispatch.py Outdated Show resolved Hide resolved

KingsleyLiu-NV added 8 commits November 19, 2023 21:33

support training convolutional networks with DTensor

294fd88

add unit test for convolution ops with DTensor

65b87fe

register hook for gradients in convnext dtensor training example to g…

b5927a5

…et better performance

renaming dtensor convolution unit tests

9d57de8

fix linter errors

4e51233

fix linter errors 2

b97cb7c

remove registering layernorm in other_ops.py

3d5ee22

merge common communications as functions for convolution fwd/bwd

83412f1

KingsleyLiu-NV added 13 commits November 19, 2023 21:33

add unit test for other_ops

7ba3295

remove nn.LayerNorm and F.layer_norm in convnext_example.py

b4e9c2a

add unit test for other_ops.py

2e52775

fix linter errors 3

b478235

fix linter errors 4

8d88959

fix type casting from Tuple to List

30ca1e1

fix linter errors 5

d61e355

resovle rebase conflicts

e22f812

rename other_ops to experimental_ops

5bb2453

remove warning filter in convnext_example.py

6d55d05

resovle rebase conflicts 2

29d699b

use custom handlers for convolution op dispatch

ffe19be

move custom handlers to tp_conv.py

b6a6728

KingsleyLiu-NV force-pushed the fea_dtensor_conv branch from eecddbe to b6a6728 Compare November 20, 2023 05:33

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 20, 2023

pytorchmergebot added the merging label Nov 20, 2023

pytorchmergebot added Merged and removed merging labels Nov 20, 2023

pytorchmergebot closed this in cd27989 Nov 20, 2023

[dtensor] support convolution ops #113123

[dtensor] support convolution ops #113123

Uh oh!

Conversation

KingsleyLiu-NV commented Nov 7, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113123

✅ No Failures

Uh oh!

linux-foundation-easycla bot commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wanchaol commented Nov 20, 2023

Uh oh!

pytorchmergebot commented Nov 20, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

KingsleyLiu-NV commented Nov 7, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 7, 2023 •

edited

Loading

linux-foundation-easycla bot commented Nov 7, 2023 •

edited

Loading