Relax tensor contiguity requirement for P2P ops #114982

kwen2501 · 2023-12-01T21:30:55Z

I hit the following error when performing pipeline parallel for T5:

    return default_pg.send([tensor], dst, tag)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Tensors must be contiguous

In theory, we shouldn't require the tensors to be contiguous, especially for P2P ops, because we are just doing bit-wise "copy".

Thus, this PR relaxes the requirement and instead calls out that it would be user responsibility to guarantee the source and destination tensors have the same contiguity setting.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @kiukchung @d4l3k @LucasLLC

pytorch-bot · 2023-12-01T21:30:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114982

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit bf80d3f with merge base 624f202 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

H-Huang · 2023-12-04T20:41:20Z

test/distributed/test_c10d_nccl.py

            self.assertEqual(send_tensor, recv_tensor)

-        # Test with non-contiguous tensors.
-        send_tensor_view = send_tensor.t()


can we keep this test and assert values are expected on rank 1?

Good point! When we confirm that we can officially support this use case, we should add that test back.

## Description This example requires PyTorch PR pytorch/pytorch#114982 to work, because stage 0 and stage 2 seem to be transmitting non-contiguous tensors. ## Test https://gist.github.com/kwen2501/33fa5723496992691f8b1cc7daaadd89

kwen2501 · 2023-12-05T15:53:07Z

@pytorchbot merge

pytorchmergebot · 2023-12-05T15:56:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Needs pytorch/pytorch#114982 to work. ``` BlenderbotForCausalLM( (model): BlenderbotDecoderWrapper( (decoder): BlenderbotDecoder( (embed_tokens): Embedding(8008, 2560, padding_idx=0) (embed_positions): BlenderbotLearnedPositionalEmbedding(128, 2560) (layers): ModuleList( (0-23): 24 x BlenderbotDecoderLayer( (self_attn): BlenderbotAttention( (k_proj): Linear(in_features=2560, out_features=2560, bias=True) (v_proj): Linear(in_features=2560, out_features=2560, bias=True) (q_proj): Linear(in_features=2560, out_features=2560, bias=True) (out_proj): Linear(in_features=2560, out_features=2560, bias=True) ) (activation_fn): GELUActivation() (self_attn_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (encoder_attn): BlenderbotAttention( (k_proj): Linear(in_features=2560, out_features=2560, bias=True) (v_proj): Linear(in_features=2560, out_features=2560, bias=True) (q_proj): Linear(in_features=2560, out_features=2560, bias=True) (out_proj): Linear(in_features=2560, out_features=2560, bias=True) ) (encoder_attn_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=2560, out_features=10240, bias=True) (fc2): Linear(in_features=10240, out_features=2560, bias=True) (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) ) ) (layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) ) ) (lm_head): Linear(in_features=2560, out_features=8008, bias=False) ) ```

Needs pytorch/pytorch#114982 to work. ``` PLBartForCausalLM( (model): PLBartDecoderWrapper( (decoder): PLBartDecoder( (embed_tokens): Embedding(50005, 768, padding_idx=1) (embed_positions): PLBartLearnedPositionalEmbedding(1026, 768) (layers): ModuleList( (0-5): 6 x PLBartDecoderLayer( (self_attn): PLBartAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (activation_fn): GELUActivation() (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (encoder_attn): PLBartAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) (layernorm_embedding): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) (lm_head): Linear(in_features=768, out_features=50005, bias=False) ) ```

Requires pytorch/pytorch#114982 to work. ``` TrOCRForCausalLM( (model): TrOCRDecoderWrapper( (decoder): TrOCRDecoder( (embed_tokens): Embedding(50265, 1024, padding_idx=1) (embed_positions): TrOCRLearnedPositionalEmbedding(514, 1024) (layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (layers): ModuleList( (0-11): 12 x TrOCRDecoderLayer( (self_attn): TrOCRAttention( (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_fn): GELUActivation() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): TrOCRAttention( (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) ) ) ) (output_projection): Linear(in_features=1024, out_features=50265, bias=False) ) ```

I hit the following error when performing pipeline parallel for T5: ``` return default_pg.send([tensor], dst, tag) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: Tensors must be contiguous ``` In theory, we shouldn't require the tensors to be contiguous, especially for P2P ops, because we are just doing bit-wise "copy". Thus, this PR relaxes the requirement and instead calls out that it would be user responsibility to guarantee the source and destination tensors have the same contiguity setting. Pull Request resolved: pytorch#114982 Approved by: https://github.com/H-Huang

Relax tensor contiguity requirement for P2P ops

7c65193

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Dec 1, 2023

github-actions bot added the module: distributed label Dec 1, 2023

Fix quotes

c139156

kwen2501 requested review from H-Huang, fduwjj and wconstab December 1, 2023 21:35

kwen2501 mentioned this pull request Dec 1, 2023

Migrate T5 example pytorch/PiPPy#881

Merged

Remove offending test

bf80d3f

H-Huang approved these changes Dec 4, 2023

View reviewed changes

H-Huang reviewed Dec 4, 2023

View reviewed changes

This was referenced Dec 4, 2023

Add PLBart example pytorch/PiPPy#895

Merged

Add Blenderbot example pytorch/PiPPy#897

Merged

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 5, 2023

pytorchmergebot added the merging label Dec 5, 2023

pytorchmergebot added the Merged label Dec 5, 2023

pytorchmergebot closed this in c9853cc Dec 5, 2023

pytorchmergebot removed the merging label Dec 5, 2023

kwen2501 mentioned this pull request Dec 6, 2023

Add TrOCR example pytorch/PiPPy#907

Merged

albanD added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed module: distributed labels Dec 8, 2023

github-actions bot deleted the non_contig_p2p branch February 19, 2024 02:00

ngimel mentioned this pull request Sep 23, 2025

Data inconsistencies when using batch_isend_irecv with 2D tensor views in PyTorch distributed #161324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Relax tensor contiguity requirement for P2P ops #114982

Relax tensor contiguity requirement for P2P ops #114982

Uh oh!

kwen2501 commented Dec 1, 2023 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Dec 1, 2023 •

edited

Loading

Uh oh!

H-Huang Dec 4, 2023

Uh oh!

kwen2501 Dec 5, 2023

Uh oh!

kwen2501 commented Dec 5, 2023

Uh oh!

pytorchmergebot commented Dec 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Relax tensor contiguity requirement for P2P ops #114982

Relax tensor contiguity requirement for P2P ops #114982

Uh oh!

Conversation

kwen2501 commented Dec 1, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114982

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

H-Huang Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

kwen2501 Dec 5, 2023

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Dec 5, 2023

Uh oh!

pytorchmergebot commented Dec 5, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Dec 1, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 1, 2023 •

edited

Loading