KEMBAR78
Register Intel distributed Backend (`XCCL`) in PyTorch distributed package by zhangxiaoli73 · Pull Request #141856 · pytorch/pytorch · GitHub
Skip to content

Conversation

@zhangxiaoli73
Copy link
Contributor

@zhangxiaoli73 zhangxiaoli73 commented Dec 2, 2024

Motivation:

As design illustrated in Intel distributed support RFC #141741, two sections are needed to enable intel distributed backend (XCCL) support in PyTorch.

  1. Intel GPU distributed Backend integration in PyTorch torch-xpu-ops.
  2. Intel distributed Backend register in PyTorch distributed package. This PR is to contribute section 2 change.

Example:

Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.

import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def run_allreduce(rank, world_size):
    setup(rank, world_size)
    device = torch.device('xpu:{}'.format(rank))
    x = torch.randn([2, 2], device=device)
    dist.all_reduce(x)
    cleanup()

if __name__ == '__main__':
    world_size = 2
    mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True)

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141856

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (4 Unrelated Failures)

As of commit 9667909 with merge base 5d36224 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Dec 2, 2024
@cpuhrsch cpuhrsch requested a review from wconstab December 3, 2024 23:49
@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 3, 2024
@gujinghui
Copy link
Collaborator

gujinghui commented Dec 4, 2024

Addressing the RFC #141741

@gujinghui gujinghui requested a review from kwen2501 December 4, 2024 05:27
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Ideally we don't need the pybind as part of the registration. But it comes from the limitation of how we create backend today -- at python level. We can host the code as is, while taking some time to figure out a better solution.

@guangyey guangyey added ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks release notes: xpu release notes category labels Dec 5, 2024
@gujinghui
Copy link
Collaborator

gujinghui commented Dec 5, 2024

LGTM. Ideally we don't need the pybind as part of the registration. But it comes from the limitation of how we create backend today -- at python level. We can host the code as is, while taking some time to figure out a better solution.

Sounds great! Let's keep it as is, then, start to think how to refine the registration process. Thanks.

@zhangxiaoli73 pls fix the CI issues.

@guangyey
Copy link
Collaborator

guangyey commented Dec 5, 2024

@zhangxiaoli73 you need to add xccl to

class ProcessGroup:
class BackendType(Enum):
UNDEFINED = ...
GLOO = ...
NCCL = ...
UCC = ...
MPI = ...
CUSTOM = ...

@zhangxiaoli73 zhangxiaoli73 force-pushed the cherry/register-xccl-in-pytorch branch from e1cc5c7 to 018473b Compare December 5, 2024 07:51
@zhangxiaoli73
Copy link
Contributor Author

@zhangxiaoli73 you need to add xccl to

class ProcessGroup:
class BackendType(Enum):
UNDEFINED = ...
GLOO = ...
NCCL = ...
UCC = ...
MPI = ...
CUSTOM = ...

Got you. This will cause a lint check failure. Fix done.

@guangyey
Copy link
Collaborator

guangyey commented Dec 6, 2024

@zhangxiaoli73 you need to add the full name torch.distributed.distributed_c10d.is_xccl_available to

.. autofunction:: is_nccl_available
to fix the doc issue.

@guangyey
Copy link
Collaborator

guangyey commented Dec 6, 2024

"rebase to the latest viable/strict to fix XPU build issue."
@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cherry/register-xccl-in-pytorch onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cherry/register-xccl-in-pytorch && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the cherry/register-xccl-in-pytorch branch from 018473b to 532ade1 Compare December 6, 2024 02:06
@zhangxiaoli73 zhangxiaoli73 force-pushed the cherry/register-xccl-in-pytorch branch from 532ade1 to 796812a Compare December 6, 2024 02:10
@zhangxiaoli73
Copy link
Contributor Author

"rebase to the latest viable/strict to fix XPU build issue."
@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #141856, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

"is_initialized",
"is_mpi_available",
"is_nccl_available",
"is_xccl_available",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding new entries to this file is not ok.
Given you added the function to the .rst below, this should not be needed unless something is wrong with the doc setup there.

Copy link
Collaborator

@guangyey guangyey Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangxiaoli73, please add the full name torch.distributed.distributed_c10d.is_xccl_available to the doc and remove is_xccl_available from this list. Request review from @albanD once we finish updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD Thanks for your reminder. Revoked the change in this file. Please help review.

@zhangxiaoli73 zhangxiaoli73 requested a review from albanD December 9, 2024 02:31
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the update!

@zhangxiaoli73
Copy link
Contributor Author

@pytorchbot merge -i

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 10, 2024

-i flag is only allowed for users with write permissions

@zhangxiaoli73
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

bluenote10 pushed a commit to bluenote10/pytorch that referenced this pull request Dec 14, 2024
…ckage (pytorch#141856)

### Motivation:

As design illustrated in Intel distributed support RFC pytorch#141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch.
1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`.
2. **Intel distributed Backend register in PyTorch distributed package**. This PR is to contribute section 2 change.

### Example:
Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.
```
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def run_allreduce(rank, world_size):
    setup(rank, world_size)
    device = torch.device('xpu:{}'.format(rank))
    x = torch.randn([2, 2], device=device)
    dist.all_reduce(x)
    cleanup()

if __name__ == '__main__':
    world_size = 2
    mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True)
```

Pull Request resolved: pytorch#141856
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
pbchekin pushed a commit to intel/intel-xpu-backend-for-triton that referenced this pull request Mar 21, 2025
With Register Intel distributed Backend (XCCL) in PyTorch distributed package pytorch/pytorch#141856, XCCL has been enabled natively in PyTorch.
With Fix when USE_C10D_XCCL define in pytorch will not correct update cache torch-xpu-ops#1441, USE_XCCL is the option to turn on XCCL.
Users of Intel triton need XCCL enabled to run distributed inference and further test distributed kernels.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category release notes: quantization release notes category release notes: xpu release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants