Register Intel distributed Backend (`XCCL`) in PyTorch distributed package #141856

zhangxiaoli73 · 2024-12-02T08:56:03Z

Motivation:

As design illustrated in Intel distributed support RFC #141741, two sections are needed to enable intel distributed backend (XCCL) support in PyTorch.

Intel GPU distributed Backend integration in PyTorch torch-xpu-ops.
Intel distributed Backend register in PyTorch distributed package. This PR is to contribute section 2 change.

Example:

Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.

import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def run_allreduce(rank, world_size):
    setup(rank, world_size)
    device = torch.device('xpu:{}'.format(rank))
    x = torch.randn([2, 2], device=device)
    dist.all_reduce(x)
    cleanup()

if __name__ == '__main__':
    world_size = 2
    mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True)

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot · 2024-12-02T08:56:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141856

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

MacOS tests has not been running for few weeks

✅ You can merge normally! (4 Unrelated Failures)

As of commit 9667909 with merge base 5d36224 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu) (gh) (matched linux rule in flaky-rules.json)
Error response from daemon: page not found
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 2, 4, linux.idc.xpu) (gh) (matched linux rule in flaky-rules.json)
Error response from daemon: page not found
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 3, 4, linux.idc.xpu) (gh) (matched linux rule in flaky-rules.json)
Error response from daemon: page not found
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu) (gh) (matched linux rule in flaky-rules.json)
Error response from daemon: page not found

This comment was automatically generated by Dr. CI and updates every 15 minutes.

gujinghui · 2024-12-04T05:27:09Z

Addressing the RFC #141741

kwen2501

LGTM.
Ideally we don't need the pybind as part of the registration. But it comes from the limitation of how we create backend today -- at python level. We can host the code as is, while taking some time to figure out a better solution.

gujinghui · 2024-12-05T05:44:08Z

LGTM. Ideally we don't need the pybind as part of the registration. But it comes from the limitation of how we create backend today -- at python level. We can host the code as is, while taking some time to figure out a better solution.

Sounds great! Let's keep it as is, then, start to think how to refine the registration process. Thanks.

@zhangxiaoli73 pls fix the CI issues.

guangyey · 2024-12-05T07:48:26Z

@zhangxiaoli73 you need to add xccl to

pytorch/torch/_C/_distributed_c10d.pyi

Lines 298 to 306 in f675f64

    
           class ProcessGroup: 
        
               class BackendType(Enum): 
        
                   UNDEFINED = ... 
        
                   GLOO = ... 
        
                   NCCL = ... 
        
                   UCC = ... 
        
                   MPI = ... 
        
                   CUSTOM = ...

zhangxiaoli73 · 2024-12-05T07:53:57Z

@zhangxiaoli73 you need to add xccl to

pytorch/torch/_C/_distributed_c10d.pyi

Lines 298 to 306 in f675f64

class ProcessGroup:

class BackendType(Enum):

UNDEFINED = ...

GLOO = ...

NCCL = ...

UCC = ...

MPI = ...

CUSTOM = ...

Got you. This will cause a lint check failure. Fix done.

guangyey · 2024-12-06T02:01:14Z

@zhangxiaoli73 you need to add the full name torch.distributed.distributed_c10d.is_xccl_available to

pytorch/docs/source/distributed.rst

Line 197 in ce22a01

.. autofunction:: is_nccl_available

to fix the doc issue.

guangyey · 2024-12-06T02:04:38Z

"rebase to the latest viable/strict to fix XPU build issue."
@pytorchbot rebase

pytorchmergebot · 2024-12-06T02:06:05Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-06T02:06:08Z

Successfully rebased cherry/register-xccl-in-pytorch onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cherry/register-xccl-in-pytorch && git pull --rebase)

zhangxiaoli73 · 2024-12-06T05:42:40Z

"rebase to the latest viable/strict to fix XPU build issue."
@pytorchbot rebase

pytorchmergebot · 2024-12-06T05:44:10Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-06T05:44:12Z

Tried to rebase and push PR #141856, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

albanD · 2024-12-06T15:28:30Z

docs/source/conf.py

    "is_initialized",
    "is_mpi_available",
    "is_nccl_available",
+    "is_xccl_available",


Adding new entries to this file is not ok.
Given you added the function to the .rst below, this should not be needed unless something is wrong with the doc setup there.

@zhangxiaoli73, please add the full name torch.distributed.distributed_c10d.is_xccl_available to the doc and remove is_xccl_available from this list. Request review from @albanD once we finish updated.

@albanD Thanks for your reminder. Revoked the change in this file. Please help review.

albanD

LGTM, thanks for the update!

zhangxiaoli73 · 2024-12-10T01:49:41Z

@pytorchbot merge -i

pytorch-bot · 2024-12-10T01:49:45Z

-i flag is only allowed for users with write permissions

zhangxiaoli73 · 2024-12-10T01:50:40Z

@pytorchbot merge

pytorchmergebot · 2024-12-10T01:52:25Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ckage (pytorch#141856) ### Motivation: As design illustrated in Intel distributed support RFC pytorch#141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch. 1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`. 2. **Intel distributed Backend register in PyTorch distributed package**. This PR is to contribute section 2 change. ### Example: Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors. ``` import os import torch import torch.distributed as dist import torch.multiprocessing as mp def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(rank=rank, world_size=world_size) def cleanup(): dist.destroy_process_group() def run_allreduce(rank, world_size): setup(rank, world_size) device = torch.device('xpu:{}'.format(rank)) x = torch.randn([2, 2], device=device) dist.all_reduce(x) cleanup() if __name__ == '__main__': world_size = 2 mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True) ``` Pull Request resolved: pytorch#141856 Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD

With Register Intel distributed Backend (XCCL) in PyTorch distributed package pytorch/pytorch#141856, XCCL has been enabled natively in PyTorch. With Fix when USE_C10D_XCCL define in pytorch will not correct update cache torch-xpu-ops#1441, USE_XCCL is the option to turn on XCCL. Users of Intel triton need XCCL enabled to run distributed inference and further test distributed kernels.

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Dec 2, 2024

pytorchbot added the open source label Dec 2, 2024

cpuhrsch requested a review from wconstab December 3, 2024 23:49

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 3, 2024

gujinghui requested a review from kwen2501 December 4, 2024 05:27

kwen2501 approved these changes Dec 4, 2024

View reviewed changes

guangyey added ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks release notes: xpu release notes category labels Dec 5, 2024

gujinghui approved these changes Dec 5, 2024

View reviewed changes

zhangxiaoli73 force-pushed the cherry/register-xccl-in-pytorch branch from e1cc5c7 to 018473b Compare December 5, 2024 07:51

zhangxiaoli73 mentioned this pull request Dec 5, 2024

[RFC][API-Unstable] Intel GPU distributed Backend integration in torch-xpu-opsand registeration in PyTorch #141741

Open

3 tasks

pytorchmergebot force-pushed the cherry/register-xccl-in-pytorch branch from 018473b to 532ade1 Compare December 6, 2024 02:06

zhangxiaoli73 force-pushed the cherry/register-xccl-in-pytorch branch from 532ade1 to 796812a Compare December 6, 2024 02:10

zhangxiaoli73 requested review from eqy, lezcano and syed-ahmed as code owners December 6, 2024 09:16

gujinghui removed request for IvanYashchuk, albanD, jerryzh168, kimishpatel, lezcano, salilsdesai and syed-ahmed December 6, 2024 09:39

zhangxiaoli73 added 5 commits December 6, 2024 17:45

register XCCL backend for XPU device in python package

991ba7f

refine tests for XCCL backend

bcb0047

fix lint error

66bfffe

add in python docs

2746e97

fix docs CI failure

b4a4362

albanD requested changes Dec 6, 2024

View reviewed changes

Add full name to docs

9667909

zhangxiaoli73 requested a review from albanD December 9, 2024 02:31

albanD approved these changes Dec 9, 2024

View reviewed changes

pytorchmergebot added the merging label Dec 10, 2024

pytorchmergebot added the Merged label Dec 10, 2024

pytorchmergebot closed this in 5d6acd5 Dec 10, 2024

pytorchmergebot removed the merging label Dec 10, 2024

Chao1Han mentioned this pull request Dec 16, 2024

Add a new distributed backend (XCCL) for Intel GPUs #136343

Closed

airMeng mentioned this pull request Mar 19, 2025

Enable xccl in pytorch intel/intel-xpu-backend-for-triton#3709

Merged

7 tasks

Register Intel distributed Backend (XCCL) in PyTorch distributed package #141856

Register Intel distributed Backend (XCCL) in PyTorch distributed package #141856

Uh oh!

Conversation

zhangxiaoli73 commented Dec 2, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation:

Example:

Uh oh!

pytorch-bot bot commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141856

❗ 1 Active SEVs

✅ You can merge normally! (4 Unrelated Failures)

Uh oh!

gujinghui commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

gujinghui commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guangyey commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangxiaoli73 commented Dec 5, 2024

Uh oh!

guangyey commented Dec 6, 2024

Uh oh!

guangyey commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Uh oh!

zhangxiaoli73 commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Uh oh!

albanD Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

guangyey Dec 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangxiaoli73 Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

zhangxiaoli73 commented Dec 10, 2024

Uh oh!

pytorch-bot bot commented Dec 10, 2024

Uh oh!

zhangxiaoli73 commented Dec 10, 2024

Uh oh!

pytorchmergebot commented Dec 10, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Register Intel distributed Backend (`XCCL`) in PyTorch distributed package #141856

Register Intel distributed Backend (`XCCL`) in PyTorch distributed package #141856

zhangxiaoli73 commented Dec 2, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 2, 2024 •

edited

Loading

gujinghui commented Dec 4, 2024 •

edited

Loading

gujinghui commented Dec 5, 2024 •

edited

Loading

guangyey commented Dec 5, 2024 •

edited

Loading

guangyey Dec 7, 2024 •

edited

Loading