[RFC][API-Unstable] Intel GPU distributed Backend integration in `torch-xpu-ops`and registeration in PyTorch

### 🚀 The feature, motivation and pitch

# Motivation

This Request for Comments (RFC) document aims to propose and discuss Intel GPU distributed support in PyTorch. This initiative begins with Intel distributed backend (`XCCL`) integration into PyTorch component `torch-xpu-ops`, and registration in PyTorch distributed Python package. 
The RFC outlines a high-level design strategy for this integration. NOTE: the device name for Intel GPU in PyTorch is XPU. Therefore, `XCCL` represents XPU collective communications library in this RFC.

# Design
**1.	Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`**

In the current design, PyTorch distributed utilizes c10d::ProcessGroup class as an interface to manage multiple communication backends (inherited from c10::Backend) and provide collective APIs that can be dispatched based on device type and backend. 

Regarding per-backend implementation, c10d::ProcessGroupNCCL targets the CUDA device with backend name “nccl”. Similarly, we would like add c10d::ProcessGroupXCCL on Intel GPU device with new backend name “xccl”. We can visualize this design as below.
 
![P1](https://github.com/user-attachments/assets/b38e9e22-50b2-41d5-83cb-71154ea9f55d)




Regarding code structure, XCCL backend source code will be put in PyTorch [torch-xpu-ops](https://github.com/intel/torch-xpu-ops), while those code will be built into `libtorch_xpu.so`.

![P2](https://github.com/user-attachments/assets/2edf7168-edfb-4d3c-bf47-2ace4d8f76f5)



**2.	Intel distributed Backend register in PyTorch distributed package**

All XPU components in PyTorch are on par with CUDA in PyTorch, illustrated in chart below. For example, `libtorch.so` supports both CUDA and XPU stream/device, `libtorch_cuda.so` contains CUDA ATen and collective ops while libtorch_xpu.so contains ATen and collective ops for XPU. 
![P3](https://github.com/user-attachments/assets/3893f49a-1ffb-4338-9d7c-d77bed213a1c)



Consequently, we expect XCCL backend could handle Python binding and backend register in the same way as NCCL backend.
1)	Add `ProcessGroupXCCL` Python module binding in [torch/csrc/distributed/c10d/init.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/init.cpp), and those code will be built in `libtorch_python.so`.
2)	Register `XCCL` name and XCCL backend in PyTorch Python Backend, including distributed [backend_type_map](https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L268), [backend_capability](https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L261) and [default_device_backend_map](https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L256).
3)	Add `XCCL` name in native ProcessGroup [backend type list](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroup.hpp#L74).

 
# PR Plan
The code changes involve some parts of PyTorch. To be clear and concise, we will split those changes into 3 PRs for easy to review with below priority.
- [x] Backend ProcessGroupXCCL and collective `allreduce` as an entrypoint to `torch-xpu-ops`.
- [x] Backend ProcessGroupXCCL Python binding in PyTorch and register `XCCL` in PyTorch distributed Python package.
- [x] Remaining collectives to `torch-xpu-ops`.


### Alternatives

_No response_

### Additional context

_No response_

cc @gujinghui @EikanWang @fengyuan14 @guangyey

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC][API-Unstable] Intel GPU distributed Backend integration in `torch-xpu-ops`and registeration in PyTorch #141741

🚀 The feature, motivation and pitch

Motivation

Design

PR Plan

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC][API-Unstable] Intel GPU distributed Backend integration in torch-xpu-opsand registeration in PyTorch #141741

Description

🚀 The feature, motivation and pitch

Motivation

Design

PR Plan

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC][API-Unstable] Intel GPU distributed Backend integration in `torch-xpu-ops`and registeration in PyTorch #141741