-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
🚀 The feature, motivation and pitch
Motivation
This Request for Comments (RFC) document aims to propose and discuss Intel GPU distributed support in PyTorch. This initiative begins with Intel distributed backend (XCCL
) integration into PyTorch component torch-xpu-ops
, and registration in PyTorch distributed Python package.
The RFC outlines a high-level design strategy for this integration. NOTE: the device name for Intel GPU in PyTorch is XPU. Therefore, XCCL
represents XPU collective communications library in this RFC.
Design
1. Intel GPU distributed Backend integration in PyTorch torch-xpu-ops
In the current design, PyTorch distributed utilizes c10d::ProcessGroup class as an interface to manage multiple communication backends (inherited from c10::Backend) and provide collective APIs that can be dispatched based on device type and backend.
Regarding per-backend implementation, c10d::ProcessGroupNCCL targets the CUDA device with backend name “nccl”. Similarly, we would like add c10d::ProcessGroupXCCL on Intel GPU device with new backend name “xccl”. We can visualize this design as below.
Regarding code structure, XCCL backend source code will be put in PyTorch torch-xpu-ops, while those code will be built into libtorch_xpu.so
.
2. Intel distributed Backend register in PyTorch distributed package
All XPU components in PyTorch are on par with CUDA in PyTorch, illustrated in chart below. For example, libtorch.so
supports both CUDA and XPU stream/device, libtorch_cuda.so
contains CUDA ATen and collective ops while libtorch_xpu.so contains ATen and collective ops for XPU.
Consequently, we expect XCCL backend could handle Python binding and backend register in the same way as NCCL backend.
- Add
ProcessGroupXCCL
Python module binding in torch/csrc/distributed/c10d/init.cpp, and those code will be built inlibtorch_python.so
. - Register
XCCL
name and XCCL backend in PyTorch Python Backend, including distributed backend_type_map, backend_capability and default_device_backend_map. - Add
XCCL
name in native ProcessGroup backend type list.
PR Plan
The code changes involve some parts of PyTorch. To be clear and concise, we will split those changes into 3 PRs for easy to review with below priority.
- Backend ProcessGroupXCCL and collective
allreduce
as an entrypoint totorch-xpu-ops
. - Backend ProcessGroupXCCL Python binding in PyTorch and register
XCCL
in PyTorch distributed Python package. - Remaining collectives to
torch-xpu-ops
.
Alternatives
No response
Additional context
No response