KEMBAR78
[RFC][API-Unstable] Intel GPU distributed Backend integration in `torch-xpu-ops`and registeration in PyTorch · Issue #141741 · pytorch/pytorch · GitHub
Skip to content

[RFC][API-Unstable] Intel GPU distributed Backend integration in torch-xpu-opsand registeration in PyTorch #141741

@zhangxiaoli73

Description

@zhangxiaoli73

🚀 The feature, motivation and pitch

Motivation

This Request for Comments (RFC) document aims to propose and discuss Intel GPU distributed support in PyTorch. This initiative begins with Intel distributed backend (XCCL) integration into PyTorch component torch-xpu-ops, and registration in PyTorch distributed Python package.
The RFC outlines a high-level design strategy for this integration. NOTE: the device name for Intel GPU in PyTorch is XPU. Therefore, XCCL represents XPU collective communications library in this RFC.

Design

1. Intel GPU distributed Backend integration in PyTorch torch-xpu-ops

In the current design, PyTorch distributed utilizes c10d::ProcessGroup class as an interface to manage multiple communication backends (inherited from c10::Backend) and provide collective APIs that can be dispatched based on device type and backend.

Regarding per-backend implementation, c10d::ProcessGroupNCCL targets the CUDA device with backend name “nccl”. Similarly, we would like add c10d::ProcessGroupXCCL on Intel GPU device with new backend name “xccl”. We can visualize this design as below.

P1

Regarding code structure, XCCL backend source code will be put in PyTorch torch-xpu-ops, while those code will be built into libtorch_xpu.so.

P2

2. Intel distributed Backend register in PyTorch distributed package

All XPU components in PyTorch are on par with CUDA in PyTorch, illustrated in chart below. For example, libtorch.so supports both CUDA and XPU stream/device, libtorch_cuda.so contains CUDA ATen and collective ops while libtorch_xpu.so contains ATen and collective ops for XPU.
P3

Consequently, we expect XCCL backend could handle Python binding and backend register in the same way as NCCL backend.

  1. Add ProcessGroupXCCL Python module binding in torch/csrc/distributed/c10d/init.cpp, and those code will be built in libtorch_python.so.
  2. Register XCCL name and XCCL backend in PyTorch Python Backend, including distributed backend_type_map, backend_capability and default_device_backend_map.
  3. Add XCCL name in native ProcessGroup backend type list.

PR Plan

The code changes involve some parts of PyTorch. To be clear and concise, we will split those changes into 3 PRs for easy to review with below priority.

  • Backend ProcessGroupXCCL and collective allreduce as an entrypoint to torch-xpu-ops.
  • Backend ProcessGroupXCCL Python binding in PyTorch and register XCCL in PyTorch distributed Python package.
  • Remaining collectives to torch-xpu-ops.

Alternatives

No response

Additional context

No response

cc @gujinghui @EikanWang @fengyuan14 @guangyey

Metadata

Metadata

Labels

module: xpuIntel XPU related issuesrelease-feature-requestThis tag is to mark Feature Tracked for PyTorch OSS ReleasestriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions