-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Register Intel distributed Backend (XCCL) in PyTorch distributed package
#141856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register Intel distributed Backend (XCCL) in PyTorch distributed package
#141856
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141856
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (4 Unrelated Failures)As of commit 9667909 with merge base 5d36224 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Addressing the RFC #141741 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Ideally we don't need the pybind as part of the registration. But it comes from the limitation of how we create backend today -- at python level. We can host the code as is, while taking some time to figure out a better solution.
Sounds great! Let's keep it as is, then, start to think how to refine the registration process. Thanks. @zhangxiaoli73 pls fix the CI issues. |
|
@zhangxiaoli73 you need to add xccl to pytorch/torch/_C/_distributed_c10d.pyi Lines 298 to 306 in f675f64
|
e1cc5c7 to
018473b
Compare
Got you. This will cause a lint check failure. Fix done. |
|
@zhangxiaoli73 you need to add the full name pytorch/docs/source/distributed.rst Line 197 in ce22a01
|
|
"rebase to the latest viable/strict to fix XPU build issue." |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
018473b to
532ade1
Compare
532ade1 to
796812a
Compare
|
"rebase to the latest viable/strict to fix XPU build issue." |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
docs/source/conf.py
Outdated
| "is_initialized", | ||
| "is_mpi_available", | ||
| "is_nccl_available", | ||
| "is_xccl_available", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding new entries to this file is not ok.
Given you added the function to the .rst below, this should not be needed unless something is wrong with the doc setup there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangxiaoli73, please add the full name torch.distributed.distributed_c10d.is_xccl_available to the doc and remove is_xccl_available from this list. Request review from @albanD once we finish updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD Thanks for your reminder. Revoked the change in this file. Please help review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the update!
|
@pytorchbot merge -i |
|
|
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ckage (pytorch#141856) ### Motivation: As design illustrated in Intel distributed support RFC pytorch#141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch. 1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`. 2. **Intel distributed Backend register in PyTorch distributed package**. This PR is to contribute section 2 change. ### Example: Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors. ``` import os import torch import torch.distributed as dist import torch.multiprocessing as mp def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(rank=rank, world_size=world_size) def cleanup(): dist.destroy_process_group() def run_allreduce(rank, world_size): setup(rank, world_size) device = torch.device('xpu:{}'.format(rank)) x = torch.randn([2, 2], device=device) dist.all_reduce(x) cleanup() if __name__ == '__main__': world_size = 2 mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True) ``` Pull Request resolved: pytorch#141856 Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
With Register Intel distributed Backend (XCCL) in PyTorch distributed package pytorch/pytorch#141856, XCCL has been enabled natively in PyTorch. With Fix when USE_C10D_XCCL define in pytorch will not correct update cache torch-xpu-ops#1441, USE_XCCL is the option to turn on XCCL. Users of Intel triton need XCCL enabled to run distributed inference and further test distributed kernels.
Motivation:
As design illustrated in Intel distributed support RFC #141741, two sections are needed to enable intel distributed backend (
XCCL) support in PyTorch.torch-xpu-ops.Example:
Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov