-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[Inductor] Generalize inductor triton backend device agnostic #109486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109486
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f3427f2 with merge base dee1009 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Hi, @jansel, |
torch/_inductor/device_properties.py
Outdated
| _compile_worker_device_properties: Dict[str, Any] = {} | ||
| _compile_worker_current_devices: Dict[str, int] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems pretty similar to RuntimeInterface in runtime.py above. Can we just merge both files into a single subclass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a try. But I met an issue. In my understanding, DeviceInterface is an interface of device runtime. So the API set_device and get_device_properties should be general. set_compiler_worker_current_device, current_device, and get_device_properties in device_properties.py are just used for compiling triton kernel in multi-processing.
Actually, get_device_properties in device_properties.py has a potential issue if the user uses set_device instead of set_compiler_worker_current_device to switch the device index. The reason is current_device in device_properties.py always fetch the cache from _compile_worker_current_devices instead of calling torch.cuda.current_device() in runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I would like to know if it is reasonable to keep device_properties.py. If you prefer to merge them into DeviceInterface, is it ok to add two specific methods set_device_for_cache and get_device_properties_for_cache to 'DeviceInterfaceto distinguish fromset_deviceandget_device_properties`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to merge them unless there is some technical reason not to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
|
Hi @jansel, are these failures unrelated to this PR? |
|
@guangyey , please check the failed cases locally. It might be caused by this PR. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Motivation
@jansel As discussed before, we expected to generalize some cuda-specific code. This can make inductor more friendly to third-party backend so that we can leverage inductor code as much as possible.
Solution
To implement this, we give a solution to introduce device runtime abstraction. We wrapper them inside
DeviceInterfaceand useregister_interface_for_deviceto register each kind of device to inductor. Then useget_interface_for_deviceto fetch the corresponding runtime from device type. Then usage is like this:The
DeviceInterfaceis a simple abstraction, which enables third-party backends that implement CUDA-like semantics to be integrated with inductor. This can prevent third-party backend from using monkey patch to override some utility functions, likedecode_devicethat is hard-coded with CUDA.Additional Context
The main code change:
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @gujinghui @arthuryuan1987