KEMBAR78
Refactor device index bound check for xpu code by guangyey · Pull Request #120768 · pytorch/pytorch · GitHub
Skip to content

Conversation

@guangyey
Copy link
Collaborator

@guangyey guangyey commented Feb 28, 2024

Movitation

refer to Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex., we use c10::Device::MAX_NUM_DEVICES to make sure the number of XPU devices is valid in PyTorch.

Solution

Use TORCH_CHECK to check if the number of XPU devices exceeds c10::Device::MAX_NUM_DEVICES when enum XPU devices.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120768

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 83266ff with merge base 965555d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@guangyey guangyey added ciflow/xpu Run XPU CI tasks topic: not user facing topic category labels Feb 28, 2024
@guangyey guangyey force-pushed the guangyey/max_device_for_xpu branch from b8286b7 to f2467e5 Compare February 28, 2024 03:34
@guangyey guangyey added the intel This tag is for PR from Intel label Feb 28, 2024
@EikanWang
Copy link
Collaborator

@guangyey could you describe more in your pr?

@guangyey guangyey requested review from albanD and malfet February 28, 2024 03:43
@guangyey guangyey added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 28, 2024
@guangyey guangyey changed the title [WIP] Refactor check device overflow for xpu code Refactor check device overflow for xpu code Feb 28, 2024
@guangyey guangyey changed the title Refactor check device overflow for xpu code Refactor device bound check for xpu code Feb 28, 2024
@guangyey guangyey force-pushed the guangyey/max_device_for_xpu branch from f2467e5 to 5d749a9 Compare February 28, 2024 05:11
@guangyey guangyey force-pushed the guangyey/max_device_for_xpu branch from 5d749a9 to c082909 Compare February 28, 2024 05:18
@guangyey guangyey requested a review from tringwald February 28, 2024 09:59

void enumDevices(std::vector<std::unique_ptr<sycl::device>>& devices) {
// Ensures we only call enumDevices only once.
void enumDevices() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could maybe be moved into a lambda function guarded by call_once (see my other comment below).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the parameter devices back to enumDevices to prevent unintentional corrupting of the device list, since the initGlobalDevicePoolState` only can be called once.

@tringwald
Copy link
Collaborator

FYI: #119639 got reverted (again).

@guangyey guangyey changed the title Refactor device bound check for xpu code [Don't merge] Refactor device bound check for xpu code Feb 29, 2024
@guangyey
Copy link
Collaborator Author

Don't merge this PR. It depends on #119639.

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Apr 29, 2024
@guangyey guangyey added no-stale and removed Stale labels May 6, 2024
@guangyey guangyey force-pushed the guangyey/max_device_for_xpu branch 3 times, most recently from 9ff00da to 83266ff Compare November 12, 2024 07:12
@guangyey guangyey changed the title [Don't merge] Refactor device bound check for xpu code Refactor device index bound check for xpu code Nov 12, 2024
@guangyey
Copy link
Collaborator Author

Looks like #119639 is stale. I will rebase to the main branch and land it.

@guangyey
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

ghstack-source-id: 1863aaf
Pull Request resolved: #140369
ghstack-source-id: c8c2e4a
Pull Request resolved: #140370
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
# Movitation
refer to [Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.](pytorch#119639), we use `c10::Device::MAX_NUM_DEVICES` to make sure the number of XPU devices is valid in PyTorch.

# Solution
Use `TORCH_CHECK` to check if the number of XPU devices exceeds `c10::Device::MAX_NUM_DEVICES` when enum XPU devices.

Pull Request resolved: pytorch#120768
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/tringwald
@github-actions github-actions bot deleted the guangyey/max_device_for_xpu branch December 14, 2024 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks intel This tag is for PR from Intel Merged no-stale open source topic: not user facing topic category

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants