KEMBAR78
Friendly catch exception when fail to initialize XPU devices by guangyey · Pull Request #141658 · pytorch/pytorch · GitHub
Skip to content

Conversation

@guangyey
Copy link
Collaborator

@guangyey guangyey commented Nov 27, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 27, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141658

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 87d974a with merge base 6e61ff4 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@guangyey guangyey added ciflow/xpu Run XPU CI tasks module: xpu Intel XPU related issues release notes: xpu release notes category labels Nov 27, 2024
@guangyey guangyey added this to the 2.6.0 milestone Nov 27, 2024
enumDevices(gDevicePool.devices);
} catch (const sycl::exception& e) {
TORCH_WARN(
"Failed to initialize XPU devices. Did you install the driver correctly?");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is not informative. Could you help refine the message?

Copy link
Collaborator Author

@guangyey guangyey Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Do you think it is OK?

guangyey added a commit that referenced this pull request Nov 27, 2024
ghstack-source-id: 340c0e5
Pull Request resolved: #141658
@guangyey guangyey requested a review from EikanWang November 27, 2024 09:20
@guangyey guangyey changed the title Friendly catch execption when initialize XPU devices Friendly catch exception when fail to initialize XPU devices Nov 27, 2024
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
TORCH_WARN(
"Failed to initialize XPU devices. The driver may not be installed, installed incorrectly, or incompatible with the current setup. ",
"Please refer to the guideline (https://github.com/pytorch/pytorch?tab=readme-ov-file#intel-gpu-support) for proper installation and configuration.");
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, torch will not crash itself even without any XPU device.
Sounds good to provide a smooth warning, instead of ugly crash.

I assume we have well-defined behavior for below cases in this scenario,

  1. The init value of device pool is always empty
  2. No undefined behavior happens even if user does not check the device count.

Pls. note, initDevicePoolCallOnce is called once, that means, the warning will be raised only once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the device pool is always initialized to empty. PyTorch will raise RuntimeError in the following code if the user would like to call the other runtime API. We deliberately warn this message only once to avoid interrupting the user over time.

DeviceIndex device_count_ensure_non_zero() {
auto count = device_count();
// Zero gpus could produce a warning in `device_count` but we fail here.
TORCH_CHECK(count, "No XPU devices are available.");
return count;
}

[ghstack-poisoned]
[ghstack-poisoned]
@guangyey guangyey requested a review from gujinghui November 28, 2024 02:47
@guangyey guangyey added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 28, 2024
Copy link
Collaborator

@EikanWang EikanWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guangyey
Copy link
Collaborator Author

"Unrelated failure, there is an issue to track it #141705"
@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
@github-actions github-actions bot deleted the gh/guangyey/103/head branch December 30, 2024 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: xpu Intel XPU related issues open source release notes: xpu release notes category

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants