Add device guard for xpu conv on multi device #153345

pytorchbot · 2025-05-11T15:23:16Z

Stack from ghstack (oldest at bottom):

-> Add device guard for xpu conv on multi device #153067

Motivation

fixes #153022
The root cause is that the XPU backend registers the convolution op using m.impl, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.

Additional Context

run the following script

import torch
import torchvision.models as models

torch.manual_seed(0)

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)

device = torch.device('xpu:1')  # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)


with torch.no_grad():
    ret = model(data)
    print(ret)

print("Execution finished")

The output is

         -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01,  6.4551e-01,
         -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01,  3.2715e-02,
         -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
         -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01,  2.0312e+00]],
       device='xpu:1', dtype=torch.float16)
Execution finished

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

# Motivation fixes #153022 The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set. # Additional Context run the following script ```python import torch import torchvision.models as models torch.manual_seed(0) model = models.resnet50(weights="ResNet50_Weights.DEFAULT") model.eval() data = torch.rand(1, 3, 224, 224) device = torch.device('xpu:1') # 'xpu:0' model = model.to(device=device, dtype=torch.float16) data = data.to(device, dtype=torch.float16) with torch.no_grad(): ret = model(data) print(ret) print("Execution finished") ``` The output is ```bash -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01, 6.4551e-01, -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01, 3.2715e-02, -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01, -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01, 2.0312e+00]], device='xpu:1', dtype=torch.float16) Execution finished ``` Pull Request resolved: #153067 Approved by: https://github.com/albanD, https://github.com/EikanWang (cherry picked from commit e06a080)

pytorch-bot · 2025-05-11T15:23:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153345

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 60d85d3 with merge base 924a247 ():

NEW FAILURES - The following jobs have failed:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh)
MISSING REGRESSION TEST
pull / linux-focal-py3_9-clang9-xla / build (gh)
ninja: build stopped: subcommand failed

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge) (gh) (#144480)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

atalman

lgtm

pytorchbot requested review from EikanWang and gujinghui as code owners May 11, 2025 15:23

This was referenced May 11, 2025

[v2.7.1] Release Tracker #152627

Closed

Add device guard for xpu conv on multi device #153067

Closed

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 11, 2025

pytorchbot added the open source label May 11, 2025

EikanWang approved these changes May 12, 2025

View reviewed changes

atalman approved these changes May 14, 2025

View reviewed changes

atalman merged commit 3bfe071 into release/2.7 May 14, 2025
181 of 187 checks passed

atalman mentioned this pull request May 28, 2025

Release 2.7.1 validations checklist and cherry-picks #154512

Closed

49 tasks

github-actions bot deleted the cherry-pick-153067-by-pytorch_bot_bot_ branch June 18, 2025 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add device guard for xpu conv on multi device #153345

Add device guard for xpu conv on multi device #153345

Uh oh!

pytorchbot commented May 11, 2025

Uh oh!

pytorch-bot bot commented May 11, 2025 •

edited

Loading

Uh oh!

atalman left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add device guard for xpu conv on multi device #153345

Add device guard for xpu conv on multi device #153345

Uh oh!

Conversation

pytorchbot commented May 11, 2025

Motivation

Additional Context

Uh oh!

pytorch-bot bot commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153345

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented May 11, 2025 •

edited

Loading