KEMBAR78
Add device guard for xpu conv on multi device by pytorchbot · Pull Request #153345 · pytorch/pytorch · GitHub
Skip to content

Conversation

@pytorchbot
Copy link
Collaborator

Stack from ghstack (oldest at bottom):

Motivation

fixes #153022
The root cause is that the XPU backend registers the convolution op using m.impl, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.

Additional Context

run the following script

import torch
import torchvision.models as models

torch.manual_seed(0)

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)

device = torch.device('xpu:1')  # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)


with torch.no_grad():
    ret = model(data)
    print(ret)

print("Execution finished")

The output is

         -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01,  6.4551e-01,
         -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01,  3.2715e-02,
         -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
         -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01,  2.0312e+00]],
       device='xpu:1', dtype=torch.float16)
Execution finished

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

# Motivation
fixes #153022
The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.

# Additional Context
run the following script
```python
import torch
import torchvision.models as models

torch.manual_seed(0)

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)

device = torch.device('xpu:1')  # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)

with torch.no_grad():
    ret = model(data)
    print(ret)

print("Execution finished")
```
The output is
```bash
         -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01,  6.4551e-01,
         -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01,  3.2715e-02,
         -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
         -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01,  2.0312e+00]],
       device='xpu:1', dtype=torch.float16)
Execution finished

```

Pull Request resolved: #153067
Approved by: https://github.com/albanD, https://github.com/EikanWang

(cherry picked from commit e06a080)
@pytorch-bot
Copy link

pytorch-bot bot commented May 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153345

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 60d85d3 with merge base 924a247 (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 11, 2025
Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@atalman atalman merged commit 3bfe071 into release/2.7 May 14, 2025
181 of 187 checks passed
@github-actions github-actions bot deleted the cherry-pick-153067-by-pytorch_bot_bot_ branch June 18, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: cpu CPU specific problem (e.g., perf, algorithm) open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants