-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[cuda][cupy] Improve cupy device placement when device is provided with explicit index #158529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158529
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 792fcb5 with merge base 0e3e377 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| if (device_opt.has_value() && device_opt->has_index()) { | ||
| // if device_opt is provided with explicit device index, use it | ||
| return device_opt; | ||
| } else { | ||
| // otherwise infer from cudaPointerGetAttributes later in from_blob | ||
| return std::nullopt; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wdvr I think this can solve the broken test in https://github.com/pytorch/pytorch/actions/runs/16317513144/job/46092420996 .
I tried that test locally, but it passes even before this fix. no ideas about what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can we trigger that test in the pr before merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao I added ciflow/trunk and ci-no-td labels, that should run all tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then how can i trigger a new ci run?
|
also cc @ezyang |
| ) | ||
|
|
||
|
|
||
| os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this, it will interfere with running this with other test files in a single process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the env var change to _init_device, then it should not interfere with other tests. hope it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functional change is fine but we shouldn't be twiddling envvars in the test file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is still done improperly, because the environ is sticky. If there is no way to do this I recommend just running the test in a subprocess, check out test_cublas_config_nondeterministic_alert for an example
I thought the test were run in subprocess? |
|
@ezyang how about i use torch.cuda.memory._set_allocator_settings('expandable_segments:True')
torch.cuda.memory._set_allocator_settings('expandable_segments:False')before and after the test? |
|
This appears to match current usage in test/test_cuda.py, so sure, if you try...finally it I will accept |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…th explicit index (pytorch#158529) resubmit pytorch#158320 , fixing a potential bug when device index is not specified explicitly. Pull Request resolved: pytorch#158529 Approved by: https://github.com/ezyang
…th explicit index (pytorch#158529) resubmit pytorch#158320 , fixing a potential bug when device index is not specified explicitly. Pull Request resolved: pytorch#158529 Approved by: https://github.com/ezyang
resubmit #158320 , fixing a potential bug when device index is not specified explicitly.
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta