- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.7k
Upgrade distributed test to g4dn instances (T4 GPUs) #137161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
| 🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137161
 Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit f6f6620 with merge base a1899b5 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
 
 This comment was automatically generated by Dr. CI and updates every 15 minutes. | 
[ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nit: since these nodes are 2x as expensive it might be good to move the experimental split build duplicate tests to periodic.yml.
WDYT @PaliC
[ghstack-poisoned]
[ghstack-poisoned]
| @pytorchbot merge -f "Previously tested; just a rebase to swap stack order to unblock the other PR" | 
| Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).  Please use  Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team | 
This PR contains multiple fixes for issue #135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: https://github.com/pytorch/pytorch/blob/1f3a79379012b408e0375e81fe9205dcba5e34ba/c10/cuda/impl/CUDAGuardImpl.h#L106-L121 When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- **that's where rank 1, 2, ... can create extra context on device 0!** ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: #135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy ghstack dependencies: #137161
| @pytorchbot revert -m "broken tests on trunk" -c "nosignal" | 
| @pytorchbot successfully started a revert job. Check the current status here. | 
This reverts commit cdd8fa9. Reverted #135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](#137161 (comment)))
This reverts commit b6a64dc. Reverted #137161 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](#137161 (comment)))
| @kwen2501 your PR has been successfully reverted. | 
…dering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ompute-comm reordering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…dering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ompute-comm reordering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…dering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ompute-comm reordering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…dering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ompute-comm reordering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…dering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…s to test_inductor_distributed (#138178) `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: #138178 Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501
cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
| @pytorchbot merge -f "unrelated failure" | 
| Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).  Please use  Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team | 
This PR contains multiple fixes for issue #135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: https://github.com/pytorch/pytorch/blob/1f3a79379012b408e0375e81fe9205dcba5e34ba/c10/cuda/impl/CUDAGuardImpl.h#L106-L121 When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- **that's where rank 1, 2, ... can create extra context on device 0!** ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: #135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy ghstack dependencies: #137161 Co-authored-by: Will Feng <yf225@cornell.edu>
Following change in #137161 , bumping world size for some test suites. Pull Request resolved: #138846 Approved by: https://github.com/fduwjj
Following change in #137161 , bumping world size for some test suites. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Following change in #137161 , bumping world size for some test suites. Pull Request resolved: #138846 Approved by: https://github.com/fduwjj
If there aren't any GPUs the WORLD_SIZE would be zero which does not work. So skip those backends completely in that case. Fix after #137161 It might make sense to still run the (CPU-) part of the tests by using something like `world_size = max(3, gpu_count)` or `num_gpus if num_gpus else 3` instead of skipping them all Pull Request resolved: #150764 Approved by: https://github.com/kwen2501
Stack from ghstack (oldest at bottom):
async_op=Truecollective if underallow_inflight_collective_as_graph_input_ctx()context manager #137763cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o