Upgrade distributed test to g4dn instances (T4 GPUs) #137161

kwen2501 · 2024-10-02T05:03:53Z

Stack from ghstack (oldest at bottom):

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager async_op=True collective if under allow_inflight_collective_as_graph_input_ctx() context manager #137763
[Distributed] Fix extra context on device 0 #135273
-> Upgrade distributed test to g4dn instances (T4 GPUs) #137161

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-02T05:03:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137161

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit f6f6620 with merge base a1899b5 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: f943258 Pull Request resolved: #137161

seemethere

LGTM, nit: since these nodes are 2x as expensive it might be good to move the experimental split build duplicate tests to periodic.yml.

WDYT @PaliC

[ghstack-poisoned]

ghstack-source-id: 0c9e8cd Pull Request resolved: #137161

[ghstack-poisoned]

kwen2501 · 2024-10-10T17:09:33Z

@pytorchbot merge -f "Previously tested; just a rebase to swap stack order to unblock the other PR"

pytorchmergebot · 2024-10-10T17:11:16Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR contains multiple fixes for issue #135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: https://github.com/pytorch/pytorch/blob/1f3a79379012b408e0375e81fe9205dcba5e34ba/c10/cuda/impl/CUDAGuardImpl.h#L106-L121 When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- **that's where rank 1, 2, ... can create extra context on device 0!** ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: #135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy ghstack dependencies: #137161

PaliC · 2024-10-10T23:45:40Z

@pytorchbot revert -m "broken tests on trunk" -c "nosignal"
check out https://hud.pytorch.org/pytorch/pytorch/commit/b6a64dce072240c0b06d2fb03ac81b3ed1b73d92

pytorchmergebot · 2024-10-10T23:47:18Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit cdd8fa9. Reverted #135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](#137161 (comment)))

This reverts commit b6a64dc. Reverted #137161 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](#137161 (comment)))

pytorchmergebot · 2024-10-10T23:47:30Z

@kwen2501 your PR has been successfully reverted.

…dering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]