KEMBAR78
Upgrade distributed test to g4dn instances (T4 GPUs) by kwen2501 · Pull Request #137161 · pytorch/pytorch · GitHub
Skip to content

Conversation

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137161

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit f6f6620 with merge base a1899b5 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Oct 2, 2024
kwen2501 added a commit that referenced this pull request Oct 2, 2024
Copy link
Member

@seemethere seemethere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nit: since these nodes are 2x as expensive it might be good to move the experimental split build duplicate tests to periodic.yml.

WDYT @PaliC

kwen2501 added a commit that referenced this pull request Oct 10, 2024
ghstack-source-id: 0c9e8cd
Pull Request resolved: #137161
@kwen2501 kwen2501 changed the title [Trial] Upgrade distributed test to g4dn instances (T4 GPUs) Upgrade distributed test to g4dn instances (T4 GPUs) Oct 10, 2024
@kwen2501 kwen2501 added the keep-going Don't stop on first failure, keep running tests until the end label Oct 10, 2024
@kwen2501
Copy link
Contributor Author

@pytorchbot merge -f "Previously tested; just a rebase to swap stack order to unblock the other PR"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Oct 10, 2024
This PR contains multiple fixes for issue #135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
https://github.com/pytorch/pytorch/blob/1f3a79379012b408e0375e81fe9205dcba5e34ba/c10/cuda/impl/CUDAGuardImpl.h#L106-L121

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: #135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161
@PaliC
Copy link
Contributor

PaliC commented Oct 10, 2024

@pytorchbot revert -m "broken tests on trunk" -c "nosignal"
check out https://hud.pytorch.org/pytorch/pytorch/commit/b6a64dce072240c0b06d2fb03ac81b3ed1b73d92

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Oct 10, 2024
This reverts commit cdd8fa9.

Reverted #135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](#137161 (comment)))
pytorchmergebot added a commit that referenced this pull request Oct 10, 2024
This reverts commit b6a64dc.

Reverted #137161 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](#137161 (comment)))
@pytorchmergebot
Copy link
Collaborator

@kwen2501 your PR has been successfully reverted.

@kwen2501 kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 10, 2024
yf225 added a commit that referenced this pull request Oct 20, 2024
…dering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Oct 20, 2024
…ompute-comm reordering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Oct 20, 2024
…dering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Oct 20, 2024
…ompute-comm reordering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Oct 20, 2024
…dering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Oct 20, 2024
…ompute-comm reordering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Oct 20, 2024
…dering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Oct 20, 2024
…ompute-comm reordering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Oct 20, 2024
…dering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Oct 20, 2024
…s to test_inductor_distributed (#138178)

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.

Pull Request resolved: #138178
Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501
cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@yf225
Copy link
Contributor

yf225 commented Oct 20, 2024

@pytorchbot merge -f "unrelated failure"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Oct 21, 2024
This PR contains multiple fixes for issue #135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
https://github.com/pytorch/pytorch/blob/1f3a79379012b408e0375e81fe9205dcba5e34ba/c10/cuda/impl/CUDAGuardImpl.h#L106-L121

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: #135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161

Co-authored-by: Will Feng <yf225@cornell.edu>
pytorchmergebot pushed a commit that referenced this pull request Oct 25, 2024
Following change in #137161 , bumping world size for some test suites.

Pull Request resolved: #138846
Approved by: https://github.com/fduwjj
kwen2501 added a commit that referenced this pull request Oct 25, 2024
Following change in #137161 , bumping world size for some test suites.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Oct 25, 2024
Following change in #137161 , bumping world size for some test suites.

Pull Request resolved: #138846
Approved by: https://github.com/fduwjj
@github-actions github-actions bot deleted the gh/kwen2501/69/head branch November 21, 2024 02:08
pytorchmergebot pushed a commit that referenced this pull request Apr 29, 2025
If there aren't any GPUs the WORLD_SIZE would be zero which does not work.
So skip those backends completely in that case.

Fix after #137161

It might make sense to still run the (CPU-) part of the tests by using something like `world_size = max(3, gpu_count)` or `num_gpus if num_gpus else 3` instead of skipping them all

Pull Request resolved: #150764
Approved by: https://github.com/kwen2501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants