-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed #138178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ibuted [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138178
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 7921287 with merge base 1f349ee ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@pytorchbot revert -m 'Sorry for reverting your change, but the new tests are failing inductor distributed jobs' -c nosignal distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_compile_backward_only GH job link HUD commit link Let me add |
@pytorchbot successfully started a revert job. Check the current status here. |
@yf225 your PR has been successfully reverted. |
…or_distributed (#138178)" This reverts commit 20af56d. Reverted #138178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the new tests are failing inductor distributed jobs ([comment](#138178 (comment)))
…dering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…dering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) Details for Dev Infra teamRaised by workflow job |
…dering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: periodic / linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu) Details for Dev Infra teamRaised by workflow job |
…dering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…dering tests to test_inductor_distributed" `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
@pytorchbot merge -f "fixed the failing test, other tests are confirmed working by previous CI runs" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
test_replicate_with_compiler.py
andtest_fully_shard_compile.py
requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.
Stack from ghstack (oldest at bottom):
async_op=True
collective if underallow_inflight_collective_as_graph_input_ctx()
context manager #137763cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o