KEMBAR78
[CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed by yf225 · Pull Request #138178 · pytorch/pytorch · GitHub
Skip to content

Conversation

@yf225
Copy link
Contributor

@yf225 yf225 commented Oct 17, 2024

test_replicate_with_compiler.py and test_fully_shard_compile.py requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.

Stack from ghstack (oldest at bottom):

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@yf225 yf225 requested a review from a team as a code owner October 17, 2024 06:35
yf225 added a commit that referenced this pull request Oct 17, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138178

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 7921287 with merge base 1f349ee (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@yf225 yf225 requested a review from xmfan October 17, 2024 06:37
@yf225 yf225 added topic: not user facing topic category keep-going Don't stop on first failure, keep running tests until the end labels Oct 17, 2024
@yf225 yf225 requested review from kwen2501 and seemethere October 17, 2024 06:43
@yf225
Copy link
Contributor Author

yf225 commented Oct 17, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 17, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@huydhn huydhn removed the keep-going Don't stop on first failure, keep running tests until the end label Oct 17, 2024
@huydhn
Copy link
Contributor

huydhn commented Oct 17, 2024

@pytorchbot revert -m 'Sorry for reverting your change, but the new tests are failing inductor distributed jobs' -c nosignal

distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_compile_backward_only GH job link HUD commit link

Let me add ciflow/inductor to the PR to get that signals

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@yf225 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Oct 17, 2024
…or_distributed (#138178)"

This reverts commit 20af56d.

Reverted #138178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the new tests are failing inductor distributed jobs ([comment](#138178 (comment)))
@yf225 yf225 added ciflow/inductor keep-going Don't stop on first failure, keep running tests until the end and removed Reverted Merged labels Oct 17, 2024
…dering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
…dering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@yf225
Copy link
Contributor Author

yf225 commented Oct 20, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100)

Details for Dev Infra team Raised by workflow job

…dering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@yf225
Copy link
Contributor Author

yf225 commented Oct 20, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100)

Details for Dev Infra team Raised by workflow job

@yf225
Copy link
Contributor Author

yf225 commented Oct 20, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: periodic / linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu)

Details for Dev Infra team Raised by workflow job

yf225 added 2 commits October 20, 2024 02:58
…dering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
…dering tests to test_inductor_distributed"

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in #137161, as the compiled distributed jobs are the only blocking ones now.




cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@yf225
Copy link
Contributor Author

yf225 commented Oct 20, 2024

@pytorchbot merge -f "fixed the failing test, other tests are confirmed working by previous CI runs"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants