KEMBAR78
[ROCM][CI] Introduce tests-to-include as rocm-test workflow input by jithunnair-amd · Pull Request #110511 · pytorch/pytorch · GitHub
Skip to content

Conversation

@jithunnair-amd
Copy link
Collaborator

@jithunnair-amd jithunnair-amd commented Oct 4, 2023

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Oct 4, 2023
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 4, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110511

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ac4ae2f with merge base 0fd856c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jithunnair-amd jithunnair-amd added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 4, 2023
@jithunnair-amd
Copy link
Collaborator Author

Seeing a failure 2023-10-17T00:59:20.6930616Z FAILED [0.0036s] test_cuda.py::TestCudaMallocAsync::test_allocator_settings - RuntimeError: Unrecognized CachingAllocator option: release_lock_on_cudamalloc that is probably already fixed upstream ...
@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased reduced_tests_for_rocm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout reduced_tests_for_rocm && git pull --rebase)

@jithunnair-amd
Copy link
Collaborator Author

ROCm CI ran 1 shard as expected: https://github.com/pytorch/pytorch/actions/runs/6552121852/job/17795441333
Duration: 44mins
Python tests run: test_cuda (78s), test_torch (160s)
Cpp tests run: ~30m!

Since the aim of this PR is to introduce a way to run smaller set of core unit tests, we should exclude the cpp tests.

@jithunnair-amd jithunnair-amd marked this pull request as ready for review October 20, 2023 19:33
@jithunnair-amd jithunnair-amd requested a review from a team as a code owner October 20, 2023 19:33
@jithunnair-amd
Copy link
Collaborator Author

@huydhn @clee2000 The ROCm CI as part of ciflow/trunk passed and ran the reduced set of tests in 53m: https://github.com/pytorch/pytorch/actions/runs/6590967469/job/17909206391

test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd

@huydhn
Copy link
Contributor

huydhn commented Nov 10, 2023

Ping @jithunnair-amd to see if there is any update on this one. The context is that we start to see some ROCm failures landing in trunk, i.e. https://hud.pytorch.org/pytorch/pytorch/commit/7ccca60927cdccde63d6a1d40480950f24e9877a, because the PR didn't have ciflow/rocm

@jithunnair-amd
Copy link
Collaborator Author

Ping @jithunnair-amd to see if there is any update on this one. The context is that we start to see some ROCm failures landing in trunk, i.e. https://hud.pytorch.org/pytorch/pytorch/commit/7ccca60927cdccde63d6a1d40480950f24e9877a, because the PR didn't have ciflow/rocm

Just updated this PR to use ROCm5.7, but otherwise it looks good from my end if all ROCm tests pass.

Requesting @jeffdaily to also take a look in case I'm missing something.

@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased reduced_tests_for_rocm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout reduced_tests_for_rocm && git pull --rebase)

@jeffdaily
Copy link
Collaborator

@jithunnair-amd are the CI failures real?

@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Nov 13, 2023

@jithunnair-amd are the CI failures real?

The torchvision build failure is real, but is due to some unsupported compiler flags:

clang++: error: unsupported option '--generate-dependencies-with-compile'
clang++: error: unsupported option '--dependency-output'

The previous CI run succeeded and used a different torchvision commit. There was another torchvision commit bump after the one in the most recent failing CI run. I'm assessing if that'll resolve this issue.

@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased reduced_tests_for_rocm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout reduced_tests_for_rocm && git pull --rebase)

@jithunnair-amd
Copy link
Collaborator Author

@jithunnair-amd are the CI failures real?

The torchvision build failure is real, but is due to some unsupported compiler flags:

clang++: error: unsupported option '--generate-dependencies-with-compile'
clang++: error: unsupported option '--dependency-output'

The previous CI run succeeded and used a different torchvision commit. There was another torchvision commit bump after the one in the most recent failing CI run. I'm assessing if that'll resolve this issue.

Actually, 0a7eef9 fixed the issue wrt unsupported compiler flags, so expecting a rebase to help.

@jithunnair-amd
Copy link
Collaborator Author

ROCm CI passing with rebase: https://github.com/pytorch/pytorch/actions/runs/6854687390/job/18638805996

Merging as pre-rebase commit had all CI checks passing except ROCm.

@pytorchbot merge -f "ROCm CI check passed post-rebase; all other CI checks passed pre-rebase already"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
@jithunnair-amd jithunnair-amd added the rocm This tag is for PRs from ROCm team label Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source rocm This tag is for PRs from ROCm team topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: rocm (default, 1, 3, linux.rocm.gpu) is very slow

5 participants