-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Enable MI355X PyTorch CI testing. #158889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@pytorchbot merge -f "CI failures unrelated. 2 failing ROCm CI jobs were due to unrelated timeouts" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Original patch from saienduri <saimanas.enduri@amd.com> This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes. - Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build. - Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](https://github.com/ROCm/composable_kernel/tree/df6023e305f389bbf7249b0c4414e649f3ad6598) for mi350 compatibility. - Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker. - Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST. - Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: https://hud.pytorch.org/pytorch/pytorch/commit/ca7d5fae112558ee3dde7ec3ce32e94b13f877fd#rocm-mi300 Unlike the original patch, this version does not change the __AOTRITON_SHA256_LIST for rocm 6.5. (Change would cause sha256 error after aotriton download) fixes: ROCm/TheRock#1119 Signed-off-by: Mika Laitio <mika.laitio@amd.com>
This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.
HIP_VERSION
instead ofROCM_VERSION
as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build.cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd