-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[ROCm] fastSpecializedAtomicAdd for MI300 #135770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135770
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 6df3cf1 with merge base 31c0467 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Hi Jeff, I tested it internally on ROCm 6.2.0 and the performance looks great—thanks! However, I noticed that the code specifies ROCM_VERSION >= 60201. Is this a requirement, or should it also work with 6.2.0? |
|
@Mellonta it's very possible that our internal clang compiler is newer than the clang rpm in 6.2.0 |
When working on this PR I discovered a bug in our ROCm 6.2 compiler. To better support you, I got the release team to push the fix as a patch in ROCm 6.2.1. The compilation will succeed on ROCm 6.2, but your results when using index_add will be garbage for bf16 and fp16 types. That's why I guard it as needing 6.2.1 or newer. |
|
@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failed tests doesn't seem relevant
|
@jianyuh @xw285cornell Build should be fixed now. |
|
@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge -f 'Landed internally' (Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally) |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd. Pull Request resolved: pytorch#135770 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd. Pull Request resolved: pytorch#135770 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd. Pull Request resolved: pytorch#135770 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh Co-authored-by: Jeff Daily <jeff.daily@amd.com>
MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd. Pull Request resolved: pytorch#135770 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd. Pull Request resolved: pytorch#135770 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh (cherry picked from commit d33a5e2)
…) (#1746) MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd. Helps with improving [torch.scatter_add_ performance](https://ontrack-internal.amd.com/browse/SWDEV-497013), among others. Pull Request resolved: pytorch#135770 Co-authored-by: Jeff Daily <jeff.daily@amd.com>
MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd. Pull Request resolved: pytorch#135770 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh Co-authored-by: Jeff Daily <jeff.daily@amd.com>
MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd.
cc @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd