-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[ROCm] Enable USE_FBGEMM_GENAI #160676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] Enable USE_FBGEMM_GENAI #160676
Conversation
|
This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @cthi, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team. |
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160676
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 68b8727 with merge base bc4db2c ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D79564024 |
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024
|
This pull request was exported from Phabricator. Differential Revision: D79564024 |
Summary: X-link: pytorch/pytorch#160676 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Differential Revision: D79564024
Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024
|
This pull request was exported from Phabricator. Differential Revision: D79564024 |
Summary: X-link: pytorch/pytorch#160676 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on. Differential Revision: D79564024
|
This pull request was exported from Phabricator. Differential Revision: D79564024 |
Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024
Summary: X-link: pytorch/FBGEMM#4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Test Plan: Will ensure CI is green internally. Ensure the op can be called, added it into the fbgemm testing script: ``` HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16 ``` ``` ck_rowwise_grouped sim: 115.344. ck_rowwise_grouped ms: 0.167. ck_rowwise_grouped TFLOPS: 257.254. ck_rowwise_grouped GB/s: 1117.952. Average metrics over 1 iterations: ck_grouped_stacked_torch_2d3d sim: 115.344. ck_grouped_stacked_torch_2d3d ms: 0.072. ck_grouped_stacked_torch_2d3d TFLOPS: 594.511. ck_grouped_stacked_torch_2d3d GB/s: 2583.570. Average metrics over 1 iterations: scaled_grouped_mm_rowwise sim: 115.344. scaled_grouped_mm_rowwise ms: 0.074. scaled_grouped_mm_rowwise TFLOPS: 576.926. scaled_grouped_mm_rowwise GB/s: 2507.148. ``` Rollback Plan: Differential Revision: D79564024
77d547a to
68b8727
Compare
|
This pull request was exported from Phabricator. Differential Revision: D79564024 |
| # This is rather hacky, I could not figure out a clean solution :( | ||
| set(HIP_CLANG_FLAGS_ORIGINAL ${HIP_CLANG_FLAGS}) | ||
| string(REGEX REPLACE "--offload-arch=[^ ]*" "" FILTERED_HIP_CLANG_FLAGS "${HIP_CLANG_FLAGS}") | ||
| list(APPEND FILTERED_HIP_CLANG_FLAGS --offload-arch=gfx942;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty similar to
Line 91 in 01f66d0
| torch_cuda_get_nvcc_gencode_flag(_existing_arch_flags) |
Summary: X-link: pytorch/pytorch#160676 Pull Request resolved: #4703 X-link: facebookresearch/FBGEMM#1728 In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Reviewed By: drisspg Differential Revision: D79564024 fbshipit-source-id: bf2aa1a3eee43d0e47e9ba1e5514152e502da35f
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This reverts commit 69a25f6.
This reverts commit 69a25f6.
Summary: X-link: pytorch/FBGEMM#4703 X-link: https://github.com/facebookresearch/FBGEMM/pull/1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Rollback Plan: Differential Revision: D79564024 Test Plan: Ensure builds with: - `USE_FBGEMM_GENAI=1` and without gfx942 - `USE_FBGEMM_GENAI=1` and with gfx942 - `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48) Pull Request resolved: pytorch#160676 Approved by: https://github.com/drisspg
Summary: X-link: pytorch/FBGEMM#4703 X-link: https://github.com/facebookresearch/FBGEMM/pull/1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Rollback Plan: Differential Revision: D79564024 Test Plan: Ensure builds with: - `USE_FBGEMM_GENAI=1` and without gfx942 - `USE_FBGEMM_GENAI=1` and with gfx942 - `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48) Pull Request resolved: pytorch#160676 Approved by: https://github.com/drisspg
Summary: X-link: pytorch/FBGEMM#4703 X-link: https://github.com/facebookresearch/FBGEMM/pull/1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Rollback Plan: Differential Revision: D79564024 Test Plan: Ensure builds with: - `USE_FBGEMM_GENAI=1` and without gfx942 - `USE_FBGEMM_GENAI=1` and with gfx942 - `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48) Pull Request resolved: pytorch#160676 Approved by: https://github.com/drisspg
Summary:
X-link: pytorch/FBGEMM#4703
X-link: https://github.com/facebookresearch/FBGEMM/pull/1728
In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for
gfx942as that is what we have thoroughly tested performance and correctness on.Rollback Plan:
Differential Revision: D79564024
Test Plan:
Ensure builds with:
USE_FBGEMM_GENAI=1and without gfx942USE_FBGEMM_GENAI=1and with gfx942USE_FBGEMM_GENAI=1and all currentPYTORCH_ROCM_ARCHcc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd