KEMBAR78

[ROCm] Enable USE_FBGEMM_GENAI by cthi · Pull Request #160676 · pytorch/pytorch · GitHub

[ROCm] Enable USE_FBGEMM_GENAI #160676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

cthi wants to merge 1 commit into pytorch:main from cthi:export-D79564024

+19 −10

Contributor

cthi commented Aug 14, 2025 •

edited

Loading

Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:

USE_FBGEMM_GENAI=1 and without gfx942
USE_FBGEMM_GENAI=1 and with gfx942
USE_FBGEMM_GENAI=1 and all current PYTORCH_ROCM_ARCH

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot bot commented Aug 14, 2025

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @cthi, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

pytorch-bot bot added module: rocm topic: not user facing labels

pytorch-bot bot commented Aug 14, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160676

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 68b8727 with merge base bc4db2c ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

rocm / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.2) (gh) (#156098)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Contributor

facebook-github-bot commented Aug 14, 2025

This pull request was exported from Phabricator. Differential Revision: D79564024

pytorch-bot bot added the ciflow/rocm label

facebook-github-bot added the fb-exported label

pytorch-bot bot commented Aug 14, 2025

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot added ciflow/rocm and removed ciflow/rocm labels

cthi force-pushed the export-D79564024 branch from dfecffb to 2775878 Compare

August 14, 2025 20:54

cthi added a commit to cthi/pytorch that referenced this pull request


          [WIP][ROCm] Enable USE_FBGEMM_GENAI (pytorch#160676)

Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024

pytorch-bot bot removed the ciflow/rocm label

Contributor

facebook-github-bot commented Aug 14, 2025

This pull request was exported from Phabricator. Differential Revision: D79564024

drisspg added the ciflow/rocm label

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Enable USE_FBGEMM_GENAI (pytorch#4703)

d7fccba

Summary:
X-link: pytorch/pytorch#160676


X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Differential Revision: D79564024

cthi force-pushed the export-D79564024 branch from 2775878 to 4a582fe Compare

August 25, 2025 21:34

pytorch-bot bot pushed a commit that referenced this pull request


          [WIP][ROCm] Enable USE_FBGEMM_GENAI (#160676)

4a582fe

Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024

Contributor

facebook-github-bot commented Aug 25, 2025

This pull request was exported from Phabricator. Differential Revision: D79564024

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request


          Enable USE_FBGEMM_GENAI (pytorch#4703)

f612129

Summary:
X-link: pytorch/pytorch#160676


X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Differential Revision: D79564024

cthi force-pushed the export-D79564024 branch from 4a582fe to 77d547a Compare

August 26, 2025 14:28

Contributor

facebook-github-bot commented Aug 26, 2025

This pull request was exported from Phabricator. Differential Revision: D79564024

pytorch-bot bot pushed a commit that referenced this pull request


          [WIP][ROCm] Enable USE_FBGEMM_GENAI (#160676)

77d547a

Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024


          [ROCm] Enable USE_FBGEMM_GENAI (pytorch#160676)

68b8727

Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024

cthi force-pushed the export-D79564024 branch from 77d547a to 68b8727 Compare

September 2, 2025 14:36

Contributor

facebook-github-bot commented Sep 2, 2025

This pull request was exported from Phabricator. Differential Revision: D79564024

cthi requested a review from drisspg

September 3, 2025 13:34

cthi changed the title ~~[WIP][ROCm] Enable USE_FBGEMM_GENAI~~ [ROCm] Enable USE_FBGEMM_GENAI

pytorch-bot bot added the ciflow/trunk label

drisspg reviewed

View reviewed changes

aten/src/ATen/CMakeLists.txt

    
                  # This is rather hacky, I could not figure out a clean solution :(

                  set(HIP_CLANG_FLAGS_ORIGINAL ${HIP_CLANG_FLAGS})

                  string(REGEX REPLACE "--offload-arch=[^ ]*" "" FILTERED_HIP_CLANG_FLAGS "${HIP_CLANG_FLAGS}")

                  list(APPEND FILTERED_HIP_CLANG_FLAGS --offload-arch=gfx942;)

Contributor

drisspg Sep 3, 2025

looks pretty similar to

pytorch/cmake/Codegen.cmake

Line 91 in 01f66d0

torch_cuda_get_nvcc_gencode_flag(_existing_arch_flags)

drisspg approved these changes

View reviewed changes

facebook-github-bot pushed a commit to pytorch/FBGEMM that referenced this pull request


          Enable USE_FBGEMM_GENAI (#4703)

a56882d

Summary:
X-link: pytorch/pytorch#160676

Pull Request resolved: #4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Reviewed By: drisspg

Differential Revision: D79564024

fbshipit-source-id: bf2aa1a3eee43d0e47e9ba1e5514152e502da35f

Contributor

facebook-github-bot commented Sep 4, 2025

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Sep 4, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in

69a25f6

pytorchmergebot added Merged and removed merging labels

jagadish-amd mentioned this pull request

[ROCm] FP8 _scaled_grouped_mm & BF16 _grouped_mm support #161366

Open

pragupta added a commit to ROCm/pytorch that referenced this pull request


          Revert "[ROCm] Enable USE_FBGEMM_GENAI (pytorch#160676)"

304889c

This reverts commit 69a25f6.

pragupta added a commit to ROCm/pytorch that referenced this pull request


          Revert "[ROCm] Enable USE_FBGEMM_GENAI (pytorch#160676)"

dba8539

This reverts commit 69a25f6.

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request


          [ROCm] Enable USE_FBGEMM_GENAI (pytorch#160676)

5c996f9

Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48)

Pull Request resolved: pytorch#160676
Approved by: https://github.com/drisspg

mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request


          [ROCm] Enable USE_FBGEMM_GENAI (pytorch#160676)

52024d9

Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48)

Pull Request resolved: pytorch#160676
Approved by: https://github.com/drisspg

dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request


          [ROCm] Enable USE_FBGEMM_GENAI (pytorch#160676)

Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48)

Pull Request resolved: pytorch#160676
Approved by: https://github.com/drisspg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm ciflow/trunk fb-exported Merged module: rocm topic: not user facing