KEMBAR78
[ROCm] Enable USE_FBGEMM_GENAI by cthi · Pull Request #160676 · pytorch/pytorch · GitHub
Skip to content

Conversation

@cthi
Copy link
Contributor

@cthi cthi commented Aug 14, 2025

Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:

  • USE_FBGEMM_GENAI=1 and without gfx942
  • USE_FBGEMM_GENAI=1 and with gfx942
  • USE_FBGEMM_GENAI=1 and all current PYTORCH_ROCM_ARCH

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 14, 2025

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @cthi, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Aug 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160676

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 68b8727 with merge base bc4db2c (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79564024

@pytorch-bot pytorch-bot bot added the ciflow/rocm Trigger "default" config CI on ROCm label Aug 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 14, 2025

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm and removed ciflow/rocm Trigger "default" config CI on ROCm labels Aug 14, 2025
@cthi cthi force-pushed the export-D79564024 branch from dfecffb to 2775878 Compare August 14, 2025 20:54
cthi added a commit to cthi/pytorch that referenced this pull request Aug 14, 2025
Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024
@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Aug 14, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79564024

@drisspg drisspg added the ciflow/rocm Trigger "default" config CI on ROCm label Aug 14, 2025
cthi added a commit to cthi/FBGEMM-1 that referenced this pull request Aug 25, 2025
Summary:
X-link: pytorch/pytorch#160676


X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Differential Revision: D79564024
@cthi cthi force-pushed the export-D79564024 branch from 2775878 to 4a582fe Compare August 25, 2025 21:34
pytorch-bot bot pushed a commit that referenced this pull request Aug 25, 2025
Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79564024

cthi added a commit to cthi/FBGEMM-1 that referenced this pull request Aug 25, 2025
Summary:
X-link: pytorch/pytorch#160676


X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for gfx942 as that is what we have thoroughly tested performance and correctness on.

Differential Revision: D79564024
@cthi cthi force-pushed the export-D79564024 branch from 4a582fe to 77d547a Compare August 26, 2025 14:28
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79564024

pytorch-bot bot pushed a commit that referenced this pull request Aug 26, 2025
Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024
Summary:

X-link: pytorch/FBGEMM#4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Test Plan:
Will ensure CI is green internally.

Ensure the op can be called, added it into the fbgemm testing script:
```
 HIP_VISIBLE_DEVICES=1 buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=ck_rowwise_grouped,ck_grouped_stacked_torch_2d3d,scaled_grouped_mm_rowwise --grouped --M=128 --N=2048 --K=5120 --groups=16
```
```
ck_rowwise_grouped sim: 115.344.
ck_rowwise_grouped ms: 0.167.
ck_rowwise_grouped TFLOPS: 257.254.
ck_rowwise_grouped GB/s: 1117.952.
Average metrics over 1 iterations:
ck_grouped_stacked_torch_2d3d sim: 115.344.
ck_grouped_stacked_torch_2d3d ms: 0.072.
ck_grouped_stacked_torch_2d3d TFLOPS: 594.511.
ck_grouped_stacked_torch_2d3d GB/s: 2583.570.
Average metrics over 1 iterations:
scaled_grouped_mm_rowwise sim: 115.344.
scaled_grouped_mm_rowwise ms: 0.074.
scaled_grouped_mm_rowwise TFLOPS: 576.926.
scaled_grouped_mm_rowwise GB/s: 2507.148.
```

Rollback Plan:

Differential Revision: D79564024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79564024

@cthi cthi requested a review from drisspg September 3, 2025 13:34
@cthi cthi changed the title [WIP][ROCm] Enable USE_FBGEMM_GENAI [ROCm] Enable USE_FBGEMM_GENAI Sep 3, 2025
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 3, 2025
# This is rather hacky, I could not figure out a clean solution :(
set(HIP_CLANG_FLAGS_ORIGINAL ${HIP_CLANG_FLAGS})
string(REGEX REPLACE "--offload-arch=[^ ]*" "" FILTERED_HIP_CLANG_FLAGS "${HIP_CLANG_FLAGS}")
list(APPEND FILTERED_HIP_CLANG_FLAGS --offload-arch=gfx942;)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty similar to

torch_cuda_get_nvcc_gencode_flag(_existing_arch_flags)

facebook-github-bot pushed a commit to pytorch/FBGEMM that referenced this pull request Sep 4, 2025
Summary:
X-link: pytorch/pytorch#160676

Pull Request resolved: #4703

X-link: facebookresearch/FBGEMM#1728

In this diff we enable the support for the new FBGEMM backed FP8 `torch._scaled_grouped_mm` on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Reviewed By: drisspg

Differential Revision: D79564024

fbshipit-source-id: bf2aa1a3eee43d0e47e9ba1e5514152e502da35f
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pragupta added a commit to ROCm/pytorch that referenced this pull request Sep 9, 2025
pragupta added a commit to ROCm/pytorch that referenced this pull request Sep 10, 2025
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48)

Pull Request resolved: pytorch#160676
Approved by: https://github.com/drisspg
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48)

Pull Request resolved: pytorch#160676
Approved by: https://github.com/drisspg
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Summary:
X-link: pytorch/FBGEMM#4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](https://github.com/pytorch/pytorch/blob/9491d289b329e4ba4a9f5f5b1be7960671bb7840/.ci/docker/libtorch/build.sh#L48)

Pull Request resolved: pytorch#160676
Approved by: https://github.com/drisspg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged module: rocm AMD GPU support for Pytorch topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants