[sparse][semi-structured] Add float8 dtype support to 24 sparsity #136397

jcaip · 2024-09-21T21:39:52Z

Stack from ghstack (oldest at bottom):

-> [sparse][semi-structured] Add float8 dtype support to 24 sparsity #136397

Summary:

This PR adds torch.float8e4m3fn support to cuSPARSELt and to_sparse_semi_structured.

This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with
cusparselt >= 0.6.2, via to scaled_mm API.

A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16)
B = torch.rand(dense_input_shape, device=device).to(torch.float16).t()

A_fp8, A_scale = to_float8(A)
B_fp8, B_scale = to_float8(B)

dense_result = torch._scaled_mm(
    A_fp8, B_fp8,
    scale_a=A_scale, scale_b=B_scale,
    out_dtype=out_dtype
)
A_fp8_sparse = to_sparse_semi_structured(A_fp8)
sparse_result = torch._scaled_mm(
    A_fp8_sparse, B_fp8,
    scale_a=A_scale, scale_b=B_scale,
    out_dtype=out_dtype
)

Note that to keep this consistent with normal torch behavior, calling
torch.mm(A_fp8_sparse, B_fp8) will raise a NotImplementedError.

I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the
backend to make the tests a bit cleaner

Test Plan:

python test/test_sparse_semi_structured -k scaled_mm
python test/test_sparse_semi_structured -k fp8

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot · 2024-09-21T21:39:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136397

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 4497986 with merge base 803ce50 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh)
'test/inductor/test_cudacodecache.py::TestCUDACodeCache::test_async_compile'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…parsity" Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: cc4333f Pull Request resolved: #136397

…parsity" Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

…parsity" Summary: This PR adds float8 support to cuSPARSELt and `to_sparse_semi_structured`. Support for this op happens through `torch._scaled_mm`. It also turns on cuSPARSELt by default and adds CUSPARSELT_MAX_ID to the backend. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This PR adds float8 support to cuSPARSELt and `to_sparse_semi_structured`. Support for this op happens through `torch._scaled_mm`. It also turns on cuSPARSELt by default and adds CUSPARSELT_MAX_ID to the backend. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: f1d68cf Pull Request resolved: #136397

…parsity" Summary: This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`. This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with cusparselt >= 0.6.2, via to `scaled_mm` API. ``` A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16) B = torch.rand(dense_input_shape, device=device).to(torch.float16).t() A_fp8, A_scale = to_float8(A) B_fp8, B_scale = to_float8(B) dense_result = torch._scaled_mm( A_fp8, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) A_fp8_sparse = to_sparse_semi_structured(A_fp8) sparse_result = torch._scaled_mm( A_fp8_sparse, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) ``` Note that to keep this consistent with normal torch behavior, calling `torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError. I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the backend to make the tests a bit cleaner Test Plan: ``` python test/test_sparse_semi_structured -k scaled_mm python test/test_sparse_semi_structured -k fp8 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`. This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with cusparselt >= 0.6.2, via to `scaled_mm` API. ``` A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16) B = torch.rand(dense_input_shape, device=device).to(torch.float16).t() A_fp8, A_scale = to_float8(A) B_fp8, B_scale = to_float8(B) dense_result = torch._scaled_mm( A_fp8, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) A_fp8_sparse = to_sparse_semi_structured(A_fp8) sparse_result = torch._scaled_mm( A_fp8_sparse, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) ``` Note that to keep this consistent with normal torch behavior, calling `torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError. I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the backend to make the tests a bit cleaner Test Plan: ``` python test/test_sparse_semi_structured -k scaled_mm python test/test_sparse_semi_structured -k fp8 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 1fe22a9 Pull Request resolved: #136397

drisspg

I think this looks good, is there a reason why we cant also add support for float8_e5m2?

jcaip · 2024-09-26T03:12:55Z

I think this looks good, is there a reason why we cant also add support for float8_e5m2?

I am purely targeting inference which from what I understand is mostly float8e4m3.
Actually, think it would be best if I could get this merged in first, will add float8_e5m2 for a subsequent PR.

…parsity" Summary: This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`. This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with cusparselt >= 0.6.2, via to `scaled_mm` API. ``` A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16) B = torch.rand(dense_input_shape, device=device).to(torch.float16).t() A_fp8, A_scale = to_float8(A) B_fp8, B_scale = to_float8(B) dense_result = torch._scaled_mm( A_fp8, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) A_fp8_sparse = to_sparse_semi_structured(A_fp8) sparse_result = torch._scaled_mm( A_fp8_sparse, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) ``` Note that to keep this consistent with normal torch behavior, calling `torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError. I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the backend to make the tests a bit cleaner Test Plan: ``` python test/test_sparse_semi_structured -k scaled_mm python test/test_sparse_semi_structured -k fp8 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

drisspg · 2024-09-26T13:33:13Z

aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp

+        tensor_alpha_mode = 1;
+        TORCH_CUDASPARSE_CHECK(cusparseLtMatmulDescSetAttribute(
+            &handle, &matmul, CUSPARSELT_MATMUL_ALPHA_VECTOR_SCALING, &tensor_alpha_mode, sizeof(tensor_alpha_mode)));
+        alpha_ptr = (float*)alpha_tensor.data_ptr();


Nit: static_cast

drisspg · 2024-09-26T13:35:36Z

test/test_sparse_semi_structured.py

+
+    @unittest.skipIf(not PLATFORM_SUPPORTS_FP8, "FP8 is only supported on H100+ and sm_89 and MI300+ devices")
+    @parametrize("out_dtype", [torch.float16, torch.bfloat16, torch.float32])
+    @parametrize("dense_input_shape", [(256, 128)])


Maybe add a fp8 out test as well, you will need to cast up before comparison of assert close

Do you have a reference test for this? allclose fails for me with this, and all the other float8 tests I saw just test these three.

I have this one:

pytorch/test/test_matmul_cuda.py

Line 481 in 0b62ebf

def test_float8_bias(self, device) -> None:

but using predefined inputs

torch/sparse/_semi_structured_ops.py

…parsity" Summary: This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`. This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with cusparselt >= 0.6.2, via to `scaled_mm` API. ``` A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16) B = torch.rand(dense_input_shape, device=device).to(torch.float16).t() A_fp8, A_scale = to_float8(A) B_fp8, B_scale = to_float8(B) dense_result = torch._scaled_mm( A_fp8, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) A_fp8_sparse = to_sparse_semi_structured(A_fp8) sparse_result = torch._scaled_mm( A_fp8_sparse, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) ``` Note that to keep this consistent with normal torch behavior, calling `torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError. I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the backend to make the tests a bit cleaner Test Plan: ``` python test/test_sparse_semi_structured -k scaled_mm python test/test_sparse_semi_structured -k fp8 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`. This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with cusparselt >= 0.6.2, via to `scaled_mm` API. ``` A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16) B = torch.rand(dense_input_shape, device=device).to(torch.float16).t() A_fp8, A_scale = to_float8(A) B_fp8, B_scale = to_float8(B) dense_result = torch._scaled_mm( A_fp8, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) A_fp8_sparse = to_sparse_semi_structured(A_fp8) sparse_result = torch._scaled_mm( A_fp8_sparse, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) ``` Note that to keep this consistent with normal torch behavior, calling `torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError. I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the backend to make the tests a bit cleaner Test Plan: ``` python test/test_sparse_semi_structured -k scaled_mm python test/test_sparse_semi_structured -k fp8 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: dd32b63 Pull Request resolved: #136397

jcaip · 2024-09-27T19:04:20Z

@pytorchbot merge

pytorchmergebot · 2024-09-27T19:05:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-27T21:34:43Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, lf.linux.g5.4xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

jcaip · 2024-09-27T21:35:51Z

@pytorchbot merge -f "unrelated failures"

pytorchmergebot · 2024-09-27T21:37:28Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[sparse][semi-structured] Add float8 dtype support to 24 sparsity

024d60d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot bot added the release notes: sparse release notes category label Sep 21, 2024

Update on "[sparse][semi-structured] Add float8 dtype support to 24 s…

e61a3fc

…parsity" Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Update on "[sparse][semi-structured] Add float8 dtype support to 24 s…

8d0ea53

…parsity" Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Update on "[sparse][semi-structured] Add float8 dtype support to 24 s…

4189bf2

…parsity" Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

jcaip added a commit that referenced this pull request Sep 22, 2024

[sparse][semi-structured] Add float8 dtype support to 24 sparsity

17be900

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: cc4333f Pull Request resolved: #136397

Update on "[sparse][semi-structured] Add float8 dtype support to 24 s…

25df467

…parsity" Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Update on "[sparse][semi-structured] Add float8 dtype support to 24 s…

0569a14

…parsity" Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Update on "[sparse][semi-structured] Add float8 dtype support to 24 s…

b9bcfcc

…parsity" Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

jcaip requested review from cpuhrsch, drisspg and vkuzo September 24, 2024 22:20

jcaip added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 24, 2024

drisspg reviewed Sep 25, 2024

View reviewed changes

drisspg approved these changes Sep 26, 2024

View reviewed changes

pytorchmergebot added the merging label Sep 27, 2024

pytorchmergebot removed the merging label Sep 27, 2024

pytorchmergebot added the merging label Sep 27, 2024

pytorchmergebot added the Merged label Sep 27, 2024

pytorchmergebot closed this in bc21689 Sep 27, 2024

pytorchmergebot removed the merging label Sep 27, 2024

jcaip mentioned this pull request Oct 22, 2024

[RFC] Sparsity Future Plans pytorch/ao#1136

Open

github-actions bot deleted the gh/jcaip/67/head branch October 28, 2024 02:08

[sparse][semi-structured] Add float8 dtype support to 24 sparsity #136397

[sparse][semi-structured] Add float8 dtype support to 24 sparsity #136397

Uh oh!

Conversation

jcaip commented Sep 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136397

❌ 1 New Failure

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

jcaip commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drisspg Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

drisspg Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

jcaip Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

drisspg Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jcaip commented Sep 27, 2024

Uh oh!

pytorchmergebot commented Sep 27, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 27, 2024

Merge failed

Uh oh!

jcaip commented Sep 27, 2024

Uh oh!

pytorchmergebot commented Sep 27, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jcaip commented Sep 21, 2024 •

edited

Loading

pytorch-bot bot commented Sep 21, 2024 •

edited

Loading

jcaip commented Sep 26, 2024 •

edited

Loading