Add NVIDIA A100 optimized meta parameters to bsr_dense_mm #111760

pearu · 2023-10-22T10:18:29Z

As in the title.

The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters.

In sum, this PR speeds up bsr_dense_mm about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations.

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

pytorch-bot · 2023-10-22T10:18:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111760

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a9ece3f with merge base 57c7aa1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

As in the title. The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters. In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations. <img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%"> [ghstack-poisoned]

ghstack-source-id: fb8f4ff Pull Request resolved: #111760

#111796) Pull Request resolved: #111796 Approved by: https://github.com/cpuhrsch ghstack dependencies: #110396, #111470, #111489, #111760

…1760) As in the title. The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters. In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations. <img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%"> Pull Request resolved: pytorch#111760 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489

pytorch#111796) Pull Request resolved: pytorch#111796 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489, pytorch#111760

…1760) As in the title. The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters. In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations. <img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%"> Pull Request resolved: pytorch#111760 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489

pytorch#111796) Pull Request resolved: pytorch#111796 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489, pytorch#111760

Add NVIDIA A100 optimized meta parameters to bsr_dense_mm

e00aa8b

[ghstack-poisoned]

This was referenced Oct 22, 2023

Add scatter_mm and bsr_scatter_mm operations. #110396

Closed

Use lru_cache to cache indices data for bsr_scatter_mm. #111470

Closed

pytorch-bot bot added the release notes: sparse release notes category label Oct 22, 2023

pearu mentioned this pull request Oct 18, 2023

Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize is 16. #111489

Closed

pearu added module: performance Issues related to performance, either of kernel code or framework glue topic: improvements topic category open source labels Oct 22, 2023

pearu added a commit that referenced this pull request Oct 22, 2023

Add NVIDIA A100 optimized meta parameters to bsr_dense_mm

a3aadbe

ghstack-source-id: fb8f4ff Pull Request resolved: #111760

pearu mentioned this pull request Oct 23, 2023

Add batched dimensions support to the second operand of bsr_scatter_mm #111796

Closed

pearu requested review from amjames and cpuhrsch October 23, 2023 15:03

cpuhrsch approved these changes Oct 23, 2023

View reviewed changes

pytorchmergebot added the Merged label Oct 23, 2023

pytorchmergebot closed this in 6382011 Oct 23, 2023

facebook-github-bot deleted the gh/pearu/124/head branch October 27, 2023 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add NVIDIA A100 optimized meta parameters to bsr_dense_mm #111760

Add NVIDIA A100 optimized meta parameters to bsr_dense_mm #111760

Uh oh!

pearu commented Oct 22, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 22, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add NVIDIA A100 optimized meta parameters to bsr_dense_mm #111760

Add NVIDIA A100 optimized meta parameters to bsr_dense_mm #111760

Uh oh!

Conversation

pearu commented Oct 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111760

✅ No Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pearu commented Oct 22, 2023 •

edited

Loading

pytorch-bot bot commented Oct 22, 2023 •

edited

Loading