KEMBAR78
Add scaling arguments to bsr_dense_addmm by pearu · Pull Request #136104 · pytorch/pytorch · GitHub
Skip to content

Conversation

@pearu
Copy link
Collaborator

@pearu pearu commented Sep 14, 2024

As in the title.

Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413

The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task.

Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered contiguous call although the underlying memory buffer was contiguous.

Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs.

Stack from ghstack (oldest at bottom):

@pytorch-bot pytorch-bot bot added the release notes: sparse release notes category label Sep 14, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136104

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 825a557 with merge base 357b7fb (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

As in the title.

Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413

The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task.




[ghstack-poisoned]
As in the title.

Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413

The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task.




[ghstack-poisoned]
@pearu pearu added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 15, 2024
@pearu pearu requested review from amjames and cpuhrsch September 15, 2024 17:32
As in the title.

Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413

The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task.

Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous.

Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs.




[ghstack-poisoned]
@pearu
Copy link
Collaborator Author

pearu commented Sep 16, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
As in the title.

Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413

The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task.

Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous.

Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs.

Pull Request resolved: pytorch#136104
Approved by: https://github.com/cpuhrsch
@github-actions github-actions bot deleted the gh/pearu/135/head branch October 17, 2024 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: sparse release notes category topic: new features topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants