[sparse] Fix semi-structured sparse shape mismatch bug #110420

jcaip · 2023-10-02T23:28:31Z

Stack from ghstack (oldest at bottom):

Summary:

Currently, PyTorch incorrectly calculates the size of the returned
matrix when we pass a non-contiguous batched (>2d) input to the
semi-structured sparse subclass.

This is most common in MLP layers, where we have 2 linear layers back to back.

This will lead to an error like the following:

RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size
62914560

Where the size of the sparse matmul result is off because we infer the
output shape with the wrong tensor shape.

This happens because of a bug where we did not update the subclass
tensor shape when doing transpose.
For semi-structured sparsity, transposing is a no-op where we just set
the boolean flag, but we forgot to also update the tensor shape.

Note that this error goes away in inference mode, since we avoid
decomposing the aten.linear op and handle shape folding ourselves,
which changes the execution path.

An alternative way to fix this issue is to set
TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error.

Test Plan:

python test/test_sparse_semi_structured.py -k test_mlp

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Currently, PyTorch incorrectly calculates the size of the returned matrix when we pass a non-contiguous batched (>2d) input to the semi-structured sparse subclass. This is most common in MLP layers, where we have 2 linear layers back to back. This will lead to an error like the following: ``` RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size 62914560 ``` Where the size of the sparse matmul result is off by a factor of 4. I'm not sure exactly where this bug comes from, but I traced it to [this](https://github.com/pytorch/pytorch/blob/01b2f25ebda85d307b27847ad67efe2b5bb54265/aten/src/ATen/native/LinearAlgebra.cpp#L1959) function. Note that this error goes away in inference mode, since we avoid decomposing the aten.linear op This fix overload __torch_function, specifically for the F.linear op. The goal is to implement our own folding to 2d / unfolding code so that we can avoid running into this issue. An alternative way to fix this issue is to set TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error. Test Plan: ``` python test/test_sparse_semi_structured.py -k test_mlp ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot · 2023-10-02T23:28:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110420

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit e8ee6f0 with merge base a3e5ec4 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh)

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Currently, PyTorch incorrectly calculates the size of the returned matrix when we pass a non-contiguous batched (>2d) input to the semi-structured sparse subclass. This is most common in MLP layers, where we have 2 linear layers back to back. This will lead to an error like the following: ``` RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size 62914560 ``` Where the size of the sparse matmul result is off by a factor of 4. I'm not sure exactly where this bug comes from, but I traced it to [this](https://github.com/pytorch/pytorch/blob/01b2f25ebda85d307b27847ad67efe2b5bb54265/aten/src/ATen/native/LinearAlgebra.cpp#L1959) function. Note that this error goes away in inference mode, since we avoid decomposing the aten.linear op This fix overload __torch_function, specifically for the F.linear op. The goal is to implement our own folding to 2d / unfolding code so that we can avoid running into this issue. An alternative way to fix this issue is to set TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error. Test Plan: ``` python test/test_sparse_semi_structured.py -k test_mlp ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Currently, PyTorch incorrectly calculates the size of the returned matrix when we pass a non-contiguous batched (>2d) input to the semi-structured sparse subclass. This is most common in MLP layers, where we have 2 linear layers back to back. This will lead to an error like the following: ``` RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size 62914560 ``` Where the size of the sparse matmul result is off by a factor of 4. I'm not sure exactly where this bug comes from, but I traced it to [this](https://github.com/pytorch/pytorch/blob/01b2f25ebda85d307b27847ad67efe2b5bb54265/aten/src/ATen/native/LinearAlgebra.cpp#L1959) function. Note that this error goes away in inference mode, since we avoid decomposing the aten.linear op This fix overload __torch_function, specifically for the F.linear op. The goal is to implement our own folding to 2d / unfolding code so that we can avoid running into this issue. An alternative way to fix this issue is to set TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error. Test Plan: ``` python test/test_sparse_semi_structured.py -k test_mlp ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 82c08a0 Pull Request resolved: #110420

Summary: Currently, PyTorch incorrectly calculates the size of the returned matrix when we pass a non-contiguous batched (>2d) input to the semi-structured sparse subclass. This is most common in MLP layers, where we have 2 linear layers back to back. This will lead to an error like the following: ``` RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size 62914560 ``` Where the size of the sparse matmul result is off by a factor of 4. I'm not sure exactly where this bug comes from, but I traced it to [this](https://github.com/pytorch/pytorch/blob/01b2f25ebda85d307b27847ad67efe2b5bb54265/aten/src/ATen/native/LinearAlgebra.cpp#L1959) function. Note that this error goes away in inference mode, since we avoid decomposing the aten.linear op This fix overload __torch_function, specifically for the F.linear op. The goal is to implement our own folding to 2d / unfolding code so that we can avoid running into this issue. An alternative way to fix this issue is to set TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error. Test Plan: ``` python test/test_sparse_semi_structured.py -k test_mlp ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Currently, PyTorch incorrectly calculates the size of the returned matrix when we pass a non-contiguous batched (>2d) input to the semi-structured sparse subclass. This is most common in MLP layers, where we have 2 linear layers back to back. This will lead to an error like the following: ``` RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size 62914560 ``` Where the size of the sparse matmul result is off by a factor of 4. I'm not sure exactly where this bug comes from, but I traced it to [this](https://github.com/pytorch/pytorch/blob/01b2f25ebda85d307b27847ad67efe2b5bb54265/aten/src/ATen/native/LinearAlgebra.cpp#L1959) function. Note that this error goes away in inference mode, since we avoid decomposing the aten.linear op This fix overload __torch_function, specifically for the F.linear op. The goal is to implement our own folding to 2d / unfolding code so that we can avoid running into this issue. An alternative way to fix this issue is to set TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error. Test Plan: ``` python test/test_sparse_semi_structured.py -k test_mlp ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 613fdce Pull Request resolved: #110420

cpuhrsch · 2023-10-04T02:41:45Z

cc @albanD since I think this is a curious Tensor subclass "edge case"

alexsamardzic · 2023-10-04T07:28:03Z

LGTM (aside for test failing at the moment because of an import).

albanD · 2023-10-04T19:26:10Z

I'm confused. Is this just a big in the at::linear op? Why not just fix that instead of doing this?

jcaip · 2023-10-04T22:52:29Z

cc @albanD I can create an issue to triage this more precisely and fix it but this issue only arises for semi-structured sparse tensors and I'm not sure if it's a bug or a special case that needs to be handled. This fix seemed easier.

Summary: Currently, PyTorch incorrectly calculates the size of the returned matrix when we pass a non-contiguous batched (>2d) input to the semi-structured sparse subclass. This is most common in MLP layers, where we have 2 linear layers back to back. This will lead to an error like the following: ``` RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size 62914560 ``` Where the size of the sparse matmul result is off by a factor of 4. I'm not sure exactly where this bug comes from, but I traced it to [this](https://github.com/pytorch/pytorch/blob/01b2f25ebda85d307b27847ad67efe2b5bb54265/aten/src/ATen/native/LinearAlgebra.cpp#L1959) function. Note that this error goes away in inference mode, since we avoid decomposing the aten.linear op This fix overload __torch_function, specifically for the F.linear op. The goal is to implement our own folding to 2d / unfolding code so that we can avoid running into this issue. An alternative way to fix this issue is to set TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error. Test Plan: ``` python test/test_sparse_semi_structured.py -k test_mlp ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Currently, PyTorch incorrectly calculates the size of the returned matrix when we pass a non-contiguous batched (>2d) input to the semi-structured sparse subclass. This is most common in MLP layers, where we have 2 linear layers back to back. This will lead to an error like the following: ``` RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size 62914560 ``` Where the size of the sparse matmul result is off by a factor of 4. I'm not sure exactly where this bug comes from, but I traced it to [this](https://github.com/pytorch/pytorch/blob/01b2f25ebda85d307b27847ad67efe2b5bb54265/aten/src/ATen/native/LinearAlgebra.cpp#L1959) function. Note that this error goes away in inference mode, since we avoid decomposing the aten.linear op This fix overload __torch_function, specifically for the F.linear op. The goal is to implement our own folding to 2d / unfolding code so that we can avoid running into this issue. An alternative way to fix this issue is to set TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error. Test Plan: ``` python test/test_sparse_semi_structured.py -k test_mlp ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 13ac044 Pull Request resolved: #110420

jcaip · 2023-10-06T16:43:55Z

@pytorchbot rebase

pytorchmergebot · 2023-10-06T16:45:55Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2023-10-06T16:46:01Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict gh/jcaip/43/orig returned non-zero exit code 1

warning: skipped previously applied commit c0e6a7f34a6
hint: use --reapply-cherry-picks to include skipped commits
hint: Disable this message with "git config advice.skippedCherryPicks false"
Rebasing (1/1)
Auto-merging test/test_sparse_semi_structured.py
Auto-merging torch/sparse/semi_structured.py
CONFLICT (content): Merge conflict in torch/sparse/semi_structured.py
error: could not apply de30920a330... [sparse] Fix semi-structured sparse shape mismatch bug
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply de30920a330... [sparse] Fix semi-structured sparse shape mismatch bug

Raised by https://github.com/pytorch/pytorch/actions/runs/6434285998

Summary: Currently, PyTorch incorrectly calculates the size of the returned matrix when we pass a non-contiguous batched (>2d) input to the semi-structured sparse subclass. This is most common in MLP layers, where we have 2 linear layers back to back. This will lead to an error like the following: ``` RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size 62914560 ``` Where the size of the sparse matmul result is off by a factor of 4. I'm not sure exactly where this bug comes from, but I traced it to [this](https://github.com/pytorch/pytorch/blob/01b2f25ebda85d307b27847ad67efe2b5bb54265/aten/src/ATen/native/LinearAlgebra.cpp#L1959) function. Note that this error goes away in inference mode, since we avoid decomposing the aten.linear op This fix overload __torch_function, specifically for the F.linear op. The goal is to implement our own folding to 2d / unfolding code so that we can avoid running into this issue. An alternative way to fix this issue is to set TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error. Test Plan: ``` python test/test_sparse_semi_structured.py -k test_mlp ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Currently, PyTorch incorrectly calculates the size of the returned matrix when we pass a non-contiguous batched (>2d) input to the semi-structured sparse subclass. This is most common in MLP layers, where we have 2 linear layers back to back. This will lead to an error like the following: ``` RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size 62914560 ``` Where the size of the sparse matmul result is off because we infer the output shape with the wrong tensor shape. This happens because of a bug where we did not update the subclass tensor shape when doing transpose. For semi-structured sparsity, transposing is a no-op where we just set the boolean flag, but we forgot to also update the tensor shape. Note that this error goes away in inference mode, since we avoid decomposing the aten.linear op and handle shape folding ourselves, which changes the execution path. An alternative way to fix this issue is to set TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error. Test Plan: ``` python test/test_sparse_semi_structured.py -k test_mlp ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

jcaip · 2023-10-09T18:59:25Z

@pytorchbot merge

pytorchmergebot · 2023-10-09T19:01:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-10-09T19:01:42Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build / build

Details for Dev Infra team

Raised by workflow job

Summary: Currently, PyTorch incorrectly calculates the size of the returned matrix when we pass a non-contiguous batched (>2d) input to the semi-structured sparse subclass. This is most common in MLP layers, where we have 2 linear layers back to back. This will lead to an error like the following: ``` RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size 62914560 ``` Where the size of the sparse matmul result is off because we infer the output shape with the wrong tensor shape. This happens because of a bug where we did not update the subclass tensor shape when doing transpose. For semi-structured sparsity, transposing is a no-op where we just set the boolean flag, but we forgot to also update the tensor shape. Note that this error goes away in inference mode, since we avoid decomposing the aten.linear op and handle shape folding ourselves, which changes the execution path. An alternative way to fix this issue is to set TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error. Test Plan: ``` python test/test_sparse_semi_structured.py -k test_mlp ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

jcaip · 2023-10-10T03:05:35Z

@pytorchbot merge -f "unrelated failures"

pytorchmergebot · 2023-10-10T03:07:22Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: Currently, PyTorch incorrectly calculates the size of the returned matrix when we pass a non-contiguous batched (>2d) input to the semi-structured sparse subclass. This is most common in MLP layers, where we have 2 linear layers back to back. This will lead to an error like the following: ``` RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size 62914560 ``` Where the size of the sparse matmul result is off because we infer the output shape with the wrong tensor shape. This happens because of a bug where we did not update the subclass tensor shape when doing transpose. For semi-structured sparsity, transposing is a no-op where we just set the boolean flag, but we forgot to also update the tensor shape. Note that this error goes away in inference mode, since we avoid decomposing the aten.linear op and handle shape folding ourselves, which changes the execution path. An alternative way to fix this issue is to set TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error. Test Plan: ``` python test/test_sparse_semi_structured.py -k test_mlp ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: #110420 Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch

jcaip mentioned this pull request Oct 2, 2023

[core][sparse][pruning] Add (i8i8)-> fp16 support to cuSPARSELt matmul #110419

Closed

pytorch-bot bot added the release notes: sparse release notes category label Oct 2, 2023

jcaip requested review from alexsamardzic and cpuhrsch October 4, 2023 02:38

cpuhrsch requested a review from albanD October 4, 2023 02:41

alexsamardzic approved these changes Oct 4, 2023

View reviewed changes

jcaip mentioned this pull request Oct 5, 2023

[sparse] Shape mismatch when doing matmul with semi-strutctured sparse and non-contiguous dense input #110664

Closed

pytorchmergebot added the merging label Oct 9, 2023

pytorchmergebot removed the merging label Oct 9, 2023

jcaip mentioned this pull request Oct 9, 2023

[sparse] Add padding for dense matrices in semi-structured sparse #110583

Closed

pytorchmergebot added the merging label Oct 10, 2023

pytorchmergebot added Merged and removed merging labels Oct 10, 2023

pytorchmergebot closed this in f10aab0 Oct 10, 2023

facebook-github-bot deleted the gh/jcaip/43/head branch October 13, 2023 14:23

This was referenced Oct 23, 2023

[sparse] Fix semi-structured sparse shape mismatch bug (#110420) #111845

Closed

[v2.1.1] Release Tracker #110961

Closed

atalman modified the milestones: 2.1.1, 2.1.2 Nov 8, 2023

This was referenced Nov 27, 2023

[sparse][semi-structured] Fix RuntimeError when passing in non-contiguous input to SparseSemiStructured linear #114593

Closed

[v2.1.2] Release Tracker #113962

Closed

jcaip mentioned this pull request Dec 12, 2023

[Tracker] torch.sparse semi-structured 2.3 beta release #115662

Closed

atalman removed this from the 2.1.2 milestone Dec 14, 2023

[sparse] Fix semi-structured sparse shape mismatch bug #110420

[sparse] Fix semi-structured sparse shape mismatch bug #110420

Uh oh!

Conversation

jcaip commented Oct 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110420

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

cpuhrsch commented Oct 4, 2023

Uh oh!

alexsamardzic commented Oct 4, 2023

Uh oh!

albanD commented Oct 4, 2023

Uh oh!

jcaip commented Oct 4, 2023

Uh oh!

jcaip commented Oct 6, 2023

Uh oh!

pytorchmergebot commented Oct 6, 2023

Uh oh!

pytorchmergebot commented Oct 6, 2023

Uh oh!

jcaip commented Oct 9, 2023

Uh oh!

pytorchmergebot commented Oct 9, 2023

Merge started

Uh oh!

pytorchmergebot commented Oct 9, 2023

Merge failed

Uh oh!

jcaip commented Oct 10, 2023

Uh oh!

pytorchmergebot commented Oct 10, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jcaip commented Oct 2, 2023 •

edited

Loading

pytorch-bot bot commented Oct 2, 2023 •

edited

Loading