KEMBAR78
[Inductor] fix broadcast logic for Triton by nandesuka · Pull Request #141027 · pytorch/pytorch · GitHub
Skip to content

Conversation

@nandesuka
Copy link
Contributor

@nandesuka nandesuka commented Nov 19, 2024

Summary: Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141027

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Unrelated Failures

As of commit b4276da with merge base 9dd3b85 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 19, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: nandesuka / name: nanzha (b4276da)

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65518033

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65518033

nandesuka added a commit to nandesuka/pytorch that referenced this pull request Nov 19, 2024
Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65518033

nandesuka added a commit to nandesuka/pytorch that referenced this pull request Nov 20, 2024
Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65518033

@nandesuka
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Nov 20, 2024
Copy link
Contributor

@blaine-rister blaine-rister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix. I left a comment about one of the tests being changed. Also it looks like the expected code for one of the CI tests needs to be updated. Once the CI is green, this looks ready to go :)

y
)

@parametrize("prefer_nd_tiling", [True])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this change was accidental? This used to be

@parametrize("prefer_nd_tiling", [False, True])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, good catch. Will update.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 20, 2024
nandesuka added a commit to nandesuka/pytorch that referenced this pull request Nov 20, 2024
Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65518033

@nandesuka nandesuka force-pushed the export-D65518033 branch 2 times, most recently from 79d197f to 929d49a Compare November 25, 2024 20:20
nandesuka added a commit to nandesuka/pytorch that referenced this pull request Nov 25, 2024
Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65518033

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65518033

nandesuka added a commit to nandesuka/pytorch that referenced this pull request Nov 26, 2024
Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65518033

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65518033

@nandesuka
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033
@pytorchmergebot
Copy link
Collaborator

Successfully rebased export-D65518033 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout export-D65518033 && git pull --rebase)

@nandesuka nandesuka closed this Nov 27, 2024
@nandesuka nandesuka deleted the export-D65518033 branch November 27, 2024 14:51
facebook-github-bot pushed a commit that referenced this pull request Nov 27, 2024
Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033
pytorchmergebot pushed a commit that referenced this pull request Nov 28, 2024
Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033

Pull Request resolved: #141693
Approved by: https://github.com/blaine-rister
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…41693)

Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033

Pull Request resolved: pytorch#141693
Approved by: https://github.com/blaine-rister
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants