KEMBAR78
[Inductor][CPP] Fix layout for local buf in outer loop fusion by CaoE · Pull Request #160857 · pytorch/pytorch · GitHub
Skip to content

Conversation

@CaoE
Copy link
Collaborator

@CaoE CaoE commented Aug 18, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160857

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d62fb9f with merge base a4fc051 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@CaoE CaoE added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Aug 18, 2025
@CaoE CaoE changed the title [Inductor][CPP] Fix layout for local buf of outer loop fusion [Inductor][CPP] Fix layout for local buf in outer loop fusion Aug 18, 2025
@CaoE CaoE requested review from jgong5 and leslie-fang-intel and removed request for jgong5 August 18, 2025 14:03
# Local Buffer is a view of global buffer
local_buffer_stride: list[int] = []
stride = global_buffer_layout.stride[-1]
local_buffer_size = get_call_ranges(scheduler_node)[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this case scheduler_node also a view of global_buffer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case global_buffer is scheduler_node.node.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why can't use global_buffer_layout.size[size_offset:] directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because global_buffer_layout is the tensor layout while size_offset is the loop depth. The dimension of global_buffer_layout may not be the same as loop number, e.g, there are merged dims.
In this case, global_buffer_layout size is [5, 1, 32, 32] but the callrange is [5, 1024]. If use global_buffer_layout.size[size_offset:] to create local_buffer_layout we get the size [32] but we need [1024].

Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CaoE CaoE marked this pull request as ready for review August 19, 2025 02:48
@CaoE CaoE requested a review from jansel August 19, 2025 02:48
continue
# Local Buffer is a view of global buffer
local_buffer_stride: list[int] = []
stride = global_buffer_layout.stride[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work for a size = [] tensor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_buffer is an instance of ir.ComputedBuffer. We haven't encountered a size = [] tensor yet. Is there a possible case for this? @leslie-fang-intel Could you please help on this question ?

@CaoE
Copy link
Collaborator Author

CaoE commented Aug 21, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.compile generates wrong code that leads to corrupted memory

5 participants