KEMBAR78
Fix OpSchema equality check by swolchok · Pull Request #161231 · pytorch/pytorch · GitHub
Skip to content

Conversation

@swolchok
Copy link
Contributor

@swolchok swolchok commented Aug 22, 2025

`__eq__` didn't compare lists of DTensorSpec, but `__hash__` did (and
it looks like attention was paid to hash, so I made comparison follow
suit).

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 22, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161231

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7fd9430 with merge base a85711d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

swolchok added a commit that referenced this pull request Aug 22, 2025
`__eq__` didn't compare lists of DTensorSpec, but `__hash__` did (and
it looks like attention was paid to hash, so I made comparison follow
suit).

ghstack-source-id: 33571f3
Pull Request resolved: #161231
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 22, 2025
@swolchok swolchok requested review from XilunWu and wanchaol and removed request for wanchaol August 22, 2025 01:18
@swolchok swolchok added topic: bug fixes topic category release notes: distributed (dtensor) release notes category labels Aug 22, 2025
@swolchok swolchok requested review from ezyang and zpcore August 22, 2025 22:29
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was at first hesitant to stamp this because while we definitely should fix this, it is possible DTensor did something that profited from the divergence and I don't want to regress that. However looking at the details of how eq was broken, i think its a safe fix.

Still i'm yolo'ing a bit here becuase i have not thoroughly reviewed where DTensor uses hash and where it uses eq so it's possible i'm missing something.

@XilunWu XilunWu added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 22, 2025
@zpcore
Copy link
Member

zpcore commented Aug 22, 2025

I was at first hesitant to stamp this because while we definitely should fix this, it is possible DTensor did something that profited from the divergence and I don't want to regress that. However looking at the details of how eq was broken, i think its a safe fix.

Still i'm yolo'ing a bit here becuase i have not thoroughly reviewed where DTensor uses hash and where it uses eq so it's possible i'm missing something.

My understanding why we didn't suffer the issue before is that we do op_schema comparison after call unwrap_to_op_info:

op_info = self.unwrap_to_op_info(op_call, args, kwargs)
logger.debug("Dispatching op_call: %s", op_info.schema)
try:
self.sharding_propagator.propagate(op_info)

in this case, self_arg can be DTensorSpec, but not a list[DTensorSpec].

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #161285

pytorchmergebot pushed a commit that referenced this pull request Aug 25, 2025
…h__` and `__eq__` (#161234)

The performance cost of `dict` lookups keyed by `OpSchema` is a
significant minority of DTensor overhead. With this change we shave a
net ~1% off the total running time of the benchmark from #160580, as
measured by using cProfile and comparing cumulative time spent in
propagate + OpSchema's `__post_init__`. (`__post_init__` grew from
2.5% to 6.4% (+3.9%) and propagate shrank from 12.5% to 7.8% (-4.7%)).

Pull Request resolved: #161234
Approved by: https://github.com/wconstab
ghstack dependencies: #161231
pytorchmergebot pushed a commit that referenced this pull request Aug 25, 2025
`self is other` means the same thing as `id(self) == id(other)`, but it's one operator instead of 3.

Pull Request resolved: #161235
Approved by: https://github.com/wconstab, https://github.com/zpcore, https://github.com/fduwjj
ghstack dependencies: #161231, #161234
pytorchmergebot pushed a commit that referenced this pull request Aug 25, 2025
…161240)

get_write_alias() call count reduction explained briefly in code comment.

We don't need to check write_aliases against None in the final outs_to_return calculation because we just did that check.
Pull Request resolved: #161240
Approved by: https://github.com/wconstab
ghstack dependencies: #161231, #161234, #161235
pytorchmergebot pushed a commit that referenced this pull request Aug 25, 2025
…ly_alias_match (#161284)

Containers are truthy iff they're non-empty.
Pull Request resolved: #161284
Approved by: https://github.com/Skylion007, https://github.com/wconstab
ghstack dependencies: #161231, #161234, #161235, #161240
pytorchmergebot pushed a commit that referenced this pull request Aug 25, 2025
Drives down the overhead of return_and_correct_storage_aliasing slightly. Hopefully you'll agree it doesn't compromise readability.
Pull Request resolved: #161285
Approved by: https://github.com/wconstab
ghstack dependencies: #161231, #161234, #161235, #161240, #161284
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
`__eq__` didn't compare lists of DTensorSpec, but `__hash__` did (and
it looks like attention was paid to hash, so I made comparison follow
suit).

Pull Request resolved: pytorch#161231
Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/zpcore
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…h__` and `__eq__` (pytorch#161234)

The performance cost of `dict` lookups keyed by `OpSchema` is a
significant minority of DTensor overhead. With this change we shave a
net ~1% off the total running time of the benchmark from pytorch#160580, as
measured by using cProfile and comparing cumulative time spent in
propagate + OpSchema's `__post_init__`. (`__post_init__` grew from
2.5% to 6.4% (+3.9%) and propagate shrank from 12.5% to 7.8% (-4.7%)).

Pull Request resolved: pytorch#161234
Approved by: https://github.com/wconstab
ghstack dependencies: pytorch#161231
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
`self is other` means the same thing as `id(self) == id(other)`, but it's one operator instead of 3.

Pull Request resolved: pytorch#161235
Approved by: https://github.com/wconstab, https://github.com/zpcore, https://github.com/fduwjj
ghstack dependencies: pytorch#161231, pytorch#161234
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…ytorch#161240)

get_write_alias() call count reduction explained briefly in code comment.

We don't need to check write_aliases against None in the final outs_to_return calculation because we just did that check.
Pull Request resolved: pytorch#161240
Approved by: https://github.com/wconstab
ghstack dependencies: pytorch#161231, pytorch#161234, pytorch#161235
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…#161285)

Drives down the overhead of return_and_correct_storage_aliasing slightly. Hopefully you'll agree it doesn't compromise readability.
Pull Request resolved: pytorch#161285
Approved by: https://github.com/wconstab
ghstack dependencies: pytorch#161231, pytorch#161234, pytorch#161235, pytorch#161240, pytorch#161284
@github-actions github-actions bot deleted the gh/swolchok/788/head branch September 25, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (dtensor) release notes category topic: bug fixes topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants