-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Optimize increment summations [Latest Nov 15] #140822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140822
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New FailureAs of commit a2b082e with merge base b740a1b ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D66008482 |
|
offline discussion todos:
|
Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Differential Revision: D66008482
Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Differential Revision: D66008482
2a716c4 to
9f05fc1
Compare
|
This pull request was exported from Phabricator. Differential Revision: D66008482 |
Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Differential Revision: D66008482
Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Differential Revision: D66008482
9f05fc1 to
57ffeb6
Compare
|
This pull request was exported from Phabricator. Differential Revision: D66008482 |
torch/fx/experimental/sym_node.py
Outdated
| self.expr, | ||
| other.expr, | ||
| getattr(self, "_optimized_summation", False), | ||
| getattr(other, "_optimized_summation", False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this in person, where you defended the ad hoc getattr/setattr because it was only set and accessed from two places. I think my preferred way of documenting this sort of situation is to have a # Note [blah blah blah] in one location describing the situation (probably the comment below), and then referencing that note consistently at both use sites.
The important thing to document, which is not directly documented at either of these sites, is what exactly the invariant specified by optimized summation is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also important to annotate the field on SymNode, if only to make sure no one accidentally clobbers it if they are adding their own one off field. I am also still not all that happy with bodging it in SymNode but I will refrain from blocking on it unless I can think of a good alternative (besides rewriting Add from scratch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I would even be happy with "this is a subclass of Add and is identical to Add in all respects except it respects the optimized summation invariant". This would probably have some annoying side effects in other parts of the code but from a layering perspective it's much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried the subclass it did not work, because we could get an add to co-live with custom add and they dont compare to be equal
| rhs._args[0] | ||
| ): | ||
| # (a0+a1) + (a2+a3) => (a0+a1+a2+a3) | ||
| return make_optimized(lhs._args + rhs._args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it cheap to test the other way too? (You have a cliff here if I accidentally swap the orders of the arguments to successive add here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| return RShift(a, b) | ||
|
|
||
|
|
||
| def _binary_search_insert_arg(ordered_args, new_arg): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert len(ordered_args) != 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks functionally correct, see also my comments on the other PR.
Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Reviewed By: ezyang Differential Revision: D66008482
57ffeb6 to
b6d7a6c
Compare
|
This pull request was exported from Phabricator. Differential Revision: D66008482 |
Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Reviewed By: ezyang Differential Revision: D66008482
b6d7a6c to
c5c0155
Compare
|
This pull request was exported from Phabricator. Differential Revision: D66008482 |
Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Reviewed By: ezyang Differential Revision: D66008482
c5c0155 to
28b1776
Compare
|
This pull request was exported from Phabricator. Differential Revision: D66008482 |
|
Address all comments |
Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Reviewed By: ezyang Differential Revision: D66008482
28b1776 to
a2b082e
Compare
|
This pull request was exported from Phabricator. Differential Revision: D66008482 |
|
@pytorchbot merge -f |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: Lint / lintrunner-noclang / linux-job Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Differential Revision: D66008482 Pull Request resolved: pytorch#140822 Approved by: https://github.com/ezyang
Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Differential Revision: D66008482 Pull Request resolved: pytorch#140822 Approved by: https://github.com/ezyang
Summary:
wins
on torchrec benchmark, for 2K nodes it save 40seconds
with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on).
This diff optimizes construction expressions of the form
a+b+c... (all unique symbols).
which are very common in torchrec models.
How
Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them.
If we have a+b+c and we are adding (d) to it, we can do a binary search to know
the position of (d) and avoid optimizing the new expression by passing the new order.
Extensions:
Differential Revision: D66008482
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ezyang @SherlockNoMad @EikanWang @wenzhe-nrv @voznesenskym @penguinwu @Guobing-Chen @zhuhaozhe @blzheng @jiayisunx @chenyang78 @kadeng @chauhang @amjames