Optimize increment summations [Latest Nov 15] #140822

laithsakka · 2024-11-15T16:27:36Z

Summary:
wins
on torchrec benchmark, for 2K nodes it save 40seconds
with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on).

buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200

This diff optimizes construction expressions of the form
a+b+c... (all unique symbols).
which are very common in torchrec models.

How
Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them.
If we have a+b+c and we are adding (d) to it, we can do a binary search to know
the position of (d) and avoid optimizing the new expression by passing the new order.

Extensions:

support constant terms.
support 10a+10b+.. (this will give even more wins will extend the support in second PR)

Differential Revision: D66008482

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ezyang @SherlockNoMad @EikanWang @wenzhe-nrv @voznesenskym @penguinwu @Guobing-Chen @zhuhaozhe @blzheng @jiayisunx @chenyang78 @kadeng @chauhang @amjames

pytorch-bot · 2024-11-15T16:27:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140822

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

❌ 1 New Failure

As of commit a2b082e with merge base b740a1b ():

NEW FAILURE - The following job has failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for test/test_nestedtensor.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-11-15T16:28:00Z

This pull request was exported from Phabricator. Differential Revision: D66008482

laithsakka · 2024-11-15T16:51:07Z

offline discussion todos:

document why we are using getattr(self, "_optimized_summation", False), on the symNode.
add micro benchmark.

Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Differential Revision: D66008482

facebook-github-bot · 2024-11-15T17:31:42Z

This pull request was exported from Phabricator. Differential Revision: D66008482

Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Differential Revision: D66008482

facebook-github-bot · 2024-11-15T17:34:29Z

This pull request was exported from Phabricator. Differential Revision: D66008482

ezyang · 2024-11-17T03:06:25Z

torch/fx/experimental/sym_node.py

+                    self.expr,
+                    other.expr,
+                    getattr(self, "_optimized_summation", False),
+                    getattr(other, "_optimized_summation", False),


We discussed this in person, where you defended the ad hoc getattr/setattr because it was only set and accessed from two places. I think my preferred way of documenting this sort of situation is to have a # Note [blah blah blah] in one location describing the situation (probably the comment below), and then referencing that note consistently at both use sites.

The important thing to document, which is not directly documented at either of these sites, is what exactly the invariant specified by optimized summation is.

It's also important to annotate the field on SymNode, if only to make sure no one accidentally clobbers it if they are adding their own one off field. I am also still not all that happy with bodging it in SymNode but I will refrain from blocking on it unless I can think of a good alternative (besides rewriting Add from scratch).

Yeah, I think I would even be happy with "this is a subclass of Add and is identical to Add in all respects except it respects the optimized summation invariant". This would probably have some annoying side effects in other parts of the code but from a layering perspective it's much cleaner.

i tried the subclass it did not work, because we could get an add to co-live with custom add and they dont compare to be equal

torch/fx/experimental/sym_node.py

ezyang · 2024-11-17T03:16:46Z

torch/fx/experimental/sym_node.py

+            rhs._args[0]
+        ):
+            # (a0+a1) + (a2+a3) => (a0+a1+a2+a3)
+            return make_optimized(lhs._args + rhs._args)


Is it cheap to test the other way too? (You have a cliff here if I accidentally swap the orders of the arguments to successive add here)

torch/fx/experimental/sym_node.py

ezyang · 2024-11-17T03:20:48Z

torch/fx/experimental/sym_node.py

    return RShift(a, b)


+def _binary_search_insert_arg(ordered_args, new_arg):


assert len(ordered_args) != 0

ezyang

This looks functionally correct, see also my comments on the other PR.

Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Reviewed By: ezyang Differential Revision: D66008482

facebook-github-bot · 2024-11-19T06:20:58Z

This pull request was exported from Phabricator. Differential Revision: D66008482

Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Reviewed By: ezyang Differential Revision: D66008482

facebook-github-bot · 2024-11-19T22:22:34Z

This pull request was exported from Phabricator. Differential Revision: D66008482

Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Reviewed By: ezyang Differential Revision: D66008482

facebook-github-bot · 2024-11-19T22:33:09Z

This pull request was exported from Phabricator. Differential Revision: D66008482

laithsakka · 2024-11-19T23:38:57Z

Address all comments

Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Test Plan: add tests add benchmark run tests Reviewed By: ezyang Differential Revision: D66008482

facebook-github-bot · 2024-11-20T00:40:32Z

This pull request was exported from Phabricator. Differential Revision: D66008482

laithsakka · 2024-11-20T16:39:38Z

@pytorchbot merge -f

pytorch-bot · 2024-11-20T16:39:41Z

❌ 🤖 pytorchbot command failed:

@pytorchbot merge: error: argument -f/--force: expected one argument

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Try @pytorchbot --help for more info.

laithsakka · 2024-11-20T16:40:42Z

@pytorchbot merge -i

pytorchmergebot · 2024-11-20T16:42:28Z

Merge started

Your change will be merged while ignoring the following 1 checks: Lint / lintrunner-noclang / linux-job

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: **wins** on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. **How** Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. **Extensions**: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Differential Revision: D66008482 Pull Request resolved: pytorch#140822 Approved by: https://github.com/ezyang

pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) release notes: fx release notes category labels Nov 15, 2024

facebook-github-bot added the fx label Nov 15, 2024

facebook-github-bot added the fb-exported label Nov 15, 2024

laithsakka force-pushed the export-D66008482 branch from 2a716c4 to 9f05fc1 Compare November 15, 2024 17:31

pytorch-bot bot added the module: dynamo label Nov 15, 2024

laithsakka force-pushed the export-D66008482 branch from 9f05fc1 to 57ffeb6 Compare November 15, 2024 17:34

ezyang reviewed Nov 17, 2024

View reviewed changes

torch/fx/experimental/sym_node.py Show resolved Hide resolved

ezyang reviewed Nov 17, 2024

View reviewed changes

torch/fx/experimental/sym_node.py Outdated Show resolved Hide resolved

ezyang reviewed Nov 17, 2024

View reviewed changes

torch/fx/experimental/sym_node.py Outdated Show resolved Hide resolved

ezyang reviewed Nov 17, 2024

View reviewed changes

ezyang approved these changes Nov 17, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 17, 2024

laithsakka force-pushed the export-D66008482 branch from 57ffeb6 to b6d7a6c Compare November 19, 2024 06:20

laithsakka force-pushed the export-D66008482 branch from b6d7a6c to c5c0155 Compare November 19, 2024 22:22

laithsakka force-pushed the export-D66008482 branch from c5c0155 to 28b1776 Compare November 19, 2024 22:32

laithsakka force-pushed the export-D66008482 branch from 28b1776 to a2b082e Compare November 20, 2024 00:40

pytorchmergebot added the merging label Nov 20, 2024

pytorchmergebot added the Merged label Nov 20, 2024

pytorchmergebot closed this in 8d70809 Nov 20, 2024

pytorchmergebot removed the merging label Nov 20, 2024

		return RShift(a, b)


		def _binary_search_insert_arg(ordered_args, new_arg):

Optimize increment summations [Latest Nov 15] #140822

Optimize increment summations [Latest Nov 15] #140822

Uh oh!

Conversation

laithsakka commented Nov 15, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140822

❗ 1 Active SEVs

❌ 1 New Failure

Uh oh!

facebook-github-bot commented Nov 15, 2024

Uh oh!

laithsakka commented Nov 15, 2024

Uh oh!

facebook-github-bot commented Nov 15, 2024

Uh oh!

facebook-github-bot commented Nov 15, 2024

Uh oh!

ezyang Nov 17, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang Nov 17, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang Nov 17, 2024

Choose a reason for hiding this comment

Uh oh!

laithsakka Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ezyang Nov 17, 2024

Choose a reason for hiding this comment

Uh oh!

laithsakka Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ezyang Nov 17, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 19, 2024

Uh oh!

facebook-github-bot commented Nov 19, 2024

Uh oh!

facebook-github-bot commented Nov 19, 2024

Uh oh!

laithsakka commented Nov 19, 2024

Uh oh!

facebook-github-bot commented Nov 20, 2024

Uh oh!

laithsakka commented Nov 20, 2024

Uh oh!

pytorch-bot bot commented Nov 20, 2024

Uh oh!

laithsakka commented Nov 20, 2024

Uh oh!

pytorchmergebot commented Nov 20, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

laithsakka commented Nov 15, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 15, 2024 •

edited

Loading