KEMBAR78
[dynamo] Properly prune dead cell local variables by StrongerXi · Pull Request #136891 · pytorch/pytorch · GitHub
Skip to content

Conversation

@StrongerXi
Copy link
Contributor

@StrongerXi StrongerXi commented Sep 27, 2024

This patch updates the prune_dead_locals logic to do slightly more aggressive pruning for cell local variables, in absence of side-effects, e.g., a cell variable can be pruned when its user function(s) will never be used again.

See added tests for examples; note that a few tests in test/dynamo/test_higher_order_ops.py also got updated because we are no longer returning the unnecessary graph output.

Fixes #127350, #124653

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 27, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136891

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 42de084 with merge base 9a957e2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 27, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: StrongerXi / name: Xiangxi Guo (Ryan) (42de084)

@StrongerXi StrongerXi requested review from anijain2305, jansel and williamwen42 and removed request for williamwen42 September 27, 2024 19:45
@StrongerXi StrongerXi linked an issue Sep 27, 2024 that may be closed by this pull request
@StrongerXi
Copy link
Contributor Author

/easycla

@StrongerXi StrongerXi added the topic: not user facing topic category label Sep 27, 2024
Comment on lines 823 to 830
# 2. In the fast path of `OutputGraph::compile_subgraph`, we populate
# the tuple of cells _after_ emitting the `MAKE FUNCTION` bytecode,
# via `STORE DEREF`; these `STORE DEREF` are generated partly based
# on the current `symbolic_locals`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about another approach where we emit STORE_DEREF as part of NestedUserFunctionVariable::reconstruct, and get rid of the special handling of cellvars here; I might look into that more if I can't fix the existing test failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to think there might be a better solution with this approach, by replacing uses of ClosureVariable with NewCellVariable. It feels related to #136814 as well.

@StrongerXi StrongerXi force-pushed the fix-closure-var branch 2 times, most recently from 37a7f44 to d75022e Compare September 27, 2024 23:15
@StrongerXi
Copy link
Contributor Author

Fix side-effect related issues -- Dynamo must treat captured variables as live (therefore generate bytecode to update these cells), if any of their user functions escapes the lifetime of the function that defined the captured variables.

@StrongerXi
Copy link
Contributor Author

Run SideEffects::prune_dead_object_new first, which might enable more cellvar pruning. This undoes the limiting effect from the last push.

Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change, but tests are failing

Comment on lines +884 to +913
# Manually release the self-referential nested function, which
# captures `self.symbolic_locals` and affects parts of PT test/logic
# that are sensitive to when certain objects get released.
del visit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempt to fix failing tests; this is somewhat surprising, or maybe expected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surprising to me - which tests are failing because of this? The memleak tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. These ones, I assume self.symbolic_locals had a reference to x, so before we run the full gc.collect(), x won't be freed (tests do pass if we add a gc.collect() after del x).
    https://github.com/pytorch/pytorch/blob/54e2b2a6d4f6b6d27637b941f35f254ef42edbad/test/dynamo/test_misc.py#L4753-L4797

  2. Somewhat similar to the above, but more behind-the-scene logic
    https://github.com/pytorch/pytorch/blob/54e2b2a6d4f6b6d27637b941f35f254ef42edbad/test/dynamo/test_misc.py#L10756-L10766

  3. This one, which can't seem to be fixed by adding a gc.collect() in the test itself.
    https://github.com/pytorch/pytorch/blob/54e2b2a6d4f6b6d27637b941f35f254ef42edbad/test/test_autograd.py#L5464-L5465

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the reference cycle being caused here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite subtle:

Inside this frame, visit is a cell variable, because visit as a nested function references and captures itself. So you have visit.__closure__[0] being a cell that references visit. Example:

def foo():
    def bar():
        return bar()
    print(bar.__closure__[0].cell_contents is bar)
foo() # prints True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is annoying but I don't have any ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to try any of the other methods that we discussed on Monday to prevent the refcycle without using del?

Comment on lines +877 to +901
# NOTE: Don't trace from the cell locals which aren't explicitly
# read anymore; if they are indirectly used, they will be reached by
# other roots. These initially excluded cells are the ones that will
# hopefully be pruned.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another fix.

@StrongerXi
Copy link
Contributor Author

Rebase to fix a inductor failure which seems unrelated to my change.

@StrongerXi
Copy link
Contributor Author

@jansel I just reproduced the failed test inductor/test_group_batch_fusion.py::TestGroupBatchFusion::test_pointwise_op_fusion on my gpu devserver, on latest main c07ebaf, I think it's not related to my patch? It also looks like a inductor-only test.

@StrongerXi StrongerXi requested a review from jansel September 30, 2024 22:55
@zou3519
Copy link
Contributor

zou3519 commented Oct 1, 2024

@StrongerXi if the test isn't related -- try to rebase? Otherwise you'll need to force merge this when it's been approved


# Returning `func` forces dynamo to output `x` in the compiled graph, so
# that we can store it as `func`'s closure.
return func, func()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the outputs are (x, x + x) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm (func, x + x) I'd say?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I meant, what are the 2 outputs in the compiled graph?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yep. Lemme document that.

@facebook-github-bot
Copy link
Contributor

@StrongerXi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@StrongerXi
Copy link
Contributor Author

Chatted with @yanboliang, I'll update the test, which itself isn't super important. The main concern here is this patch breaks brittle patterns other parts of the stack rely on, and their effect on our internal workloads. I'll also hold off "fixing" the fusion pass, for the same reason.

Another 1 line fix in test/inductor/test_group_batch_fusion.py, for similar reason as above.

@facebook-github-bot
Copy link
Contributor

@StrongerXi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@StrongerXi
Copy link
Contributor Author

All green, but I'm still waiting for some internal experiments to guard against potential regression.

@StrongerXi
Copy link
Contributor Author

Need to investigate compile-time increase and qps drop for internal workflow.

StrongerXi added a commit that referenced this pull request Oct 7, 2024
@StrongerXi
Copy link
Contributor Author

Still investigating. Update documentation about why we skip InlinedClosureVariable in VariableTracker visiting.

@StrongerXi
Copy link
Contributor Author

Internal workflow passed (previous results turned out to be noise). Remove inaccurate documentation on why we skip InlinedClosureVariable in search for must-keep-alive cells, now that #137510 is about to land.

@facebook-github-bot
Copy link
Contributor

@StrongerXi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

This patch updates the `prune_dead_locals` logic to do slightly more
aggressive pruning for cell local variables, in absence of side-effects,
e.g., a cell variable can be pruned when its user function(s) will never
be used again.

See added tests for examples; note that a few tests in
`test/dynamo/test_higher_order_ops.py` also got updated because we are
no longer returning the unnecessary graph output.

Fixes pytorch#127350, pytorch#124653
@StrongerXi
Copy link
Contributor Author

Chatted with @yanboliang, I'll update the test, which itself isn't super important. The main concern here is this patch breaks brittle patterns other parts of the stack rely on, and their effect on our internal workloads. I'll also hold off "fixing" the fusion pass, for the same reason.

Another 1 line fix in test/inductor/test_group_batch_fusion.py, for similar reason as above.

Another 1 line fix of this sort, in a test that requires fbgemm.

@facebook-github-bot
Copy link
Contributor

@StrongerXi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@StrongerXi
Copy link
Contributor Author

All green externally and internally. Merging.

1 similar comment
@StrongerXi
Copy link
Contributor Author

All green externally and internally. Merging.

@StrongerXi
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 10, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dynamo should prune non-live captured variables dynamo creates unnecessary buffers

8 participants