-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Fix resurrection logic to trigger early enough #137267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137267
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d8ac201 with merge base 88e338f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Fixes #136358 The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++. Before this PR, they have the following gc methods: Tensor: - tp_clear subtype_clear - tp_traverse THPVariable_subclass_traverse - tp_dealloc THPVariable_subclass_dealloc TensorBase: - tp_clear THPVariable_clear - tp_traverse THPFunction_traverse (fake function that is just an error) - tp_dealloc object_dealloc The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it. [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops! Glad it was simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the quick fix @albanD
Fixes #136358 The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++. Before this PR, they have the following gc methods: Tensor: - tp_clear subtype_clear - tp_traverse THPVariable_subclass_traverse - tp_dealloc THPVariable_subclass_dealloc TensorBase: - tp_clear THPVariable_clear - tp_traverse THPFunction_traverse (fake function that is just an error) - tp_dealloc object_dealloc The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it. [ghstack-poisoned]
|
@pytorchbot merge -r |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
2.5.1 is an emergency patch release to address known large regressions, moving this to 2.6.0 |
|
@kit1980 if you're gonna move this fix to 2.6.0, you might as well say there's only one patch going in 2.5.1 lol. This one is clear, unambiguous and solves silent corruption (that happens rarely, but it is silent corruption!) |
|
@ezyang We were considering actually to do only one patch, but decided to include also couple of fixes for critical regressions where external users loudly complained after 2.5.0. |
|
Also, is this a regression introduced in 2.5.0? |
Fixes a bug introduced in #137267 While the test ensures the finalizer did run to make sure things are cleared, the objects are not properly collected by the gc due to the faulty tp_clear implementation. So, while the finalizer did run, the object was still alive. Fixing this by giving tp_clear the same treatment as tp_traverse and tp_dealloc on Tensor: make it a unique function that handles the full subclass hierarchy in one place. Pull Request resolved: #143203 Approved by: https://github.com/ezyang, https://github.com/colesbury ghstack dependencies: #143202
Stack from ghstack (oldest at bottom):
Fixes #136358
The bug here is that the Tensor object is actually 2 classes:
Tensorfrom_tensor.pyandTensorBasefrom c++.Before this PR, they have the following gc methods:
Tensor:
TensorBase:
The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its
__dict__attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it.