KEMBAR78
Fix resurrection logic to trigger early enough by albanD · Pull Request #137267 · pytorch/pytorch · GitHub
Skip to content

Conversation

@albanD
Copy link
Collaborator

@albanD albanD commented Oct 3, 2024

Stack from ghstack (oldest at bottom):

Fixes #136358

The bug here is that the Tensor object is actually 2 classes: Tensor from _tensor.py and TensorBase from c++.

Before this PR, they have the following gc methods:
Tensor:

  • tp_clear subtype_clear
  • tp_traverse THPVariable_subclass_traverse
  • tp_dealloc THPVariable_subclass_dealloc

TensorBase:

  • tp_clear THPVariable_clear
  • tp_traverse THPFunction_traverse (fake function that is just an error)
  • tp_dealloc object_dealloc

The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its __dict__ attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it.

@albanD albanD requested a review from soulitzer as a code owner October 3, 2024 14:09
@albanD albanD mentioned this pull request Oct 3, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137267

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d8ac201 with merge base 88e338f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD added a commit that referenced this pull request Oct 3, 2024
ghstack-source-id: 3bee9e1
Pull Request resolved: #137267
@albanD albanD added release notes: python_frontend python frontend release notes category topic: bug fixes topic category labels Oct 3, 2024
Fixes #136358

The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++.

Before this PR, they have the following gc methods:
Tensor:
 - tp_clear subtype_clear
 - tp_traverse THPVariable_subclass_traverse
 - tp_dealloc THPVariable_subclass_dealloc

TensorBase:
- tp_clear THPVariable_clear
- tp_traverse THPFunction_traverse (fake function that is just an error)
- tp_dealloc object_dealloc

The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it.

[ghstack-poisoned]
albanD added a commit that referenced this pull request Oct 3, 2024
ghstack-source-id: 6c61053
Pull Request resolved: #137267
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops! Glad it was simple

Copy link
Collaborator

@kshitij12345 kshitij12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the quick fix @albanD

Fixes #136358

The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++.

Before this PR, they have the following gc methods:
Tensor:
 - tp_clear subtype_clear
 - tp_traverse THPVariable_subclass_traverse
 - tp_dealloc THPVariable_subclass_dealloc

TensorBase:
- tp_clear THPVariable_clear
- tp_traverse THPFunction_traverse (fake function that is just an error)
- tp_dealloc object_dealloc

The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it.

[ghstack-poisoned]
albanD added a commit that referenced this pull request Oct 4, 2024
ghstack-source-id: 3cff200
Pull Request resolved: #137267
@albanD
Copy link
Collaborator Author

albanD commented Oct 4, 2024

@pytorchbot merge -r

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 4, 2024
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/albanD/2/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/137267)

pytorchmergebot pushed a commit that referenced this pull request Oct 4, 2024
ghstack-source-id: 5222915
Pull Request resolved: #137267
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@albanD albanD added this to the 2.5.1 milestone Oct 18, 2024
@kit1980
Copy link
Contributor

kit1980 commented Oct 23, 2024

2.5.1 is an emergency patch release to address known large regressions, moving this to 2.6.0

@kit1980 kit1980 modified the milestones: 2.5.1, 2.6.0 Oct 23, 2024
@ezyang
Copy link
Contributor

ezyang commented Oct 24, 2024

@kit1980 if you're gonna move this fix to 2.6.0, you might as well say there's only one patch going in 2.5.1 lol. This one is clear, unambiguous and solves silent corruption (that happens rarely, but it is silent corruption!)

@kit1980
Copy link
Contributor

kit1980 commented Oct 24, 2024

@ezyang We were considering actually to do only one patch, but decided to include also couple of fixes for critical regressions where external users loudly complained after 2.5.0.
We want to release 2.5.1 next week, and creating new release candidate will delay it and increase risk.
We might do 2.5.2 later.

@kit1980
Copy link
Contributor

kit1980 commented Oct 24, 2024

Also, is this a regression introduced in 2.5.0?
If not, then it's not for a patch release.

@github-actions github-actions bot deleted the gh/albanD/2/head branch November 24, 2024 02:13
pytorchmergebot pushed a commit that referenced this pull request Dec 14, 2024
Fixes a bug introduced in #137267

While the test ensures the finalizer did run to make sure things are cleared, the objects are not properly collected by the gc due to the faulty tp_clear implementation. So, while the finalizer did run, the object was still alive.
Fixing this by giving tp_clear the same treatment as tp_traverse and tp_dealloc on Tensor: make it a unique function that handles the full subclass hierarchy in one place.

Pull Request resolved: #143203
Approved by: https://github.com/ezyang, https://github.com/colesbury
ghstack dependencies: #143202
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: python_frontend python frontend release notes category topic: bug fixes topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants