KEMBAR78
Fix python gc race condition with THPVariable_traverse by zou3519 · Pull Request #4437 · pytorch/pytorch · GitHub
Skip to content

Conversation

@zou3519
Copy link
Contributor

@zou3519 zou3519 commented Jan 2, 2018

Fixes #3883: asserts being triggered in DataParallel for long training regiments like imagenet.

The bug is that a python Function in the computation graph gets garbage collected due to a C++ shared pointer to the C++ Function it wraps changing use_count() from 1 to more than 1 during python garbage collection. This happens because the "python state" of Variable changes: the THPVariable_traverse method indicates that when the shared pointer's use_count() is 1, the python Variable holds a python reference to the python Function, but not when use_count() > 1. See this gist for a more detailed exploration of the diagnosis.

The bug could be fixed by treating the shared pointer (a Variable's grad_fn) as part of the python state and intelligently grabbing the GIL when necessary. The downside to this approach is the complexity and potential overhead.

Alternatively, after talking with @colesbury, we think it might be better to remove the checking of the shared pointer's use_count() in THPVariable_traverse. This effectively disables reference cycle garbage collection for python Functions.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment next to the deleted C++ code that links to your gist and maybe says a few words about why we don't want to traverse to the grad_fn. Also, mention that this in fact leaks when hooks are involved in a cycle

@apaszke apaszke merged commit 2060f35 into pytorch:master Jan 2, 2018
@zou3519 zou3519 deleted the fix-race branch January 3, 2018 19:50
@soumith soumith mentioned this pull request Jan 23, 2018
16 tasks
@soumith soumith added the 0.3.1 label Feb 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Assertion pos >= 0 && pos < buffer.size() failed

3 participants