KEMBAR78
Fix compare_exchange_weak usage in weak_intrusive_ptr by mtbrandy · Pull Request #16302 · pytorch/pytorch · GitHub
Skip to content

Conversation

@mtbrandy
Copy link
Contributor

In the case of spurious failure, refcount is not incremented -- which leads to underflow once all references are released.

This was discovered when exercising multiprocessing on ppc64le.

@mtbrandy
Copy link
Contributor Author

The following test recreates the issue on ppc64le:

import torch

q = torch.multiprocessing.Queue()
while True:
    t = torch.ones(1)
    q.put(t)
    q.get()

It yields the following error:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/pytorch-ref-nodebug/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/anaconda3/envs/pytorch-ref-nodebug/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/opt/anaconda3/envs/pytorch-ref-nodebug/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 304, in reduce_storage                                                                                                                           
    shared_cache[cache_key] = StorageWeakRef(storage)
  File "/opt/anaconda3/envs/pytorch-ref-nodebug/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 51, in __setitem__                                                                                                                               
    self.free_dead_references()
  File "/opt/anaconda3/envs/pytorch-ref-nodebug/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 59, in free_dead_references                                                                                                                      
    if storage_ref.expired():
  File "/opt/anaconda3/envs/pytorch-ref-nodebug/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 33, in expired                                                                                                                                   
    return torch.Storage._expired(self.cdata)
RuntimeError: owning_weak_ptr == NullType::singleton() || owning_weak_ptr->weakcount_.load() > 1 || (owning_weak_ptr->refcount_.load() == 0 && owning_weak_ptr->weakcount_.load() > 0) ASSERT FAILED at /opt/anaconda2/conda-bld/pytorch_1547774768751/work/c10/util/intrusive_ptr.h:605, please report a bug to PyTorch. weak_intrusive_ptr: Can only weak_intrusive_ptr::reclaim() owning pointers that were created using weak_intrusive_ptr::release().     

@mtbrandy
Copy link
Contributor Author

mtbrandy commented Jan 24, 2019

@smessmer please review

Note that this issue exists in v1.0.0 as well.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@smessmer
Copy link
Contributor

Fix looks good, thanks. Can we add a test case for this as well?

@smessmer
Copy link
Contributor

oh nvm, I actually don't think we can write a reliable unit test for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants