Implement reference counting for shared IPC CUDA tensors #16854

VitalyFedyunin · 2019-02-07T20:56:36Z

This is to fix #16141 and similar issues.

The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage.

@ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress.

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2019-02-08T16:26:37Z

What was the original performance problem, in the end?

ezyang · 2019-02-08T16:27:07Z

Test failures are real:

Feb 07 23:43:10 test_cuda (__main__.TestMultiprocessing) ... terminate called after throwing an instance of 'c10::Error'
Feb 07 23:43:10   what():  unable to open shared memory object </torch_5993_2546988397> in read-write mode (THMapAllocator at /var/lib/jenkins/workspace/aten/src/TH/THAllocator.cpp:234)
Feb 07 23:43:10 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7fbcfc569e2a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Feb 07 23:43:10 frame #1: THMapAllocator::THMapAllocator(WithFd, char const*, int, int, unsigned long) + 0x64e (0x7fbcfd22e78e in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
Feb 07 23:43:10 frame #2: THRefcountedMapAllocator::THRefcountedMapAllocator(char const*, int, unsigned long) + 0x2f (0x7fbcfd22f55f in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
Feb 07 23:43:10 frame #3: THRefcountedMapAllocator::makeDataPtr(char const*, int, unsigned long, unsigned long*) + 0x4b (0x7fbcfd22f5eb in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
Feb 07 23:43:10 frame #4: <unknown function> + 0x4d3b96 (0x7fbd08bd0b96 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Feb 07 23:43:10 frame #5: THCIpcDeleter::~THCIpcDeleter() + 0xa5 (0x7fbd00ab9085 in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
Feb 07 23:43:10 frame #6: deleteTHCIpcDeleter(void*) + 0xe (0x7fbd00ab90ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
Feb 07 23:43:10 frame #7: c10::TensorImpl::release_resources() + 0x61 (0x7fbcfc55ebc1 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Feb 07 23:43:10 frame #8: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fbcfbc104be in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

ezyang · 2019-02-08T16:27:42Z

I still don't see a Note explaining the general strategy in the PR ;) For example, the most critical information to add is under what circumstances collect gets run, since that affects the overall performance of this scheme. I shouldn't have to read the code to figure it out!

EDIT: Sorry, I didn't see your note about documentation being in progress :)

c10/core/StorageImpl.h

ezyang · 2019-02-08T16:31:41Z

torch/csrc/CudaIPCTypes.h

+  CudaIPCSentData(std::string handle, int64_t offset, int64_t* counter_ptr)
+      : handle(handle), offset(offset), counter_ptr(counter_ptr){};
+  ~CudaIPCSentData();
+  int64_t get();


I'd definitely appreciate a doc here

torch/csrc/CudaIPCTypes.h

ezyang · 2019-02-08T16:37:04Z

Needs tests

torch/csrc/CudaIPCTypes.cpp

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

c10/core/StorageImpl.h

c10/cuda/CUDACachingAllocator.cpp

test/test_multiprocessing.py

ezyang · 2019-02-17T02:36:20Z

torch/csrc/CudaIPCTypes.cpp

Wondering if this shouldn't go in the torch/csrc/cuda folder. I'm not too familiar with how the build works here, but it seems worth looking into, or maybe asking @zdevito about

torch/csrc/CudaIPCTypes.cpp

ezyang · 2019-02-19T03:53:27Z

torch/csrc/CudaIPCTypes.cpp

This message can be even better if it offers some information about what this means, and advice about how to remediate the situation. A link to more detailed docs is often good enough.

torch/multiprocessing/cuda_multiprocessing.md

test/test_multiprocessing.py

ezyang · 2019-03-21T20:59:00Z

test/test_multiprocessing.py


        with leak_checker(self) as lc:
            for _ in range(repeat):
                do_test()


This is almost assuredly failing lint

torch/csrc/CudaIPCTypes.h

torch/csrc/CudaIPCTypes.cpp

torch/cuda/__init__.py

ezyang · 2019-03-21T21:15:09Z

OK, finished reviewing the new stuff. Note that you want to make the dev docs discoverable. The best way to do it is to cite them from the relevant code, so that when people are reading the code they know where to go to get the info. WE use the convention Note [Blah blah] and See Note [Blah blah] to do these cross references.

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2019-03-22T14:55:53Z

@pytorchbot retest this please

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2019-03-22T20:06:57Z

@pytorchbot retest this please

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-03-25T19:08:28Z

@VitalyFedyunin merged this pull request in 5653a91.

…ed (#19904) Summary: The mp notes are not updated after #16854. (The torch.multiprocessing page is.) Pull Request resolved: #19904 Differential Revision: D15509661 Pulled By: soumith fbshipit-source-id: 7c11e14a6c804498dda3adbf19710e63e6a564a0

VitalyFedyunin requested a review from ezyang February 7, 2019 20:56

VitalyFedyunin mentioned this pull request Feb 7, 2019

Cuda ipc deallocation fix #16764

Closed

facebook-github-bot reviewed Feb 7, 2019

View reviewed changes