KEMBAR78
Implement reference counting for shared IPC CUDA tensors by VitalyFedyunin · Pull Request #16854 · pytorch/pytorch · GitHub
Skip to content

Conversation

@VitalyFedyunin
Copy link
Contributor

This is to fix #16141 and similar issues.

The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage.

@ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Feb 8, 2019

What was the original performance problem, in the end?

@ezyang
Copy link
Contributor

ezyang commented Feb 8, 2019

Test failures are real:

Feb 07 23:43:10 test_cuda (__main__.TestMultiprocessing) ... terminate called after throwing an instance of 'c10::Error'
Feb 07 23:43:10   what():  unable to open shared memory object </torch_5993_2546988397> in read-write mode (THMapAllocator at /var/lib/jenkins/workspace/aten/src/TH/THAllocator.cpp:234)
Feb 07 23:43:10 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7fbcfc569e2a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Feb 07 23:43:10 frame #1: THMapAllocator::THMapAllocator(WithFd, char const*, int, int, unsigned long) + 0x64e (0x7fbcfd22e78e in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
Feb 07 23:43:10 frame #2: THRefcountedMapAllocator::THRefcountedMapAllocator(char const*, int, unsigned long) + 0x2f (0x7fbcfd22f55f in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
Feb 07 23:43:10 frame #3: THRefcountedMapAllocator::makeDataPtr(char const*, int, unsigned long, unsigned long*) + 0x4b (0x7fbcfd22f5eb in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
Feb 07 23:43:10 frame #4: <unknown function> + 0x4d3b96 (0x7fbd08bd0b96 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Feb 07 23:43:10 frame #5: THCIpcDeleter::~THCIpcDeleter() + 0xa5 (0x7fbd00ab9085 in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
Feb 07 23:43:10 frame #6: deleteTHCIpcDeleter(void*) + 0xe (0x7fbd00ab90ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
Feb 07 23:43:10 frame #7: c10::TensorImpl::release_resources() + 0x61 (0x7fbcfc55ebc1 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Feb 07 23:43:10 frame #8: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fbcfbc104be in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)

@ezyang
Copy link
Contributor

ezyang commented Feb 8, 2019

I still don't see a Note explaining the general strategy in the PR ;) For example, the most critical information to add is under what circumstances collect gets run, since that affects the overall performance of this scheme. I shouldn't have to read the code to figure it out!

EDIT: Sorry, I didn't see your note about documentation being in progress :)

CudaIPCSentData(std::string handle, int64_t offset, int64_t* counter_ptr)
: handle(handle), offset(offset), counter_ptr(counter_ptr){};
~CudaIPCSentData();
int64_t get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely appreciate a doc here

@ezyang
Copy link
Contributor

ezyang commented Feb 8, 2019

Needs tests

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin VitalyFedyunin force-pushed the cuda_ipc_deallocation_share_counters_file branch from 3ba803c to e146141 Compare February 14, 2019 18:46
@ezyang ezyang changed the title Implement reference counting for shared ICP CUDA tensors Implement reference counting for shared IPC CUDA tensors Feb 14, 2019
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this shouldn't go in the torch/csrc/cuda folder. I'm not too familiar with how the build works here, but it seems worth looking into, or maybe asking @zdevito about

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message can be even better if it offers some information about what this means, and advice about how to remediate the situation. A link to more detailed docs is often good enough.


with leak_checker(self) as lc:
for _ in range(repeat):
do_test()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost assuredly failing lint

@ezyang
Copy link
Contributor

ezyang commented Mar 21, 2019

OK, finished reviewing the new stuff. Note that you want to make the dev docs discoverable. The best way to do it is to cite them from the relevant code, so that when people are reading the code they know where to go to get the info. WE use the convention Note [Blah blah] and See Note [Blah blah] to do these cross references.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor Author

@pytorchbot retest this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor Author

@pytorchbot retest this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 5653a91.

facebook-github-bot pushed a commit that referenced this pull request May 26, 2019
…ed (#19904)

Summary:
The mp notes are not updated after #16854. (The torch.multiprocessing page is.)
Pull Request resolved: #19904

Differential Revision: D15509661

Pulled By: soumith

fbshipit-source-id: 7c11e14a6c804498dda3adbf19710e63e6a564a0
@neerajprad neerajprad mentioned this pull request Jul 12, 2019
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[multiprocessing] does not play well with distributions in GPU

5 participants