-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
- OS: Ubuntu 16.04
- PyTorch version: e58a53a
- How you installed PyTorch (conda, pip, source): source
- Python version: 3.6.4
- CUDA/cuDNN version: 9.0 / 7.0.3
- GPU models and configuration: V100
- GCC version (if compiling from source): 5.4
There seems to have been a regression from PyTorch 0.30 in master. You would need multi GPU machine to show this problem. Please have two windows open
Window 1
Run nvidia-smi -l 5
Window 2
Run the following python script by stepping through pdb
import torch
import pdb
gpu = 1
pdb.set_trace()
with torch.cuda.device(gpu):
pdb.set_trace()
Here is what happens
- Hit the first
set_trace- no context created on any GPUs - Hit the second
set_trace- context is created on GPU 0 (instead of GPU1)
This is due to how initialization happens inside torch/cuda/__init__.py
- First hit is
__enter__forclass devicewhich calls_lazy_init() - Since this is the first call
_lazy_init()will actually do something. Problem starts attorch._C._cuda_init() - This calls goes to
THCPModule_initExtensionintorch/csrc/cuda/Module.cpp - This will call
THCPModule_initCudawhich in turn callsstate = at::globalContext().lazyInitCUDA();. - Inside ATen contexts will eventually get created on GPU 0 since it assumes that
cudaSetDevicehas already been called
In fact any function that calls _lazy_init() in torch.cuda will create a context on GPU 0 the first time it is called because of this, irrespective what the user asked for. This is not harmful in a single GPU setting, but on a multi GPU setting as is the case with our cluster, different users get different GPUs, and everybody will have these contexts on GPU 0, no matter what. And contexts take up a fair bit of memory.
I think the right thing to do is to have a _lazy_init(gpuId) which in turns sends down into THCPModule_initExtension to do the right thing.
Quite interestingly set_device() does not call _lazy_init() which also looks like a bug to me since _lazy_init() seems to do a lot of initialization.
This works just fine in PyTorch 0.30 btw where with torch.cuda.device(gpu) will only create contexts on the GPU you ask for.
I am happy to try to fix this but I wanted to get a broader context on why things are this way.