KEMBAR78
[DDP] Remove train call to module copies by rohan-varma · Pull Request #58595 · pytorch/pytorch · GitHub
Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented May 19, 2021

Stack from ghstack:

No longer needed since this list is always of size 1.

Differential Revision: D28548426

No longer needed since this list is always of size 1.

Differential Revision: [D28548426](https://our.internmc.facebook.com/intern/diff/D28548426/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 19, 2021

💊 CI failures summary and remediations

As of commit 2e0f63d (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_windows_vs2019_py36_cpu_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

FAILED: caffe2/CMakeFiles/torch_cpu.dir/operators/conv_transpose_gradient_op.cc.obj
caused by: Failed to read response header
caused by: failed to fill whole buffer
[3850/5024] C:\Users\circleci\project\build\win_tmp\bin\sccache-cl.exe   /TP -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cpu_EXPORTS -Iaten\src -I..\aten\src -I. -I..\ -I..\cmake\..\third_party\benchmark\include -Icaffe2\contrib\aten -I..\third_party\onnx -Ithird_party\onnx -I..\third_party\foxi -Ithird_party\foxi -I..\torch\csrc\api -I..\torch\csrc\api\include -I..\caffe2\aten\src\TH -Icaffe2\aten\src\TH -Icaffe2\aten\src -Icaffe2\..\aten\src -Icaffe2\..\aten\src\ATen -I..\torch\csrc -I..\third_party\miniz-2.0.8 -I..\third_party\kineto\libkineto\include -I..\third_party\kineto\libkineto\src -I..\aten\src\TH -I..\aten\..\third_party\catch\single_include -I..\aten\src\ATen\.. -Icaffe2\aten\src\ATen -I..\caffe2\core\nomnigraph\include -I..\c10\.. -Ithird_party\ideep\mkl-dnn\include -I..\third_party\ideep\mkl-dnn\src\..\include -I..\third_party\pthreadpool\include -I..\third_party\cpuinfo\include -I..\third_party\fbgemm\include -I..\third_party\fbgemm -I..\third_party\fbgemm\third_party\asmjit\src -I..\third_party\FP16\include -I..\third_party\fmt\include -Ithird_party\gloo -I..\cmake\..\third_party\gloo -I..\cmake\..\third_party\googletest\googlemock\include -I..\cmake\..\third_party\googletest\googletest\include -I..\third_party\protobuf\src -Iwin_tmp\mkl\include -I..\third_party\XNNPACK\include -I..\third_party -I..\cmake\..\third_party\eigen -IC:\Jenkins\Miniconda3\include -IC:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -I..\cmake\..\third_party\pybind11\include -I..\third_party\ideep\mkl-dnn\include -I..\third_party\ideep\include -I..\caffe2 /DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/Users/circleci/project/build/win_tmp/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG -DCAFFE2_USE_GLOO -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -IC:/Users/circleci/project/build/win_tmp/mkl/include -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\operators\collect_and_distribute_fpn_rpn_proposals_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\operators\collect_and_distribute_fpn_rpn_proposals_op.cc
FAILED: caffe2/CMakeFiles/torch_cpu.dir/operators/collect_and_distribute_fpn_rpn_proposals_op.cc.obj 
C:\Users\circleci\project\build\win_tmp\bin\sccache-cl.exe   /TP -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cpu_EXPORTS -Iaten\src -I..\aten\src -I. -I..\ -I..\cmake\..\third_party\benchmark\include -Icaffe2\contrib\aten -I..\third_party\onnx -Ithird_party\onnx -I..\third_party\foxi -Ithird_party\foxi -I..\torch\csrc\api -I..\torch\csrc\api\include -I..\caffe2\aten\src\TH -Icaffe2\aten\src\TH -Icaffe2\aten\src -Icaffe2\..\aten\src -Icaffe2\..\aten\src\ATen -I..\torch\csrc -I..\third_party\miniz-2.0.8 -I..\third_party\kineto\libkineto\include -I..\third_party\kineto\libkineto\src -I..\aten\src\TH -I..\aten\..\third_party\catch\single_include -I..\aten\src\ATen\.. -Icaffe2\aten\src\ATen -I..\caffe2\core\nomnigraph\include -I..\c10\.. -Ithird_party\ideep\mkl-dnn\include -I..\third_party\ideep\mkl-dnn\src\..\include -I..\third_party\pthreadpool\include -I..\third_party\cpuinfo\include -I..\third_party\fbgemm\include -I..\third_party\fbgemm -I..\third_party\fbgemm\third_party\asmjit\src -I..\third_party\FP16\include -I..\third_party\fmt\include -Ithird_party\gloo -I..\cmake\..\third_party\gloo -I..\cmake\..\third_party\googletest\googlemock\include -I..\cmake\..\third_party\googletest\googletest\include -I..\third_party\protobuf\src -Iwin_tmp\mkl\include -I..\third_party\XNNPACK\include -I..\third_party -I..\cmake\..\third_party\eigen -IC:\Jenkins\Miniconda3\include -IC:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -I..\cmake\..\third_party\pybind11\include -I..\third_party\ideep\mkl-dnn\include -I..\third_party\ideep\include -I..\caffe2 /DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/Users/circleci/project/build/win_tmp/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG -DCAFFE2_USE_GLOO -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -IC:/Users/circleci/project/build/win_tmp/mkl/include -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\operators\collect_and_distribute_fpn_rpn_proposals_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\operators\collect_and_distribute_fpn_rpn_proposals_op.cc
error: failed to execute compile
caused by: error reading compile response from server
caused by: Failed to read response header
caused by: failed to fill whole buffer
[3851/5024] C:\Users\circleci\project\build\win_tmp\bin\sccache-cl.exe   /TP -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cpu_EXPORTS -Iaten\src -I..\aten\src -I. -I..\ -I..\cmake\..\third_party\benchmark\include -Icaffe2\contrib\aten -I..\third_party\onnx -Ithird_party\onnx -I..\third_party\foxi -Ithird_party\foxi -I..\torch\csrc\api -I..\torch\csrc\api\include -I..\caffe2\aten\src\TH -Icaffe2\aten\src\TH -Icaffe2\aten\src -Icaffe2\..\aten\src -Icaffe2\..\aten\src\ATen -I..\torch\csrc -I..\third_party\miniz-2.0.8 -I..\third_party\kineto\libkineto\include -I..\third_party\kineto\libkineto\src -I..\aten\src\TH -I..\aten\..\third_party\catch\single_include -I..\aten\src\ATen\.. -Icaffe2\aten\src\ATen -I..\caffe2\core\nomnigraph\include -I..\c10\.. -Ithird_party\ideep\mkl-dnn\include -I..\third_party\ideep\mkl-dnn\src\..\include -I..\third_party\pthreadpool\include -I..\third_party\cpuinfo\include -I..\third_party\fbgemm\include -I..\third_party\fbgemm -I..\third_party\fbgemm\third_party\asmjit\src -I..\third_party\FP16\include -I..\third_party\fmt\include -Ithird_party\gloo -I..\cmake\..\third_party\gloo -I..\cmake\..\third_party\googletest\googlemock\include -I..\cmake\..\third_party\googletest\googletest\include -I..\third_party\protobuf\src -Iwin_tmp\mkl\include -I..\third_party\XNNPACK\include -I..\third_party -I..\cmake\..\third_party\eigen -IC:\Jenkins\Miniconda3\include -IC:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -I..\cmake\..\third_party\pybind11\include -I..\third_party\ideep\mkl-dnn\include -I..\third_party\ideep\include -I..\caffe2 /DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/Users/circleci/project/build/win_tmp/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG -DCAFFE2_USE_GLOO -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -IC:/Users/circleci/project/build/win_tmp/mkl/include -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\operators\conv_transpose_gradient_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\operators\conv_transpose_gradient_op.cc
FAILED: caffe2/CMakeFiles/torch_cpu.dir/operators/conv_transpose_gradient_op.cc.obj 
C:\Users\circleci\project\build\win_tmp\bin\sccache-cl.exe   /TP -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_OPENMP_NOFORCE_MANIFEST -Dtorch_cpu_EXPORTS -Iaten\src -I..\aten\src -I. -I..\ -I..\cmake\..\third_party\benchmark\include -Icaffe2\contrib\aten -I..\third_party\onnx -Ithird_party\onnx -I..\third_party\foxi -Ithird_party\foxi -I..\torch\csrc\api -I..\torch\csrc\api\include -I..\caffe2\aten\src\TH -Icaffe2\aten\src\TH -Icaffe2\aten\src -Icaffe2\..\aten\src -Icaffe2\..\aten\src\ATen -I..\torch\csrc -I..\third_party\miniz-2.0.8 -I..\third_party\kineto\libkineto\include -I..\third_party\kineto\libkineto\src -I..\aten\src\TH -I..\aten\..\third_party\catch\single_include -I..\aten\src\ATen\.. -Icaffe2\aten\src\ATen -I..\caffe2\core\nomnigraph\include -I..\c10\.. -Ithird_party\ideep\mkl-dnn\include -I..\third_party\ideep\mkl-dnn\src\..\include -I..\third_party\pthreadpool\include -I..\third_party\cpuinfo\include -I..\third_party\fbgemm\include -I..\third_party\fbgemm -I..\third_party\fbgemm\third_party\asmjit\src -I..\third_party\FP16\include -I..\third_party\fmt\include -Ithird_party\gloo -I..\cmake\..\third_party\gloo -I..\cmake\..\third_party\googletest\googlemock\include -I..\cmake\..\third_party\googletest\googletest\include -I..\third_party\protobuf\src -Iwin_tmp\mkl\include -I..\third_party\XNNPACK\include -I..\third_party -I..\cmake\..\third_party\eigen -IC:\Jenkins\Miniconda3\include -IC:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -I..\cmake\..\third_party\pybind11\include -I..\third_party\ideep\mkl-dnn\include -I..\third_party\ideep\include -I..\caffe2 /DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/Users/circleci/project/build/win_tmp/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG -DCAFFE2_USE_GLOO -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -openmp:experimental -IC:/Users/circleci/project/build/win_tmp/mkl/include -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\operators\conv_transpose_gradient_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\operators\conv_transpose_gradient_op.cc
error: failed to execute compile
caused by: error reading compile response from server
caused by: Failed to read response header
caused by: failed to fill whole buffer
ninja: build stopped: subcommand failed.
-- Building version 1.9.0a0+git2e0f63d
 --- Trying to initialize submodules
 --- Submodule initialization took 135.53 sec
cmake -GNinja -DBUILD_ENVIRONMENT=pytorch-win-vs2019-cpu-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DCMAKE_BUILD_TYPE=Release -DCMAKE_GENERATOR=Ninja -DCMAKE_INCLUDE_PATH=C:\Users\circleci\project\build\win_tmp\mkl\include -DCMAKE_INSTALL_PREFIX=C:\Users\circleci\project\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCMAKE_VERBOSE_MAKEFILE=1 -DCUDA_NVCC_EXECUTABLE=C:\Users\circleci\project\build\win_tmp\bin\randomtemp.exe -DJAVA_HOME=C:\Program Files\OpenJDK\jdk-12.0.2 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python36.lib -DTORCH_BUILD_VERSION=1.9.0a0+git2e0f63d -DUSE_CUDA=0 -DUSE_NUMPY=True C:\Users\circleci\project

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels May 19, 2021
rohan-varma added a commit that referenced this pull request May 19, 2021
No longer needed since this list is always of size 1.

Differential Revision: [D28548426](https://our.internmc.facebook.com/intern/diff/D28548426/)

ghstack-source-id: 129383153
Pull Request resolved: #58595
def train(self, mode=True):
super(DistributedDataParallel, self).train(mode)
for module in self._module_copies[1:]:
module.train(mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need to keep train(mode) call?

Copy link
Contributor Author

@rohan-varma rohan-varma May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the call to the parent class method on L894, which does train(mode) on the module for this process. this call is for the rest of the module copies on this process, but there are none (see https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/distributed.py#L492)

No longer needed since this list is always of size 1.

Differential Revision: [D28548426](https://our.internmc.facebook.com/intern/diff/D28548426/)

[ghstack-poisoned]
No longer needed since this list is always of size 1.

Differential Revision: [D28548426](https://our.internmc.facebook.com/intern/diff/D28548426/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 1d67c6d.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/320/head branch May 24, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants