KEMBAR78
Update faq.rst so OOM section mentions checkpoint by cpatru · Pull Request #62709 · pytorch/pytorch · GitHub
Skip to content

Conversation

@cpatru
Copy link
Contributor

@cpatru cpatru commented Aug 4, 2021

This FAQ has a section for CUDA OOMs where there are lots of don'ts. This limits modeling solution. Deep nets can blow up memory due to output caching during training.
It's a known problem with a known solution: to trade-off compute for memory via checkpointing.
FAQ should mention it.

This FAQ has a section for CUDA OOMs. There are lots of don'ts which limit modeling solution. Deep nets can blow up memory due to output caching during training. It's a known problem with a known solution: to trade-off compute for memory via checkpointing. So I think the FAQ should mention it.
@facebook-github-bot
Copy link
Contributor

Hi @cpatru!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 4, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit e0b3e26 (more details on the Dr. CI page):


  • 3/3 failures possibly* introduced in this PR
    • 1/3 non-scanned failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (1/2)

Step: "Run test scripts" (full log | diagnosis details | 🔁 rerun)

2021-08-04T21:07:15.5070923Z RuntimeError: test_jit failed!
2021-08-04T21:07:14.6718089Z   test_pattern_based_rewrite (__main__.TestJit) ... ok (0.000s)
2021-08-04T21:07:14.7498194Z   test_pattern_based_rewrite_with_source_range_preserved (__main__.TestJit) ... ok (0.080s)
2021-08-04T21:07:14.7615306Z   test_peephole_optimize_shape_ops (__main__.TestJit) ... skip (0.016s)
2021-08-04T21:07:14.8850002Z   test_pretty_printer (__main__.TestJit) ... ok (0.112s)
2021-08-04T21:07:14.8869614Z   test_print_op_module (__main__.TestJit) ... ok (0.016s)
2021-08-04T21:07:15.5067697Z   test_profiler (__main__.TestJit) ... Traceback (most recent call last):
2021-08-04T21:07:15.5069244Z   File "run_test.py", line 1092, in <module>
2021-08-04T21:07:15.5069629Z     main()
2021-08-04T21:07:15.5070099Z   File "run_test.py", line 1071, in main
2021-08-04T21:07:15.5070513Z     raise RuntimeError(err_message)
2021-08-04T21:07:15.5070923Z RuntimeError: test_jit failed!
2021-08-04T21:07:15.6925866Z 
2021-08-04T21:07:15.6926784Z (base) C:\actions-runner\_work\pytorch\pytorch\pytorch-1098963943\test>if ERRORLEVEL 1 exit /b 1 
2021-08-04T21:07:15.6948988Z + cleanup
2021-08-04T21:07:15.6949340Z + retcode=1
2021-08-04T21:07:15.6949590Z + set +x
2021-08-04T21:07:15.6980349Z ##[error]Process completed with exit code 1.
2021-08-04T21:07:15.7201125Z ##[group]Run # -ir => recursive include all files in pattern
2021-08-04T21:07:15.7201734Z �[36;1m# -ir => recursive include all files in pattern�[0m
2021-08-04T21:07:15.7202277Z �[36;1m7z a "test-reports-$Env:COMMIT_SHA1-$Env:WORKFLOW_ID.zip" -ir'!test\*.xml'�[0m
2021-08-04T21:07:15.7219415Z shell: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.EXE -command ". '{0}'"

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (2/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Aug 04 20:05:14 collect2: error: ld returned 1 exit status
Aug 04 20:05:13 [ 94%] Building CXX object caffe2/CMakeFiles/static_runtime_test.dir/__/benchmarks/static_runtime/test_static_runtime.cc.o
Aug 04 20:05:13 Scanning dependencies of target init_test
Aug 04 20:05:13 [ 94%] Building CXX object caffe2/CMakeFiles/init_test.dir/core/init_test.cc.o
Aug 04 20:05:13 [ 94%] Linking CXX executable ../bin/init_test
Aug 04 20:05:13 [ 94%] Linking CXX executable ../bin/static_runtime_test
Aug 04 20:05:14 [ 94%] Built target init_test
Aug 04 20:05:14 Scanning dependencies of target module_test
Aug 04 20:05:14 [ 94%] Building CXX object caffe2/CMakeFiles/module_test.dir/core/module_test.cc.o
Aug 04 20:05:14 CMakeFiles/static_runtime_test.dir/__/benchmarks/static_runtime/test_utils.cc.o: In function `torch::jit::test::testStaticRuntime(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::IValue, std::allocator<c10::IValue> > const&, std::vector<c10::IValue, std::allocator<c10::IValue> > const&, bool, bool, bool)':
Aug 04 20:05:14 test_utils.cc:(.text+0x2f2f): undefined reference to `torch::jit::disableUnsafeMathOp(char const*)'
Aug 04 20:05:14 collect2: error: ld returned 1 exit status
Aug 04 20:05:14 caffe2/CMakeFiles/static_runtime_test.dir/build.make:157: recipe for target 'bin/static_runtime_test' failed
Aug 04 20:05:14 make[2]: *** [bin/static_runtime_test] Error 1
Aug 04 20:05:14 make[1]: *** [caffe2/CMakeFiles/static_runtime_test.dir/all] Error 2
Aug 04 20:05:14 make[1]: *** Waiting for unfinished jobs....
Aug 04 20:05:14 CMakeFiles/Makefile2:10153: recipe for target 'caffe2/CMakeFiles/static_runtime_test.dir/all' failed
Aug 04 20:05:15 [ 94%] Linking CXX executable ../bin/module_test
Aug 04 20:05:16 [ 94%] Built target module_test
Aug 04 20:05:16 [ 94%] Built target caffe2_pybind11_state
Aug 04 20:05:16 make: *** [all] Error 2
Aug 04 20:05:16 Makefile:138: recipe for target 'all' failed

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@cpatru cpatru changed the title Update faq.rst so CUDA OOMs mentions checkpoint Update faq.rst so OOM section mentions checkpoint Aug 4, 2021
@facebook-github-bot
Copy link
Contributor

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 6d896cb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants