Update faq.rst so OOM section mentions checkpoint #62709

cpatru · 2021-08-04T09:16:10Z

This FAQ has a section for CUDA OOMs where there are lots of don'ts. This limits modeling solution. Deep nets can blow up memory due to output caching during training.
It's a known problem with a known solution: to trade-off compute for memory via checkpointing.
FAQ should mention it.

This FAQ has a section for CUDA OOMs. There are lots of don'ts which limit modeling solution. Deep nets can blow up memory due to output caching during training. It's a known problem with a known solution: to trade-off compute for memory via checkpointing. So I think the FAQ should mention it.

facebook-github-bot · 2021-08-04T09:16:14Z

Hi @cpatru!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2021-08-04T09:16:16Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/62709
📄 Preview docs built from this PR

💊 CI failures summary and remediations

As of commit e0b3e26 (more details on the Dr. CI page):

3/3 failures possibly* introduced in this PR
- 1/3 non-scanned failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (1/2)

Step: "Run test scripts" (full log | diagnosis details | 🔁 rerun)

2021-08-04T21:07:15.5070923Z RuntimeError: test_jit failed!

2021-08-04T21:07:14.6718089Z   test_pattern_based_rewrite (__main__.TestJit) ... ok (0.000s)
2021-08-04T21:07:14.7498194Z   test_pattern_based_rewrite_with_source_range_preserved (__main__.TestJit) ... ok (0.080s)
2021-08-04T21:07:14.7615306Z   test_peephole_optimize_shape_ops (__main__.TestJit) ... skip (0.016s)
2021-08-04T21:07:14.8850002Z   test_pretty_printer (__main__.TestJit) ... ok (0.112s)
2021-08-04T21:07:14.8869614Z   test_print_op_module (__main__.TestJit) ... ok (0.016s)
2021-08-04T21:07:15.5067697Z   test_profiler (__main__.TestJit) ... Traceback (most recent call last):
2021-08-04T21:07:15.5069244Z   File "run_test.py", line 1092, in <module>
2021-08-04T21:07:15.5069629Z     main()
2021-08-04T21:07:15.5070099Z   File "run_test.py", line 1071, in main
2021-08-04T21:07:15.5070513Z     raise RuntimeError(err_message)
2021-08-04T21:07:15.5070923Z RuntimeError: test_jit failed!
2021-08-04T21:07:15.6925866Z 
2021-08-04T21:07:15.6926784Z (base) C:\actions-runner\_work\pytorch\pytorch\pytorch-1098963943\test>if ERRORLEVEL 1 exit /b 1 
2021-08-04T21:07:15.6948988Z + cleanup
2021-08-04T21:07:15.6949340Z + retcode=1
2021-08-04T21:07:15.6949590Z + set +x
2021-08-04T21:07:15.6980349Z ##[error]Process completed with exit code 1.
2021-08-04T21:07:15.7201125Z ##[group]Run # -ir => recursive include all files in pattern
2021-08-04T21:07:15.7201734Z �[36;1m# -ir => recursive include all files in pattern�[0m
2021-08-04T21:07:15.7202277Z �[36;1m7z a "test-reports-$Env:COMMIT_SHA1-$Env:WORKFLOW_ID.zip" -ir'!test\*.xml'�[0m
2021-08-04T21:07:15.7219415Z shell: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.EXE -command ". '{0}'"

pytorch_linux_xenial_py3_6_gcc5_4_build (2/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Aug 04 20:05:14 collect2: error: ld returned 1 exit status

Aug 04 20:05:13 [ 94%] Building CXX object caffe2/CMakeFiles/static_runtime_test.dir/__/benchmarks/static_runtime/test_static_runtime.cc.o
Aug 04 20:05:13 Scanning dependencies of target init_test
Aug 04 20:05:13 [ 94%] Building CXX object caffe2/CMakeFiles/init_test.dir/core/init_test.cc.o
Aug 04 20:05:13 [ 94%] Linking CXX executable ../bin/init_test
Aug 04 20:05:13 [ 94%] Linking CXX executable ../bin/static_runtime_test
Aug 04 20:05:14 [ 94%] Built target init_test
Aug 04 20:05:14 Scanning dependencies of target module_test
Aug 04 20:05:14 [ 94%] Building CXX object caffe2/CMakeFiles/module_test.dir/core/module_test.cc.o
Aug 04 20:05:14 CMakeFiles/static_runtime_test.dir/__/benchmarks/static_runtime/test_utils.cc.o: In function `torch::jit::test::testStaticRuntime(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::IValue, std::allocator<c10::IValue> > const&, std::vector<c10::IValue, std::allocator<c10::IValue> > const&, bool, bool, bool)':
Aug 04 20:05:14 test_utils.cc:(.text+0x2f2f): undefined reference to `torch::jit::disableUnsafeMathOp(char const*)'
Aug 04 20:05:14 collect2: error: ld returned 1 exit status
Aug 04 20:05:14 caffe2/CMakeFiles/static_runtime_test.dir/build.make:157: recipe for target 'bin/static_runtime_test' failed
Aug 04 20:05:14 make[2]: *** [bin/static_runtime_test] Error 1
Aug 04 20:05:14 make[1]: *** [caffe2/CMakeFiles/static_runtime_test.dir/all] Error 2
Aug 04 20:05:14 make[1]: *** Waiting for unfinished jobs....
Aug 04 20:05:14 CMakeFiles/Makefile2:10153: recipe for target 'caffe2/CMakeFiles/static_runtime_test.dir/all' failed
Aug 04 20:05:15 [ 94%] Linking CXX executable ../bin/module_test
Aug 04 20:05:16 [ 94%] Built target module_test
Aug 04 20:05:16 [ 94%] Built target caffe2_pybind11_state
Aug 04 20:05:16 make: *** [all] Error 2
Aug 04 20:05:16 Makefile:138: recipe for target 'all' failed

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.2-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

facebook-github-bot · 2021-08-04T09:53:56Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

facebook-github-bot · 2021-08-04T15:56:05Z

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-08-04T19:38:39Z

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-08-05T14:41:35Z

@ezyang merged this pull request in 6d896cb.

pytorchbot added the open source label Aug 4, 2021

facebook-github-bot added the cla signed label Aug 4, 2021

cpatru changed the title ~~Update faq.rst so CUDA OOMs mentions checkpoint~~ Update faq.rst so OOM section mentions checkpoint Aug 4, 2021

cpatru and others added 2 commits August 4, 2021 14:11

Update faq.rst

9814797

Update faq.rst

ba86ac9

ezyang approved these changes Aug 4, 2021

View reviewed changes

Update faq.rst

e0b3e26

facebook-github-bot closed this in 6d896cb Aug 5, 2021

facebook-github-bot added the Merged label Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update faq.rst so OOM section mentions checkpoint #62709

Update faq.rst so OOM section mentions checkpoint #62709

Uh oh!

cpatru commented Aug 4, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 4, 2021

Uh oh!

facebook-github-bot commented Aug 4, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 4, 2021

Uh oh!

facebook-github-bot commented Aug 4, 2021

Uh oh!

facebook-github-bot commented Aug 4, 2021

Uh oh!

facebook-github-bot commented Aug 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Update faq.rst so OOM section mentions checkpoint #62709

Update faq.rst so OOM section mentions checkpoint #62709

Uh oh!

Conversation

cpatru commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Aug 4, 2021

Action Required

Process

Uh oh!

facebook-github-bot commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 2 new failures recognized by patterns

win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (1/2)

pytorch_linux_xenial_py3_6_gcc5_4_build (2/2)

ci.pytorch.org: 1 failed

Uh oh!

facebook-github-bot commented Aug 4, 2021

Uh oh!

facebook-github-bot commented Aug 4, 2021

Uh oh!

facebook-github-bot commented Aug 4, 2021

Uh oh!

facebook-github-bot commented Aug 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cpatru commented Aug 4, 2021 •

edited

Loading

facebook-github-bot commented Aug 4, 2021 •

edited

Loading