[modeling_utils] use less cpu memory with sharded checkpoint loading #16844

stas00 · 2022-04-20T02:52:32Z

This PR lowers the peak cpu memory usage for sharded checkpoint loading

The following demonstration tells the full story. I'm using /usr/bin/time -f %M to report max rss = total cpu memory used by the process including peak memory.

This demo uses T0 which is 42GB big in fp32 https://huggingface.co/bigscience/T0/tree/main

So with the normal loading the program needs 87GB of CPU RAM (42x2 plus a few GBs for temps)

# full checkpoint
/usr/bin/time -f %M python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('bigscience/T0')"
87286376

# shard it to 10GB / shard
python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('bigscience/T0'); \
model.save_pretrained('t0-sharded')"

# before this PR
/usr/bin/time -f %M python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('t0-sharded')"
68358000

# after this PR
/usr/bin/time -f %M python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('t0-sharded')"
53529416

So after this PR the CPU memory usage is 1x model size (42GB here) + largest shard (10GB) + some temps = 53GB

Before this PR we were getting an additional 15GB (1.5x shard) of peak cpu memory.

@sgugger

HuggingFaceDocBuilderDev · 2022-04-20T03:13:12Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

I confirm I had to do that on other tools I'm developing as well, to make sure Python was releasing the memory
Thanks for adding this!

…uggingface#16844) * less cpu memory with sharded checkpoint loading * Trigger CI * Trigger CI

less cpu memory with sharded checkpoint loading

522e7e7

stas00 added 2 commits April 19, 2022 20:28

Trigger CI

5a8d511

Trigger CI

1a17f9d

stas00 mentioned this pull request Apr 20, 2022

bigscience/T0 multi-gpu inference exits with return code -9 #16616

Closed

4 tasks

stas00 changed the title ~~[modeling_utils] less cpu memory with sharded checkpoint loading~~ [modeling_utils] use less cpu memory with sharded checkpoint loading Apr 20, 2022

sgugger approved these changes Apr 20, 2022

View reviewed changes

stas00 merged commit afa1ef0 into huggingface:main Apr 20, 2022

stas00 deleted the low-cpu-mem-ds branch April 20, 2022 14:44

stas00 mentioned this pull request Apr 20, 2022

GPT-NeoX-20B Integration #15642

Closed

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022

[modeling_utils] use less cpu memory with sharded checkpoint loading (h…

002d966

…uggingface#16844) * less cpu memory with sharded checkpoint loading * Trigger CI * Trigger CI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[modeling_utils] use less cpu memory with sharded checkpoint loading #16844

[modeling_utils] use less cpu memory with sharded checkpoint loading #16844

Uh oh!

stas00 commented Apr 20, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2022 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[modeling_utils] use less cpu memory with sharded checkpoint loading #16844

[modeling_utils] use less cpu memory with sharded checkpoint loading #16844

Uh oh!

Conversation

stas00 commented Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stas00 commented Apr 20, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 20, 2022 •

edited

Loading