KEMBAR78
[modeling_utils] use less cpu memory with sharded checkpoint loading by stas00 · Pull Request #16844 · huggingface/transformers · GitHub
Skip to content

Conversation

@stas00
Copy link
Contributor

@stas00 stas00 commented Apr 20, 2022

This PR lowers the peak cpu memory usage for sharded checkpoint loading

The following demonstration tells the full story. I'm using /usr/bin/time -f %M to report max rss = total cpu memory used by the process including peak memory.

This demo uses T0 which is 42GB big in fp32 https://huggingface.co/bigscience/T0/tree/main

So with the normal loading the program needs 87GB of CPU RAM (42x2 plus a few GBs for temps)

# full checkpoint
/usr/bin/time -f %M python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('bigscience/T0')"
87286376

# shard it to 10GB / shard
python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('bigscience/T0'); \
model.save_pretrained('t0-sharded')"

# before this PR
/usr/bin/time -f %M python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('t0-sharded')"
68358000

# after this PR
/usr/bin/time -f %M python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('t0-sharded')"
53529416

So after this PR the CPU memory usage is 1x model size (42GB here) + largest shard (10GB) + some temps = 53GB

Before this PR we were getting an additional 15GB (1.5x shard) of peak cpu memory.

@sgugger

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 20, 2022

The documentation is not available anymore as the PR was closed or merged.

@stas00 stas00 changed the title [modeling_utils] less cpu memory with sharded checkpoint loading [modeling_utils] use less cpu memory with sharded checkpoint loading Apr 20, 2022
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm I had to do that on other tools I'm developing as well, to make sure Python was releasing the memory
Thanks for adding this!

@stas00 stas00 merged commit afa1ef0 into huggingface:main Apr 20, 2022
@stas00 stas00 deleted the low-cpu-mem-ds branch April 20, 2022 14:44
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
…uggingface#16844)

* less cpu memory with sharded checkpoint loading

* Trigger CI

* Trigger CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants