[DSD] Implement broadcast_from_rank0 option for model state_dict #125338

fegin · 2024-05-01T21:11:53Z

Stack from ghstack (oldest at bottom):

Summary:
This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

[ghstack-poisoned]

pytorch-bot · 2024-05-01T21:11:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125338

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d113109 with merge base 196a0b1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

awgu · 2024-05-02T00:10:16Z

torch/distributed/_state_dict_utils.py

+
+    if pg is None:
+        pg = dist.distributed_c10d._get_default_group()
+    dist._broadcast_coalesced(pg, tensors, 500, 0)


After this _broadcast_coalesced call, does every rank have the full state dict in GPU memory?

I think no matter what, we want to interleave the broadcast with sharding. Otherwise, we either use 8x CPU memory (not mmap'd) across the host or 1x GPU memory per GPU.

Good point, I forgot to address overlapping. The new version has already taken care of this issue. However, do you still recommend avoiding broadcast_coalesced and instead using regular broadcast?

awgu · 2024-05-02T00:11:48Z

torch/distributed/_state_dict_utils.py

+    if pg is None:
+        pg = dist.distributed_c10d._get_default_group()
+    dist._broadcast_coalesced(pg, tensors, 500, 0)
+


Another detail: I think we should be careful to use dist._broadcast_coalesced today since it will call recordStream on the input tensors. I think to avoid this, we can either manually use the coalescing context from Python or just use dist.broadcast.

Our current coalescing context does not support broadcast. I can take a look to see how can we can the support. But even with recordStream, is that really going to cause problem if we only do this when loading the checkpoint?

we had a debug last week in torchtune around this. for 70b, recordstream during loading checkpoint peaks memeory to 78G. it’s reduced to 22G without recordstream

we have time to figure out coalescing ctx though

the above comment of avoiding full sd in GPU is critical and time sensitive

we were doing broadcast+distribute_tensor together in one for loop in torchtune. maybe it also apply here

oh, right. This implementation may blow out the GPU memory. The new version already change it. If dist._broadcast_coalesced is really an issue, I can change it to the regular broadcast since the current context manager version does not support broadcast.

@fegin I missed this. User need to set TORCH_NCCL_AVOID_RECORD_STREAMS when using dist._broadcast_coalesced to release GPU memory timely. Maybe regular broadcast to avoid recordstreams before we had a solution for coalescing?

[ghstack-poisoned]

fegin · 2024-05-08T07:02:35Z

@pytorchbot merge

pytorchmergebot · 2024-05-08T07:04:33Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

fegin · 2024-05-08T07:08:53Z

@pytorchbot merge

pytorchmergebot · 2024-05-08T07:10:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…25339) Summary: This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict. Pull Request resolved: #125339 Approved by: https://github.com/weifengpy ghstack dependencies: #125708, #125338

weifengpy · 2024-05-08T18:03:12Z

torch/distributed/checkpoint/state_dict.py

+                else:
+                    assert device == value.device
+        assert device is not None
+        _broadcast_state_dict(state_dict, local_state_dict, device=device)


@fegin for meta init model, we will see error because of device=meta. Is there a way to resolve it? eg passing device from StateDictOptions?

awaelchli · 2024-05-14T18:45:26Z

torch/distributed/_state_dict_utils.py

+                dtype=tensor_info.dtype,
+            )
+
+        tensors.append(full_tensor)


With this approach, the entire state dict will have to fit into a single GPU? But that's not guaranteed to work. Shouldn't this broadcast and distribute each tensor individually?

There doesn't seem to be a way currently to efficiently load a full state dict checkpoint into a DTensor model with the new torch.distributed.checkpoint APIs.

No, that's not true, we don't need to fit the entire state_dict into a single GPU. The full_tensor is not kept after the broadcasting and we broadcast every 10 tensors now.

I'm simply getting OOM trying to load Llama 70B the way it is currently. So what I did now is load each parameter individually, and this works and saves memory:
https://github.com/Lightning-AI/pytorch-lightning/blob/bd2843f6cbac769ad71b0b6404e411e9844ea9ce/src/lightning/fabric/strategies/model_parallel.py#L541-L553

Update

5b08600

[ghstack-poisoned]

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 1, 2024

fegin added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels May 1, 2024

fegin requested review from LucasLLC and wz337 May 1, 2024 21:32

Update

53c5c0a

[ghstack-poisoned]

awgu reviewed May 2, 2024

View reviewed changes

fegin added 2 commits May 1, 2024 23:19

Update

0b011be

[ghstack-poisoned]

Update

29cc5fc

[ghstack-poisoned]

fegin mentioned this pull request May 3, 2024

[DSD] Improve the performance of distributed state_dict #125501

Closed

Update

ce99806

[ghstack-poisoned]

fegin mentioned this pull request May 7, 2024

[DSD] Fix set_optimizer_state_dict() changes the parameters with some optimizers #125708

Closed

Update

d113109

[ghstack-poisoned]

weifengpy approved these changes May 7, 2024

View reviewed changes

pytorchmergebot added the merging label May 8, 2024

pytorchmergebot removed the merging label May 8, 2024

fegin added the release notes: distributed (checkpoint) label May 8, 2024

pytorchmergebot added the merging label May 8, 2024

pytorchmergebot closed this in 0542fd4 May 8, 2024

pytorchmergebot added Merged and removed merging labels May 8, 2024

weifengpy reviewed May 8, 2024

View reviewed changes

awaelchli reviewed May 14, 2024

View reviewed changes

awaelchli mentioned this pull request May 15, 2024

(3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints Lightning-AI/pytorch-lightning#19870

Merged

weifengpy mentioned this pull request May 30, 2024

[FSDP] keep paras in torch.distributed.checkpoint.state_dict.set_optimizer_state_dict #127504

Closed

github-actions bot deleted the gh/fegin/235/head branch June 15, 2024 02:00

[DSD] Implement broadcast_from_rank0 option for model state_dict #125338

[DSD] Implement broadcast_from_rank0 option for model state_dict #125338

Uh oh!

Conversation

fegin commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125338

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Merge failed

Uh oh!

fegin commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Merge started

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fegin commented May 1, 2024 •

edited

Loading

pytorch-bot bot commented May 1, 2024 •

edited

Loading

fegin May 2, 2024 •

edited

Loading

fegin May 2, 2024 •

edited

Loading