[RFC] Don't materialize ignored modules for FSDP #108032

rohan-varma · 2023-08-28T01:03:41Z

Stack from ghstack (oldest at bottom):

Per title. This seems needed for cases where I have a large embedding
I want to separately manage, but FSDP would initialize it and thus consume the
memory.

Currently the interaction with torchdistX materialize_module is not tested,
this can be done as follow up work.

Differential Revision: D48722046

Per title. This seems needed for cases where I have a large embedding I want to separately manage, but FSDP would initialize it and thus consume the memory. Currently the interaction with torchdistX materialize_module is not tested, this can be done as follow up work. Differential Revision: [D48722046](https://our.internmc.facebook.com/intern/diff/D48722046/) [ghstack-poisoned]

pytorch-bot · 2023-08-28T01:03:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108032

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Merge Blocking SEVs

There is 1 active merge blocking SEVs. Please view them below:

(merge blocking) Networking issue preventing multiple CI jobs to access GH API

If you must merge, use @pytorchbot merge -f.

❌ 4 New Failures, 1 Unrelated Failure

As of commit 017d4ae with merge base a20fac8 ():

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

linux-focal-py3.11-clang10 / test (dynamo, 1, 2, linux.2xlarge) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu

SGTM!

awgu · 2023-08-31T18:42:22Z

test/distributed/fsdp/test_fsdp_misc.py

+            device_id=self.rank,
+            ignored_modules=[m.a],
+            use_orig_params=True,
+            param_init_fn=lambda m: m.cuda(),


In general, this lambda m: m.cuda() would do some repeated checks trying to move modules to CUDA since param_init_fn would be called on every module. This should just lead to some CPU overhead since copying to CUDA is a no-op if already on CUDA.

As a tiny nit, the variable shadowing of m is also a bit precarious.

awgu

Sorry, I should cancel the approve. The unit test is failing.

return self._apply(lambda t: t.cuda(device))
NotImplementedError: Cannot copy out of meta tensor; no data!

I think you need a different param_init_fn.

rohan-varma · 2023-09-01T00:15:53Z

That's weird, I feel like I ran the test before sending the PR...

Per title. This seems needed for cases where I have a large embedding I want to separately manage, but FSDP would initialize it and thus consume the memory. Currently the interaction with torchdistX materialize_module is not tested, this can be done as follow up work. Differential Revision: [D48722046](https://our.internmc.facebook.com/intern/diff/D48722046/) [ghstack-poisoned]

Since these are ignored by FSDP, don't move them. Differential Revision: [D48727044](https://our.internmc.facebook.com/intern/diff/D48727044/) Pull Request resolved: #108033 Approved by: https://github.com/awgu ghstack dependencies: #108032

…NTRANT (#108435) We should use no_reentrant. There are a lot of users of this API, but it is in a prototype state so should be fine to change. Differential Revision: [D48898148](https://our.internmc.facebook.com/intern/diff/D48898148/) Pull Request resolved: #108435 Approved by: https://github.com/awgu ghstack dependencies: #108032, #108033

rohan-varma requested review from H-Huang, awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, wanchaol, wz337 and zhaojuanmao as code owners August 28, 2023 01:03

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Aug 28, 2023

rohan-varma mentioned this pull request Aug 28, 2023

[RFC][FSDP] Don't move ignored params / buffers to device #108033

Closed

awgu approved these changes Aug 31, 2023

View reviewed changes

awgu requested changes Aug 31, 2023

View reviewed changes

rohan-varma requested a review from awgu September 1, 2023 00:43

rohan-varma mentioned this pull request Sep 1, 2023

[RFC] Somewhat BC breaking: make checkpoint_wrapper default to NO_REENTRANT #108435

Closed

awgu approved these changes Sep 1, 2023

View reviewed changes

pytorchmergebot added the Merged label Sep 5, 2023

pytorchmergebot closed this in 3334ec3 Sep 5, 2023

facebook-github-bot deleted the gh/rohan-varma/734/head branch September 9, 2023 14:22

Ghost-LZW mentioned this pull request Jan 20, 2024

[FSDP] Ignore meta device module failed with auto_wrap_policy #117921

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Don't materialize ignored modules for FSDP #108032

[RFC] Don't materialize ignored modules for FSDP #108032

Uh oh!

rohan-varma commented Aug 28, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 28, 2023 •

edited

Loading

Uh oh!

awgu left a comment

Uh oh!

awgu Aug 31, 2023

Uh oh!

awgu left a comment

Uh oh!

rohan-varma commented Sep 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[RFC] Don't materialize ignored modules for FSDP #108032

[RFC] Don't materialize ignored modules for FSDP #108032

Uh oh!

Conversation

rohan-varma commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108032

❗ 1 Merge Blocking SEVs

❌ 4 New Failures, 1 Unrelated Failure

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

awgu Aug 31, 2023

Choose a reason for hiding this comment

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented Sep 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rohan-varma commented Aug 28, 2023 •

edited

Loading

pytorch-bot bot commented Aug 28, 2023 •

edited

Loading