-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[RFC] Don't materialize ignored modules for FSDP #108032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Per title. This seems needed for cases where I have a large embedding I want to separately manage, but FSDP would initialize it and thus consume the memory. Currently the interaction with torchdistX materialize_module is not tested, this can be done as follow up work. Differential Revision: [D48722046](https://our.internmc.facebook.com/intern/diff/D48722046/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108032
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Merge Blocking SEVsThere is 1 active merge blocking SEVs. Please view them below:
If you must merge, use ❌ 4 New Failures, 1 Unrelated FailureAs of commit 017d4ae with merge base a20fac8 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM!
| device_id=self.rank, | ||
| ignored_modules=[m.a], | ||
| use_orig_params=True, | ||
| param_init_fn=lambda m: m.cuda(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this lambda m: m.cuda() would do some repeated checks trying to move modules to CUDA since param_init_fn would be called on every module. This should just lead to some CPU overhead since copying to CUDA is a no-op if already on CUDA.
As a tiny nit, the variable shadowing of m is also a bit precarious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should cancel the approve. The unit test is failing.
return self._apply(lambda t: t.cuda(device))
NotImplementedError: Cannot copy out of meta tensor; no data!
I think you need a different param_init_fn.
|
That's weird, I feel like I ran the test before sending the PR... |
Per title. This seems needed for cases where I have a large embedding I want to separately manage, but FSDP would initialize it and thus consume the memory. Currently the interaction with torchdistX materialize_module is not tested, this can be done as follow up work. Differential Revision: [D48722046](https://our.internmc.facebook.com/intern/diff/D48722046/) [ghstack-poisoned]
Per title. This seems needed for cases where I have a large embedding I want to separately manage, but FSDP would initialize it and thus consume the memory. Currently the interaction with torchdistX materialize_module is not tested, this can be done as follow up work. Differential Revision: [D48722046](https://our.internmc.facebook.com/intern/diff/D48722046/) [ghstack-poisoned]
Per title. This seems needed for cases where I have a large embedding I want to separately manage, but FSDP would initialize it and thus consume the memory. Currently the interaction with torchdistX materialize_module is not tested, this can be done as follow up work. Differential Revision: [D48722046](https://our.internmc.facebook.com/intern/diff/D48722046/) [ghstack-poisoned]
Since these are ignored by FSDP, don't move them. Differential Revision: [D48727044](https://our.internmc.facebook.com/intern/diff/D48727044/) Pull Request resolved: #108033 Approved by: https://github.com/awgu ghstack dependencies: #108032
…NTRANT (#108435) We should use no_reentrant. There are a lot of users of this API, but it is in a prototype state so should be fine to change. Differential Revision: [D48898148](https://our.internmc.facebook.com/intern/diff/D48898148/) Pull Request resolved: #108435 Approved by: https://github.com/awgu ghstack dependencies: #108032, #108033
Stack from ghstack (oldest at bottom):
Per title. This seems needed for cases where I have a large embedding
I want to separately manage, but FSDP would initialize it and thus consume the
memory.
Currently the interaction with torchdistX materialize_module is not tested,
this can be done as follow up work.
Differential Revision: D48722046