[FSDP2] Fixed incorrect tensor meta after `.to(dtype)` #137593

awgu · 2024-10-09T15:15:19Z

Stack from ghstack (oldest at bottom):

This fixes #137522. After a method that changes to module parameters (like .to(torch.float64)), we need to update the DTensorSpec, whose TensorMeta's dtype may have changed.

cc @XilunWu @H-Huang @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-09T15:15:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137593

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 28daaa3 with merge base d1b87e2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d922192 Pull Request resolved: #137593

Skylion007 · 2024-10-09T15:27:00Z

torch/distributed/_composable/fsdp/_fsdp_param.py

        if updated_local_tensor:
            # Only change the local tensor object if needed
            self.sharded_param._local_tensor = local_tensor[: self.sharded_size[0]]
+        self._sharding_spec = self.sharded_param._spec


Question. instead of cashing self._sharding_spec would it make sense to have it be a property that always just queries self.sharded_param._spec

That sounds good!

took a quick look -- this will require some refactoring, so I will defer this to a separate PR

awgu · 2024-10-09T17:08:32Z

@pytorchbot merge

pytorchmergebot · 2024-10-09T17:10:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@yifuwang

## Overview This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size. ``` # Example: def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]: largest_dim = largest_dim_size = -1 for dim, dim_size in enumerate(param.shape): if dim_size > largest_dim_size: largest_dim = dim largest_dim_size = dim_size return Shard(largest_dim) fully_shard(module, shard_placement_fn=shard_placement_fn) ``` ## Follow-Ups - **Copy kernels:** For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in. Pull Request resolved: #137496 Approved by: https://github.com/weifengpy ghstack dependencies: #137593

[FSDP2] Fixed incorrect tensor meta after .to(dtype)

28daaa3

[ghstack-poisoned]

awgu pushed a commit that referenced this pull request Oct 9, 2024

[FSDP2] Fixed incorrect tensor meta after .to(dtype)

d6afac4

ghstack-source-id: d922192 Pull Request resolved: #137593

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Oct 9, 2024

awgu added release notes: distributed (fsdp2) release notes category and removed release notes: distributed (fsdp) release notes category labels Oct 9, 2024

awgu marked this pull request as ready for review October 9, 2024 15:16

awgu requested a review from weifengpy October 9, 2024 15:16

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 9, 2024

awgu mentioned this pull request Oct 9, 2024

[FSDP2] Added shard_placement_fn arg #137496

Closed

Skylion007 approved these changes Oct 9, 2024

View reviewed changes

Skylion007 reviewed Oct 9, 2024

View reviewed changes

Skylion007 approved these changes Oct 9, 2024

View reviewed changes

pytorchmergebot added the merging label Oct 9, 2024

pytorchmergebot added the Merged label Oct 9, 2024

pytorchmergebot closed this in ceb2fcc Oct 9, 2024

pytorchmergebot removed the merging label Oct 9, 2024

This was referenced Oct 9, 2024

NotImplementedError: Operator aten.unbind.int does not have a sharding #137649

Closed

[benchmark] mimic new copy-out kernels with out_views #137683

Closed

github-actions bot deleted the gh/awgu/650/head branch November 9, 2024 02:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP2] Fixed incorrect tensor meta after `.to(dtype)` #137593

[FSDP2] Fixed incorrect tensor meta after `.to(dtype)` #137593

Uh oh!

awgu commented Oct 9, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 9, 2024 •

edited

Loading

Uh oh!

Skylion007 Oct 9, 2024

Uh oh!

awgu Oct 9, 2024

Uh oh!

awgu Oct 9, 2024

Uh oh!

awgu commented Oct 9, 2024

Uh oh!

pytorchmergebot commented Oct 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FSDP2] Fixed incorrect tensor meta after .to(dtype) #137593

[FSDP2] Fixed incorrect tensor meta after .to(dtype) #137593

Uh oh!

Conversation

awgu commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137593

✅ No Failures

Uh oh!

Skylion007 Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

awgu commented Oct 9, 2024

Uh oh!

pytorchmergebot commented Oct 9, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FSDP2] Fixed incorrect tensor meta after `.to(dtype)` #137593

[FSDP2] Fixed incorrect tensor meta after `.to(dtype)` #137593

awgu commented Oct 9, 2024 •

edited

Loading

pytorch-bot bot commented Oct 9, 2024 •

edited

Loading