[fsdp2] based on device, use stream and Event #136843

jeejakp12 · 2024-09-27T09:14:25Z

currently FSDP2 support only CUDA, for other backends that need to use FSDP2 it won’t work as stream and events are based on CUDA. To support other backends, use
_get_device_handle by device type to get the class and use this
for stream and events.

Fixes #ISSUE_NUMBER

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2024-09-27T09:14:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136843

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7750cda with merge base f54e142 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu · 2024-09-27T15:23:01Z

torch/distributed/_composable/fsdp/_fsdp_collectives.py

                dist.all_reduce(
                    reduce_output,
                    group=all_reduce_group,
-                    op=ReduceOp.AVG if predivide_factor is None else ReduceOp.SUM,
+                    # hpu need to add support fot AVG, just change it to proceed, will see accuracy issue


nit: maybe can you not include this comment unless you are confident that you are going to fix it and then remove this comment later?

otherwise, I would prefer you track the HPU to-do separately from the main code

@awgu sorry that was temp fix for verifying it on HPU, forgot to remove it. fixed in the latest patch.

awgu · 2024-09-27T16:14:10Z

change looks good to me overall, let me let CI run first, and I will do a second pass

awgu · 2024-09-27T17:18:44Z

@jeejakp12 sorry could you also fix the failing unit tests? I think it should not be a complicated fix fortunately

jeejakp12 · 2024-09-30T11:04:03Z

@pytorchbot drci

awgu

Can we please try to minimize the code surface affected? For example, let us not add default args if not needed; let us not add None paths if not needed, etc.

I left some comments regarding these inline.

awgu · 2024-09-30T12:16:53Z

torch/distributed/_composable/fsdp/_fsdp_collectives.py

    all_reduce_grads: bool,
    partial_reduce_output: Optional[torch.Tensor],  # only used for HSDP
-) -> Tuple[torch.Tensor, torch.cuda.Event, torch.cuda.Event, Optional[torch.Tensor]]:
+) -> Tuple[torch.Tensor, torch.Event, torch.Event, Optional[torch.Tensor]]:
+    device_handle = _get_device_handle_from_device(device)


nit: let us not put code before the block comment 🤔

will fix it

awgu · 2024-09-30T12:18:16Z

torch/distributed/_composable/fsdp/_fsdp_common.py

@@ -150,3 +151,16 @@ def _cast_fp_tensor(dtype: torch.dtype, x: torch.Tensor) -> torch.Tensor:
    ):
        return x
    return x.to(dtype)
+
+
+def _get_device_handle_from_device(device: Optional[torch.device] = None):


let us not use a default arg if we never expect to call this without passing a device

awgu · 2024-09-30T12:19:40Z

torch/distributed/_composable/fsdp/_fsdp_common.py

+    if device is None:
+        device_type = "cuda" if torch.cuda.is_available() else "cpu"


when do we expect to pass device=None?
if never, let us get rid of this branch

I would even vote to directly just use _get_device_handle(device.type) instead of this function because we do not really need this error checking on all calls during runtime

Below is the test which fails, when FSDPParamGroup is created with device None.
def test_dynamo_trace_use_training_state(self):
torch._dynamo.reset()
# Construct a dummy FSDPParamGroup, since we just want to test the use_training_state ctx manager.
param_group = FSDPParamGroup(
[], # params: List[nn.Parameter],
(torch.nn.Linear(1, 1),), # module: Tuple[nn.Module, ...],
None, # mesh_info: FSDPMeshInfo,
None, # post_forward_mesh_info: Optional[FSDPMeshInfo],
None, # device: torch.device,
None, # mp_policy: MixedPrecisionPolicy,
None, # offload_policy: OffloadPolicy,
)

Thanks! Can we try to pass a real device here?

awgu · 2024-09-30T12:19:50Z

torch/distributed/_composable/fsdp/_fsdp_common.py

+    device_handle = _get_device_handle(device_type)
+
+    if device_handle is None:
+        raise RuntimeError("InValid device handle for device type:", device)


if you decide to keep this

Suggested change

raise RuntimeError("InValid device handle for device type:", device)

raise RuntimeError("Invalid device handle for device type:", device)

will fix this

awgu · 2024-09-30T12:20:59Z

torch/distributed/_composable/fsdp/_fsdp_param_group.py

-        if not torch.cuda.is_available():
-            raise RuntimeError("FSDP requires CUDA for streams")
+    def lazy_init(self, device: torch.device):
+        if device is None:


when do we hit this branch? it mismatches the type annotation of device: torch.device

if we do not hit it, we should get rid of it

if FSDPParamGroup is created with device as None, device is None. There were some test where the FSDPParamGroup device passed was None. if we allow None for FSDPParamGroup- torch.device , then we need the check. do you think that we should not allow None here?

ah makes sense! can we try to pass a proper device in those cases and disallow None here? sorry about this 😢

awgu · 2024-09-30T12:21:40Z

torch/distributed/_composable/fsdp/_fsdp_param_group.py

-    ) -> Tuple[torch.cuda.Stream, torch.cuda.Stream]:
+        self, async_op: bool, training_state: TrainingState, device: torch.device
+    ) -> Tuple[torch.Stream, torch.Stream]:
+        device_handle = _get_device_handle_from_device(device)


can we use self.device_handle?

i will check and fix this

jeejakp12 · 2024-09-30T18:10:01Z

Can we please try to minimize the code surface affected? For example, let us not add default args if not needed; let us not add None paths if not needed, etc.

I left some comments regarding these inline.

i will try to rework and try to use directly get_device_handle()

jeejakp12 · 2024-10-04T16:01:45Z

@awgu reworked and used directly get_device_handle(). can you please help review

awgu

LGTM! Thanks for working through this.

torch/distributed/_composable/fsdp/_fsdp_param_group.py

awgu · 2024-10-04T17:59:25Z

@pytorchbot merge

pytorchmergebot · 2024-10-04T18:01:00Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-04T18:42:13Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, lf.linux.8xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

awgu · 2024-10-04T18:47:57Z

test failure looks real
will need to debug

awgu · 2024-10-04T18:54:40Z

This is weird. I patched your PR but cannot repro locally:

 pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k  test_to_float64_after_init  -s

The above passes for me. Maybe there was a regression in trunk.

Edit: ah, indeed there was

BROKEN TRUNK - The following job failed but were present on the merge base:
👉 Rebase onto the viable/strict branch to avoid these failures

awgu · 2024-10-04T18:57:11Z

For pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, lf.linux.8xlarge.nvidia.gpu) (gh), the issue is that we did not previous assume that lazy init had to be called before running unshard. I will revisit this in a bit.

awgu · 2024-10-04T22:01:48Z

@jeejakp12 could you just do this for now to unblock:

def lazy_init(self):
        # Lazy init should be idempotent
+        if not hasattr(self.comm_ctx, "device_handle"):
+            self.comm_ctx.device_handle = _get_device_handle(self.device.type)

in _fsdp_param_group.py

currently FSDP2 support only CUDA, for other backends that need to use FSDP2 it won’t work as stream and events are based on CUDA. To support other backends, use _get_device_handle by device type to get the class and use this for stream and events. Signed-off-by: Jeeja <jeejakp@habana.ai>

jeejakp12 · 2024-10-05T15:21:41Z

@jeejakp12 could you just do this for now to unblock:

def lazy_init(self):
        # Lazy init should be idempotent
+        if not hasattr(self.comm_ctx, "device_handle"):
+            self.comm_ctx.device_handle = _get_device_handle(self.device.type)

in _fsdp_param_group.py

@awgu i have pushed the above change. Thanks:-)

cyyever · 2024-10-06T01:40:08Z

@pytorchbot merge

pytorchmergebot · 2024-10-06T01:41:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Sep 27, 2024

pytorchbot added the open source label Sep 27, 2024

Skylion007 requested a review from awgu September 27, 2024 13:54

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 27, 2024

awgu reviewed Sep 27, 2024

View reviewed changes

jeejakp12 force-pushed the origin/jeeja_use_device_handle_for_stream_event branch from bb94015 to 949e2ce Compare September 27, 2024 15:59

awgu added release notes: distributed (fsdp2) release notes category and removed release notes: distributed (fsdp) release notes category labels Sep 27, 2024

jeejakp12 force-pushed the origin/jeeja_use_device_handle_for_stream_event branch from 949e2ce to ea62d93 Compare September 30, 2024 09:38

awgu reviewed Sep 30, 2024

View reviewed changes

jeejakp12 force-pushed the origin/jeeja_use_device_handle_for_stream_event branch 2 times, most recently from 0ab1669 to 1e00f21 Compare October 4, 2024 14:31

awgu approved these changes Oct 4, 2024

View reviewed changes

torch/distributed/_composable/fsdp/_fsdp_param_group.py Outdated Show resolved Hide resolved

jeejakp12 force-pushed the origin/jeeja_use_device_handle_for_stream_event branch from 1e00f21 to 9afa427 Compare October 4, 2024 17:53

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 4, 2024

pytorchmergebot added the merging label Oct 4, 2024

pytorchmergebot removed the merging label Oct 4, 2024

jeejakp12 force-pushed the origin/jeeja_use_device_handle_for_stream_event branch from 9afa427 to 7750cda Compare October 5, 2024 15:18

pytorchmergebot added the merging label Oct 6, 2024

pytorchmergebot added the Merged label Oct 6, 2024

pytorchmergebot closed this in ad4e91a Oct 6, 2024

pytorchmergebot removed the merging label Oct 6, 2024

		if device is None:
		device_type = "cuda" if torch.cuda.is_available() else "cpu"

	raise RuntimeError("InValid device handle for device type:", device)
	raise RuntimeError("Invalid device handle for device type:", device)

[fsdp2] based on device, use stream and Event #136843

[fsdp2] based on device, use stream and Event #136843

Uh oh!

Conversation

jeejakp12 commented Sep 27, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136843

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu commented Sep 27, 2024

Uh oh!

awgu commented Sep 27, 2024

Uh oh!

jeejakp12 commented Sep 30, 2024

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeejakp12 commented Sep 30, 2024

Uh oh!

jeejakp12 commented Oct 4, 2024

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

awgu commented Oct 4, 2024

Uh oh!

pytorchmergebot commented Oct 4, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 4, 2024

Merge failed

Uh oh!

awgu commented Oct 4, 2024

Uh oh!

awgu commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awgu commented Oct 4, 2024

Uh oh!

awgu commented Oct 4, 2024

Uh oh!

jeejakp12 commented Oct 5, 2024

Uh oh!

cyyever commented Oct 6, 2024

jeejakp12 commented Sep 27, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 27, 2024 •

edited

Loading

awgu commented Oct 4, 2024 •

edited

Loading