Update real device in FSDP state_dict_utils #134994

ankurneog · 2024-09-03T07:19:27Z

Motivation

The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor.

[rank3]   File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical
[rank3]     sharded_tensor_sd = ref_model.state_dict()
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict
[rank3]     hook_result = hook(self, destination, prefix, local_metadata)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]     return func(*args, **kwargs)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook
[rank3]     tensor.device,
[rank3]   File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper
[rank3]     return arg(*args, **kwargs)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__
[rank3]     return dispatch(st_instance, func)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch
[rank3]     return _SHARDED_OPS[func](types, args, kwargs, st._process_group)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper
[rank3]     return wrapped_func(types, args, kwargs, process_group)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device
[rank3]     dev = torch.device(torch.cuda.current_device())
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device
[rank3]     _lazy_init()
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init
[rank3]     raise AssertionError("Torch not compiled with CUDA enabled")
[rank3] AssertionError: Torch not compiled with CUDA enabled

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2024-09-03T07:19:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134994

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit c8fc022 with merge base c977bb7 ():

NEW FAILURE - The following job has failed:

periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, SIMULATOR, arm64, 1, 0, 1) (gh)

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
'test/profiler/test_cpp_thread.py::CppThreadTest::test_profile_memory'

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 3, 3, linux.rocm.gpu, unstable) (gh) (#129209)
distributed/_composable/test_replicate_with_compiler.py::DDP_TP_Test::test_ddp_tp

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fegin · 2024-09-03T18:14:56Z

@ankurneog Thanks for the PR. Curious about the use case, my understanding is FSDP cannot be used with CPU only environment?

torch/distributed/fsdp/_state_dict_utils.py

ankurneog · 2024-09-04T03:12:36Z

@ankurneog Thanks for the PR. Curious about the use case, my understanding is FSDP cannot be used with CPU only environment?

@fegin : this is for intel Gaudi /HPU device

ankurneog · 2024-09-04T16:27:21Z

@hippocookie : can you please help with the approval. Thanks

zeshengzong · 2024-09-06T00:38:12Z

@hippocookie : can you please help with the approval. Thanks

Sorry I don't have permission to do that, need help from @fegin :D

ankurneog · 2024-09-06T03:55:47Z

@fegin : Could you please help with the approval. thanks.

fegin · 2024-09-09T17:29:20Z

Let me initial the tests and see if this PR breaks the existing tests before stamping the PR.

ankurneog · 2024-09-13T03:39:31Z

@fegin : gentle reminder, could you please help with the approval. thanks.

fegin · 2024-09-13T18:31:41Z

Can you rebase and resubmit again? There are too many noise in the CI. Thanks!

wz337 · 2024-09-14T04:50:46Z

@pytorchmergebot rebase

pytorchmergebot · 2024-09-14T04:52:16Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-09-14T04:52:19Z

Successfully rebased fsdp_state_dict_device onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_state_dict_device && git pull --rebase)

fegin · 2024-09-16T06:28:30Z

@pytorchbot merge

pytorchmergebot · 2024-09-16T06:30:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-16T06:30:33Z

Merge failed

Reason: 1 jobs have failed, first few of them are: periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, SIMULATOR, arm64, 1, 0, 1)

Details for Dev Infra team

Raised by workflow job

ankurneog · 2024-09-17T02:23:51Z

@pytorchbot rebase

pytorchmergebot · 2024-09-17T02:25:19Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-09-17T02:25:21Z

Successfully rebased fsdp_state_dict_device onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_state_dict_device && git pull --rebase)

ankurneog · 2024-09-17T02:29:40Z

@pytorchbot merge

pytorchmergebot · 2024-09-17T02:31:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-17T02:42:23Z

Merge failed

Reason: 1 jobs have failed, first few of them are: periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, SIMULATOR, arm64, 1, 0, 1)

Details for Dev Infra team

Raised by workflow job

ankurneog · 2024-09-17T02:45:29Z

@fegin : I believe the failures are not related to my change, can you please help with the merge

fegin · 2024-09-17T04:37:22Z

@pytorchbot merge -f "The failing test is not related."

pytorchmergebot · 2024-09-17T04:38:54Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

## Motivation The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor. ``` [rank3] File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical [rank3] sharded_tensor_sd = ref_model.state_dict() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict [rank3] hook_result = hook(self, destination, prefix, local_metadata) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank3] return func(*args, **kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook [rank3] tensor.device, [rank3] File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper [rank3] return arg(*args, **kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__ [rank3] return dispatch(st_instance, func) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch [rank3] return _SHARDED_OPS[func](types, args, kwargs, st._process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper [rank3] return wrapped_func(types, args, kwargs, process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device [rank3] dev = torch.device(torch.cuda.current_device()) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device [rank3] _lazy_init() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init [rank3] raise AssertionError("Torch not compiled with CUDA enabled") [rank3] AssertionError: Torch not compiled with CUDA enabled ```` Pull Request resolved: pytorch#134994 Approved by: https://github.com/fegin

access tensor.device variable in right formart from ShardedTensor,DTensor and Tensor PR : pytorch#134994 Change-Id: Id1ae919b8cd902899386ee756af680b872fee8c9

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Sep 3, 2024

pytorchbot added the open source label Sep 3, 2024

colesbury requested a review from wanchaol September 3, 2024 18:58

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 3, 2024

zeshengzong reviewed Sep 4, 2024

View reviewed changes

torch/distributed/fsdp/_state_dict_utils.py Outdated Show resolved Hide resolved

ankurneog force-pushed the fsdp_state_dict_device branch 2 times, most recently from b246e4e to c05d748 Compare September 4, 2024 16:08

ankurneog force-pushed the fsdp_state_dict_device branch from c05d748 to c824c7d Compare September 6, 2024 03:51

fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Sep 9, 2024

pytorchmergebot force-pushed the fsdp_state_dict_device branch from c824c7d to dc4e0ae Compare September 14, 2024 04:52

fegin approved these changes Sep 14, 2024

View reviewed changes

pytorchmergebot added the merging label Sep 16, 2024

pytorchmergebot removed the merging label Sep 16, 2024

Update real device for sharded tensor in fsdp state_dict_utils

c8fc022

pytorchmergebot force-pushed the fsdp_state_dict_device branch from dc4e0ae to c8fc022 Compare September 17, 2024 02:25

pytorchmergebot added the merging label Sep 17, 2024

pytorchmergebot removed the merging label Sep 17, 2024

pytorchmergebot added the merging label Sep 17, 2024

pytorchmergebot added the Merged label Sep 17, 2024

pytorchmergebot closed this in e248c1d Sep 17, 2024

pytorchmergebot removed the merging label Sep 17, 2024

ankurneog mentioned this pull request Mar 5, 2025

[RFC] Generalize pytorch content for non-native device execution pytorch/rfcs#66

Open

Update real device in FSDP state_dict_utils #134994

Update real device in FSDP state_dict_utils #134994

Uh oh!

Conversation

ankurneog commented Sep 3, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

pytorch-bot bot commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134994

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

fegin commented Sep 3, 2024

Uh oh!

Uh oh!

ankurneog commented Sep 4, 2024

Uh oh!

ankurneog commented Sep 4, 2024

Uh oh!

zeshengzong commented Sep 6, 2024

Uh oh!

ankurneog commented Sep 6, 2024

Uh oh!

fegin commented Sep 9, 2024

Uh oh!

ankurneog commented Sep 13, 2024

Uh oh!

fegin commented Sep 13, 2024

Uh oh!

wz337 commented Sep 14, 2024

Uh oh!

pytorchmergebot commented Sep 14, 2024

Uh oh!

pytorchmergebot commented Sep 14, 2024

Uh oh!

fegin commented Sep 16, 2024

Uh oh!

pytorchmergebot commented Sep 16, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 16, 2024

Merge failed

Uh oh!

ankurneog commented Sep 17, 2024

Uh oh!

pytorchmergebot commented Sep 17, 2024

Uh oh!

pytorchmergebot commented Sep 17, 2024

Uh oh!

ankurneog commented Sep 17, 2024

Uh oh!

pytorchmergebot commented Sep 17, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 17, 2024

Merge failed

Uh oh!

ankurneog commented Sep 17, 2024

Uh oh!

fegin commented Sep 17, 2024

Uh oh!

pytorchmergebot commented Sep 17, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ankurneog commented Sep 3, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 3, 2024 •

edited

Loading