-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
Platforms: linux
Broken on multigpu
To reenable on your PR, put Fixes #<this issue number>
in the PR body and add the ciflow/periodic
tag to trigger multigpu
Probably caused by #113547 or something in its stack @wanchaol do you mind providing a forward fix?
First known bad: https://hud.pytorch.org/pytorch/pytorch/commit/93372455a73043332c16a71cb9dccdf3e0412a57
Last known good: https://hud.pytorch.org/pytorch/pytorch/commit/a1e3c501652101e8b37baac62216db7ca22c9923
Ex. https://github.com/pytorch/pytorch/actions/runs/6863856295/job/18665805628
_______________ TestDTensorCompileE2E.test_2d_fsdp_tp_ac_compile _______________
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 542, in wrapper
self._join_processes(fn)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 761, in _join_processes
self._check_return_codes(elapsed_time)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 811, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
getattr(self, test_name)()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
fn()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2575, in wrapper
method(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 193, in wrapper
func(self, *args, **kwargs) # type: ignore[misc]
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
return func(*args, **kwargs)
File "/var/lib/jenkins/workspace/test/distributed/_tensor/test_dtensor_compile.py", line 328, in test_2d_fsdp_tp_ac_compile
compiled_output = compiled_2d(inp)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 408, in _fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 840, in forward
args, kwargs = _pre_forward(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 412, in _pre_forward
unshard_fn(state, handle)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 447, in _pre_forward_unshard
_unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 331, in _unshard
handle.unshard()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1272, in unshard
self._use_unsharded_flat_param(padded_unsharded_flat_param)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1404, in _use_unsharded_flat_param
self._use_unsharded_views(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1847, in _use_unsharded_views
views = self._get_unflat_views()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1824, in _get_unflat_views_aligned
_ext_post_unflatten_transform(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_fsdp_extensions.py", line 113, in _ext_post_unflatten_transform
return fsdp_extension.post_unflatten_transform(tensor, param_extension)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/fsdp.py", line 334, in post_unflatten_transform
result = _unflatten_tensor(tensor, param_extension)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 569, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 671, in _convert_frame
result = inner_convert(frame, cache_entry, hooks, frame_state)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 377, in _convert_frame_assert
return _compile(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 614, in _compile
raise InternalTorchDynamoError(str(e)).with_traceback(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 595, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 243, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 512, in compile_inner
out_code = transform_code_object(code, transform)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
transformations(instructions, code_options)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 150, in _fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 477, in transform
tracer.run()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2120, in run
super().run()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 815, in run
and self.step()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 778, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
return inner_fn(self, inst)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1259, in CALL_FUNCTION_KW
self.call_function(fn, args, kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 650, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/torch.py", line 572, in call_function
kwargs_as_value = {k: v.as_python_constant() for k, v in kwargs.items()}
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/torch.py", line 572, in <dictcomp>
kwargs_as_value = {k: v.as_python_constant() for k, v in kwargs.items()}
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/lists.py", line 66, in as_python_constant
return self.python_type()([x.as_python_constant() for x in self.items])
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/lists.py", line 66, in <listcomp>
return self.python_type()([x.as_python_constant() for x in self.items])
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/base.py", line 238, in as_python_constant
raise NotImplementedError(f"{self} is not a constant")
torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant
from user code:
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 18, in _unflatten_tensor
result = DistributedTensor.from_local(
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
To execute this test, run the following from the base repo dir:
python test/distributed/_tensor/test_dtensor_compile.py -k test_2d_fsdp_tp_ac_compile
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------- Captured stdout call -----------------------------
Process 2 terminated with exit code 10, terminating remaining processes.
------------------------------ Captured log call -------------------------------
INFO numba.cuda.cudadrv.driver:driver.py:245 init
This test was disabled because it is failing on main branch (recent examples).
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519