KEMBAR78
DISABLED test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E) · Issue #113781 · pytorch/pytorch · GitHub
Skip to content

DISABLED test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E) #113781

@clee2000

Description

@clee2000

Platforms: linux

Broken on multigpu

To reenable on your PR, put Fixes #<this issue number> in the PR body and add the ciflow/periodic tag to trigger multigpu

Probably caused by #113547 or something in its stack @wanchaol do you mind providing a forward fix?

First known bad: https://hud.pytorch.org/pytorch/pytorch/commit/93372455a73043332c16a71cb9dccdf3e0412a57
Last known good: https://hud.pytorch.org/pytorch/pytorch/commit/a1e3c501652101e8b37baac62216db7ca22c9923

Ex. https://github.com/pytorch/pytorch/actions/runs/6863856295/job/18665805628

_______________ TestDTensorCompileE2E.test_2d_fsdp_tp_ac_compile _______________
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 542, in wrapper
    self._join_processes(fn)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 761, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 811, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
    getattr(self, test_name)()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
    fn()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2575, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 193, in wrapper
    func(self, *args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
    return func(*args, **kwargs)
  File "/var/lib/jenkins/workspace/test/distributed/_tensor/test_dtensor_compile.py", line 328, in test_2d_fsdp_tp_ac_compile
    compiled_output = compiled_2d(inp)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 408, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 840, in forward
    args, kwargs = _pre_forward(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 412, in _pre_forward
    unshard_fn(state, handle)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 447, in _pre_forward_unshard
    _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 331, in _unshard
    handle.unshard()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1272, in unshard
    self._use_unsharded_flat_param(padded_unsharded_flat_param)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1404, in _use_unsharded_flat_param
    self._use_unsharded_views(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1847, in _use_unsharded_views
    views = self._get_unflat_views()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1824, in _get_unflat_views_aligned
    _ext_post_unflatten_transform(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/_fsdp_extensions.py", line 113, in _ext_post_unflatten_transform
    return fsdp_extension.post_unflatten_transform(tensor, param_extension)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/fsdp.py", line 334, in post_unflatten_transform
    result = _unflatten_tensor(tensor, param_extension)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 569, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 671, in _convert_frame
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 377, in _convert_frame_assert
    return _compile(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 614, in _compile
    raise InternalTorchDynamoError(str(e)).with_traceback(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 595, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 243, in time_wrapper
    r = func(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 512, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 150, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 477, in transform
    tracer.run()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2120, in run
    super().run()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 815, in run
    and self.step()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 778, in step
    getattr(self, inst.opname)(inst)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
    return inner_fn(self, inst)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1259, in CALL_FUNCTION_KW
    self.call_function(fn, args, kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 650, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/torch.py", line 572, in call_function
    kwargs_as_value = {k: v.as_python_constant() for k, v in kwargs.items()}
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/torch.py", line 572, in <dictcomp>
    kwargs_as_value = {k: v.as_python_constant() for k, v in kwargs.items()}
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/lists.py", line 66, in as_python_constant
    return self.python_type()([x.as_python_constant() for x in self.items])
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/lists.py", line 66, in <listcomp>
    return self.python_type()([x.as_python_constant() for x in self.items])
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/base.py", line 238, in as_python_constant
    raise NotImplementedError(f"{self} is not a constant")
torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant

from user code:
   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 18, in _unflatten_tensor
    result = DistributedTensor.from_local(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


To execute this test, run the following from the base repo dir:
     python test/distributed/_tensor/test_dtensor_compile.py -k test_2d_fsdp_tp_ac_compile

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0


----------------------------- Captured stdout call -----------------------------
Process 2 terminated with exit code 10, terminating remaining processes.
------------------------------ Captured log call -------------------------------
INFO     numba.cuda.cudadrv.driver:driver.py:245 init

This test was disabled because it is failing on main branch (recent examples).

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queueoncall: pt2skippedDenotes a (flaky) test currently skipped in CI.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions