wrap cudaStreamSynchronize calls #61889

ngimel · 2021-07-20T04:44:45Z

This is a first step towards creating context manager that errors out on synchronizing calls.

facebook-github-bot · 2021-07-20T04:44:51Z

💊 CI failures summary and remediations

As of commit 5febc92 (more details on the Dr. CI page and at hud.pytorch.org/pr/61889):

1/2 failures introduced in this PR
1/2 broken upstream at merge base bf1c9aa from Jul 19 until Jul 20

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 21 20:41:15 AssertionError: False is not tr...ot sizes torch.Size([5, 5, 5]) and torch.Size([]).

Jul 21 20:41:15   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 780, in test_wrapper
Jul 21 20:41:15     return test(*args, **kwargs)
Jul 21 20:41:15   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 411, in test_reference_eager
Jul 21 20:41:15     self.compare_with_eager_reference(op, sample_input)
Jul 21 20:41:15   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 402, in compare_with_eager_reference
Jul 21 20:41:15     self.assertEqual(actual, expected, exact_dtype=True, exact_device=False)
Jul 21 20:41:15   File "/var/lib/jenkins/workspace/xla/test/pytorch_test_base.py", line 608, in assertEqual
Jul 21 20:41:15     return DeviceTypeTestBase.assertEqual(self, x, y, *args, **kwargs)
Jul 21 20:41:15   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1524, in assertEqual
Jul 21 20:41:15     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Jul 21 20:41:15 AssertionError: False is not true : Tensors failed to compare as equal!Attempted to compare equality of tensors with different sizes. Got sizes torch.Size([5, 5, 5]) and torch.Size([]).
Jul 21 20:41:15 
Jul 21 20:41:15 ----------------------------------------------------------------------
Jul 21 20:41:15 Ran 347 tests in 326.417s
Jul 21 20:41:15 
Jul 21 20:41:15 FAILED (failures=4)
Jul 21 20:41:15 
Jul 21 20:41:15 Generating XML reports...
Jul 21 20:41:16 + cleanup
Jul 21 20:41:16 + retcode=1
Jul 21 20:41:16 + set +x

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Linux CI (pytorch-linux-bionic-py3.8-gcc9-coverage) / render_test_results (default) from Jul 19 until Jul 20 (bf1c9aa - 018dc41)
- 🔁 rerun

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ezyang

Handling the ifdef in one place is a nice bonus. Will you add a lint to flag bare occurrences of these calls?

ngimel · 2021-07-20T17:26:10Z

Yes, will add a lint!

facebook-github-bot · 2021-07-20T22:44:53Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-21T00:31:00Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-21T19:05:29Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-21T19:06:53Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-22T02:32:24Z

@ngimel merged this pull request in 6284d2a.

byronyi · 2021-07-27T05:28:51Z

@ngimel Any chance cudaStreamSynchronize could be completely removed from at::nonzero? We found it quite difficult to support for accelerators, including NVIDIA GPU.

Also cc @ezyang @ailzhang @asuhan @JackCaoG as we first identify this issue when supporting detection/segmentation models in PyTorch XLA but then find out it is mainly from the limitation (tensor shape must be concrete value) in PyTorch core.

ezyang · 2021-07-27T15:29:51Z

Given the existing semantics of the operation, removing the synchronization is not possible.

What we should do, however, is support JAX's extension to the nonzero API https://jax.readthedocs.io/en/latest/_autosummary/jax.numpy.nonzero.html where an explicit size can be specified, giving an upper bound to the number of nonzero entries that will be returned (and zero padded if there aren't enough). Can you file an issue for this?

ngimel · 2021-07-27T16:35:02Z

We would need this extension not only for nonzero, but also for indexing ops with mask, the most common situation when people encounter this particular sync is out = x[mask]

byronyi · 2021-07-28T11:03:06Z

Given the existing semantics of the operation, removing the synchronization is not possible.

What we should do, however, is support JAX's extension to the nonzero API https://jax.readthedocs.io/en/latest/_autosummary/jax.numpy.nonzero.html where an explicit size can be specified, giving an upper bound to the number of nonzero entries that will be returned (and zero padded if there aren't enough). Can you file an issue for this?

Raised in #62320

Summary: This is a first step towards creating context manager that errors out on synchronizing calls. Pull Request resolved: pytorch/pytorch#61889 Reviewed By: albanD Differential Revision: D29805280 Pulled By: ngimel fbshipit-source-id: b66400fbe0941b7daa51e6b30abe27b9cccd4e8a

wrap cudaStreamSynchronize calls

42c73d7

facebook-github-bot added oncall: jit Add this issue/PR to JIT oncall triage queue cla signed labels Jul 20, 2021

adjust includes

5714c4a

ngimel requested a review from ezyang July 20, 2021 16:25

ezyang approved these changes Jul 20, 2021

View reviewed changes

Natalia Gimelshein added 2 commits July 20, 2021 14:00

move wrappers to c10/cuda/CUDAFunctions.h

7f1ef6c

clang-format

a7ed0d1

Natalia Gimelshein added 2 commits July 20, 2021 16:20

clang, hopefully fix rocm

3a1f3ef

check if HIP_VERSION is defined

f407345

add lint

5febc92

ngimel requested review from driazati, janeyx99, seemethere and zhouzhuojie as code owners July 21, 2021 17:51

facebook-github-bot closed this in 6284d2a Jul 22, 2021

facebook-github-bot added the Merged label Jul 22, 2021

ngimel deleted the ngimel/wrap_cuda_calls branch December 26, 2021 06:44

wrap cudaStreamSynchronize calls #61889

wrap cudaStreamSynchronize calls #61889

Uh oh!

Conversation

ngimel commented Jul 20, 2021

Uh oh!

facebook-github-bot commented Jul 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

XLA failure

🚧 1 fixed upstream failure:

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel commented Jul 20, 2021

Uh oh!

facebook-github-bot commented Jul 20, 2021

Uh oh!

facebook-github-bot commented Jul 21, 2021

Uh oh!

facebook-github-bot commented Jul 21, 2021

Uh oh!

facebook-github-bot commented Jul 21, 2021

Uh oh!

facebook-github-bot commented Jul 22, 2021

Uh oh!

byronyi commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Jul 27, 2021

Uh oh!

ngimel commented Jul 27, 2021

Uh oh!

byronyi commented Jul 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

facebook-github-bot commented Jul 20, 2021 •

edited

Loading

byronyi commented Jul 27, 2021 •

edited

Loading