A quick fix for Stream operation errors on non-current device #15689

mrshenli · 2019-01-03T03:43:33Z

This is a quick fix by implementing the simpler solution as suggested by @colesbury. As benchmark result shows, it slows down Stream.query() by ~20%, I would be happy to further pursue a more complex solution by implementing this in C++/ATen. But I would still vote for merge this quick fix first just to get rid of the bug sooner.

~~Test TBA~~ Added

FYI @jeffreyksmithjr

Benchmark

now

In [1]: def f():
   ...:     d0 = torch.device('cuda:0')
   ...:     d1 = torch.device('cuda:1')
   ...:     with torch.cuda.device(d0):
   ...:         s0 = torch.cuda.current_stream()
   ...:     with torch.cuda.device(d1):
   ...:         s1 = torch.cuda.current_stream()
   ...:     s0.query()
   ...:     s1.query()

In [4]: %timeit f()
38.1 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit f()
37.6 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

before

In [4]: %timeit f()
28.5 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit f()
35.3 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

ezyang

Someone else will need to merge

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli · 2019-01-03T05:00:17Z

Hi @ezyang thanks for reviewing the code.

Could you please help to click the rerun button of this failed test. It was aborted for don't know what reason. I don't have a write permission to do so. Thanks.

ezyang · 2019-01-03T05:08:55Z

@mrshenli If you go to 'oss pytorch' you can add yourself as a member which should give you access. (In any case I restarted your job.)

gchanan · 2019-01-03T18:08:24Z

doesn't this issue exist on a bunch of functions (not just query?). Is there some more holistic approach we could take here?

mrshenli · 2019-01-03T18:15:23Z

@gchanan

I am thinking about implementing something like torch._C._cuda_queryStream, and replacing cudaStreamQuery invocations with the new implementation (not sure). I initially wanted to change cudaStreamQuery directly, but it seems to be an API in cuda which we cannot modify (maybe we can talk to NVIDIA??).

mrshenli · 2019-01-03T18:28:29Z

Hmm.., the new test keeps hitting error in one of the ci test...

mrshenli · 2019-01-03T19:48:29Z

test_streams_multi_gpu_query (test_cuda.TestCuda) keep failing (no error message) on rocmdeb, will skip the test on that platform.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli added 2 commits January 2, 2019 19:28

A quick fix for Stream operation errors on non-current device.

e79fdce

add a test

3281379

ezyang approved these changes Jan 3, 2019

View reviewed changes

facebook-github-bot reviewed Jan 3, 2019

View reviewed changes

synchronize on stream before exiting test

3fc5425

reduce gpu spin time in test

d3d4b07

skip testing test_streams_multi_gpu_query on Rocm

044d51c

facebook-github-bot reviewed Jan 3, 2019

View reviewed changes

facebook-github-bot closed this in 1e9a6d7 Jan 3, 2019

mrshenli deleted the stream branch January 17, 2019 21:02

ezyang added the merged label Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A quick fix for Stream operation errors on non-current device #15689

A quick fix for Stream operation errors on non-current device #15689

Uh oh!

mrshenli commented Jan 3, 2019 •

edited

Loading

Uh oh!

ezyang left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

mrshenli commented Jan 3, 2019

Uh oh!

ezyang commented Jan 3, 2019

Uh oh!

gchanan commented Jan 3, 2019

Uh oh!

mrshenli commented Jan 3, 2019

Uh oh!

mrshenli commented Jan 3, 2019

Uh oh!

mrshenli commented Jan 3, 2019 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

A quick fix for Stream operation errors on non-current device #15689

A quick fix for Stream operation errors on non-current device #15689

Uh oh!

Conversation

mrshenli commented Jan 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Jan 3, 2019

Uh oh!

ezyang commented Jan 3, 2019

Uh oh!

gchanan commented Jan 3, 2019

Uh oh!

mrshenli commented Jan 3, 2019

Uh oh!

mrshenli commented Jan 3, 2019

Uh oh!

mrshenli commented Jan 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mrshenli commented Jan 3, 2019 •

edited

Loading

mrshenli commented Jan 3, 2019 •

edited

Loading