KEMBAR78
Always synchronize on src and dst current streams when copying tensors by mrshenli · Pull Request #16966 · pytorch/pytorch · GitHub
Skip to content

Conversation

@mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Feb 11, 2019

fixes #15568

_test_copy(self, x0, x1, torch.zeros(5, 5, device=d1))

x2 = torch.zeros(5, 5, device=d0)
_test_copy(self, x0, x2, torch.ones(5, 5, device=d1))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colesbury

This behavior is expected (as we do not sync if src and dst are the same), but looks weird to me. Should we do the sync using dst default stream regardless of whether src and dst are different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Which part is weird or unexpected here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These four lines, both are copying a zero tensor to x0, but the result differs, depending on whether the src and dst are on the same device.

x1 = torch.zeros(5, 5, device=d1)
_test_copy(self, x0, x1, torch.zeros(5, 5, device=d1))

x2 = torch.zeros(5, 5, device=d0)
_test_copy(self, x0, x2, torch.ones(5, 5, device=d1))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the multi-device case like the single-device case you would have to set the streams for both d0 and d1 (and synchronize with the current stream instead of default stream). Something like:

    def _test_copy(self, x, y, output):
        x_plus_one = x + 1

        s0 = torch.cuda.Stream()
        s1 = torch.cuda.Stream()
        s2 = torch.cuda.Stream(device=y.device)
        s3 = torch.cuda.Stream(device=y.device)

        with torch.cuda.stream(s2), torch.cuda.stream(s0):
                torch.cuda._sleep(50000000)
                y.copy_(x_plus_one)

        with torch.cuda.stream(s3), torch.cuda.stream(s1):
                y.copy_(x)

        s0.synchronize()
        s1.synchronize()

@colesbury
Copy link
Member

The copy is still synchronizing on the default stream on the destination. It should be synchronizing on the current stream on the destination. That way it appears as-if the copy takes place in both the streams of the source and destination even though the kernel only runs on the current stream for the source's device.

dst_ready.record(getDefaultCUDAStream(dst_device.index()));

and

src_ready.block(getDefaultCUDAStream(dst_device.index()));

@mrshenli
Copy link
Contributor Author

rerun tests after merging #17439

@mrshenli mrshenli changed the title [WIP] Always synchronize src and dst streams when copying tensors Always synchronize src and dst streams when copying tensors Feb 25, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mrshenli mrshenli changed the title Always synchronize src and dst streams when copying tensors Always synchronize on src and dst current streams when copying tensors Feb 25, 2019
zdevito pushed a commit to zdevito/ATen that referenced this pull request Feb 28, 2019
Summary:
fixes #15568
Pull Request resolved: pytorch/pytorch#16966

Differential Revision: D14213144

Pulled By: mrshenli

fbshipit-source-id: 2fcf5e07895fde80b4aee72e2736b0def876d21f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

*Possible* bug in P2P (device-to-device) copy overlapped with work on GPU

4 participants