MAINT Migrates rrelu_with_noise from THC to ATen on Cuda #57864

thomasjpfan · 2021-05-07T22:05:38Z

Fixes #24618
Related to #24507

Benchmark script:

import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    torch.cuda.synchronize()
    return time.time()

device = "cuda"
m = nn.RReLU().cuda()

for n in [100, 10_000, 100_000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        fwd_t = fwd_t + (t2 -t1)
    fwd_avg = fwd_t / 10000 * 1000
    print(f"input size(128, {n}) forward time is {fwd_avg:.2f} (ms)")

Results from benchmark:

This PR

input size(128, 100) forward time is 0.01 (ms)
input size(128, 10000) forward time is 0.06 (ms)
input size(128, 100000) forward time is 0.54 (ms)

On master

input size(128, 100) forward time is 0.01 (ms)
input size(128, 10000) forward time is 0.08 (ms)
input size(128, 100000) forward time is 0.66 (ms)

facebook-github-bot · 2021-05-07T22:05:43Z

💊 CI failures summary and remediations

As of commit 88bdcd7 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…n_migrate

ezyang · 2021-05-10T15:25:05Z

removed myself

…n_migrate

ngimel · 2021-06-04T06:16:10Z

aten/src/ATen/native/cuda/Activation.cu

+inline scalar_t __device__ curand_uniform_type(curandStatePhilox4_32_10_t *state);
+
+template <>
+inline THHalf __device__ curand_uniform_type<THHalf>(curandStatePhilox4_32_10_t *state) {


Don't use legacy THHalf type, use at::Half instead. There are implicit conversions between at::Half and float, so ScalarConvert is not necessary

ngimel · 2021-06-04T06:17:17Z

aten/src/ATen/native/cuda/Activation.cu

+template <>
+inline THHalf __device__ curand_uniform_type<THHalf>(curandStatePhilox4_32_10_t *state) {
+  auto rand = curand_uniform4(state);
+  return ScalarConvert<float, THHalf>::to(rand.x);


using only .x out of 4 generated numbers is wasteful, you can have an unroll loop in the kernel that would use all the values, you can take a look e.g. at the non-vectorized fused_dropout_kernel in Dropout.cu

ngimel · 2021-06-04T06:18:49Z

aten/src/ATen/native/cuda/Activation.cu

+    if (input[i] <= 0)
+    {
+      scalar_t r = curand_uniform_type<scalar_t>(&state);
+      r = ScalarConvert<double, scalar_t>::to(r * (b - a) + a);


having double is usually perf penalty, it should be scalar_t or at most accscalar_t

ngimel · 2021-06-04T06:19:08Z

aten/src/ATen/native/cuda/Activation.cu

+    else
+    {
+      output[i] = input[i];
+      noise[i] = ScalarConvert<int, scalar_t>::to(1);


No ScalarConvert please

ngimel · 2021-06-04T06:21:05Z

aten/src/ATen/native/cuda/Activation.cu

+
+  CUDA_KERNEL_LOOP(i, n)
+  {
+    if (input[i] <= 0)


to avoid warp divergence, you should generate randoms for every input, and then only diverge on fast operations like computing output and noise

…n_migrate

thomasjpfan · 2021-06-11T17:07:22Z

Thank you for the review @ngimel ! I updated the PR to use unrolling. I ran the following benchmark:

Benchmark script:

import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    torch.cuda.synchronize()
    return time.time()

device = "cuda"
m = nn.RReLU().cuda()
n_runs = 1_000

for n in [10_000, 100_000, 1_000_000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(n_runs):
        t1 = _time()
        output = m(input)
        t2 = _time()
        fwd_t = fwd_t + (t2 -t1)
    fwd_avg = fwd_t / n_runs * 1000
    print(f"input size(128, {n}) forward time is {fwd_avg:.2f} (ms)")

Results from benchmark:

This PR

input size(128, 10000) forward time is 0.06 (ms)
input size(128, 100000) forward time is 0.43 (ms)
input size(128, 1000000) forward time is 4.17 (ms)

On master

input size(128, 10000) forward time is 0.09 (ms)
input size(128, 100000) forward time is 0.69 (ms)
input size(128, 1000000) forward time is 6.66 (ms)

ngimel

Thanks, this looks good, I left minor comments.

ngimel · 2021-06-15T17:14:00Z

aten/src/ATen/native/cuda/Activation.cu

+  double range = upper - lower;
+
+  for (int linear_index = idx; linear_index < rounded_size; linear_index += grid_stride) {
+    auto rand = random_func(&state);


Can you please add static assert here that sizeof(rand)/sizeof(rand.x) == unroll_factor? Otherwise your (&rand.x)[ii] access is unsafe.

ngimel · 2021-06-15T17:34:08Z

aten/src/ATen/native/cuda/Activation.cu

+  checkAllSameGPU("rrelu_with_noise_out_cuda", {self_arg, noise_arg, output_arg});
+
+  auto input = self.contiguous();
+  auto noise_ = noise.contiguous();


rrelu_with_noise_out_cuda is a user facing function, which means that output can also be discontiguous here.

ngimel · 2021-06-15T17:42:25Z

aten/src/ATen/native/cuda/Activation.cu

+              output, input, noise_, lower, upper, generator);
+        });
+  } else {
+    auto lower_tensor = scalar_to_tensor(lower);


you don't need to convert Scalar to tensor here, instead convert to regular type (using .to<double>) and negative_slope back to Scalar

ngimel · 2021-06-15T18:41:36Z

aten/src/ATen/native/cuda/Activation.cu

    auto rand = random_func(&state);
+
+    // ensure that (&rand.x)[ii] is safe
+    CUDA_KERNEL_ASSERT(sizeof(rand)/sizeof(rand.x) == unroll_factor);


it should be static_assert (to be done at compile time), not runtime assert.

…n_migrate

ngimel · 2021-06-16T16:49:53Z

Can you please try rebasing, to get CI signal?

…n_migrate

facebook-github-bot · 2021-06-16T21:17:41Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-06-17T07:37:20Z

@ngimel merged this pull request in a0ad4c2.

MAINT Migrates rrelu from THC to aten

b006d91

thomasjpfan requested a review from ezyang as a code owner May 7, 2021 22:05

facebook-github-bot added the cla signed label May 7, 2021

pytorchbot added the open source label May 7, 2021

ngimel self-requested a review May 7, 2021 22:47

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 7, 2021

Merge remote-tracking branch 'upstream/master' into rrelu_inplace_ate…

9508148

…n_migrate

ezyang removed their request for review May 10, 2021 15:24

Merge remote-tracking branch 'upstream/master' into rrelu_inplace_ate…

40ccc62

…n_migrate

thomasjpfan added the module: nn Related to torch.nn label May 11, 2021

thomasjpfan added 5 commits May 11, 2021 15:28

Merge remote-tracking branch 'upstream/master' into rrelu_inplace_ate…

d54ad56

…n_migrate

Merge remote-tracking branch 'upstream/master' into rrelu_inplace_ate…

0fc57a8

…n_migrate

Merge remote-tracking branch 'upstream/master' into rrelu_inplace_ate…

bb949f5

…n_migrate

Merge remote-tracking branch 'upstream/master' into rrelu_inplace_ate…

5871f21

…n_migrate

Merge remote-tracking branch 'upstream/master' into rrelu_inplace_ate…

90032c6

…n_migrate

ngimel reviewed Jun 4, 2021

View reviewed changes

thomasjpfan added 6 commits June 4, 2021 12:05

Merge remote-tracking branch 'upstream/master' into rrelu_inplace_ate…

464311d

…n_migrate

CLN Address comments

da7a154

Merge remote-tracking branch 'upstream/master' into rrelu_inplace_ate…

b07fffd

…n_migrate

CLN Use static_cast

8ebd381

WIP

dc27dd8

ENH Uses unrolling for random numbers

bdd6e87

thomasjpfan force-pushed the rrelu_inplace_aten_migrate branch from c04f62b to bdd6e87 Compare June 11, 2021 17:01

REV Removes unneeded code

007efae

ngimel reviewed Jun 15, 2021

View reviewed changes

ENH Address comments

74b5e23

ngimel reviewed Jun 15, 2021

View reviewed changes

thomasjpfan added 4 commits June 16, 2021 09:24

Merge remote-tracking branch 'upstream/master' into rrelu_inplace_ate…

911326c

…n_migrate

FIX Uses static assert

87fcf4c

FIX Fxies segfault

3ab3e6a

Fix Uses contiguous for copy

853cab2

ngimel approved these changes Jun 16, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into rrelu_inplace_ate…

88bdcd7

…n_migrate

facebook-github-bot closed this in a0ad4c2 Jun 17, 2021

facebook-github-bot added the Merged label Jun 17, 2021

MAINT Migrates rrelu_with_noise from THC to ATen on Cuda #57864

MAINT Migrates rrelu_with_noise from THC to ATen on Cuda #57864

Uh oh!

Conversation

thomasjpfan commented May 7, 2021

Results from benchmark:

This PR

On master

Uh oh!

facebook-github-bot commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

ezyang commented May 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Jun 11, 2021

Results from benchmark:

This PR

On master

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngimel commented Jun 16, 2021

Uh oh!

facebook-github-bot commented Jun 16, 2021

Uh oh!

facebook-github-bot commented Jun 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

facebook-github-bot commented May 7, 2021 •

edited

Loading