-
Notifications
You must be signed in to change notification settings - Fork 25.7k
MAINT Migrates rrelu_with_noise from THC to ATen on Cuda #57864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT Migrates rrelu_with_noise from THC to ATen on Cuda #57864
Conversation
💊 CI failures summary and remediationsAs of commit 88bdcd7 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
removed myself |
| inline scalar_t __device__ curand_uniform_type(curandStatePhilox4_32_10_t *state); | ||
|
|
||
| template <> | ||
| inline THHalf __device__ curand_uniform_type<THHalf>(curandStatePhilox4_32_10_t *state) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use legacy THHalf type, use at::Half instead. There are implicit conversions between at::Half and float, so ScalarConvert is not necessary
| template <> | ||
| inline THHalf __device__ curand_uniform_type<THHalf>(curandStatePhilox4_32_10_t *state) { | ||
| auto rand = curand_uniform4(state); | ||
| return ScalarConvert<float, THHalf>::to(rand.x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using only .x out of 4 generated numbers is wasteful, you can have an unroll loop in the kernel that would use all the values, you can take a look e.g. at the non-vectorized fused_dropout_kernel in Dropout.cu
| if (input[i] <= 0) | ||
| { | ||
| scalar_t r = curand_uniform_type<scalar_t>(&state); | ||
| r = ScalarConvert<double, scalar_t>::to(r * (b - a) + a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having double is usually perf penalty, it should be scalar_t or at most accscalar_t
| else | ||
| { | ||
| output[i] = input[i]; | ||
| noise[i] = ScalarConvert<int, scalar_t>::to(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No ScalarConvert please
|
|
||
| CUDA_KERNEL_LOOP(i, n) | ||
| { | ||
| if (input[i] <= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid warp divergence, you should generate randoms for every input, and then only diverge on fast operations like computing output and noise
c04f62b to
bdd6e87
Compare
|
Thank you for the review @ngimel ! I updated the PR to use unrolling. I ran the following benchmark: Benchmark script:import torch
import torch.nn as nn
import time
torch.manual_seed(0)
def _time():
torch.cuda.synchronize()
return time.time()
device = "cuda"
m = nn.RReLU().cuda()
n_runs = 1_000
for n in [10_000, 100_000, 1_000_000]:
fwd_t = 0
bwd_t = 0
input = torch.randn(128, n, device=device)
grad_output = torch.ones(128, n, device=device)
for i in range(n_runs):
t1 = _time()
output = m(input)
t2 = _time()
fwd_t = fwd_t + (t2 -t1)
fwd_avg = fwd_t / n_runs * 1000
print(f"input size(128, {n}) forward time is {fwd_avg:.2f} (ms)")Results from benchmark:This PROn master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good, I left minor comments.
| double range = upper - lower; | ||
|
|
||
| for (int linear_index = idx; linear_index < rounded_size; linear_index += grid_stride) { | ||
| auto rand = random_func(&state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add static assert here that sizeof(rand)/sizeof(rand.x) == unroll_factor? Otherwise your (&rand.x)[ii] access is unsafe.
| checkAllSameGPU("rrelu_with_noise_out_cuda", {self_arg, noise_arg, output_arg}); | ||
|
|
||
| auto input = self.contiguous(); | ||
| auto noise_ = noise.contiguous(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rrelu_with_noise_out_cuda is a user facing function, which means that output can also be discontiguous here.
| output, input, noise_, lower, upper, generator); | ||
| }); | ||
| } else { | ||
| auto lower_tensor = scalar_to_tensor(lower); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to convert Scalar to tensor here, instead convert to regular type (using .to<double>) and negative_slope back to Scalar
| auto rand = random_func(&state); | ||
|
|
||
| // ensure that (&rand.x)[ii] is safe | ||
| CUDA_KERNEL_ASSERT(sizeof(rand)/sizeof(rand.x) == unroll_factor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be static_assert (to be done at compile time), not runtime assert.
|
Can you please try rebasing, to get CI signal? |
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Fixes #24618
Related to #24507
Benchmark script:
Results from benchmark:
This PR
On master