-
Notifications
You must be signed in to change notification settings - Fork 25.7k
MAINT Migrates multilabel_margin_loss from THC to ATen (CUDA) #60708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT Migrates multilabel_margin_loss from THC to ATen (CUDA) #60708
Conversation
💊 CI failures summary and remediationsAs of commit a340ef2 (more details on the Dr. CI page and at hud.pytorch.org/pr/60708): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 Preview docs built from this PR This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, please resolve conflicts and don't use legacy reduction functions. Also, such large performance gains are indeed suspicious, can you run correctness tests on some bigger sizes (tests are probably run only on very small inputs).
| namespace native { | ||
|
|
||
| namespace { | ||
| const int MULTILABELMARGIN_THREADS = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you've changed the number of threads from 1024 to 32, maybe that's the reason for perf improvement? (Usually, 32 is too small, you need at least 64-128)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the number of threads back to 1024 made the performance of this PR align with master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we want better perf :-) so if 32 produces correct result we should be using it (or, probably better, 64 or 128)
|
|
||
| // reduce | ||
| using Op = ReduceAdd<accscalar_t>; | ||
| accscalar_t total_sum = reduceBlock<accscalar_t>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of legacy reduceBlock it's better to use BlockReduceSum from block_reduce.cuh (it also doesn't require shared memory)
| (target_.size(0) == nframe) && (target_.size(1) == dim), | ||
| "inconsistent target size"); | ||
| TORCH_CHECK( | ||
| (is_target_.dim() == 2) && (is_target.size(0) == nframe) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this check is slightly cleaner to write as target_.sizes() == is_target_.sizes()
| namespace native { | ||
|
|
||
| namespace { | ||
| const int MULTILABELMARGIN_THREADS = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, can you check if reducing number of threads here brings good perf and correct resulst, and we can land?
|
Here is the same benchmark but with bigger tensors. Using and with So As for correctness, I wrote a script to check Correctness scriptfrom itertools import product
import torch
import torch.nn.functional as F
torch.manual_seed(0)
C = 100
n_runs = 3
reductions = ["none", "sum", "mean"]
Ns = [10, 100, 1_000, 10_000]
for reduction, N in product(reductions, Ns):
print(f"Checking {reduction}, ({N}, {C})")
for _ in range(n_runs):
grad_out_cpu = torch.randn(N, device="cpu")
if reduction != "none":
grad_out_cpu = grad_out_cpu[0]
grad_out_gpu = grad_out_cpu.to("cuda")
input_cpu = torch.randn(N, C, requires_grad=True)
target_cpu = torch.randint(0, C, size=input_cpu.size())
result_cpu = F.multilabel_margin_loss(input_cpu, target_cpu, reduction=reduction)
result_cpu.backward(grad_out_cpu)
input_gpu = input_cpu.to("cuda")
target_gpu = target_cpu.to("cuda")
result_gpu = F.multilabel_margin_loss(input_gpu, target_gpu, reduction=reduction)
result_gpu.backward(grad_out_gpu)
torch.allclose(result_cpu, result_gpu.to("cpu"))
torch.allclose(grad_out_cpu, grad_out_gpu.to("cpu")) |
|
Thank you, results look great! |
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Fixes #24603
Fixes #24602
The implementation should be exactly the same, so it is strange that the benchmarks show such a significant improvement in this PR.The benchmarks are now the same.
Benchmark script
master
this PR