-
Notifications
You must be signed in to change notification settings - Fork 282
Deprecate and Replace cub::BFE
#4031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🟨 CI finished in 1h 41m: Pass: 84%/93 | Total: 2d 21h | Avg: 44m 42s | Max: 1h 25m | Hits: 60%/115695
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
|
Using |
|
Thanks! I'm out until Monday. Will review then. Could you, in the meantime, please run the benchmarks for radix sort wnd share the results here? |
|
I can take a look |
🟨 CI finished in 1h 15m: Pass: 90%/93 | Total: 18h 36m | Avg: 12m 00s | Max: 1h 00m | Hits: 97%/121785
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟨 CI finished in 1h 33m: Pass: 98%/158 | Total: 2d 13h | Avg: 23m 16s | Max: 1h 16m | Hits: 84%/246942
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 1h 18m: Pass: 100%/158 | Total: 1d 05h | Avg: 11m 10s | Max: 1h 01m | Hits: 91%/250596
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| const int begin_bit = GENERATE_COPY(take(2, random(0, key_size - 1))); | ||
| const int end_bit = GENERATE_COPY(take(2, random(begin_bit + 1, key_size))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elstehle could you please have a quick look whether this fix is correct? We previously tested begin_bit == end_bit sometimes. Was this an invalid scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if begin_bit == end_bit is valid then the kernel should not be called (I guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think begin_bit == end_bit is generally valid, it's similar to num_items==0.
[...] then the kernel should not be called (I guess)
We can skip any kernel invocation only if the user invoked DeviceRardixSort via the DoubleBuffer interface. Otherwise the user will expect the output to end up in d_{keys,values}_out, in which case we need to copy the "sorted" output there.
|
added performance comparison in the description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the begin_bit==end_bit constraint, this looks good. Can we lift that new constraint? Otherwise, this will be a breaking change as it narrows the usage scenarios.
|
Btw, I already included this change in the incoming CCCL 3.0 migration guide: #4069 |
|
@elstehle sorry, what do you mean for lifting the new constraint? what do you think if we return from the host call when |
I assumed that the new
Is your question: "What should we do if the user passes |
|
|
After reconsidering, I would actually prefer failing explicitly when Broadly, we have two types of users:
The priority should be to avoid regressing user group (1), whether that means preventing performance degradation or avoiding the compilation of an extra copy kernel. If we can achieve that while still accommodating user group (2), I’m in favor. Otherwise, I’d prefer to fail in cases where |
🟨 CI finished in 1h 37m: Pass: 93%/93 | Total: 2d 21h | Avg: 44m 56s | Max: 1h 24m | Hits: 63%/125451
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟨 CI finished in 1h 21m: Pass: 94%/158 | Total: 2d 10h | Avg: 22m 23s | Max: 1h 18m | Hits: 88%/240766
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟨 CI finished in 1h 17m: Pass: 96%/158 | Total: 2d 10h | Avg: 22m 18s | Max: 1h 15m | Hits: 92%/242650
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 1h 24m: Pass: 100%/158 | Total: 2d 10h | Avg: 22m 21s | Max: 1h 22m | Hits: 91%/251089
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
Fixes #4025
Description
Replace/deprecate
cub::BFEwith new<cuda/bit>functionalities nvidia.github.io/cccl/libcudacxx/extended_api/bit.htmlThe initial PR found the following problems:
bitfield.hcatch2_test_device_segmented_radix_sort_keys.cuincludes tests with 0 bit widthcatch2_test_block_radix_sort.cuincludes tests with 0 bit widthPerformance comparison PTX
BFEvs.cuda::bitfield_extractwith SM80. TLDR: slightly faster[0] NVIDIA RTX A6000
Summary