-
Notifications
You must be signed in to change notification settings - Fork 282
Add cuda::device::warp_match_all
#4746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🟨 CI finished in 3h 05m: Pass: 97%/183 | Total: 1d 08h | Avg: 10m 32s | Max: 35m 42s | Hits: 96%/281996
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 183)
| # | Runner |
|---|---|
| 129 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 12 | linux-arm64-cpu16 |
| 12 | linux-amd64-gpu-rtxa6000-latest-1 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 2h 59m: Pass: 100%/183 | Total: 1d 12h | Avg: 11m 58s | Max: 1h 24m | Hits: 93%/289460
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 183)
| # | Runner |
|---|---|
| 129 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 12 | linux-arm64-cpu16 |
| 12 | linux-amd64-gpu-rtxa6000-latest-1 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 2h 09m: Pass: 100%/183 | Total: 4d 05h | Avg: 33m 14s | Max: 1h 37m | Hits: 58%/290783
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 183)
| # | Runner |
|---|---|
| 129 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 12 | linux-arm64-cpu16 |
| 12 | linux-amd64-gpu-rtxa6000-latest-1 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
|
I've opened a PR #4804 for |
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
Yes, it is perfectly fine with me. Btw, thanks for reviewing the PR! |
🟨 CI finished in 1h 07m: Pass: 98%/183 | Total: 1d 07h | Avg: 10m 23s | Max: 33m 57s | Hits: 97%/287451
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 183)
| # | Runner |
|---|---|
| 129 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 12 | linux-arm64-cpu16 |
| 12 | linux-amd64-gpu-rtxa6000-latest-1 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 1h 35m: Pass: 100%/187 | Total: 1d 07h | Avg: 10m 13s | Max: 57m 39s | Hits: 97%/292133
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| +/- | CCCL Infrastructure |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 187)
| # | Runner |
|---|---|
| 129 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 12 | linux-arm64-cpu16 |
| 12 | linux-amd64-gpu-rtxa6000-latest-1 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 1h 07m: Pass: 100%/187 | Total: 1d 07h | Avg: 10m 00s | Max: 33m 52s | Hits: 97%/292133
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 187)
| # | Runner |
|---|---|
| 129 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 12 | linux-arm64-cpu16 |
| 12 | linux-amd64-gpu-rtxa6000-latest-1 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| _CCCL_ASSERT(__lane_mask != lane_mask::none(), "lane_mask must be non-zero"); | ||
| constexpr int __ratio = ::cuda::ceil_div(sizeof(_Up), sizeof(uint32_t)); | ||
| uint32_t __array[__ratio]; | ||
| _CUDA_VSTD::memcpy(__array, _CUDA_VSTD::addressof(__data), sizeof(_Up)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this implementation exhibits undefined behavior if _Tp has padding, because this memcpy will copy the entire object representation, including padding, into the array, which will result in padding bytes being compared by the __match_all_sync. Depending on compiler optimizations, or the values these padding bytes take at runtime (e.g. depending on divergence, etc.), this implementation will non-deterministically produce different results for types with padding.
A compiler intrinsic should be able to properly implement this API, because the compiler knows which bytes of _Tp are padding bytes and can handle those appropriately.
If there is a way to static_assert that _Tp has no padding bytes, we could restrict this API to types without padding. Otherwise, a good library implementation seems challenging (passing this API a type with padding seems like an easy mistake to make), and we should at least update the documentation to call this out.
Description
Following the same idea of the existing utilities in <cuda/warp>, this PR adds
cuda::device::warp_match_allto check if a subset of lanes has the same value.The function performs a bitwise comparison. We could raise a compile-time error if the data type overrides the
operator==. On the other hand, this would prevent comparing types likecuda::std::array.