-
Notifications
You must be signed in to change notification settings - Fork 282
Adds support for large number of buffers to DeviceMemcpy::Batched
#4065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for large number of buffers to DeviceMemcpy::Batched
#4065
Conversation
🟨 CI finished in 1h 31m: Pass: 51%/93 | Total: 1d 21h | Avg: 29m 20s | Max: 1h 20m | Hits: 79%/80264
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟩 CI finished in 1h 08m: Pass: 100%/93 | Total: 17h 23m | Avg: 11m 13s | Max: 1h 00m | Hits: 94%/133878
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟩 CI finished in 2h 14m: Pass: 100%/93 | Total: 1d 05h | Avg: 18m 45s | Max: 2h 03m | Hits: 91%/133890
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟩 CI finished in 1h 16m: Pass: 100%/93 | Total: 17h 23m | Avg: 11m 13s | Max: 1h 06m | Hits: 95%/133890
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in principle. I find the test fairly complex though. Would it be possible to come up with a test that does not require this many levels of fancy iterators?
I agree, it's partially in the nature of these tests because we have to use fancy iterators to create some empty segments while not blowing up memory requirements. I added a few more comments to improve comprehension. I'm afraid we cannot do much more here. |
🟩 CI finished in 1h 41m: Pass: 100%/93 | Total: 17h 30m | Avg: 11m 17s | Max: 1h 04m | Hits: 94%/133890
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
…VIDIA#4065) * offset iterator * adds support for large number of buffers to memcpy batched * fixes thrust ns * fixes narrowing conversion * expects user iterators to be advancable on the host * update the kernel to always use a 32-bit buffer offset type * fixes benchmarks * removes superfluous includes * addresses review comments
…VIDIA#4065) * offset iterator * adds support for large number of buffers to memcpy batched * fixes thrust ns * fixes narrowing conversion * expects user iterators to be advancable on the host * update the kernel to always use a 32-bit buffer offset type * fixes benchmarks * removes superfluous includes * addresses review comments
Description
Partially addresses #3622
The idea is to use the streaming approach over the buffers being copied, processing at most 512 M buffers at a time. I chose 512 M instead of
INT_MAXbecause that helps to lower the temporary storage requirements for very large number of buffers.Specifically,
BufferOffsetTfrom theDispatchBatchMemcpy. This type is supposed to represent thenum_buffersvalue provided by the user. Instead I fixed this type to beint64_t. I think having that an extra template parameter is now somewhat superfluous and confusing.detail::batch_memcpy::per_invocation_buffer_offset_t = ::cuda::std::uint32_t;type alias, which is the hard-coded buffer offset type that we instantiate the relevant kernel templates with.