-
Notifications
You must be signed in to change notification settings - Fork 282
Add transform c parallel implementation #4048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add transform c parallel implementation #4048
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
2719dfb to
8b065af
Compare
8b065af to
e659396
Compare
|
/ok to test |
🟩 CI finished in 1h 43m: Pass: 100%/93 | Total: 2d 19h | Avg: 43m 14s | Max: 1h 23m | Hits: 59%/133898
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, left some comments for some potential refactoring
| size_t cubin_size; | ||
| CUlibrary library; | ||
| CUkernel transform_kernel; | ||
| int loaded_bytes_per_iteration; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| int loaded_bytes_per_iteration; | |
| size_t loaded_bytes_per_iteration; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of loaded_bytes_per_iteration in CUB is int - should I still make this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested this because you initialize it with input_it.value_type.size, which is of type size_t. But if CUB expects it to be an int, then I think you should keep it as is
c/parallel/src/transform.cu
Outdated
| std::string get_input_iterator_name() | ||
| { | ||
| std::string iterator_t; | ||
| check(nvrtcGetTypeName<input_iterator_t>(&iterator_t)); | ||
| return iterator_t; | ||
| } | ||
|
|
||
| std::string get_input1_iterator_name() | ||
| { | ||
| std::string iterator_t; | ||
| check(nvrtcGetTypeName<input1_iterator_t>(&iterator_t)); | ||
| return iterator_t; | ||
| } | ||
|
|
||
| std::string get_input2_iterator_name() | ||
| { | ||
| std::string iterator_t; | ||
| check(nvrtcGetTypeName<input2_iterator_t>(&iterator_t)); | ||
| return iterator_t; | ||
| } | ||
|
|
||
| std::string get_output_iterator_name() | ||
| { | ||
| std::string iterator_t; | ||
| check(nvrtcGetTypeName<output_iterator_t>(&iterator_t)); | ||
| return iterator_t; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Georgii recently suggested to return the string with the typename directly, which I agree looks cleaner. There is some inconsistency with how the existing c.parallel algorithms do it, which we should resolve eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this in 6eb5e53. I tried to introduce a get_iterator_name function that is a bit more general and could be reused across all the algorithms. It relies on defining the iterator names as constexpr strings for the source of truth. Please let me know if you think it's an acceptable pattern and I'll go ahead and change this for the other algorithms as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that looks good
c/parallel/src/transform.cu
Outdated
| const std::string input_iterator_t = | ||
| input_it.type == cccl_iterator_kind_t::CCCL_POINTER // | ||
| ? cccl_type_enum_to_name<input_storage_t>(input_it.value_type.type, true) // | ||
| : transform::get_input_iterator_name(); | ||
|
|
||
| const std::string output_iterator_t = | ||
| (output_it.type == cccl_iterator_kind_t::CCCL_POINTER // | ||
| ? cccl_type_enum_to_name<output_storage_t>(output_it.value_type.type, true) // | ||
| : transform::get_output_iterator_name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe refactor the logic on how we get the type into a separate function to avoid repetition? I do something similar in
cccl/c/parallel/src/unique_by_key.cu
Line 133 in be4cc94
| const std::string input_keys_iterator_t = get_iterator_name(input_keys_it, unique_by_key_iterator_t::input_keys); |
c/parallel/src/transform.cu
Outdated
| const std::string input1_iterator_t = | ||
| input1_it.type == cccl_iterator_kind_t::CCCL_POINTER // | ||
| ? cccl_type_enum_to_name<input1_storage_t>(input1_it.value_type.type, true) // | ||
| : transform::get_input1_iterator_name(); | ||
|
|
||
| const std::string input2_iterator_t = | ||
| input2_it.type == cccl_iterator_kind_t::CCCL_POINTER // | ||
| ? cccl_type_enum_to_name<input2_storage_t>(input2_it.value_type.type, true) // | ||
| : transform::get_input2_iterator_name(); | ||
|
|
||
| const std::string output_iterator_t = | ||
| (output_it.type == cccl_iterator_kind_t::CCCL_POINTER // | ||
| ? cccl_type_enum_to_name<output_storage_t>(output_it.value_type.type, true) // | ||
| : transform::get_output_iterator_name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to above
c/parallel/src/transform.cu
Outdated
| const std::string input_iterator_src = | ||
| make_kernel_input_iterator(offset_t, "input_iterator_t", input_it_value_t, input_it); | ||
| const std::string output_iterator_src = | ||
| make_kernel_output_iterator(offset_t, "output_iterator_t", output_it_value_t, output_it); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, instead of "input_iterator_t" maybe use get_input_iterator_name()? (Same comment for output_iterator_t)
c/parallel/src/transform.cu
Outdated
| catch (const std::exception& exc) | ||
| { | ||
| fflush(stderr); | ||
| printf("\nEXCEPTION in cccl_device_transform(): %s\n", exc.what()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| printf("\nEXCEPTION in cccl_device_transform(): %s\n", exc.what()); | |
| printf("\nEXCEPTION in cccl_device_unary_transform(): %s\n", exc.what()); |
| const std::string input1_iterator_src = | ||
| make_kernel_input_iterator(offset_t, "input1_iterator_t", input1_it_value_t, input1_it); | ||
| const std::string input2_iterator_src = | ||
| make_kernel_input_iterator(offset_t, "input2_iterator_t", input2_it_value_t, input2_it); | ||
|
|
||
| const std::string output_iterator_src = | ||
| make_kernel_output_iterator(offset_t, "output_iterator_t", output_it_value_t, output_it); | ||
| const std::string op_src = | ||
| make_kernel_user_binary_operator(input1_it_value_t, input2_it_value_t, output_it_value_t, op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to above regarding using the functions to get the iterator name
c/parallel/src/transform.cu
Outdated
| check(cuCtxGetDevice(&cu_device)); | ||
| auto cuda_error = cub::detail::transform::dispatch_t< | ||
| cub::detail::transform::requires_stable_address::no, // TODO implement yes | ||
| ::cuda::std::int64_t, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to above
c/parallel/src/transform.cu
Outdated
| catch (const std::exception& exc) | ||
| { | ||
| fflush(stderr); | ||
| printf("\nEXCEPTION in cccl_device_transform(): %s\n", exc.what()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| printf("\nEXCEPTION in cccl_device_transform(): %s\n", exc.what()); | |
| printf("\nEXCEPTION in cccl_device_binary_transform(): %s\n", exc.what()); |
c/parallel/src/transform.cu
Outdated
|
|
||
| #include <format> | ||
| #include <iostream> | ||
| #include <optional> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This header seems unused?
🟩 CI finished in 1h 36m: Pass: 100%/93 | Total: 2d 15h | Avg: 41m 01s | Max: 1h 22m | Hits: 74%/133898
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| _CCCL_HOST_DEVICE auto make_iterator_kernel_arg(It it) -> kernel_arg<It> | ||
| { | ||
| kernel_arg<It> arg; | ||
| // since we switch the active member of the union, we must use placement new or construct_at. This also uses the copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should stay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops - fixed.
🟩 CI finished in 2h 42m: Pass: 100%/93 | Total: 2d 14h | Avg: 40m 20s | Max: 1h 17m | Hits: 75%/133898
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
|
/ok to test |
🟩 CI finished in 2h 32m: Pass: 100%/93 | Total: 1d 01h | Avg: 16m 10s | Max: 1h 22m | Hits: 93%/133898
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUB changes LGTM
* Enable setting _CUB_HAS_TRANSFORM_UBLKCP as flag * Stale comment * Update make_kernel_user_binary_operator to accept lhs/rhs types * Add transform c.parallel implementation * Add tests for c.parallel transform * Use launcher factory to query ptx version * Refactor how we get iterator names * Add comment explaining why we need -default-device * Address remaining review feedback * Undo change to make_iterator_kernel_arg * Restore the comment * Restore comment --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
* Enable NVHPC in CUDASTF CI * Use nvtx_range in the POTRI example * no need to include nvtx3 anymore here * Do not use nvtx3 directly * fix compilation * WIP: try to pass the type of the execution place to the parallel_for_scope * pass the exec_place type to parallel_for_scope * remove a device annotation for a host only lambda * is_shape_of_v is not working * experiment to see if we can avoid generating device code in parallel_for with a host place * fixes for a previous conflict merge * disable an invalid test * better typing for ctx.parallel_for with a partition * use ::std and more types for parallel_for * Add transform c parallel implementation (#4048) * Enable setting _CUB_HAS_TRANSFORM_UBLKCP as flag * Stale comment * Update make_kernel_user_binary_operator to accept lhs/rhs types * Add transform c.parallel implementation * Add tests for c.parallel transform * Use launcher factory to query ptx version * Refactor how we get iterator names * Add comment explaining why we need -default-device * Address remaining review feedback * Undo change to make_iterator_kernel_arg * Restore the comment * Restore comment --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> * Drop duplicated system header blocks (#4245) Those were accidentally duplicated * Exclude sm101 from RDC testing. (#4247) * Make `cuda::stream_ref` constructible on device (#4243) * Make `cuda::stream_ref` constructible on device There is no reason we should not be able to construct it or extract the pointer out of it on device. * Pass by value * Make nvbench work * Fix logic in test_segmented_reduce (#4198) * Fix logic in test_segmented_reduce, also test over different types of offsets To resolve gh-4197, use `cupy.cumsum` to accumulate over random partition sizes to form correct offsets sequence. Add assertions to verify that `offsets` is a non-decreasing sequence, and that its last element equals the size of the input array. Perform the test for several plausible offset data types. * Changes per PR review comments 1. Use `cupy.random` to draw random sample on GPU, rather than on CPU followed by a transfer 2. Use `cp.empty` to allocate output, rather than `cp.zeros` * Add new `WarpReduce` overloadings (#3884) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Fix #4250 (#4251) * Refactor fp masks (#4246) * Implement `views::all` (#4244) * Implement `views::all` * [cudax] incorporate P3557 (constexpr completion signatures) into µstdex (#3841) * incorporate P3557 (constexpr completion signatures) into ustdex * remove the need for nvcc-specific workarounds * review feedback * tweaks to ustdex for clangd * fix typo * use concepts portability macros for C++17 support * Add fixed size segmented reduce (#3969) * Adds fixed size segmented reduce * remove unnecessary changes * revert changes * Properly deal with parallel_for overloads with grids, add an is_host() for exec_place * Remove test doing illegal things, and which is not really useful anymore --------- Co-authored-by: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: Allison Piper <alliepiper16@gmail.com> Co-authored-by: Oleksandr Pavlyk <21087696+oleksandr-pavlyk@users.noreply.github.com> Co-authored-by: Federico Busato <50413820+fbusato@users.noreply.github.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com> Co-authored-by: Eric Niebler <eniebler@nvidia.com> Co-authored-by: Srinivas Yadav <43375352+srinivasyadav18@users.noreply.github.com>
* Enable setting _CUB_HAS_TRANSFORM_UBLKCP as flag * Stale comment * Update make_kernel_user_binary_operator to accept lhs/rhs types * Add transform c.parallel implementation * Add tests for c.parallel transform * Use launcher factory to query ptx version * Refactor how we get iterator names * Add comment explaining why we need -default-device * Address remaining review feedback * Undo change to make_iterator_kernel_arg * Restore the comment * Restore comment --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
* Enable NVHPC in CUDASTF CI * Use nvtx_range in the POTRI example * no need to include nvtx3 anymore here * Do not use nvtx3 directly * fix compilation * WIP: try to pass the type of the execution place to the parallel_for_scope * pass the exec_place type to parallel_for_scope * remove a device annotation for a host only lambda * is_shape_of_v is not working * experiment to see if we can avoid generating device code in parallel_for with a host place * fixes for a previous conflict merge * disable an invalid test * better typing for ctx.parallel_for with a partition * use ::std and more types for parallel_for * Add transform c parallel implementation (NVIDIA#4048) * Enable setting _CUB_HAS_TRANSFORM_UBLKCP as flag * Stale comment * Update make_kernel_user_binary_operator to accept lhs/rhs types * Add transform c.parallel implementation * Add tests for c.parallel transform * Use launcher factory to query ptx version * Refactor how we get iterator names * Add comment explaining why we need -default-device * Address remaining review feedback * Undo change to make_iterator_kernel_arg * Restore the comment * Restore comment --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> * Drop duplicated system header blocks (NVIDIA#4245) Those were accidentally duplicated * Exclude sm101 from RDC testing. (NVIDIA#4247) * Make `cuda::stream_ref` constructible on device (NVIDIA#4243) * Make `cuda::stream_ref` constructible on device There is no reason we should not be able to construct it or extract the pointer out of it on device. * Pass by value * Make nvbench work * Fix logic in test_segmented_reduce (NVIDIA#4198) * Fix logic in test_segmented_reduce, also test over different types of offsets To resolve NVIDIAgh-4197, use `cupy.cumsum` to accumulate over random partition sizes to form correct offsets sequence. Add assertions to verify that `offsets` is a non-decreasing sequence, and that its last element equals the size of the input array. Perform the test for several plausible offset data types. * Changes per PR review comments 1. Use `cupy.random` to draw random sample on GPU, rather than on CPU followed by a transfer 2. Use `cp.empty` to allocate output, rather than `cp.zeros` * Add new `WarpReduce` overloadings (NVIDIA#3884) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Fix NVIDIA#4250 (NVIDIA#4251) * Refactor fp masks (NVIDIA#4246) * Implement `views::all` (NVIDIA#4244) * Implement `views::all` * [cudax] incorporate P3557 (constexpr completion signatures) into µstdex (NVIDIA#3841) * incorporate P3557 (constexpr completion signatures) into ustdex * remove the need for nvcc-specific workarounds * review feedback * tweaks to ustdex for clangd * fix typo * use concepts portability macros for C++17 support * Add fixed size segmented reduce (NVIDIA#3969) * Adds fixed size segmented reduce * remove unnecessary changes * revert changes * Properly deal with parallel_for overloads with grids, add an is_host() for exec_place * Remove test doing illegal things, and which is not really useful anymore --------- Co-authored-by: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: Allison Piper <alliepiper16@gmail.com> Co-authored-by: Oleksandr Pavlyk <21087696+oleksandr-pavlyk@users.noreply.github.com> Co-authored-by: Federico Busato <50413820+fbusato@users.noreply.github.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com> Co-authored-by: Eric Niebler <eniebler@nvidia.com> Co-authored-by: Srinivas Yadav <43375352+srinivasyadav18@users.noreply.github.com>
Description
Closes #3877
This PR introduces
transformto the c.parallel API, using only theprefetchalgorithm (notublkcp).unary_transformapplies a unary operation on a single input iteratorbinary_transformapplies a binary operation on two input iteratorsNote that this is more limited than the C++ CUB API, which allows passing an arbitrary number of input iterators.
Checklist