KEMBAR78
Add transform c parallel implementation by shwina · Pull Request #4048 · NVIDIA/cccl · GitHub
Skip to content

Conversation

@shwina
Copy link
Contributor

@shwina shwina commented Mar 7, 2025

Description

Closes #3877

This PR introduces transform to the c.parallel API, using only the prefetch algorithm (not ublkcp).

  • unary_transform applies a unary operation on a single input iterator
  • binary_transform applies a binary operation on two input iterators

Note that this is more limited than the C++ CUB API, which allows passing an arbitrary number of input iterators.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Mar 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Mar 7, 2025
@shwina shwina force-pushed the add-transform-c-parallel-implementation branch 3 times, most recently from 2719dfb to 8b065af Compare March 10, 2025 14:58
@shwina shwina force-pushed the add-transform-c-parallel-implementation branch from 8b065af to e659396 Compare March 11, 2025 16:07
@shwina
Copy link
Contributor Author

shwina commented Mar 11, 2025

/ok to test

@github-actions
Copy link
Contributor

🟩 CI finished in 1h 43m: Pass: 100%/93 | Total: 2d 19h | Avg: 43m 14s | Max: 1h 23m | Hits: 59%/133898
  • 🟩 cub: Pass: 100%/45 | Total: 1d 16h | Avg: 54m 31s | Max: 1h 23m | Hits: 65%/53614

    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total:  1d 14h | Avg: 54m 18s | Max:  1h 23m | Hits:  65%/51178 
      🟩 arm64              Pass: 100%/2   | Total:  1h 58m | Avg: 59m 01s | Max: 59m 27s | Hits:  68%/2436  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  5h 05m | Avg:  1h 01m | Max:  1h 12m | Hits:  57%/5922  
      🟩 12.5               Pass: 100%/2   | Total:  2h 33m | Avg:  1h 16m | Max:  1h 17m | Hits:  16%/2254  
      🟩 12.8               Pass: 100%/38  | Total:  1d 09h | Avg: 52m 29s | Max:  1h 23m | Hits:  69%/45438 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 03m | Hits:  21%/2104  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 05m | Avg:  1h 01m | Max:  1h 12m | Hits:  57%/5922  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 33m | Avg:  1h 16m | Max:  1h 17m | Hits:  16%/2254  
      🟩 nvcc12.8           Pass: 100%/36  | Total:  1d 07h | Avg: 51m 54s | Max:  1h 23m | Hits:  71%/43334 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 03m | Hits:  21%/2104  
      🟩 nvcc               Pass: 100%/43  | Total:  1d 14h | Avg: 54m 07s | Max:  1h 23m | Hits:  67%/51510 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 47m | Avg: 56m 49s | Max: 58m 40s | Hits:  68%/4880  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 56m | Avg: 58m 01s | Max:  1h 00m | Hits:  68%/2436  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 55m | Avg: 57m 45s | Max:  1h 00m | Hits:  68%/2436  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 52m | Avg: 56m 02s | Max: 56m 34s | Hits:  68%/2436  
      🟩 Clang18            Pass: 100%/7   | Total:  5h 49m | Avg: 49m 58s | Max:  1h 04m | Hits:  65%/8194  
      🟩 GCC7               Pass: 100%/2   | Total:  1h 54m | Avg: 57m 11s | Max: 58m 28s | Hits:  67%/2440  
      🟩 GCC8               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m | Hits:  68%/1220  
      🟩 GCC9               Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m | Hits:  67%/2440  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 56m | Avg: 58m 06s | Max:  1h 00m | Hits:  68%/2440  
      🟩 GCC11              Pass: 100%/2   | Total:  1h 59m | Avg: 59m 52s | Max:  1h 02m | Hits:  67%/2436  
      🟩 GCC12              Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 01m | Hits:  67%/2436  
      🟩 GCC13              Pass: 100%/11  | Total:  6h 43m | Avg: 36m 39s | Max:  1h 03m | Hits:  85%/13398 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 35m | Avg:  1h 17m | Max:  1h 23m | Hits:  14%/2084  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 41m | Avg:  1h 20m | Max:  1h 21m | Hits:  14%/2084  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 33m | Avg:  1h 16m | Max:  1h 17m | Hits:  16%/2254  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total: 15h 20m | Avg: 54m 09s | Max:  1h 04m | Hits:  67%/20382 
      🟩 GCC                Pass: 100%/22  | Total: 17h 42m | Avg: 48m 17s | Max:  1h 04m | Hits:  76%/26810 
      🟩 MSVC               Pass: 100%/4   | Total:  5h 16m | Avg:  1h 19m | Max:  1h 23m | Hits:  14%/4168  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 33m | Avg:  1h 16m | Max:  1h 17m | Hits:  16%/2254  
    🟩 gpu
      🟩 h100               Pass: 100%/3   | Total:  1h 13m | Avg: 24m 28s | Max: 25m 42s | Hits:  89%/3654  
      🟩 rtx2080            Pass: 100%/34  | Total:  1d 11h | Avg:  1h 02m | Max:  1h 23m | Hits:  57%/40216 
      🟩 rtxa6000           Pass: 100%/8   | Total:  4h 16m | Avg: 32m 04s | Max:  1h 04m | Hits:  91%/9744  
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total:  1d 13h | Avg:  1h 01m | Max:  1h 23m | Hits:  58%/43870 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 21m 13s | Avg: 21m 13s | Max: 21m 13s | Hits:  99%/1218  
      🟩 GraphCapture       Pass: 100%/1   | Total: 17m 57s | Avg: 17m 57s | Max: 17m 57s | Hits:  99%/1218  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 12m | Avg: 24m 17s | Max: 25m 05s | Hits:  99%/3654  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 05m | Avg: 21m 42s | Max: 22m 38s | Hits:  99%/3654  
    🟩 sm
      🟩 90                 Pass: 100%/3   | Total:  1h 13m | Avg: 24m 28s | Max: 25m 42s | Hits:  89%/3654  
      🟩 90;90a;100         Pass: 100%/1   | Total:  1h 03m | Avg:  1h 03m | Max:  1h 03m | Hits:  67%/1218  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 21h 01m | Avg:  1h 03m | Max:  1h 23m | Hits:  56%/23591 
      🟩 20                 Pass: 100%/25  | Total: 19h 51m | Avg: 47m 40s | Max:  1h 21m | Hits:  72%/30023 
    
  • 🟩 thrust: Pass: 100%/45 | Total: 1d 00h | Avg: 33m 00s | Max: 1h 07m | Hits: 56%/79956

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 39m 39s | Avg: 19m 49s | Max: 28m 23s | Hits:  73%/3556  
    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total: 23h 54m | Avg: 33m 20s | Max:  1h 07m | Hits:  55%/76401 
      🟩 arm64              Pass: 100%/2   | Total: 51m 33s | Avg: 25m 46s | Max: 27m 25s | Hits:  77%/3555  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 11m | Avg: 38m 22s | Max:  1h 02m | Hits:  51%/8881  
      🟩 12.5               Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 07m | Hits:  21%/3554  
      🟩 12.8               Pass: 100%/38  | Total: 19h 24m | Avg: 30m 39s | Max:  1h 06m | Hits:  58%/67521 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 56m 46s | Avg: 28m 23s | Max: 29m 08s | Hits:  46%/3554  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 11m | Avg: 38m 22s | Max:  1h 02m | Hits:  51%/8881  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 07m | Hits:  21%/3554  
      🟩 nvcc12.8           Pass: 100%/36  | Total: 18h 27m | Avg: 30m 46s | Max:  1h 06m | Hits:  59%/63967 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 56m 46s | Avg: 28m 23s | Max: 29m 08s | Hits:  46%/3554  
      🟩 nvcc               Pass: 100%/43  | Total: 23h 48m | Avg: 33m 13s | Max:  1h 07m | Hits:  56%/76402 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 05m | Avg: 31m 18s | Max: 32m 10s | Hits:  56%/7108  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 05m | Avg: 32m 49s | Max: 34m 27s | Hits:  46%/3554  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 07m | Avg: 33m 38s | Max: 34m 41s | Hits:  46%/3554  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 02m | Avg: 31m 12s | Max: 31m 16s | Hits:  46%/3554  
      🟩 Clang18            Pass: 100%/7   | Total:  2h 42m | Avg: 23m 10s | Max: 32m 38s | Hits:  68%/12439 
      🟩 GCC7               Pass: 100%/2   | Total:  1h 05m | Avg: 32m 49s | Max: 32m 55s | Hits:  56%/3556  
      🟩 GCC8               Pass: 100%/1   | Total: 31m 39s | Avg: 31m 39s | Max: 31m 39s | Hits:  46%/1778  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 06m | Avg: 33m 25s | Max: 33m 30s | Hits:  57%/3556  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 11m | Avg: 35m 38s | Max: 36m 12s | Hits:  46%/3556  
      🟩 GCC11              Pass: 100%/2   | Total:  1h 06m | Avg: 33m 13s | Max: 33m 25s | Hits:  46%/3556  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 09m | Avg: 34m 48s | Max: 35m 09s | Hits:  46%/3556  
      🟩 GCC13              Pass: 100%/10  | Total:  3h 34m | Avg: 21m 28s | Max: 33m 17s | Hits:  77%/17780 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 02m | Hits:  31%/3542  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  2h 42m | Avg: 54m 09s | Max:  1h 06m | Hits:  33%/5313  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 07m | Hits:  21%/3554  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  8h 02m | Avg: 28m 23s | Max: 34m 41s | Hits:  58%/30209 
      🟩 GCC                Pass: 100%/21  | Total:  9h 46m | Avg: 27m 55s | Max: 36m 12s | Hits:  63%/37338 
      🟩 MSVC               Pass: 100%/5   | Total:  4h 47m | Avg: 57m 31s | Max:  1h 06m | Hits:  32%/8855  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 07m | Hits:  21%/3554  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 31m 02s | Avg: 15m 31s | Max: 20m 11s | Hits:  73%/3556  
      🟩 rtx2080            Pass: 100%/33  | Total: 20h 14m | Avg: 36m 48s | Max:  1h 07m | Hits:  48%/58637 
      🟩 rtx4090            Pass: 100%/10  | Total:  3h 59m | Avg: 23m 59s | Max:  1h 06m | Hits:  76%/17763 
    🟩 jobs
      🟩 Build              Pass: 100%/38  | Total: 23h 14m | Avg: 36m 41s | Max:  1h 07m | Hits:  48%/67519 
      🟩 TestCPU            Pass: 100%/3   | Total: 47m 25s | Avg: 15m 48s | Max: 32m 15s | Hits:  90%/5326  
      🟩 TestGPU            Pass: 100%/4   | Total: 44m 09s | Avg: 11m 02s | Max: 11m 39s | Hits:  99%/7111  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 31m 02s | Avg: 15m 31s | Max: 20m 11s | Hits:  73%/3556  
      🟩 90;90a;100         Pass: 100%/1   | Total: 30m 49s | Avg: 30m 49s | Max: 30m 49s | Hits:  77%/1778  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 12h 55m | Avg: 38m 47s | Max:  1h 07m | Hits:  45%/35531 
      🟩 20                 Pass: 100%/23  | Total: 11h 10m | Avg: 29m 08s | Max:  1h 06m | Hits:  63%/40869 
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 19m 13s | Avg: 9m 36s | Max: 16m 30s | Hits: 88%/328

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 19m 13s | Avg:  9m 36s | Max: 16m 30s | Hits:  88%/328   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total: 19m 13s | Avg:  9m 36s | Max: 16m 30s | Hits:  88%/328   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total: 19m 13s | Avg:  9m 36s | Max: 16m 30s | Hits:  88%/328   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 19m 13s | Avg:  9m 36s | Max: 16m 30s | Hits:  88%/328   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 19m 13s | Avg:  9m 36s | Max: 16m 30s | Hits:  88%/328   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 19m 13s | Avg:  9m 36s | Max: 16m 30s | Hits:  88%/328   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 19m 13s | Avg:  9m 36s | Max: 16m 30s | Hits:  88%/328   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 43s | Avg:  2m 43s | Max:  2m 43s | Hits:  77%/164   
      🟩 Test               Pass: 100%/1   | Total: 16m 30s | Avg: 16m 30s | Max: 16m 30s | Hits:  98%/164   
    
  • 🟩 python: Pass: 100%/1 | Total: 1h 02m | Avg: 1h 02m | Max: 1h 02m

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 93)

# Runner
66 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-h100-latest-1
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1

@shwina shwina marked this pull request as ready for review March 11, 2025 18:16
@shwina shwina requested review from a team as code owners March 11, 2025 18:16
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Mar 11, 2025
Copy link
Contributor

@NaderAlAwar NaderAlAwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left some comments for some potential refactoring

size_t cubin_size;
CUlibrary library;
CUkernel transform_kernel;
int loaded_bytes_per_iteration;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int loaded_bytes_per_iteration;
size_t loaded_bytes_per_iteration;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of loaded_bytes_per_iteration in CUB is int - should I still make this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested this because you initialize it with input_it.value_type.size, which is of type size_t. But if CUB expects it to be an int, then I think you should keep it as is

Comment on lines 98 to 124
std::string get_input_iterator_name()
{
std::string iterator_t;
check(nvrtcGetTypeName<input_iterator_t>(&iterator_t));
return iterator_t;
}

std::string get_input1_iterator_name()
{
std::string iterator_t;
check(nvrtcGetTypeName<input1_iterator_t>(&iterator_t));
return iterator_t;
}

std::string get_input2_iterator_name()
{
std::string iterator_t;
check(nvrtcGetTypeName<input2_iterator_t>(&iterator_t));
return iterator_t;
}

std::string get_output_iterator_name()
{
std::string iterator_t;
check(nvrtcGetTypeName<output_iterator_t>(&iterator_t));
return iterator_t;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Georgii recently suggested to return the string with the typename directly, which I agree looks cleaner. There is some inconsistency with how the existing c.parallel algorithms do it, which we should resolve eventually.

Copy link
Contributor Author

@shwina shwina Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this in 6eb5e53. I tried to introduce a get_iterator_name function that is a bit more general and could be reused across all the algorithms. It relies on defining the iterator names as constexpr strings for the source of truth. Please let me know if you think it's an acceptable pattern and I'll go ahead and change this for the other algorithms as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that looks good

Comment on lines 131 to 139
const std::string input_iterator_t =
input_it.type == cccl_iterator_kind_t::CCCL_POINTER //
? cccl_type_enum_to_name<input_storage_t>(input_it.value_type.type, true) //
: transform::get_input_iterator_name();

const std::string output_iterator_t =
(output_it.type == cccl_iterator_kind_t::CCCL_POINTER //
? cccl_type_enum_to_name<output_storage_t>(output_it.value_type.type, true) //
: transform::get_output_iterator_name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe refactor the logic on how we get the type into a separate function to avoid repetition? I do something similar in

const std::string input_keys_iterator_t = get_iterator_name(input_keys_it, unique_by_key_iterator_t::input_keys);

Comment on lines 162 to 175
const std::string input1_iterator_t =
input1_it.type == cccl_iterator_kind_t::CCCL_POINTER //
? cccl_type_enum_to_name<input1_storage_t>(input1_it.value_type.type, true) //
: transform::get_input1_iterator_name();

const std::string input2_iterator_t =
input2_it.type == cccl_iterator_kind_t::CCCL_POINTER //
? cccl_type_enum_to_name<input2_storage_t>(input2_it.value_type.type, true) //
: transform::get_input2_iterator_name();

const std::string output_iterator_t =
(output_it.type == cccl_iterator_kind_t::CCCL_POINTER //
? cccl_type_enum_to_name<output_storage_t>(output_it.value_type.type, true) //
: transform::get_output_iterator_name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to above

Comment on lines 251 to 254
const std::string input_iterator_src =
make_kernel_input_iterator(offset_t, "input_iterator_t", input_it_value_t, input_it);
const std::string output_iterator_src =
make_kernel_output_iterator(offset_t, "output_iterator_t", output_it_value_t, output_it);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, instead of "input_iterator_t" maybe use get_input_iterator_name()? (Same comment for output_iterator_t)

catch (const std::exception& exc)
{
fflush(stderr);
printf("\nEXCEPTION in cccl_device_transform(): %s\n", exc.what());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
printf("\nEXCEPTION in cccl_device_transform(): %s\n", exc.what());
printf("\nEXCEPTION in cccl_device_unary_transform(): %s\n", exc.what());

Comment on lines 426 to 434
const std::string input1_iterator_src =
make_kernel_input_iterator(offset_t, "input1_iterator_t", input1_it_value_t, input1_it);
const std::string input2_iterator_src =
make_kernel_input_iterator(offset_t, "input2_iterator_t", input2_it_value_t, input2_it);

const std::string output_iterator_src =
make_kernel_output_iterator(offset_t, "output_iterator_t", output_it_value_t, output_it);
const std::string op_src =
make_kernel_user_binary_operator(input1_it_value_t, input2_it_value_t, output_it_value_t, op);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to above regarding using the functions to get the iterator name

check(cuCtxGetDevice(&cu_device));
auto cuda_error = cub::detail::transform::dispatch_t<
cub::detail::transform::requires_stable_address::no, // TODO implement yes
::cuda::std::int64_t,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to above

catch (const std::exception& exc)
{
fflush(stderr);
printf("\nEXCEPTION in cccl_device_transform(): %s\n", exc.what());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
printf("\nEXCEPTION in cccl_device_transform(): %s\n", exc.what());
printf("\nEXCEPTION in cccl_device_binary_transform(): %s\n", exc.what());


#include <format>
#include <iostream>
#include <optional>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header seems unused?

@github-actions
Copy link
Contributor

🟩 CI finished in 1h 36m: Pass: 100%/93 | Total: 2d 15h | Avg: 41m 01s | Max: 1h 22m | Hits: 74%/133898
  • 🟩 cub: Pass: 100%/45 | Total: 1d 16h | Avg: 53m 25s | Max: 1h 22m | Hits: 70%/53614

    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total:  1d 14h | Avg: 53m 11s | Max:  1h 22m | Hits:  70%/51178 
      🟩 arm64              Pass: 100%/2   | Total:  1h 57m | Avg: 58m 41s | Max: 59m 22s | Hits:  68%/2436  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  5h 00m | Avg:  1h 00m | Max:  1h 06m | Hits:  58%/5922  
      🟩 12.5               Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 12m | Hits:  68%/2254  
      🟩 12.8               Pass: 100%/38  | Total:  1d 08h | Avg: 51m 36s | Max:  1h 22m | Hits:  71%/45438 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 58m | Avg: 59m 15s | Max: 59m 30s | Hits:  74%/2104  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 00m | Avg:  1h 00m | Max:  1h 06m | Hits:  58%/5922  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 12m | Hits:  68%/2254  
      🟩 nvcc12.8           Pass: 100%/36  | Total:  1d 06h | Avg: 51m 10s | Max:  1h 22m | Hits:  71%/43334 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 58m | Avg: 59m 15s | Max: 59m 30s | Hits:  74%/2104  
      🟩 nvcc               Pass: 100%/43  | Total:  1d 14h | Avg: 53m 09s | Max:  1h 22m | Hits:  69%/51510 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 56m | Avg: 59m 05s | Max:  1h 00m | Hits:  68%/4880  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 50m | Avg: 55m 07s | Max: 56m 09s | Hits:  68%/2436  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 50m | Avg: 55m 27s | Max: 56m 44s | Hits:  68%/2436  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 57m | Avg: 58m 44s | Max:  1h 02m | Hits:  68%/2436  
      🟩 Clang18            Pass: 100%/7   | Total:  5h 36m | Avg: 48m 06s | Max: 59m 30s | Hits:  79%/8194  
      🟩 GCC7               Pass: 100%/2   | Total:  1h 53m | Avg: 56m 48s | Max: 57m 42s | Hits:  68%/2440  
      🟩 GCC8               Pass: 100%/1   | Total: 56m 16s | Avg: 56m 16s | Max: 56m 16s | Hits:  68%/1220  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 54m | Avg: 57m 26s | Max: 57m 37s | Hits:  68%/2440  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 57m | Avg: 58m 48s | Max:  1h 00m | Hits:  68%/2440  
      🟩 GCC11              Pass: 100%/2   | Total:  1h 55m | Avg: 57m 34s | Max: 59m 41s | Hits:  68%/2436  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 54m | Avg: 57m 28s | Max: 58m 03s | Hits:  68%/2436  
      🟩 GCC13              Pass: 100%/11  | Total:  6h 53m | Avg: 37m 37s | Max:  1h 10m | Hits:  85%/13398 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 28m | Avg:  1h 14m | Max:  1h 22m | Hits:  14%/2084  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 35m | Avg:  1h 17m | Max:  1h 18m | Hits:  14%/2084  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 12m | Hits:  68%/2254  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total: 15h 11m | Avg: 53m 38s | Max:  1h 02m | Hits:  72%/20382 
      🟩 GCC                Pass: 100%/22  | Total: 17h 26m | Avg: 47m 33s | Max:  1h 10m | Hits:  76%/26810 
      🟩 MSVC               Pass: 100%/4   | Total:  5h 03m | Avg:  1h 15m | Max:  1h 22m | Hits:  14%/4168  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 12m | Hits:  68%/2254  
    🟩 gpu
      🟩 h100               Pass: 100%/3   | Total:  1h 11m | Avg: 23m 53s | Max: 25m 19s | Hits:  89%/3654  
      🟩 rtx2080            Pass: 100%/34  | Total:  1d 10h | Avg:  1h 01m | Max:  1h 22m | Hits:  63%/40216 
      🟩 rtxa6000           Pass: 100%/8   | Total:  4h 10m | Avg: 31m 22s | Max:  1h 02m | Hits:  91%/9744  
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total:  1d 13h | Avg:  1h 00m | Max:  1h 22m | Hits:  63%/43870 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 21m 23s | Avg: 21m 23s | Max: 21m 23s | Hits:  99%/1218  
      🟩 GraphCapture       Pass: 100%/1   | Total: 17m 11s | Avg: 17m 11s | Max: 17m 11s | Hits:  99%/1218  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 10m | Avg: 23m 27s | Max: 23m 49s | Hits:  99%/3654  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 07m | Avg: 22m 31s | Max: 23m 30s | Hits:  99%/3654  
    🟩 sm
      🟩 90                 Pass: 100%/3   | Total:  1h 11m | Avg: 23m 53s | Max: 25m 19s | Hits:  89%/3654  
      🟩 90;90a;100         Pass: 100%/1   | Total:  1h 10m | Avg:  1h 10m | Max:  1h 10m | Hits:  68%/1218  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 20h 30m | Avg:  1h 01m | Max:  1h 22m | Hits:  61%/23591 
      🟩 20                 Pass: 100%/25  | Total: 19h 34m | Avg: 46m 57s | Max:  1h 16m | Hits:  76%/30023 
    
  • 🟩 thrust: Pass: 100%/45 | Total: 22h 09m | Avg: 29m 32s | Max: 59m 59s | Hits: 77%/79956

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 35m 21s | Avg: 17m 40s | Max: 24m 03s | Hits:  88%/3556  
    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total: 21h 17m | Avg: 29m 42s | Max: 59m 59s | Hits:  77%/76401 
      🟩 arm64              Pass: 100%/2   | Total: 51m 18s | Avg: 25m 39s | Max: 27m 17s | Hits:  77%/3555  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  2h 45m | Avg: 33m 11s | Max: 52m 54s | Hits:  72%/8881  
      🟩 12.5               Pass: 100%/2   | Total:  1h 43m | Avg: 51m 39s | Max: 53m 39s | Hits:  65%/3554  
      🟩 12.8               Pass: 100%/38  | Total: 17h 39m | Avg: 27m 53s | Max: 59m 59s | Hits:  78%/67521 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 49m 54s | Avg: 24m 57s | Max: 26m 32s | Hits:  77%/3554  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  2h 45m | Avg: 33m 11s | Max: 52m 54s | Hits:  72%/8881  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 43m | Avg: 51m 39s | Max: 53m 39s | Hits:  65%/3554  
      🟩 nvcc12.8           Pass: 100%/36  | Total: 16h 49m | Avg: 28m 03s | Max: 59m 59s | Hits:  79%/63967 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 49m 54s | Avg: 24m 57s | Max: 26m 32s | Hits:  77%/3554  
      🟩 nvcc               Pass: 100%/43  | Total: 21h 19m | Avg: 29m 44s | Max: 59m 59s | Hits:  77%/76402 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  1h 50m | Avg: 27m 34s | Max: 28m 18s | Hits:  77%/7108  
      🟩 Clang15            Pass: 100%/2   | Total: 56m 31s | Avg: 28m 15s | Max: 29m 18s | Hits:  77%/3554  
      🟩 Clang16            Pass: 100%/2   | Total: 56m 27s | Avg: 28m 13s | Max: 28m 25s | Hits:  77%/3554  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 01m | Avg: 30m 35s | Max: 31m 07s | Hits:  77%/3554  
      🟩 Clang18            Pass: 100%/7   | Total:  2h 25m | Avg: 20m 43s | Max: 27m 02s | Hits:  83%/12439 
      🟩 GCC7               Pass: 100%/2   | Total: 59m 44s | Avg: 29m 52s | Max: 30m 31s | Hits:  77%/3556  
      🟩 GCC8               Pass: 100%/1   | Total: 29m 57s | Avg: 29m 57s | Max: 29m 57s | Hits:  77%/1778  
      🟩 GCC9               Pass: 100%/2   | Total: 58m 04s | Avg: 29m 02s | Max: 29m 04s | Hits:  77%/3556  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 02m | Avg: 31m 17s | Max: 31m 32s | Hits:  77%/3556  
      🟩 GCC11              Pass: 100%/2   | Total: 59m 20s | Avg: 29m 40s | Max: 30m 53s | Hits:  77%/3556  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 01m | Avg: 30m 57s | Max: 33m 20s | Hits:  77%/3556  
      🟩 GCC13              Pass: 100%/10  | Total:  3h 25m | Avg: 20m 35s | Max: 32m 49s | Hits:  86%/17780 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 51m | Avg: 55m 40s | Max: 58m 27s | Hits:  54%/3542  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  2h 27m | Avg: 49m 08s | Max: 59m 59s | Hits:  60%/5313  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 43m | Avg: 51m 39s | Max: 53m 39s | Hits:  65%/3554  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  7h 09m | Avg: 25m 15s | Max: 31m 07s | Hits:  80%/30209 
      🟩 GCC                Pass: 100%/21  | Total:  8h 57m | Avg: 25m 35s | Max: 33m 20s | Hits:  81%/37338 
      🟩 MSVC               Pass: 100%/5   | Total:  4h 18m | Avg: 51m 45s | Max: 59m 59s | Hits:  58%/8855  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 43m | Avg: 51m 39s | Max: 53m 39s | Hits:  65%/3554  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 29m 36s | Avg: 14m 48s | Max: 18m 03s | Hits:  88%/3556  
      🟩 rtx2080            Pass: 100%/33  | Total: 17h 59m | Avg: 32m 41s | Max: 59m 59s | Hits:  74%/58637 
      🟩 rtx4090            Pass: 100%/10  | Total:  3h 40m | Avg: 22m 02s | Max: 55m 24s | Hits:  85%/17763 
    🟩 jobs
      🟩 Build              Pass: 100%/38  | Total: 20h 36m | Avg: 32m 32s | Max: 59m 59s | Hits:  74%/67519 
      🟩 TestCPU            Pass: 100%/3   | Total: 47m 57s | Avg: 15m 59s | Max: 32m 03s | Hits:  90%/5326  
      🟩 TestGPU            Pass: 100%/4   | Total: 44m 42s | Avg: 11m 10s | Max: 11m 33s | Hits:  99%/7111  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 29m 36s | Avg: 14m 48s | Max: 18m 03s | Hits:  88%/3556  
      🟩 90;90a;100         Pass: 100%/1   | Total: 31m 04s | Avg: 31m 04s | Max: 31m 04s | Hits:  77%/1778  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 11h 30m | Avg: 34m 31s | Max: 59m 59s | Hits:  73%/35531 
      🟩 20                 Pass: 100%/23  | Total: 10h 03m | Avg: 26m 13s | Max: 55m 24s | Hits:  80%/40869 
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 18m 49s | Avg: 9m 24s | Max: 16m 13s | Hits: 98%/328

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 18m 49s | Avg:  9m 24s | Max: 16m 13s | Hits:  98%/328   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total: 18m 49s | Avg:  9m 24s | Max: 16m 13s | Hits:  98%/328   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total: 18m 49s | Avg:  9m 24s | Max: 16m 13s | Hits:  98%/328   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 18m 49s | Avg:  9m 24s | Max: 16m 13s | Hits:  98%/328   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 18m 49s | Avg:  9m 24s | Max: 16m 13s | Hits:  98%/328   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 18m 49s | Avg:  9m 24s | Max: 16m 13s | Hits:  98%/328   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 18m 49s | Avg:  9m 24s | Max: 16m 13s | Hits:  98%/328   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 36s | Avg:  2m 36s | Max:  2m 36s | Hits:  98%/164   
      🟩 Test               Pass: 100%/1   | Total: 16m 13s | Avg: 16m 13s | Max: 16m 13s | Hits:  98%/164   
    
  • 🟩 python: Pass: 100%/1 | Total: 1h 02m | Avg: 1h 02m | Max: 1h 02m

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 93)

# Runner
66 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-h100-latest-1
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1

_CCCL_HOST_DEVICE auto make_iterator_kernel_arg(It it) -> kernel_arg<It>
{
kernel_arg<It> arg;
// since we switch the active member of the union, we must use placement new or construct_at. This also uses the copy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should stay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops - fixed.

@github-actions
Copy link
Contributor

🟩 CI finished in 2h 42m: Pass: 100%/93 | Total: 2d 14h | Avg: 40m 20s | Max: 1h 17m | Hits: 75%/133898
  • 🟩 cub: Pass: 100%/45 | Total: 1d 15h | Avg: 53m 00s | Max: 1h 17m | Hits: 70%/53614

    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total:  1d 13h | Avg: 52m 45s | Max:  1h 17m | Hits:  70%/51178 
      🟩 arm64              Pass: 100%/2   | Total:  1h 56m | Avg: 58m 13s | Max: 58m 45s | Hits:  69%/2436  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  5h 05m | Avg:  1h 01m | Max:  1h 10m | Hits:  59%/5922  
      🟩 12.5               Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 07m | Hits:  68%/2254  
      🟩 12.8               Pass: 100%/38  | Total:  1d 08h | Avg: 51m 14s | Max:  1h 17m | Hits:  72%/45438 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 58m | Avg: 59m 28s | Max:  1h 01m | Hits:  76%/2104  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 05m | Avg:  1h 01m | Max:  1h 10m | Hits:  59%/5922  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 07m | Hits:  68%/2254  
      🟩 nvcc12.8           Pass: 100%/36  | Total:  1d 06h | Avg: 50m 46s | Max:  1h 17m | Hits:  71%/43334 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 58m | Avg: 59m 28s | Max:  1h 01m | Hits:  76%/2104  
      🟩 nvcc               Pass: 100%/43  | Total:  1d 13h | Avg: 52m 42s | Max:  1h 17m | Hits:  70%/51510 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 52m | Avg: 58m 07s | Max:  1h 01m | Hits:  69%/4880  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 53m | Avg: 56m 55s | Max: 58m 56s | Hits:  69%/2436  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 51m | Avg: 55m 46s | Max: 57m 05s | Hits:  69%/2436  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 56m | Avg: 58m 13s | Max: 59m 38s | Hits:  69%/2436  
      🟩 Clang18            Pass: 100%/7   | Total:  5h 39m | Avg: 48m 27s | Max:  1h 01m | Hits:  80%/8194  
      🟩 GCC7               Pass: 100%/2   | Total:  1h 55m | Avg: 57m 50s | Max: 58m 05s | Hits:  68%/2440  
      🟩 GCC8               Pass: 100%/1   | Total: 57m 51s | Avg: 57m 51s | Max: 57m 51s | Hits:  68%/1220  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 58m | Avg: 59m 05s | Max:  1h 02m | Hits:  68%/2440  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 58m | Avg: 59m 27s | Max:  1h 00m | Hits:  68%/2440  
      🟩 GCC11              Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 01m | Hits:  68%/2436  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 52m | Avg: 56m 13s | Max: 56m 30s | Hits:  68%/2436  
      🟩 GCC13              Pass: 100%/11  | Total:  6h 33m | Avg: 35m 46s | Max:  1h 09m | Hits:  85%/13398 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 27m | Avg:  1h 13m | Max:  1h 17m | Hits:  14%/2084  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 34m | Avg:  1h 17m | Max:  1h 17m | Hits:  14%/2084  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 07m | Hits:  68%/2254  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total: 15h 13m | Avg: 53m 44s | Max:  1h 01m | Hits:  73%/20382 
      🟩 GCC                Pass: 100%/22  | Total: 17h 16m | Avg: 47m 07s | Max:  1h 09m | Hits:  77%/26810 
      🟩 MSVC               Pass: 100%/4   | Total:  5h 01m | Avg:  1h 15m | Max:  1h 17m | Hits:  14%/4168  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 07m | Hits:  68%/2254  
    🟩 gpu
      🟩 h100               Pass: 100%/3   | Total:  1h 10m | Avg: 23m 31s | Max: 23m 39s | Hits:  89%/3654  
      🟩 rtx2080            Pass: 100%/34  | Total:  1d 10h | Avg:  1h 00m | Max:  1h 17m | Hits:  63%/40216 
      🟩 rtxa6000           Pass: 100%/8   | Total:  4h 01m | Avg: 30m 11s | Max: 57m 16s | Hits:  92%/9744  
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total:  1d 12h | Avg: 59m 43s | Max:  1h 17m | Hits:  64%/43870 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 21m 21s | Avg: 21m 21s | Max: 21m 21s | Hits:  99%/1218  
      🟩 GraphCapture       Pass: 100%/1   | Total: 16m 48s | Avg: 16m 48s | Max: 16m 48s | Hits:  99%/1218  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 09m | Avg: 23m 08s | Max: 23m 25s | Hits:  99%/3654  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 07m | Avg: 22m 33s | Max: 23m 39s | Hits:  99%/3654  
    🟩 sm
      🟩 90                 Pass: 100%/3   | Total:  1h 10m | Avg: 23m 31s | Max: 23m 39s | Hits:  89%/3654  
      🟩 90;90a;100         Pass: 100%/1   | Total:  1h 09m | Avg:  1h 09m | Max:  1h 09m | Hits:  68%/1218  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 20h 17m | Avg:  1h 00m | Max:  1h 17m | Hits:  61%/23591 
      🟩 20                 Pass: 100%/25  | Total: 19h 27m | Avg: 46m 43s | Max:  1h 16m | Hits:  77%/30023 
    
  • 🟩 thrust: Pass: 100%/45 | Total: 21h 26m | Avg: 28m 34s | Max: 1h 01m | Hits: 79%/79956

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 35m 46s | Avg: 17m 53s | Max: 24m 23s | Hits:  89%/3556  
    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total: 20h 36m | Avg: 28m 46s | Max:  1h 01m | Hits:  79%/76401 
      🟩 arm64              Pass: 100%/2   | Total: 49m 10s | Avg: 24m 35s | Max: 26m 22s | Hits:  80%/3555  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  2h 47m | Avg: 33m 32s | Max: 56m 50s | Hits:  74%/8881  
      🟩 12.5               Pass: 100%/2   | Total:  1h 39m | Avg: 49m 43s | Max: 50m 08s | Hits:  65%/3554  
      🟩 12.8               Pass: 100%/38  | Total: 16h 59m | Avg: 26m 48s | Max:  1h 01m | Hits:  80%/67521 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 45m 28s | Avg: 22m 44s | Max: 22m 45s | Hits:  80%/3554  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  2h 47m | Avg: 33m 32s | Max: 56m 50s | Hits:  74%/8881  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 39m | Avg: 49m 43s | Max: 50m 08s | Hits:  65%/3554  
      🟩 nvcc12.8           Pass: 100%/36  | Total: 16h 13m | Avg: 27m 02s | Max:  1h 01m | Hits:  80%/63967 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 45m 28s | Avg: 22m 44s | Max: 22m 45s | Hits:  80%/3554  
      🟩 nvcc               Pass: 100%/43  | Total: 20h 40m | Avg: 28m 51s | Max:  1h 01m | Hits:  78%/76402 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  1h 43m | Avg: 25m 59s | Max: 27m 17s | Hits:  79%/7108  
      🟩 Clang15            Pass: 100%/2   | Total: 55m 23s | Avg: 27m 41s | Max: 28m 59s | Hits:  79%/3554  
      🟩 Clang16            Pass: 100%/2   | Total: 59m 17s | Avg: 29m 38s | Max: 29m 45s | Hits:  79%/3554  
      🟩 Clang17            Pass: 100%/2   | Total: 57m 04s | Avg: 28m 32s | Max: 29m 52s | Hits:  79%/3554  
      🟩 Clang18            Pass: 100%/7   | Total:  2h 20m | Avg: 20m 00s | Max: 27m 51s | Hits:  85%/12439 
      🟩 GCC7               Pass: 100%/2   | Total: 55m 05s | Avg: 27m 32s | Max: 28m 14s | Hits:  79%/3556  
      🟩 GCC8               Pass: 100%/1   | Total: 26m 23s | Avg: 26m 23s | Max: 26m 23s | Hits:  78%/1778  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 00m | Avg: 30m 04s | Max: 31m 00s | Hits:  79%/3556  
      🟩 GCC10              Pass: 100%/2   | Total: 56m 15s | Avg: 28m 07s | Max: 28m 50s | Hits:  79%/3556  
      🟩 GCC11              Pass: 100%/2   | Total: 55m 18s | Avg: 27m 39s | Max: 28m 06s | Hits:  79%/3556  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 00m | Avg: 30m 00s | Max: 31m 21s | Hits:  78%/3556  
      🟩 GCC13              Pass: 100%/10  | Total:  3h 18m | Avg: 19m 49s | Max: 30m 01s | Hits:  87%/17780 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 49m | Avg: 54m 50s | Max: 56m 50s | Hits:  54%/3542  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  2h 29m | Avg: 49m 58s | Max:  1h 01m | Hits:  60%/5313  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 39m | Avg: 49m 43s | Max: 50m 08s | Hits:  65%/3554  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  6h 55m | Avg: 24m 27s | Max: 29m 52s | Hits:  81%/30209 
      🟩 GCC                Pass: 100%/21  | Total:  8h 31m | Avg: 24m 21s | Max: 31m 21s | Hits:  83%/37338 
      🟩 MSVC               Pass: 100%/5   | Total:  4h 19m | Avg: 51m 54s | Max:  1h 01m | Hits:  58%/8855  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 39m | Avg: 49m 43s | Max: 50m 08s | Hits:  65%/3554  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 27m 50s | Avg: 13m 55s | Max: 16m 16s | Hits:  89%/3556  
      🟩 rtx2080            Pass: 100%/33  | Total: 17h 12m | Avg: 31m 16s | Max: 56m 50s | Hits:  76%/58637 
      🟩 rtx4090            Pass: 100%/10  | Total:  3h 46m | Avg: 22m 37s | Max:  1h 01m | Hits:  86%/17763 
    🟩 jobs
      🟩 Build              Pass: 100%/38  | Total: 19h 48m | Avg: 31m 17s | Max:  1h 01m | Hits:  75%/67519 
      🟩 TestCPU            Pass: 100%/3   | Total: 52m 31s | Avg: 17m 30s | Max: 36m 22s | Hits:  90%/5326  
      🟩 TestGPU            Pass: 100%/4   | Total: 44m 44s | Avg: 11m 11s | Max: 11m 35s | Hits:  99%/7111  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 27m 50s | Avg: 13m 55s | Max: 16m 16s | Hits:  89%/3556  
      🟩 90;90a;100         Pass: 100%/1   | Total: 30m 01s | Avg: 30m 01s | Max: 30m 01s | Hits:  78%/1778  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 10h 57m | Avg: 32m 52s | Max: 56m 50s | Hits:  74%/35531 
      🟩 20                 Pass: 100%/23  | Total:  9h 52m | Avg: 25m 46s | Max:  1h 01m | Hits:  81%/40869 
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 18m 59s | Avg: 9m 29s | Max: 16m 45s | Hits: 98%/328

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 18m 59s | Avg:  9m 29s | Max: 16m 45s | Hits:  98%/328   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total: 18m 59s | Avg:  9m 29s | Max: 16m 45s | Hits:  98%/328   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total: 18m 59s | Avg:  9m 29s | Max: 16m 45s | Hits:  98%/328   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 18m 59s | Avg:  9m 29s | Max: 16m 45s | Hits:  98%/328   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 18m 59s | Avg:  9m 29s | Max: 16m 45s | Hits:  98%/328   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 18m 59s | Avg:  9m 29s | Max: 16m 45s | Hits:  98%/328   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 18m 59s | Avg:  9m 29s | Max: 16m 45s | Hits:  98%/328   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 14s | Avg:  2m 14s | Max:  2m 14s | Hits:  98%/164   
      🟩 Test               Pass: 100%/1   | Total: 16m 45s | Avg: 16m 45s | Max: 16m 45s | Hits:  98%/164   
    
  • 🟩 python: Pass: 100%/1 | Total: 1h 01m | Avg: 1h 01m | Max: 1h 01m

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 93)

# Runner
66 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-h100-latest-1
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Mar 16, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shwina
Copy link
Contributor Author

shwina commented Mar 16, 2025

/ok to test

@github-actions
Copy link
Contributor

🟩 CI finished in 2h 32m: Pass: 100%/93 | Total: 1d 01h | Avg: 16m 10s | Max: 1h 22m | Hits: 93%/133898
  • 🟩 cub: Pass: 100%/45 | Total: 14h 19m | Avg: 19m 06s | Max: 1h 22m | Hits: 91%/53614

    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total: 14h 08m | Avg: 19m 44s | Max:  1h 22m | Hits:  90%/51178 
      🟩 arm64              Pass: 100%/2   | Total: 11m 14s | Avg:  5m 37s | Max:  5m 56s | Hits:  99%/2436  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  1h 33m | Avg: 18m 40s | Max:  1h 10m | Hits:  85%/5922  
      🟩 12.5               Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m | Hits:  68%/2254  
      🟩 12.8               Pass: 100%/38  | Total: 10h 29m | Avg: 16m 34s | Max:  1h 22m | Hits:  93%/45438 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 07s | Avg:  5m 03s | Max:  5m 06s | Hits: 100%/2104  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 33m | Avg: 18m 40s | Max:  1h 10m | Hits:  85%/5922  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m | Hits:  68%/2254  
      🟩 nvcc12.8           Pass: 100%/36  | Total: 10h 19m | Avg: 17m 12s | Max:  1h 22m | Hits:  92%/43334 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 07s | Avg:  5m 03s | Max:  5m 06s | Hits: 100%/2104  
      🟩 nvcc               Pass: 100%/43  | Total: 14h 09m | Avg: 19m 45s | Max:  1h 22m | Hits:  90%/51510 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 23m 44s | Avg:  5m 56s | Max:  6m 24s | Hits: 100%/4880  
      🟩 Clang15            Pass: 100%/2   | Total: 12m 20s | Avg:  6m 10s | Max:  6m 12s | Hits: 100%/2436  
      🟩 Clang16            Pass: 100%/2   | Total: 12m 28s | Avg:  6m 14s | Max:  6m 16s | Hits: 100%/2436  
      🟩 Clang17            Pass: 100%/2   | Total: 12m 51s | Avg:  6m 25s | Max:  6m 34s | Hits:  99%/2436  
      🟩 Clang18            Pass: 100%/7   | Total:  1h 11m | Avg: 10m 16s | Max: 23m 53s | Hits: 100%/8194  
      🟩 GCC7               Pass: 100%/2   | Total: 11m 44s | Avg:  5m 52s | Max:  6m 10s | Hits:  99%/2440  
      🟩 GCC8               Pass: 100%/1   | Total:  6m 04s | Avg:  6m 04s | Max:  6m 04s | Hits:  99%/1220  
      🟩 GCC9               Pass: 100%/2   | Total: 12m 19s | Avg:  6m 09s | Max:  6m 24s | Hits:  99%/2440  
      🟩 GCC10              Pass: 100%/2   | Total: 13m 31s | Avg:  6m 45s | Max:  6m 55s | Hits:  99%/2440  
      🟩 GCC11              Pass: 100%/2   | Total: 13m 06s | Avg:  6m 33s | Max:  6m 38s | Hits:  99%/2436  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 00m | Avg: 30m 28s | Max: 54m 13s | Hits:  84%/2436  
      🟩 GCC13              Pass: 100%/11  | Total:  2h 46m | Avg: 15m 06s | Max: 24m 33s | Hits:  99%/13398 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 29m | Avg:  1h 14m | Max:  1h 18m | Hits:  15%/2084  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 36m | Avg:  1h 18m | Max:  1h 22m | Hits:  15%/2084  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m | Hits:  68%/2254  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  2h 13m | Avg:  7m 50s | Max: 23m 53s | Hits:  99%/20382 
      🟩 GCC                Pass: 100%/22  | Total:  4h 43m | Avg: 12m 54s | Max: 54m 13s | Hits:  98%/26810 
      🟩 MSVC               Pass: 100%/4   | Total:  5h 05m | Avg:  1h 16m | Max:  1h 22m | Hits:  15%/4168  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m | Hits:  68%/2254  
    🟩 gpu
      🟩 h100               Pass: 100%/3   | Total: 53m 09s | Avg: 17m 43s | Max: 24m 33s | Hits:  99%/3654  
      🟩 rtx2080            Pass: 100%/34  | Total: 11h 03m | Avg: 19m 30s | Max:  1h 22m | Hits:  88%/40216 
      🟩 rtxa6000           Pass: 100%/8   | Total:  2h 23m | Avg: 17m 54s | Max: 23m 53s | Hits:  99%/9744  
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total: 11h 20m | Avg: 18m 24s | Max:  1h 22m | Hits:  89%/43870 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 21m 36s | Avg: 21m 36s | Max: 21m 36s | Hits:  99%/1218  
      🟩 GraphCapture       Pass: 100%/1   | Total: 17m 35s | Avg: 17m 35s | Max: 17m 35s | Hits:  99%/1218  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 11m | Avg: 23m 47s | Max: 23m 53s | Hits:  99%/3654  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 08m | Avg: 22m 47s | Max: 24m 33s | Hits:  99%/3654  
    🟩 sm
      🟩 90                 Pass: 100%/3   | Total: 53m 09s | Avg: 17m 43s | Max: 24m 33s | Hits:  99%/3654  
      🟩 90;90a;100         Pass: 100%/1   | Total:  7m 23s | Avg:  7m 23s | Max:  7m 23s | Hits:  99%/1218  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total:  6h 28m | Avg: 19m 24s | Max:  1h 18m | Hits:  87%/23591 
      🟩 20                 Pass: 100%/25  | Total:  7h 51m | Avg: 18m 52s | Max:  1h 22m | Hits:  94%/30023 
    
  • 🟩 thrust: Pass: 100%/45 | Total: 9h 18m | Avg: 12m 24s | Max: 52m 10s | Hits: 95%/79956

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 17m 12s | Avg:  8m 36s | Max: 11m 18s | Hits:  99%/3556  
    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total:  9h 08m | Avg: 12m 45s | Max: 52m 10s | Hits:  94%/76401 
      🟩 arm64              Pass: 100%/2   | Total:  9m 45s | Avg:  4m 52s | Max:  5m 14s | Hits:  99%/3555  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  1h 03m | Avg: 12m 41s | Max: 44m 04s | Hits:  94%/8881  
      🟩 12.5               Pass: 100%/2   | Total:  1h 43m | Avg: 51m 58s | Max: 52m 10s | Hits:  65%/3554  
      🟩 12.8               Pass: 100%/38  | Total:  6h 30m | Avg: 10m 17s | Max: 48m 50s | Hits:  96%/67521 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 06s | Avg:  5m 03s | Max:  5m 10s | Hits: 100%/3554  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 03m | Avg: 12m 41s | Max: 44m 04s | Hits:  94%/8881  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 43m | Avg: 51m 58s | Max: 52m 10s | Hits:  65%/3554  
      🟩 nvcc12.8           Pass: 100%/36  | Total:  6h 20m | Avg: 10m 34s | Max: 48m 50s | Hits:  96%/63967 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 06s | Avg:  5m 03s | Max:  5m 10s | Hits: 100%/3554  
      🟩 nvcc               Pass: 100%/43  | Total:  9h 08m | Avg: 12m 45s | Max: 52m 10s | Hits:  94%/76402 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 20m 19s | Avg:  5m 04s | Max:  5m 36s | Hits: 100%/7108  
      🟩 Clang15            Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  5m 42s | Hits: 100%/3554  
      🟩 Clang16            Pass: 100%/2   | Total: 10m 56s | Avg:  5m 28s | Max:  5m 30s | Hits: 100%/3554  
      🟩 Clang17            Pass: 100%/2   | Total: 11m 40s | Avg:  5m 50s | Max:  5m 54s | Hits: 100%/3554  
      🟩 Clang18            Pass: 100%/7   | Total: 42m 52s | Avg:  6m 07s | Max: 10m 17s | Hits: 100%/12439 
      🟩 GCC7               Pass: 100%/2   | Total: 10m 19s | Avg:  5m 09s | Max:  5m 14s | Hits:  99%/3556  
      🟩 GCC8               Pass: 100%/1   | Total:  5m 10s | Avg:  5m 10s | Max:  5m 10s | Hits:  99%/1778  
      🟩 GCC9               Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  5m 43s | Hits:  99%/3556  
      🟩 GCC10              Pass: 100%/2   | Total: 11m 19s | Avg:  5m 39s | Max:  6m 02s | Hits:  99%/3556  
      🟩 GCC11              Pass: 100%/2   | Total: 11m 38s | Avg:  5m 49s | Max:  5m 55s | Hits:  99%/3556  
      🟩 GCC12              Pass: 100%/2   | Total: 11m 43s | Avg:  5m 51s | Max:  5m 59s | Hits:  99%/3556  
      🟩 GCC13              Pass: 100%/10  | Total:  1h 17m | Avg:  7m 47s | Max: 12m 48s | Hits:  99%/17780 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 29m | Avg: 44m 38s | Max: 45m 12s | Hits:  70%/3542  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  2h 09m | Avg: 43m 12s | Max: 48m 50s | Hits:  70%/5313  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 43m | Avg: 51m 58s | Max: 52m 10s | Hits:  65%/3554  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  1h 36m | Avg:  5m 42s | Max: 10m 17s | Hits: 100%/30209 
      🟩 GCC                Pass: 100%/21  | Total:  2h 18m | Avg:  6m 36s | Max: 12m 48s | Hits:  99%/37338 
      🟩 MSVC               Pass: 100%/5   | Total:  3h 38m | Avg: 43m 46s | Max: 48m 50s | Hits:  70%/8855  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 43m | Avg: 51m 58s | Max: 52m 10s | Hits:  65%/3554  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 17m 35s | Avg:  8m 47s | Max: 12m 48s | Hits:  99%/3556  
      🟩 rtx2080            Pass: 100%/33  | Total:  6h 32m | Avg: 11m 53s | Max: 52m 10s | Hits:  95%/58637 
      🟩 rtx4090            Pass: 100%/10  | Total:  2h 28m | Avg: 14m 50s | Max: 48m 50s | Hits:  94%/17763 
    🟩 jobs
      🟩 Build              Pass: 100%/38  | Total:  7h 43m | Avg: 12m 11s | Max: 52m 10s | Hits:  95%/67519 
      🟩 TestCPU            Pass: 100%/3   | Total: 49m 14s | Avg: 16m 24s | Max: 33m 52s | Hits:  90%/5326  
      🟩 TestGPU            Pass: 100%/4   | Total: 45m 53s | Avg: 11m 28s | Max: 12m 48s | Hits:  99%/7111  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 17m 35s | Avg:  8m 47s | Max: 12m 48s | Hits:  99%/3556  
      🟩 90;90a;100         Pass: 100%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s | Hits:  99%/1778  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total:  4h 35m | Avg: 13m 47s | Max: 51m 46s | Hits:  93%/35531 
      🟩 20                 Pass: 100%/23  | Total:  4h 25m | Avg: 11m 32s | Max: 52m 10s | Hits:  95%/40869 
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 18m 07s | Avg: 9m 03s | Max: 15m 51s | Hits: 98%/328

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 18m 07s | Avg:  9m 03s | Max: 15m 51s | Hits:  98%/328   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total: 18m 07s | Avg:  9m 03s | Max: 15m 51s | Hits:  98%/328   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total: 18m 07s | Avg:  9m 03s | Max: 15m 51s | Hits:  98%/328   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 18m 07s | Avg:  9m 03s | Max: 15m 51s | Hits:  98%/328   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 18m 07s | Avg:  9m 03s | Max: 15m 51s | Hits:  98%/328   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 18m 07s | Avg:  9m 03s | Max: 15m 51s | Hits:  98%/328   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 18m 07s | Avg:  9m 03s | Max: 15m 51s | Hits:  98%/328   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 16s | Avg:  2m 16s | Max:  2m 16s | Hits:  98%/164   
      🟩 Test               Pass: 100%/1   | Total: 15m 51s | Avg: 15m 51s | Max: 15m 51s | Hits:  98%/164   
    
  • 🟩 python: Pass: 100%/1 | Total: 1h 07m | Avg: 1h 07m | Max: 1h 07m

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total:  1h 07m | Avg:  1h 07m | Max:  1h 07m
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total:  1h 07m | Avg:  1h 07m | Max:  1h 07m
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total:  1h 07m | Avg:  1h 07m | Max:  1h 07m
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total:  1h 07m | Avg:  1h 07m | Max:  1h 07m
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total:  1h 07m | Avg:  1h 07m | Max:  1h 07m
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total:  1h 07m | Avg:  1h 07m | Max:  1h 07m
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total:  1h 07m | Avg:  1h 07m | Max:  1h 07m
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total:  1h 07m | Avg:  1h 07m | Max:  1h 07m
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 93)

# Runner
66 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-h100-latest-1
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUB changes LGTM

@shwina shwina merged commit 40eb959 into NVIDIA:main Mar 24, 2025
109 of 110 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Mar 24, 2025
caugonnet pushed a commit to caugonnet/cccl that referenced this pull request Mar 25, 2025
* Enable setting _CUB_HAS_TRANSFORM_UBLKCP as flag

* Stale comment

* Update make_kernel_user_binary_operator to accept lhs/rhs types

* Add transform c.parallel implementation

* Add tests for c.parallel transform

* Use launcher factory to query ptx version

* Refactor how we get iterator names

* Add comment explaining why we need -default-device

* Address remaining review feedback

* Undo change to make_iterator_kernel_arg

* Restore the comment

* Restore comment

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
caugonnet added a commit that referenced this pull request Mar 26, 2025
* Enable NVHPC in CUDASTF CI

* Use nvtx_range in the POTRI example

* no need to include nvtx3 anymore here

* Do not use nvtx3 directly

* fix compilation

* WIP: try to pass the type of the execution place to the parallel_for_scope

* pass the exec_place type to parallel_for_scope

* remove a device annotation for a host only lambda

* is_shape_of_v is not working

* experiment to see if we can avoid generating device code in parallel_for with a host place

* fixes for a previous conflict merge

* disable an invalid test

* better typing for ctx.parallel_for with a partition

* use ::std and more types for parallel_for

* Add transform c parallel implementation (#4048)

* Enable setting _CUB_HAS_TRANSFORM_UBLKCP as flag

* Stale comment

* Update make_kernel_user_binary_operator to accept lhs/rhs types

* Add transform c.parallel implementation

* Add tests for c.parallel transform

* Use launcher factory to query ptx version

* Refactor how we get iterator names

* Add comment explaining why we need -default-device

* Address remaining review feedback

* Undo change to make_iterator_kernel_arg

* Restore the comment

* Restore comment

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

* Drop duplicated system header blocks (#4245)

Those were accidentally duplicated

* Exclude sm101 from RDC testing. (#4247)

* Make `cuda::stream_ref` constructible on device (#4243)

* Make `cuda::stream_ref` constructible on device

There is no reason we should not be able to construct it or extract the pointer out of it on device.

* Pass by value

* Make nvbench work

* Fix logic in test_segmented_reduce (#4198)

* Fix logic in test_segmented_reduce, also test over different types of offsets

To resolve gh-4197, use `cupy.cumsum` to accumulate over random partition
sizes to form correct offsets sequence.

Add assertions to verify that `offsets` is a non-decreasing sequence, and
that its last element equals the size of the input array.

Perform the test for several plausible offset data types.

* Changes per PR review comments

1. Use `cupy.random` to draw random sample on GPU, rather than on CPU
   followed by a transfer
2. Use `cp.empty` to allocate output, rather than `cp.zeros`

* Add new `WarpReduce` overloadings (#3884)

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

* Fix #4250 (#4251)

* Refactor fp masks (#4246)

* Implement `views::all` (#4244)

* Implement `views::all`

* [cudax] incorporate P3557 (constexpr completion signatures) into µstdex (#3841)

* incorporate P3557 (constexpr completion signatures) into ustdex

* remove the need for nvcc-specific workarounds

* review feedback

* tweaks to ustdex for clangd

* fix typo

* use concepts portability macros for C++17 support

* Add fixed size segmented reduce (#3969)

* Adds fixed size segmented reduce

* remove unnecessary changes

* revert changes

* Properly deal with parallel_for overloads with grids, add an is_host() for exec_place

* Remove test doing illegal things, and which is not really useful anymore

---------

Co-authored-by: Ashwin Srinath <3190405+shwina@users.noreply.github.com>
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Co-authored-by: Allison Piper <alliepiper16@gmail.com>
Co-authored-by: Oleksandr Pavlyk <21087696+oleksandr-pavlyk@users.noreply.github.com>
Co-authored-by: Federico Busato <50413820+fbusato@users.noreply.github.com>
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
Co-authored-by: Eric Niebler <eniebler@nvidia.com>
Co-authored-by: Srinivas Yadav <43375352+srinivasyadav18@users.noreply.github.com>
davebayer pushed a commit to davebayer/cccl that referenced this pull request Apr 7, 2025
* Enable setting _CUB_HAS_TRANSFORM_UBLKCP as flag

* Stale comment

* Update make_kernel_user_binary_operator to accept lhs/rhs types

* Add transform c.parallel implementation

* Add tests for c.parallel transform

* Use launcher factory to query ptx version

* Refactor how we get iterator names

* Add comment explaining why we need -default-device

* Address remaining review feedback

* Undo change to make_iterator_kernel_arg

* Restore the comment

* Restore comment

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
davebayer added a commit to davebayer/cccl that referenced this pull request Apr 7, 2025
* Enable NVHPC in CUDASTF CI

* Use nvtx_range in the POTRI example

* no need to include nvtx3 anymore here

* Do not use nvtx3 directly

* fix compilation

* WIP: try to pass the type of the execution place to the parallel_for_scope

* pass the exec_place type to parallel_for_scope

* remove a device annotation for a host only lambda

* is_shape_of_v is not working

* experiment to see if we can avoid generating device code in parallel_for with a host place

* fixes for a previous conflict merge

* disable an invalid test

* better typing for ctx.parallel_for with a partition

* use ::std and more types for parallel_for

* Add transform c parallel implementation (NVIDIA#4048)

* Enable setting _CUB_HAS_TRANSFORM_UBLKCP as flag

* Stale comment

* Update make_kernel_user_binary_operator to accept lhs/rhs types

* Add transform c.parallel implementation

* Add tests for c.parallel transform

* Use launcher factory to query ptx version

* Refactor how we get iterator names

* Add comment explaining why we need -default-device

* Address remaining review feedback

* Undo change to make_iterator_kernel_arg

* Restore the comment

* Restore comment

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

* Drop duplicated system header blocks (NVIDIA#4245)

Those were accidentally duplicated

* Exclude sm101 from RDC testing. (NVIDIA#4247)

* Make `cuda::stream_ref` constructible on device (NVIDIA#4243)

* Make `cuda::stream_ref` constructible on device

There is no reason we should not be able to construct it or extract the pointer out of it on device.

* Pass by value

* Make nvbench work

* Fix logic in test_segmented_reduce (NVIDIA#4198)

* Fix logic in test_segmented_reduce, also test over different types of offsets

To resolve NVIDIAgh-4197, use `cupy.cumsum` to accumulate over random partition
sizes to form correct offsets sequence.

Add assertions to verify that `offsets` is a non-decreasing sequence, and
that its last element equals the size of the input array.

Perform the test for several plausible offset data types.

* Changes per PR review comments

1. Use `cupy.random` to draw random sample on GPU, rather than on CPU
   followed by a transfer
2. Use `cp.empty` to allocate output, rather than `cp.zeros`

* Add new `WarpReduce` overloadings (NVIDIA#3884)

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

* Fix NVIDIA#4250 (NVIDIA#4251)

* Refactor fp masks (NVIDIA#4246)

* Implement `views::all` (NVIDIA#4244)

* Implement `views::all`

* [cudax] incorporate P3557 (constexpr completion signatures) into µstdex (NVIDIA#3841)

* incorporate P3557 (constexpr completion signatures) into ustdex

* remove the need for nvcc-specific workarounds

* review feedback

* tweaks to ustdex for clangd

* fix typo

* use concepts portability macros for C++17 support

* Add fixed size segmented reduce (NVIDIA#3969)

* Adds fixed size segmented reduce

* remove unnecessary changes

* revert changes

* Properly deal with parallel_for overloads with grids, add an is_host() for exec_place

* Remove test doing illegal things, and which is not really useful anymore

---------

Co-authored-by: Ashwin Srinath <3190405+shwina@users.noreply.github.com>
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Co-authored-by: Allison Piper <alliepiper16@gmail.com>
Co-authored-by: Oleksandr Pavlyk <21087696+oleksandr-pavlyk@users.noreply.github.com>
Co-authored-by: Federico Busato <50413820+fbusato@users.noreply.github.com>
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
Co-authored-by: Eric Niebler <eniebler@nvidia.com>
Co-authored-by: Srinivas Yadav <43375352+srinivasyadav18@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Add c.parallel implementation of transform

5 participants