CCCL 3.0 Release

These are the release notes for the cuda-cccl Python package version 0.3.3, dated October 21st, 2025. The previous release was v0.3.2.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

This is the first release that features Windows wheels published to PyPI. You can now pip install cuda-cccl[cu12] or pip install cuda-cccl[cu13] on Windows for Python versions 3.10, 3.11, 3.12, and 3.13.

Bug Fixes

Breaking Changes

These are the release notes for the cuda-cccl Python package version 0.3.2, dated October 17th, 2025. The previous release was v0.3.1.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

Allow passing in a device array or None as the initial value in scan.

Bug Fixes

Breaking Changes

@brycelelbach

What's Changed

🚀 Thrust / CUB

[Thrust] Perform asynchronous allocations by default for the par_nosync policy by @brycelelbach in #4204
[Thrust] reduce_into by @brycelelbach in #4355
Enable Catch2 tests in Thrust by @bernhardmgruber in #2669
Add memcpy_async transform kernel for Ampere by @bernhardmgruber in #2394
Allow default-initializing and skipping initialization of Thrust vectors by @bernhardmgruber in #4183
Add thrust::strided_iterator and a step for thrust::counting_iterator by @bernhardmgruber in #4014
Add new WarpReduce overloadings by @fbusato in #3884
Optimize ThreadReduce by @fbusato in #3441

📚 Libcudacxx

Enable device assertions in CUDA debug mode nvcc -G by @fbusato in #4444
avoid EDG bug by moving diagnostic push & pop out of templates by @ericniebler in #4416
Add host/device/managed mdspan and accessors by @fbusato in #3686
Add cuda::ptx::elect.sync by @fbusato in #4445
Add pointer utilities cuda::is_aligned, cuda::align_up, cuda::align_down, cuda::ptr_rebind by @fbusato in #5037
Add cuda::ceil_ilog2 by @fbusato in #4485
Add cuda::is_power_of_two, cuda::next_power_of_two, cuda::prev_power_of_two by @fbusato in #4627
Add cuda::device::warp_match_all by @fbusato in #4746
Add cuda::static_for by @fbusato in #4855
Improve/cleanup cuda::annotated_ptr implementation by @fbusato in #4503
Add cuda::fast_mod_div Fast Modulo Division by @fbusato in #5210

📝 Documentation

Making extended API documentation slightly more uniform by @fbusato in #4965
Add memory space note to cuda::memory documentation by @fbusato in #5151
Better specify lane_mask::all_active() behavior by @fbusato in #5183

🔄 Other Changes

[CUDAX] Add universal comparison across memory resources by @pciolkosz in #4168
Implement ranges::range_adaptor by @miscco in #4066
Avoiding looping over problem size in individual tests by @oleksandr-pavlyk in #4140
Replace CUB util_arch.cuh macros with inline constexpr variables by @fbusato in #4165
Improves test times for DeviceSegmentedRadixSort by @elstehle in #4156
Simplify Thrust iterator functions by @bernhardmgruber in #4178
Remove _LIBCUDACXX_UNUSED_VAR by @davebayer in #4174
Remove _CCCL_NO_IF_CONSTEXPR by @davebayer in #4187
Implement __fp_native_type_t by @davebayer in #4173
Adds support for large number of segments and large number of items to DeviceSegmentedRadixSort by @elstehle in #3402
Implement inclusive scan in cuda.parallel by @NaderAlAwar in #4147
Remove _CCCL_NO_NOEXCEPT_FUNCTION_TYPE by @davebayer in #4190
Fix not_fn by @miscco in #4186
Remove _CCCL_NTTP_AUTO by @davebayer in #4191
Avoid instantiating discard_iterator while parsing by @bernhardmgruber in #4180
Host/Device accessors for mdspan by @fbusato in #3686
Remove _CCCL_NO_DEDUCTION_GUIDES by @davebayer in #4188
Set NO_CMAKE_FIND_ROOT_PATH for cudax. by @bdice in #4162
Fix build breaking with setuptools by @miscco in #4212
Replaces remaining uses of thrust::{host,device}_vector in our Catch2 tests by @elstehle in #4205
Add check that CXX + CUDA_HOST compilers match when necessary. by @alliepiper in #4201
Disable test on 12.0 CTK by @miscco in #4214
Implement fp properties by @davebayer in #4213
[CUDAX] Separate non-async pinned memory resource into legacy_pinned_memory_resource by @pciolkosz in #4179
Avoid errors in get_device_address tests by @miscco in #4209
Implement extended fp traits by @davebayer in #4211
Remove _CCCL_INLINE_VAR by @davebayer in #4192
Improve host/device mdspan documentation by @fbusato in #4220
Drop _LIBCUDACXX_BEGIN_NAMESPACE_RANGES_ABI by @miscco in #4210
Fix C++ version used in CONTRIBUTING.md by @bernhardmgruber in #4224
Extend tuning documentation by @bernhardmgruber in #4184
Drop tuning params for benchmarks with custom ops by @bernhardmgruber in #4176
Make compiler version comparisons safer by @davebayer in #4185
Document python packages for sol plot script by @bernhardmgruber in #4228
Remove _CCCL_NO_FOLD_EXPRESSIONS by @davebayer in #4189
Remove python/cuda_cooperative/setup.py by @rwgk in #4221
Allow cuda::par*.on() to take cuda::stream_ref by @bernhardmgruber in #4225
Drop _CCCL_NO_VARIABLE_TEMPLATES by @miscco in #4229
Fix typos in cuda mdspan documentation by @fbusato in #4231
Simplify Thrust assign_value by @bernhardmgruber in #4227
Remove double underscore limit macros by @davebayer in #4194
Document deprecations from #4165 by @bernhardmgruber in #4237
Implement __fp_is_subset trait by @davebayer in #4230
Extend tuning verification docs by @bernhardmgruber in #4236
Use [[maybe_unused]] in whole cccl by @davebayer in #4207
Move implementation of cuda::std::array to libcu++ by @davebayer in #4239
Implement __cccl_fp class by @davebayer in #4238
Add transform c parallel implementation by @shwina in #4048
Drop duplicated system header blocks by @miscco in #4245
Exclude sm101 from RDC testing. by @alliepiper in #4247
Make cuda::stream_ref constructible on device by @miscco in #4243
Fix logic in test_segmented_reduce by @oleksandr-pavlyk in #4198
Add new WarpReduce overloadings by @fbusato in #3884
Fix construction of host init value in test_reduce made incorrect after refactoring by @oleksandr-pavlyk in #4251
Refac...

These are the release notes for the cuda-cccl Python package version 0.3.1, dated October 8th, 2025. The previous release was v0.3.0.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

The cuda.cccl.parallel.experimental package has been renamed to cuda.compute.
The cuda.cccl.cooperative.experimental package has been renamed to cuda.coop.
The old imports will continue to work for now, but will be removed in a subsequent release.
Documentation at https://nvidia.github.io/cccl/python/ has been updated to reflect these changes.

Bug Fixes

Breaking Changes

If you previously were importing subpackages of cuda.cccl.parallel.experimental or cuda.cccl.cooperative.experimental, those imports may not work as expected. Please import from cuda.compute and cuda.coop respectively.

@shwina

What's Changed

🔄 Other Changes

Backport #5442 to branch/3.0x by @shwina in #5469
Backport to 3.0: Fix grid dependency sync in cub::DeviceMergeSort (#5456) by @bernhardmgruber in #5461
Partial backport to 3.0: Fix SMEM alignment in DeviceTransform by @bernhardmgruber in #5463
[Version] Update branch/3.0.x to v3.0.3 by @github-actions[bot] in #5502
[Backport branch/3.0.x] NV_TARGET and cuda::ptx for CTK 13 by @fbusato in #5481
[BACKPORT 3.0]: Update PTX ISA version for CUDA 13 (#5676) by @miscco in #5700
Backport some MSVC test fixes to 3.0 by @miscco in #5819
[Backport 3.0]: Work around submdspan compiler issue on MSVC (#5885) by @miscco in #5903
Backport pin of llvmlite dependency to branch/3.0x by @shwina in #6000
[Backport branch/3.0.x] Ensure that we are actually calling the cuda APIs ... (#4570) by @davebayer in #6098
[Backport to 3.0] add a specialization of __make_tuple_types for complex<T> (#6102) by @davebayer in #6117
[Backport 3.0.x] Use proper qualification in allocate.h (#4796) by @wmaxey in #6126

Full Changelog: v3.0.2...v3.0.3

These are the release notes for the cuda-cccl Python package version 0.3.0, dated October 2nd, 2025. The previous release was v0.2.1.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

ARM64 wheel and conda package support: Installation via pip and conda now supported on ARM64 (aarch64) architecture.
New algorithm: three-way partitioning: The three_way_partition algorithm enables partitioning an array (or iterator) into three partitions, given two selection operators.
Improved scan performance: The inclusive_scan and exclusive_scan APIs provide improved performance by automatically selecting the optimal tuning for the input data types and device architecture.

Bug Fixes

None.

Breaking Changes

None.

These are the changes in the cuda.cccl libraries introduced in the pre-release 0.1.3.2.0dev128 dated August 14th, 2025.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Major API improvements

Single-call APIs in `cuda.cccl.parallel`

Previously, performing operation like reduce_into required 4 API invocations to
(1) create a reducer object, (2) compute the amount of temporary storage required for the reduction,
(3) allocate the required amount of temporary memory, and (4) perform the reduction.

In this version, cuda.cccl.parallel introduces simpler, single-call APIs. For example, reduction looks like:

# New API - single function call with automatic temp storage
parallel.reduce_into(d_input, d_output, add_op, num_items, h_init)

If you wish to have more control over temporary memory allocation,
the previous API still exists (and always will). It has been renamed from reduce_into to make_reduce_into:

# Object API
reducer = parallel.make_reduce_into(d_input, d_output, add_op, h_init)
temp_storage_size = reducer(None, d_input, d_output, num_items, h_init)
temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
reducer(temp_storage, d_input, d_output, num_items, h_init)

New algorithms

Device-wide histogram

The histogram_even
function provides Python exposure of the corresponding CUB C++ API DeviceHistogram::HistogramEven.

`StripedtoBlock` exchange

cuda.cccl.cooperative adds a block.exchange
providing Python exposure of the corresponding CUB C++ API BlockExchange.
Currently, only the StripedToBlock exchange pattern is supported.

Infrastructure improvements

CuPy dependency replaced with `cuda.core`

Use of CuPy within the library has been replaced with the lighter weight cuda.core
package. This means that installing cuda.cccl won't install CuPy as a dependency.

Support for CUDA 13 drivers

cuda.cccl can be used with CUDA 13 compatible drivers. However, the CUDA 13 toolkit (runtime and libraries) is not
yet supported, meaning you still need the CUDA 12 toolkit. Full support for CUDA 13 toolkit is planned for the next
pre-release.

@bernhardmgruber

What's Changed

🔄 Other Changes

[Version] Update branch/3.0.x to v3.0.2 by @github-actions[bot] in #5348
Backport to 3.0: Add a macro to disable PDL (#5316) by @bernhardmgruber in #5330
[Backport branch/3.0x] Add gitlab devcontainers (#5325) by @wmaxey in #5352

Full Changelog: v3.0.1...v3.0.2

@miscco

What's Changed

🔄 Other Changes

[Version] Update branch/3.0.x to v3.0.1 by @github-actions[bot] in #5256
[Backport branch/3.0.x] Disable assertions for QNX, they do not provide the machinery with their libc by @github-actions[bot] in #5258
[BACKPORT 3.0] Make sure that nested tuple and pair have the expected size (#5246) by @miscco in #5265
[BACKPORT] Add missed specializations of the new aligned vector types to cub (#5264) by @miscco in #5271
[BACKPORT 3.0] Backport diagnostic suppression machinery by @miscco in #5281

Full Changelog: v3.0.0...v3.0.1

@bernhardmgruber

CCCL 3.0 Release

The 3.0 release of the CUDA Core Compute Libraries (CCCL) marks our first major version since unifying the Thrust, CUB, and libcudacxx libraries under a single repository. This release reflects over a year of work focused on cleanup, consolidation, and modernizing the codebase to support future growth.

While this release includes a number of breaking changes, many involve the consolidation of APIs—particularly in the thrust:: and cub:: namespaces—as well as cleanup of internal details that were never intended for public use. In many cases, redundant functionality from thrust:: or cub:: has been replaced with equivalent or improved abstractions from the cuda:: or cuda::std:: namespaces. Impact should be minimal for most users. For full details and recommended migration steps, please consult the CCCL 2.x to 3.0 Migration Guide.

Key Changes in CCCL 3.0

Requirements

C++17 or newer is now required (support for C++11 and C++14 has been dropped #3255)
CUDA Toolkit 12.0+ is now required (support for CTK 11.0+ has been dropped). For details on version compatibility, see the README.
Compilers:
- GCC 7+ (support for GCC < 7 has been dropped #3268)
- Clang 14+ (support for Clang < 14 has been dropped #3309)
- MSVC 2019+ (support for MSVC 2017 has been dropped #3287, #3553)
Dropped support for
- ICC #3277, #3279
- CUDA Dynamic Parallelism v1 (CDPv1) #3344

Header Directory Changes in CUDA Toolkit 13.0

CCCL 3.0 will be included with an upcoming CUDA Toolkit 13.0 release. In this release, the bundled CCCL headers have moved to new top-level directories under ${CTK_ROOT}/include/cccl/.

Before CUDA 13.0	After CUDA 13.0
`${CTK_ROOT}/include/cuda/`	`${CTK_ROOT}/include/cccl/cuda/`
`${CTK_ROOT}/include/cub/`	`${CTK_ROOT}/include/cccl/cub/`
`${CTK_ROOT}/include/thrust/`	`${CTK_ROOT}/include/cccl/thrust/`

These changes only affect the on-disk location of CCCL headers within the CUDA Toolkit installation.

What you need to know

❌ Do NOT write #include <cccl/...> — this will break.
If using CCCL headers only in files compiled with nvcc
- ✅ No action needed. This is the default for most users.
If using CCCL headers in files compiled exclusively with a host compiler (e.g., GCC, Clang, MSVC):
- Using CMake and linking CCCL::CCCL
  - ✅ No action needed. (This is the recommended path. See example)
- Other build systems
  - ⚠️ Add ${CTK_ROOT}/include/cccl to your compiler’s include search path (e.g., with -I)

These changes prevent issues when mixing CCCL headers bundled with the CUDA Toolkit and those from external package managers. For more detail, see the CCCL 2.x to 3.0 Migration Guide.

Major API Changes

Hundreds of macros, internal types, and implementation details were removed or relocated to internal namespaces. This significantly reduces surface area and eliminates long-standing technical debt, improving both compile times and maintainability.

Removed Macros

Over 50 legacy macros have been removed in favor of modern C++ alternatives:

CUB_{MIN,MAX}: use cuda::std::{min,max} instead #3821
THRUST_NODISCARD: use [[nodiscard]] instead #3746
THRUST_INLINE_CONSTANT: use `inline constexpr` instead #3746
See CCCL 2.x to 3.0 Migration Guide for complete list

Removed Functions and Classes

thrust::optional: use cuda::std::optional instead #4172
thrust::tuple: use cuda::std::tuple instead #2395
thrust::pair: use cuda::std::pair instead #2395
thrust::numeric_limits: use cuda::std::numeric_limits instead #3366
cub::BFE: use `cuda::bitfield_inser`t and cuda::bitfield_extract instead #4031
cub::ConstantInputIterator: use thrust::constant_iterator instead #3831
cub::CountingInputIterator: use thrust::counting_iterator instead #3831
cub::GridBarrier: use cooperative groups instead #3745
cub::DeviceSpmv: use cuSPARSE instead #3320
cub::Mutex: use cuda::std::mutex instead #3251
See CCCL 2.x to 3.0 Migration Guide for complete list

New Features

C++

`cuda::`

cuda::std::numeric_limits now supports __float128 #4059
cuda::std::optional<T&> implementation (P2988) #3631
cuda::std::numbers header for mathematical constants #3355
NVFP8/6/4 extended floating-point types support in <cuda/std/cmath> #3843
cuda::overflow_cast for safe numeric conversions #4151
cuda::ilog2 and cuda::ilog10 integer logarithms #4100
cuda::round_up and cuda::round_down utilities #3234

`cub::`

`cub::DeviceSegmentedReduce` now supports large number of segments #3746
`cub::DeviceCopy::Batched` now supports large number of buffers #4129
`cub::DeviceMemcpy::Batched` now supports large number of buffers #4065

`thrust::`

New `thrust::offset_iterator` iterator #4073
Temporary storage allocations in parallel algorithms now respect `par_nosync` #4204

Python

CUDA Python Core Libraries are now available on PyPI through the cuda-cccl package.

pip install cuda-cccl

cuda.cccl.cooperative

Block-level sorting now supports multi-dimensional thread blocks #4035, #4028
Block-level data movement now supports multi-dimensional thread blocks #3161
New block-level inclusive sum algorithm #3921

cuda.cccl.parallel

New device-level segmented-reduce algorithm #3906
New device-level unique-by-key algorithm #3947
New device-level merge-sort algorithm #3763

What's Changed

🚀 Thrust / CUB

Drop cub::Mutex by @bernhardmgruber in #3251
Remove legacy macros from CUB util_arch.cuh by @bernhardmgruber in #3257
Remove thrust::[unary|binary]_traits by @bernhardmgruber in #3260
Drop thrust not1 and not2 by @bernhardmgruber in #3264
Deprecate GridBarrier and GridBarrierLifetime by @bernhardmgruber in #3258
Drop thrust::[unary|binary]_function by @bernhardmgruber in #3274
Enable thrust::identity test for non-MSVC by @bernhardmgruber in #3281
Enable PDL in triple chevron launch by @bernhardmgruber in #3282
Drop Thrust legacy arch macros by @bernhardmgruber in #3298
Drop Thrust's compiler_fence.h by @bernhardmgruber in #3300
Drop CUB's util_compiler.cuh by @bernhardmgruber in #3302
Drop Thrust's deprecated compiler macros by @bernhardmgruber in #3301
Drop CUB_RUNTIME_ENABLED and THRUST_HAS_CUDART by @bernhardmgruber in #3305
Require C++17 for compiling Thrust and CUB by @bernhardmgruber in #3255
Deprecate Thrust's cpp_compatibility.h macros by @bernhardmgruber in #3299
Deprecate cub::IterateThreadStore by @bernhardmgruber in #3337
Drop CUB's BinaryFlip operator by @bernhardmgruber in #3332
Deprecate cub::Swap by @bernhardmgruber in #3333
Drop CUB APIs with a debug_synchronous parameter by ...

Releases: NVIDIA/cccl

python-0.3.3

Installation

Features and improvements

Bug Fixes

Breaking Changes

Uh oh!

python-0.3.2

Installation

Features and improvements

Bug Fixes

Breaking Changes

Uh oh!

v3.1.0

What's Changed

🚀 Thrust / CUB

📚 Libcudacxx

📝 Documentation

🔄 Other Changes

Contributors

Uh oh!

python-0.3.1

Installation

Features and improvements

Bug Fixes

Breaking Changes

Uh oh!

v3.0.3

What's Changed

🔄 Other Changes

Contributors

Uh oh!

CCCL Python Libraries (v0.3.0)

Installation

Features and improvements

Bug Fixes

Breaking Changes

Uh oh!

CCCL Python Libraries v0.1.3.2.0.dev128 (pre-release)

Major API improvements

Single-call APIs in cuda.cccl.parallel

New algorithms

Device-wide histogram

StripedtoBlock exchange

Infrastructure improvements

CuPy dependency replaced with cuda.core

Support for CUDA 13 drivers

Uh oh!

v3.0.2

What's Changed

🔄 Other Changes

Contributors

Uh oh!

v3.0.1

What's Changed

🔄 Other Changes

Contributors

Uh oh!

v3.0.0

CCCL 3.0 Release

Key Changes in CCCL 3.0

Requirements

Header Directory Changes in CUDA Toolkit 13.0

What you need to know

Major API Changes

Removed Macros

Removed Functions and Classes

New Features

C++

cuda::

cub::

thrust::

Python

What's Changed

🚀 Thrust / CUB

Contributors

Uh oh!

Single-call APIs in `cuda.cccl.parallel`

`StripedtoBlock` exchange

CuPy dependency replaced with `cuda.core`

`cuda::`

`cub::`

`thrust::`