Releases: NVIDIA/cccl
python-0.3.3
These are the release notes for the cuda-cccl
Python package version 0.3.3, dated October 21st, 2025. The previous release was v0.3.2.
cuda.cccl
is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
- This is the first release that features Windows wheels published to PyPI. You can now
pip install cuda-cccl[cu12]
orpip install cuda-cccl[cu13]
on Windows for Python versions 3.10, 3.11, 3.12, and 3.13.
Bug Fixes
Breaking Changes
python-0.3.2
These are the release notes for the cuda-cccl
Python package version 0.3.2, dated October 17th, 2025. The previous release was v0.3.1.
cuda.cccl
is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
- Allow passing in a device array or
None
as the initial value in scan.
Bug Fixes
Breaking Changes
v3.1.0
What's Changed
🚀 Thrust / CUB
- [Thrust] Perform asynchronous allocations by default for the
par_nosync
policy by @brycelelbach in #4204 - [Thrust]
reduce_into
by @brycelelbach in #4355 - Enable Catch2 tests in Thrust by @bernhardmgruber in #2669
- Add memcpy_async transform kernel for Ampere by @bernhardmgruber in #2394
- Allow default-initializing and skipping initialization of Thrust vectors by @bernhardmgruber in #4183
- Add thrust::strided_iterator and a step for thrust::counting_iterator by @bernhardmgruber in #4014
- Add new WarpReduce overloadings by @fbusato in #3884
- Optimize ThreadReduce by @fbusato in #3441
📚 Libcudacxx
- Enable device assertions in CUDA debug mode
nvcc -G
by @fbusato in #4444 - avoid EDG bug by moving diagnostic push & pop out of templates by @ericniebler in #4416
- Add host/device/managed mdspan and accessors by @fbusato in #3686
- Add cuda::ptx::elect.sync by @fbusato in #4445
- Add pointer utilities cuda::is_aligned, cuda::align_up, cuda::align_down, cuda::ptr_rebind by @fbusato in #5037
- Add cuda::ceil_ilog2 by @fbusato in #4485
- Add cuda::is_power_of_two, cuda::next_power_of_two, cuda::prev_power_of_two by @fbusato in #4627
- Add cuda::device::warp_match_all by @fbusato in #4746
- Add cuda::static_for by @fbusato in #4855
- Improve/cleanup cuda::annotated_ptr implementation by @fbusato in #4503
- Add cuda::fast_mod_div Fast Modulo Division by @fbusato in #5210
📝 Documentation
- Making extended API documentation slightly more uniform by @fbusato in #4965
- Add memory space note to
cuda::memory
documentation by @fbusato in #5151 - Better specify
lane_mask::all_active()
behavior by @fbusato in #5183
🔄 Other Changes
- [CUDAX] Add universal comparison across memory resources by @pciolkosz in #4168
- Implement
ranges::range_adaptor
by @miscco in #4066 - Avoiding looping over problem size in individual tests by @oleksandr-pavlyk in #4140
- Replace CUB
util_arch.cuh
macros withinline constexpr
variables by @fbusato in #4165 - Improves test times for
DeviceSegmentedRadixSort
by @elstehle in #4156 - Simplify Thrust iterator functions by @bernhardmgruber in #4178
- Remove
_LIBCUDACXX_UNUSED_VAR
by @davebayer in #4174 - Remove
_CCCL_NO_IF_CONSTEXPR
by @davebayer in #4187 - Implement
__fp_native_type_t
by @davebayer in #4173 - Adds support for large number of segments and large number of items to
DeviceSegmentedRadixSort
by @elstehle in #3402 - Implement inclusive scan in cuda.parallel by @NaderAlAwar in #4147
- Remove
_CCCL_NO_NOEXCEPT_FUNCTION_TYPE
by @davebayer in #4190 - Fix
not_fn
by @miscco in #4186 - Remove
_CCCL_NTTP_AUTO
by @davebayer in #4191 - Avoid instantiating discard_iterator while parsing by @bernhardmgruber in #4180
- Host/Device accessors for
mdspan
by @fbusato in #3686 - Remove
_CCCL_NO_DEDUCTION_GUIDES
by @davebayer in #4188 - Set NO_CMAKE_FIND_ROOT_PATH for cudax. by @bdice in #4162
- Fix build breaking with setuptools by @miscco in #4212
- Replaces remaining uses of
thrust::{host,device}_vector
in our Catch2 tests by @elstehle in #4205 - Add check that CXX + CUDA_HOST compilers match when necessary. by @alliepiper in #4201
- Disable test on 12.0 CTK by @miscco in #4214
- Implement fp properties by @davebayer in #4213
- [CUDAX] Separate non-async pinned memory resource into legacy_pinned_memory_resource by @pciolkosz in #4179
- Avoid errors in
get_device_address
tests by @miscco in #4209 - Implement extended fp traits by @davebayer in #4211
- Remove
_CCCL_INLINE_VAR
by @davebayer in #4192 - Improve host/device mdspan documentation by @fbusato in #4220
- Drop
_LIBCUDACXX_BEGIN_NAMESPACE_RANGES_ABI
by @miscco in #4210 - Fix C++ version used in CONTRIBUTING.md by @bernhardmgruber in #4224
- Extend tuning documentation by @bernhardmgruber in #4184
- Drop tuning params for benchmarks with custom ops by @bernhardmgruber in #4176
- Make compiler version comparisons safer by @davebayer in #4185
- Document python packages for sol plot script by @bernhardmgruber in #4228
- Remove
_CCCL_NO_FOLD_EXPRESSIONS
by @davebayer in #4189 - Remove python/cuda_cooperative/setup.py by @rwgk in #4221
- Allow cuda::par*.on() to take cuda::stream_ref by @bernhardmgruber in #4225
- Drop
_CCCL_NO_VARIABLE_TEMPLATES
by @miscco in #4229 - Fix typos in cuda mdspan documentation by @fbusato in #4231
- Simplify Thrust assign_value by @bernhardmgruber in #4227
- Remove double underscore limit macros by @davebayer in #4194
- Document deprecations from #4165 by @bernhardmgruber in #4237
- Implement
__fp_is_subset
trait by @davebayer in #4230 - Extend tuning verification docs by @bernhardmgruber in #4236
- Use
[[maybe_unused]]
in whole cccl by @davebayer in #4207 - Move implementation of
cuda::std::array
to libcu++ by @davebayer in #4239 - Implement
__cccl_fp
class by @davebayer in #4238 - Add transform c parallel implementation by @shwina in #4048
- Drop duplicated system header blocks by @miscco in #4245
- Exclude sm101 from RDC testing. by @alliepiper in #4247
- Make
cuda::stream_ref
constructible on device by @miscco in #4243 - Fix logic in test_segmented_reduce by @oleksandr-pavlyk in #4198
- Add new
WarpReduce
overloadings by @fbusato in #3884 - Fix construction of host init value in test_reduce made incorrect after refactoring by @oleksandr-pavlyk in #4251
- Refac...
python-0.3.1
These are the release notes for the cuda-cccl
Python package version 0.3.1, dated October 8th, 2025. The previous release was v0.3.0.
cuda.cccl
is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
- The
cuda.cccl.parallel.experimental
package has been renamed tocuda.compute
. - The
cuda.cccl.cooperative.experimental
package has been renamed tocuda.coop
. - The old imports will continue to work for now, but will be removed in a subsequent release.
- Documentation at https://nvidia.github.io/cccl/python/ has been updated to reflect these changes.
Bug Fixes
Breaking Changes
- If you previously were importing subpackages of
cuda.cccl.parallel.experimental
orcuda.cccl.cooperative.experimental
, those imports may not work as expected. Please import fromcuda.compute
andcuda.coop
respectively.
v3.0.3
What's Changed
🔄 Other Changes
- Backport #5442 to branch/3.0x by @shwina in #5469
- Backport to 3.0: Fix grid dependency sync in cub::DeviceMergeSort (#5456) by @bernhardmgruber in #5461
- Partial backport to 3.0: Fix SMEM alignment in DeviceTransform by @bernhardmgruber in #5463
- [Version] Update branch/3.0.x to v3.0.3 by @github-actions[bot] in #5502
- [Backport branch/3.0.x] NV_TARGET and cuda::ptx for CTK 13 by @fbusato in #5481
- [BACKPORT 3.0]: Update PTX ISA version for CUDA 13 (#5676) by @miscco in #5700
- Backport some MSVC test fixes to 3.0 by @miscco in #5819
- [Backport 3.0]: Work around
submdspan
compiler issue on MSVC (#5885) by @miscco in #5903 - Backport pin of llvmlite dependency to branch/3.0x by @shwina in #6000
- [Backport branch/3.0.x] Ensure that we are actually calling the cuda APIs ... (#4570) by @davebayer in #6098
- [Backport to 3.0] add a specialization of
__make_tuple_types
forcomplex<T>
(#6102) by @davebayer in #6117 - [Backport 3.0.x] Use proper qualification in allocate.h (#4796) by @wmaxey in #6126
Full Changelog: v3.0.2...v3.0.3
CCCL Python Libraries (v0.3.0)
These are the release notes for the cuda-cccl
Python package version 0.3.0, dated October 2nd, 2025. The previous release was v0.2.1.
cuda.cccl
is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
-
ARM64 wheel and conda package support: Installation via
pip
andconda
now supported on ARM64 (aarch64) architecture. -
New algorithm: three-way partitioning: The
three_way_partition
algorithm enables partitioning an array (or iterator) into three partitions, given two selection operators. -
Improved scan performance: The
inclusive_scan
andexclusive_scan
APIs provide improved performance by automatically selecting the optimal tuning for the input data types and device architecture.
Bug Fixes
None.
Breaking Changes
None.
CCCL Python Libraries v0.1.3.2.0.dev128 (pre-release)
These are the changes in the cuda.cccl
libraries introduced in the pre-release 0.1.3.2.0dev128 dated August 14th, 2025.
cuda.cccl
is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Major API improvements
Single-call APIs in cuda.cccl.parallel
Previously, performing operation like reduce_into
required 4 API invocations to
(1) create a reducer object, (2) compute the amount of temporary storage required for the reduction,
(3) allocate the required amount of temporary memory, and (4) perform the reduction.
In this version, cuda.cccl.parallel
introduces simpler, single-call APIs. For example, reduction looks like:
# New API - single function call with automatic temp storage
parallel.reduce_into(d_input, d_output, add_op, num_items, h_init)
If you wish to have more control over temporary memory allocation,
the previous API still exists (and always will). It has been renamed from reduce_into
to make_reduce_into
:
# Object API
reducer = parallel.make_reduce_into(d_input, d_output, add_op, h_init)
temp_storage_size = reducer(None, d_input, d_output, num_items, h_init)
temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
reducer(temp_storage, d_input, d_output, num_items, h_init)
New algorithms
Device-wide histogram
The histogram_even
function provides Python exposure of the corresponding CUB C++ API DeviceHistogram::HistogramEven
.
StripedtoBlock
exchange
cuda.cccl.cooperative
adds a block.exchange
providing Python exposure of the corresponding CUB C++ API BlockExchange.
Currently, only the StripedToBlock
exchange pattern is supported.
Infrastructure improvements
CuPy dependency replaced with cuda.core
Use of CuPy within the library has been replaced with the lighter weight cuda.core
package. This means that installing cuda.cccl
won't install CuPy as a dependency.
Support for CUDA 13 drivers
cuda.cccl
can be used with CUDA 13 compatible drivers. However, the CUDA 13 toolkit (runtime and libraries) is not
yet supported, meaning you still need the CUDA 12 toolkit. Full support for CUDA 13 toolkit is planned for the next
pre-release.
v3.0.2
What's Changed
🔄 Other Changes
- [Version] Update branch/3.0.x to v3.0.2 by @github-actions[bot] in #5348
- Backport to 3.0: Add a macro to disable PDL (#5316) by @bernhardmgruber in #5330
- [Backport branch/3.0x] Add gitlab devcontainers (#5325) by @wmaxey in #5352
Full Changelog: v3.0.1...v3.0.2
v3.0.1
What's Changed
🔄 Other Changes
- [Version] Update branch/3.0.x to v3.0.1 by @github-actions[bot] in #5256
- [Backport branch/3.0.x] Disable assertions for QNX, they do not provide the machinery with their libc by @github-actions[bot] in #5258
- [BACKPORT 3.0] Make sure that nested
tuple
andpair
have the expected size (#5246) by @miscco in #5265 - [BACKPORT] Add missed specializations of the new aligned vector types to cub (#5264) by @miscco in #5271
- [BACKPORT 3.0] Backport diagnostic suppression machinery by @miscco in #5281
Full Changelog: v3.0.0...v3.0.1
v3.0.0
CCCL 3.0 Release
The 3.0 release of the CUDA Core Compute Libraries (CCCL) marks our first major version since unifying the Thrust, CUB, and libcudacxx libraries under a single repository. This release reflects over a year of work focused on cleanup, consolidation, and modernizing the codebase to support future growth.
While this release includes a number of breaking changes, many involve the consolidation of APIs—particularly in the thrust::
and cub::
namespaces—as well as cleanup of internal details that were never intended for public use. In many cases, redundant functionality from thrust::
or cub::
has been replaced with equivalent or improved abstractions from the cuda::
or cuda::std::
namespaces. Impact should be minimal for most users. For full details and recommended migration steps, please consult the CCCL 2.x to 3.0 Migration Guide.
Key Changes in CCCL 3.0
Requirements
- C++17 or newer is now required (support for C++11 and C++14 has been dropped #3255)
- CUDA Toolkit 12.0+ is now required (support for CTK 11.0+ has been dropped). For details on version compatibility, see the README.
- Compilers:
- Dropped support for
Header Directory Changes in CUDA Toolkit 13.0
CCCL 3.0 will be included with an upcoming CUDA Toolkit 13.0 release. In this release, the bundled CCCL headers have moved to new top-level directories under ${CTK_ROOT}/include/cccl/.
Before CUDA 13.0 | After CUDA 13.0 |
---|---|
${CTK_ROOT}/include/cuda/ |
${CTK_ROOT}/include/cccl/cuda/ |
${CTK_ROOT}/include/cub/ |
${CTK_ROOT}/include/cccl/cub/ |
${CTK_ROOT}/include/thrust/ |
${CTK_ROOT}/include/cccl/thrust/ |
These changes only affect the on-disk location of CCCL headers within the CUDA Toolkit installation.
What you need to know
- ❌ Do NOT write
#include <cccl/...>
— this will break. - If using CCCL headers only in files compiled with nvcc
- ✅ No action needed. This is the default for most users.
- If using CCCL headers in files compiled exclusively with a host compiler (e.g., GCC, Clang, MSVC):
- Using CMake and linking
CCCL::CCCL
- ✅ No action needed. (This is the recommended path. See example)
- Other build systems
⚠️ Add${CTK_ROOT}/include/cccl
to your compiler’s include search path (e.g., with-I
)
- Using CMake and linking
These changes prevent issues when mixing CCCL headers bundled with the CUDA Toolkit and those from external package managers. For more detail, see the CCCL 2.x to 3.0 Migration Guide.
Major API Changes
Hundreds of macros, internal types, and implementation details were removed or relocated to internal namespaces. This significantly reduces surface area and eliminates long-standing technical debt, improving both compile times and maintainability.
Removed Macros
Over 50 legacy macros have been removed in favor of modern C++ alternatives:
CUB_{MIN,MAX}
: usecuda::std::{min,max}
instead #3821THRUST_NODISCARD
: use[[nodiscard]]
instead #3746THRUST_INLINE_CONSTANT
: use `inline constexpr` instead #3746- See CCCL 2.x to 3.0 Migration Guide for complete list
Removed Functions and Classes
thrust::optional
: usecuda::std::optional
instead #4172thrust::tuple
: usecuda::std::tuple
instead #2395thrust::pair
: usecuda::std::pair
instead #2395thrust::numeric_limits
: usecuda::std::numeric_limits
instead #3366cub::BFE
: use `cuda::bitfield_inser`t andcuda::bitfield_extract
instead #4031cub::ConstantInputIterator
: usethrust::constant_iterator
instead #3831cub::CountingInputIterator
: usethrust::counting_iterator
instead #3831cub::GridBarrier
: use cooperative groups instead #3745cub::DeviceSpmv
: use cuSPARSE instead #3320cub::Mutex
: usecuda::std::mutex
instead #3251- See CCCL 2.x to 3.0 Migration Guide for complete list
New Features
C++
cuda::
cuda::std::numeric_limits
now supports__float128
#4059cuda::std::optional<T&>
implementation (P2988) #3631cuda::std::numbers
header for mathematical constants #3355NVFP8/6/4
extended floating-point types support in<cuda/std/cmath>
#3843cuda::overflow_cast
for safe numeric conversions #4151cuda::ilog2
andcuda::ilog10
integer logarithms #4100cuda::round_up
andcuda::round_down
utilities #3234
cub::
- `cub::DeviceSegmentedReduce` now supports large number of segments #3746
- `cub::DeviceCopy::Batched` now supports large number of buffers #4129
- `cub::DeviceMemcpy::Batched` now supports large number of buffers #4065
thrust::
- New `thrust::offset_iterator` iterator #4073
- Temporary storage allocations in parallel algorithms now respect `par_nosync` #4204
Python
CUDA Python Core Libraries are now available on PyPI through the cuda-cccl
package.
pip install cuda-cccl
cuda.cccl.cooperative
- Block-level sorting now supports multi-dimensional thread blocks #4035, #4028
- Block-level data movement now supports multi-dimensional thread blocks #3161
- New block-level inclusive sum algorithm #3921
cuda.cccl.parallel
- New device-level segmented-reduce algorithm #3906
- New device-level unique-by-key algorithm #3947
- New device-level merge-sort algorithm #3763
What's Changed
🚀 Thrust / CUB
- Drop cub::Mutex by @bernhardmgruber in #3251
- Remove legacy macros from CUB util_arch.cuh by @bernhardmgruber in #3257
- Remove thrust::[unary|binary]_traits by @bernhardmgruber in #3260
- Drop thrust not1 and not2 by @bernhardmgruber in #3264
- Deprecate GridBarrier and GridBarrierLifetime by @bernhardmgruber in #3258
- Drop thrust::[unary|binary]_function by @bernhardmgruber in #3274
- Enable thrust::identity test for non-MSVC by @bernhardmgruber in #3281
- Enable PDL in triple chevron launch by @bernhardmgruber in #3282
- Drop Thrust legacy arch macros by @bernhardmgruber in #3298
- Drop Thrust's compiler_fence.h by @bernhardmgruber in #3300
- Drop CUB's util_compiler.cuh by @bernhardmgruber in #3302
- Drop Thrust's deprecated compiler macros by @bernhardmgruber in #3301
- Drop CUB_RUNTIME_ENABLED and THRUST_HAS_CUDART by @bernhardmgruber in #3305
- Require C++17 for compiling Thrust and CUB by @bernhardmgruber in #3255
- Deprecate Thrust's cpp_compatibility.h macros by @bernhardmgruber in #3299
- Deprecate cub::IterateThreadStore by @bernhardmgruber in #3337
- Drop CUB's BinaryFlip operator by @bernhardmgruber in #3332
- Deprecate cub::Swap by @bernhardmgruber in #3333
- Drop CUB APIs with a debug_synchronous parameter by ...