Changelog for CuTe DSL API changes#
4.3.0 (2025-10-07)#
Debuggability improvements: - Supported source location tracking for DSL APIs - Supported dumping PTX and SASS code
Remove deprecated
cutlass.<arch>_utils.SMEM_CAPACITY["<arch_str>"]andcutlass.utils.ampere_helpersSupport calling nested functions without capturing variables inside dynamic control flow
Replace usage of
cute.arch.barrierin examples with corresponding APIs inpipeline- Usepipeline.syncfor simple cases like synchronizing the whole CTA - Usepipeline.NamedBarrierto customize barriers with different participating threads and barrier idAdded new APIs
repeatandrepeat_as_tupleAdded new APIs
make_rmem_tensorto replacemake_fragmentwith better namingAdded new APIs
make_rmem_tensor_likewhich create rmem tensor from a tensor using the same shape with compact col-major stridesAdded
TmemAllocatorfor allocating tensor memoryUpdated
SmemAllocator.allocateto support allocation of a single scalar valueFixed
TensorSSA.reduceto support static value as initial valueUpdated docstring for following APIs to be more concise and easier to understand: -
make_layout_tv-is_static-PipelineAsync-SmemAllocatorFixed documentation for
pipeline,utilsandcute.math
4.2.0 (2025-09-10)#
Added back
cute.make_tiled_copyper the request from communityAdded support for explicit and implicit broadcast in
TensorSSA-cutlass.cute.TensorSSA: supportbroadcast_toand implicit broadcasting for binary operations.Supported printing
TensorSSAvalue incutlass.cute.print_tensorUpdated
cute.gemmto support all dispatch patterns and improved checks for illegal inputsIntroduced automatic kernel smem usage calculation for launch config.
Introduced per op fast-math control for math ops(e.g.
exp,exp2,log2,log)Introduced
CopyReduceBulkTensorTileS2GOpin tcgen05/copy.py to support TMA Reduce.
4.1.0 (2025-07-16)#
for loop
Python built-in
rangenow always generates codes and executes at runtimecutlass.rangeis advancedrangewith kernel code level unrolling and pipelining controlDeprecated
cutlass.range_dynamic, please replace withrangeorcutlass.rangeExperimental Added
pipeliningcontrol for compiler generated software pipeline code
while/if
while/ifnow by default generates codes and executes at runtime unlesscutlass.const_expris specified for the predicateDeprecated
cutlass.dynamic_expr, please remove it
Rename mbarrier functions to reduce ambiguity
Modify SyncObject API (
MbarrierArray,NamedBarrier,TmaStoreFence) to matchstd::barrierChange pipeline
createfunction to take only keyword arguments, and makebarrier_storageoptional.Introduce
cutlass.cute.arch.get_dyn_smem_sizeapi to get runtime dynamic shared memory size.Various API Support for SM100 BlockScaled Gemm
Introduce BlockScaled MmaOps in tcgen05/mma.py, and provide a
make_blockscaled_trivial_tiled_mmafunction in blackwell_helpers.py to help construct a BlockScaled TiledMma.Introduce S2T CopyOps in tcgen05/copy.py.
Introduce BlockScaled layout utilities in blockscaled_layout.py for creating the required scale factor layouts in global memory, shared memory and tensor memory.
cutlass.cute.compilenow supports compilation options. Refer to JIT compilation options for more details.cutlass.cute.testing.assert_now works for device JIT function. Specify--enable-device-assertionsas compilation option to enable.cutlass.cute.make_tiled_copyis now deprecated. Please usecutlass.cute.make_tiled_copy_tvinstead.Shared memory capacity query
Introduce
cutlass.utils.get_smem_capacity_in_bytesfor querying the shared memory capacity.<arch>_utils.SMEM_CAPACITY["<arch_str>"]is now deprecated.
4.0.0 (2025-06-03)#
Fixed API mismatch in class
cute.runtime.Pointer: changeelement_typetodtypeto matchtyping.Pointer