Interface For Sparse Linear Algebra Operations
Interface For Sparse Linear Algebra Operations
1
Abstract
The standardization of an interface for dense linear algebra operations in the
BLAS standard has enabled interoperability between different linear algebra li-
braries, thereby boosting the success of scientific computing, in particular in
scientific HPC. In particular, from this point on, applications could interface
highly optimized implementations through a standardized API. Despite numer-
ous efforts in the past, the community has not yet agreed on a standardization
for sparse linear algebra operations due to numerous reasons. One is the fact
that sparse linear algebra objects allow for many different storage formats, and
different hardware may favor different storage formats. This makes the defi-
nition of a FORTRAN-style all-circumventing interface extremely challenging.
Another reason is that opposed to dense linear algebra functionality, in sparse
linear algebra, the size of the sparse data structure for the operation result is not
always known prior to the information. Furthermore, as opposed to the stan-
dardization effort for dense linear algebra, we are late in the technology readiness
cycle, and many production-ready software libraries using sparse linear algebra
routines have implemented and committed to their own sparse BLAS interface.
At the same time, there exists a demand for standardization that would improve
interoperability, and sustainability, and allow for easier integration of building
blocks. In an inclusive, cross-institutional effort involving numerous academic
institutions, US National Labs, and industry, we spent two years designing a
hardware-portable interface for basic sparse linear algebra functionality that
serves the user needs and is compatible with the different interfaces currently
used by different vendors. In this paper, we present a C++ API for sparse
linear algebra functionality, discuss the design choices, and detail how software
developers preserve a lot of freedom in terms of how to implement functionality
behind this API.
2
Contents
1 Introduction and Motivation 3
2 Related Efforts 4
2.1 The (dense) BLAS standard . . . . . . . . . . . . . . . . . . . . . 4
2.2 The GraphBLAS standard . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Existing Sparse BLAS standards . . . . . . . . . . . . . . . . . . 7
5 API Design 12
5.1 Ownership and Opaqueness . . . . . . . . . . . . . . . . . . . . . 12
5.2 Horizontal and Vertical Interoperability . . . . . . . . . . . . . . 14
5.3 Review of existing API designs SpMV . . . . . . . . . . . . . . . 14
5.4 API Design Considerations . . . . . . . . . . . . . . . . . . . . . 18
5.5 Single-Stage API for Operations with known Output . . . . . . . 19
5.6 Multi-Stage API for Operations with unknown Output . . . . . . 24
6 Execution Model 30
7 Numerical considerations 31
7.1 Choices of numerical formats . . . . . . . . . . . . . . . . . . . . 32
7.2 Error bounds to be satisfied . . . . . . . . . . . . . . . . . . . . . 33
7.3 Consistent exception handling from input to output . . . . . . . 34
7.4 Bit-wise reproducibility from run to run . . . . . . . . . . . . . . 36
7.5 Test Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8 Future extensions 38
3
been modernized by the C++ committee when integrating the functionality into
std::linalg when integrating the functionality into the C++-26 standard li-
brary.
Learning from these previous and ongoing efforts and building on top of
the C++ committee’s recent work, we propose a new C++-based Sparse BLAS
standard. Our main objectives are to enable portability across vendors, and
to provide high-performance, flexible and extensible interfaces for sparse linear
algebra functionality. Our API is designed to run on accelerators as well as
traditional CPUs alike. This is especially important given the pervasiveness
of accelerators in computing and the challenges in achieving high performance
from such hardware, which further increases the value of Sparse BLAS.
We recognize that linear algebra is a fundamental topic beyond classical
mathematical disciplines and physical sciences. Over the past decade, sparsity
has emerged as an important element of computation in AI. We argue that
a Sparse BLAS standard would provide a useful set of tools for modern AI
frameworks, potentially expanding to standardize new functionality such as low-
and mixed-precision computation and batch processing, and easing integration
with high-level deep learning frameworks.
While we may include more high-level routines in the future, such as solvers,
we initially restrict ourselves to basic functionality and also retain the familiar
name BLAS (Basic Linear Algebra Subroutines). Given the breadth and scope
of Sparse functionality (numerous matrix formats, complex algorithms, ...), we
expect the standardization effort to be an iterative process. We focus in the
current iteration on functionalities for which we found there is sufficient interest
and which are achievable for a first version of this standard. Therefore, the list
of features, kernels, and options may expand in future. The current API design
proposal is the fruit of an open, collegial collaboration between industry and
university experts. To suggest functionality for inclusion, or to get involved in
the project, feel free to contact the coordinator of the effort as identified in the
author list.
Concerning interfaces, we currently focus on the C++ interface as the key
building block, but plan to also suggest interfaces for C and Fortran. We ac-
knowledge the use of high-level languages such as Python and Julia in the HPC
community and plan to provide interfaces also to high-level programming lan-
guages in a later iteration.
2 Related Efforts
2.1 The (dense) BLAS standard
The origins of the BLAS standard can be traced back to 1973 when Hanson,
Krogh, and Lawson wrote an article in the SIGNUM Newsletter (Vol. 8, no.
4, p. 16) describes the advantages of adopting a set of basic routines for prob-
lems in linear algebra. This led to the development of the original BLAS [41],
which indeed turned out to be advantageous and very successful. It was adopted
4
as a standard and used in a wide range of numerical software, including LIN-
PACK [24]. An extended, Level 2 BLAS, was proposed for matrix-vector oper-
ations [23, 22]. At the time, these two levels were appropriate for vector proces-
sors and vector register chaining, respectively. Later in the 90s with the appear-
ance of cache-based machines, a strong need to restructure existing algorithms
to benefit from reusing data from cache emerged. The idea was to express algo-
rithms in terms of blocks that fit into cache memory, decreasing data movement
by providing a surface-to-volume effect for the ratio of O(n2 ) data movement to
O(n3 ) operations. Therefore, the Level 3 BLAS were proposed [20, 21], covering
matrix-matrix operations, and LINPACK was redesigned into LAPACK [3] to
use the new Level 3 BLAS where possible. With multi-core and GPU-accelerated
architectures, the BLAS have continued to be a valuable paradigm, including
in parallel task-based libraries such as PLASMA [1].
Several C++ interfaces have been proposed for the BLAS, ranging from
lightweight wrappers [29] to expression templates [46, 48, 30] that replace func-
tion calls with arithmetic expressions (C = A*B). Recently, the std::linalg
interface [35] was adopted as part of the C++26 standard, and has served as an
inspiration for this Sparse BLAS proposal. linalg is a templated, free-function
interface that leverages mdspan as an abstraction for dense matrices and ten-
sors. Matrix scaling and (conjugate) transposition are provided by lightweight
views of the matrix, which we have followed here. In a break from traditional
BLAS, functions use plain English names (multiply) instead of expert-focused
acronyms (gemm). Execution policies provide a mechanism for parallelization
and GPU acceleration.
5
operations.
The GraphBLAS C API Specification [6] defines a set of opaque containers
such as matrices and vectors along with generalized linear algebra operations
in C. SuiteSparse provides a high-performance, multi-threaded implementation
of the GraphBLAS C API, along with some support for GPU execution [11,
12]. There are also several GraphBLAS-like libraries in other programming
libraries [50, 53, 52]. The GraphBLAS C++ Specification provides a native C++
interface, providing better support for user-defined types through templates as
well as taking advantage of C++ concepts and views to simplify the interface [7,
8].
While GraphBLAS provides much of the functionality necessary needed in
a Sparse BLAS interface, there are several issues that make it suboptimal for
adoption as a Sparse BLAS standard:
1. While GraphBLAS’ opaque objects provide a clean separation between
what data is owned by the user and what data is owned by the imple-
mentation, this separation generally requires data to be copied in and
out of GraphBLAS via its import/export API. For users who only want
to use a library to execute a few operations, this presents considerable
overhead. This issue can be resolved with the pack/unpack methods in
SuiteSparse:GraphBLAS, which takes O(1) time and does not allocate
memory. These methods require the library and GraphBLAS to share the
same memory space and are extensions to GraphBLAS that are not in the
GraphBLAS C API Specification, however.
2. GraphBLAS provides some complex functionality, such as masked oper-
ations and extract and assign routines, that are not needed by Sparse
BLAS, and that vendors may be unlikely to implement. The GraphBLAS
C API allows an unimplemented method to return an error code to indi-
cate that it is not implemented, so this issue could be resolved by creating
a well-defined Sparse BLAS subset of GraphBLAS.
3. Sparse BLAS has a few key routines, such as triangular solve, that are not
needed or included in GraphBLAS.
4. The GraphBLAS C++ API does not currently provide explicit support for
accelerators or explicitly control how and where code should be executed,
although the C++ specification does allow implementations to support ac-
celerators through C++ execution policies [7, 8]. Parallel and accelerated
execution should be a first-class consideration for a SparseBLAS specifi-
cation.
5. Finally, there are currently no vendor-provided implementations of Graph-
BLAS, with vendors deferring to support open-source implementations
where necessary due to the breadth of the API. This Sparse BLAS ef-
fort includes math library developers from many vendors (Intel, NVIDIA,
AMD, Arm), and we anticipate working with vendors to ensure the inter-
face is suitable for them to produce high-performance implementations.
6
Nevertheless, we anticipate that the proposed standard and GraphBLAS will
overlap for a subset of functionalities, and the Sparse BLAS API will be capable
of efficiently expressing them. Several GraphBLAS innovations we may explore
utilizing in this iteration of the Sparse BLAS standard include: (1) operations
between two sparse objects, (2) masking, and (3) non-blocking or asynchronous
execution.
7
Filippone et al. [28]. The scope and breadth of the library make it akin to a
Sparse standard proposal, as most operations are supported and careful atten-
tion is given to interface design. They also reuse the work of Duff et al. [26],
e.g. use the same matrix storage type and rely on the serial Sparse BLAS of the
previous proposals. Their focus is on proposing a parallel (node-distributed)
implementation of the algorithms and implementing the library using multiple
layers, with base functionality implemented in Fortran 77 and higher-level inter-
faces in Fortran 90. In 2012, Filippone et al. [27] proposed an Object-Oriented,
Fortran 2003, interface thanks to recent programming language innovation that
greatly simplifies Sparse BLAS usage and standardization. This design enables
the support of multiple formats through class inheritance while maintaining
an abstract matrix handle type. This handle type allows for conversions and
other optimizations while enabling simpler interfaces. Notably, the design also
considers move versus copy semantics for matrix conversions and user data.
We learn from and base our standard proposal upon the previous efforts to
provide an increasingly simple-to-use yet expressive Sparse BLAS set of func-
tionality. At the same time, we extend the scope by, e.g., introducing an inter-
face for the more complex sparse-matrix times sparse-matrix product deemed
complicated in the previous iterations. The hardware evolution, namely GPUs,
requires us to also introduce a new set of functionality, user-provided allocators,
as memory management can be critical for these complex operations on such
complex hardware. Furthermore, we acknowledge in this proposal that for re-
peated application of a sparse operator, the initial investment in optimization –
analyzing sparsity patterns and characteristics, and fine-tuning format and exe-
cution policies – can quickly be amortized by faster execution. We thus propose
interfaces also allowing for an optimization phase prior to repeated execution of
functionality.
8
plication (SDDMM) kernel common in machine learning frameworks. A list of
functionality falling into this scope is given in Table 1. For convenience, we
also summarize the API type for each functionality as detailed in Section 5.
Multi-stage APIs produce a Sparse Matrix as an output, whereas single-stage
APIs produce a dense object as an output.
9
Several additional storage formats have been proposed in the literature
throughout the years [40], outlining trade-offs between performance, memory
footprint, and creating overhead. These formats can be considered as optimiza-
tion targets, rather than input/exchange formats. Possible future extension
formats include CSR(4), CSC(4), BSR / BSC, variable BSR / BSC, Ellpack
(including variants ESB/SELL-P/SELL-C-σ), and diagonal/tridiagonal/penta-
diagonal formats. We will not go into specific details about these here; for now,
these are considered to be optional extension formats.
Some applications require the structure of sparse matrices only (i.e., the
pattern of non-zero elements) and not the values themselves. Other applications
may use the same value in all nonzero locations, we may call these iso-valued
matrices. We do not provide a specific type for structure-only matrices but
instead define iso-valued matrices. We anticipate iso-valued matrix views being
created via CSR, CSC, or COO matrix views whose value arrays are replaced
by a symbolic “iso array” whose values are all the same.
Table 2 outlines the input parameters required to construct a CSR matrix.
These input parameters are defined by three important types: scalar type,
which is the type of the scalar elements stored in the sparse matrix, index type,
which is the type used to store explicit indices, and offset type, which is the
type used to store offsets into other arrays. The index type limits the maxi-
mum dimensions of a matrix that can be stored, while offset type limits the
maximum number of nonzeros that can be stored. This is useful, for example,
when large matrices may require int64 t for rowptr elements in order to avoid
integer overflow during calculations involving rowptr. However, using int64 t
for the large colindx would be unnecessary unless the matrix dimensions are
larger than 232 and could result in lower performance than using a smaller in-
teger type.
The parameter nnz might easily be deduced from the input arrays,
10
Type Name Description
index type nrows Number of rows of the matrix
index type ncols Number of columns of the matrix
offset type nnz Structural number of non-zeros
offset type array colptr Column pointer array (length: ncols+1)
index type array rowindx Row index array (length: nnz)
scalar type array values Structural values (length: nnz)
base type index base Base indexing
matrices found in many simulation codes and obtained with the finite element
method (FEM) or the discontinuous Galerkin method (DG). These data struc-
tures reduce the memory footprint and propose suitable storage for using dense
linear algebra. For instance, the BSR (or BCSR, Block Compressed Sparse
Row) format extends the CSR format by using square block-dense matrices
rather than scalars for the values. Information describing the characteristics
of its blocks, such as the dimension bsize and the internal storage bstorage
(row-oriented or column-oriented layout), must be included. The array values
stores non-zero blocks contiguously.
While we acknowledge the wide-spread use of block formats, we exclude
them in this first version of an API design proposal. This is motivated by the
idea of keeping this document within reasonable length and complexity.
We anticipate using std::mdspan, introduced in C++23, to support dense
matrices in the same style as the recently standardized std::linalg C++
BLAS interface [34]. Users can create an mdspan view with a data handle
(a pointer is the most common example, although different data handles may
appear as defined by an accessor policy) along with matrix extents, represent-
ing the matrix’s dimensions. Once an mdspan view has been created, the user
can pass in a single argument to represent a dense matrix, instead of passing
in multiple arguments for the pointer, dimensions, and stride. std::mdspan
supports a large degree of compile-time customization, including compile-time
constant extents, customization of index type view extents, as well as custom
layout and accessor policies. Row- and column- major matrices are supported by
standard-defined layout policies (std::layout right and std::layout left).
We anticipate vendors restricting users to only a subset of the possible mdspans,
11
such as mdspans with standard-defined layout policies, default accessors, and an
order of one or two. These mdspans could be directly passed into most vendor-
provided implementations that expect a traditional dense matrix represented
by a pointer, dimensions, stride, and row or column major layout. Unsupported
mdspans would trigger a clear compile error message in a high-quality imple-
mentation.
5 API Design
The sparse BLAS interface design has to consider a wide range of aspects, rang-
ing from opaqueness to memory allocation for sparse outputs and data owner-
ship to multi-stage APIs and object- and function- specific execution policies
and hints for optimization. In the following sections, we address these inter-
twined aspects individually.
12
always be inspected and accessed at little cost, whereas accessors are required
for opaque objects and can come with an extra cost.
Because each of these aspects can be important in different situations and
to different users, we believe that the two options are both needed, and comple-
mentary, and providing these two types of API will satisfy most needs.
In this first version of an interface design, we focus on the second option, the
non-owning API for functionality. However, as we will detail next, we extend
this approach to allow for a moderate level of optimization. In this non-owning
API, the user owns the data and light-weight views are used to expose the
functionality to the library operations. In particular, the light-weight views
expose the raw pointers and size information of, for example, a CSR matrix,
dense matrix, or a COO matrix.
As discussed, this approach limits the optimization potential for developers.
In particular, the light-weight views do not allow to store additional informa-
tion that can be exploited by algorithms operating on the sparse matrices, like
symmetry, upper or lower triangular structure, matrix transpose, etc. Without
breaking the concept of light-weight views, we enable optimization and storing
additional information by wrapping the view into a matrix handle that con-
tains both the view to the user-owned raw data and a structure that can contain
additional data for optimization. Important is that all the optimization data is
owned by the library, not by the user, and the data is stored in library-allocated
memory and is generally opaque to the user.
Some users want to keep control of all available memory, in particular control
how data structures are allocated. For this reason, we propose a design where
the user can pass an allocator to the library that is then used by the library
to allocate additional data in the matrix handle. If no allocator is provided,
the library is permitted to use whichever allocator they deem to be appropriate.
In Section 5.1 we give an example how a CSR matrix structure is passed as
a light-weight view to a library and a matrix handle is wrapped around it to
allow for storing additional information.
We note that the API is expected to accept both plain views or matrix handle’s.
If plain views are passed in, the library has no optimization structure at hand
and can not store any additional matrix information that persists beyond the
operation.
13
5.2 Horizontal and Vertical Interoperability
The interface and functionality standardization we propose must allow for hor-
izontal and vertical interoperability. Vertical interoperability includes both up-
ward and downward interoperability. Upward interoperability refers to usability
and productivity inside high-level algorithms like iterative solvers, precondition-
ers, and sparse neural networks. Only interfaces and functionality decisions al-
lowing for high productivity for high-level algorithms and applications justify
the design of a sparse BLAS standard. We assess upward interoperability by us-
ing the sparse BLAS functionality inside mini apps. Downward interoperability
refers to the possibility of wrapping already existing sparse BLAS functionality
developed by hardware vendors into the sparse BLAS interface. The intention
for the downward interoperability is not to create yet another layer of inter-
faces, but to ensure that design decisions taken by the vendors in the kernel
implementations do not hinder the adoption of a new interface in future ker-
nel development. Horizontal interoperability refers to compatibility with dense
BLAS functionality and the possibility to use the functionality in other pro-
gramming languages. The latter takes on a new significance in the context of
Sparse BLAS, as we aim to focus on building a high-level C++ API, as opposed
to the Fortran-based interface used in the dense BLAS to allow for ABI compat-
ibility. Instead, our goal is to ensure that basic functionality is accessible from
a wide range of languages, including Fortran, C, Julia, Python, and others. The
strategy we envision is to create C-bindings for the most central functionality.
Most vendors implement linear algebra functionality using the C++ program-
ming language. Likewise, most already implemented sparse BLAS functional-
ity, e.g. in NVIDIA’s, Arm’s, Intel’s, and AMD’s vendor libraries, are often
designed in C++. Also, for dense BLAS, the C++ standard officially adopted the
std::linalg API [34]. We adopt this strategy and build upon an interface in
C++. At the same time, we acknowledge that many users and application devel-
opers depend on Fortran, C, or Python interfaces as their application codes are
using the respective language. Designing an interface to sparse BLAS function-
ality in C or Fortran without the use of static and dynamic polymorphism is
almost impossible due to the explosion in variants, which would result in a pro-
hibitively long parameter list and a plethora of function variants. In response,
we will select the most common use cases and provide bindings in C, Fortran,
and Python to a subset of the sparse BLAS functionality.
14
blueprints guiding the design of the new API standard for sparse linear algebra
functionality. Thus, in the subsequent, we provide a comparison of existing C
and C++ library APIs for sparse matrix-vector (SpMV) operations. Some APIs
are synchronous and blocking, while others extend to asynchronous non-blocking
execution models.
15
SpMV API comparisons between existing libraries:
IEEE754 Double Precision Sparse Matrix-Vector Multiplication (SpMV)
y = α · op (A) x + β · y
(4) Graph BLAS Suite Sparse C API (implements y hmaski = accum (y, A · x)
in some semiring)
// y = alpha*y
GrB apply(y, GrB NULL, GrB NULL, GrB TIMES FP64, alpha, y, GrB NULL);
y += beta*A*x
GrB mxm (y, GrB NULL, GrB PLUS FP64. GrB PLUS TIMES SEMIRING FP64,
A, x, GrB NULL);
16
(7) rocSPARSE ROCm C, generic API (has multiple required stages)
status = rocsparse spmv(handle, rocsparse operation none,
p alpha, matA, vecX, p beta, vecY,
rocsparse datatype f64 r,
rocsparse spmv alg csr adaptive,
rocsparse spmv stage compute,
p buffer size, temp buffer);
(9) oneMKL SYCL C++ API with an out-of-order queue (has an optional
optimize stage)
ev gemv = oneapi::mkl::sparse::gemv(
queue, oneapi::mkl::transpose::nontrans,
alpha, csrA, x, beta, y, {ev opt});
(10) oneMKL SYCL C++ API with an in-order queue (has an optional optimize
stage)
oneapi::mkl::sparse::gemv(
queue, oneapi::mkl::transpose::nontrans,
alpha, csrA, x, beta, y);
17
5.4 API Design Considerations
From a high-level point of view, there exist two classes of sparse BLAS opera-
tions:
1. Operations where the sparsity pattern of the output object is known a
priori, e.g. the multiplication of a sparse matrix with a vector or a dense
matrix, or the sampled multiplication of two dense matrices, or the scaling
of a matrix. These operations can be realized with a single kernel operating
on pre-allocated output data.
2. Operations where the sparsity pattern of the output object is not known
a priori, e.g. the multiplication of two sparse matrices, a filtering oper-
ation removing values below a pre-defined threshold, or the addition of
two sparse matrices (with – in general – different sparsity patterns). For
these operations, it is impossible to execute them using a single kernel as
the memory requirement for the output is not known a priori, and the
memory allocation can only happen after the number of nonzero elements
has been computed. For this reason, operations of this class are typically
implemented using the pattern compute, allocate, fill [26].
For both single-stage and multi-stage operations that produce sparse output,
the sparsity structure of the output is based on the symbolic operation, not on
the numeric operations. In other words, a numeric calculation resulting in a
numeric zero does not result in omitting this entry from the output sparsity
pattern.
While there exist strong arguments for unifying the interface for the differ-
ent functionality to always require the same number of stages (conceptually,
a matrix matrix multiplication is the same operation, agnostic to the non-zero
structure and the storage format of the matrices), we decided to sacrifice consis-
tency in favor of reduced complexity. In particular, we use a one-stage API for
functionality that produces output of known sparsity pattern, and a multi-stage
API for functionality that produces output of unknown sparsity pattern.
Functionality operating on sparse objects differ also in another aspect from
functionality operating on dense objects: depending on the distribution of the
nonzero entries, different parameter choices may be optimal for fast execution
– up to the level where different algorithms may be chosen depending on the
nonzero distribution. To account for this, we consider an additional inspect
phase that is optional to all functionality (single-stage or multi-stage) and al-
lows for additional optimization with regard to the selection of the algorithm
and parameters for processing the operation and for storing some additional
information in the matrix handle. The optimization may in particular pay off
in case of repeated function invocation, e.g., the repeated invocation of a sparse
matrix vector product inside an iterative solver loop. If the user feeds in plain
views into the operation, the inspect phase can invoke operation-specific opti-
mization but can not access additional information in the optimization structure
of a matrix handle. If the optional inspect phase is not invoked prior to the op-
eration, a default algorithm and parameter selection is used. This concept of
18
an optional inspect phase is in line with the current design of vendor interfaces,
see Section 5.3.
In the next sections, we will discuss the single-stage API for functional-
ity producing output with a-priori known sparsity pattern and the multi-stage
API for functionality producing output with a-priori unknown sparsity pattern.
Note, that all operations take policy and state objects as inputs. The policy
object is opaque to the user. It may contain any vendor-specific execution poli-
cies, similar to a stream or queue. The state object is operation-specific and
can be used to store any operation-specific data, e.g. a dependency graph in a
triangular solve or data in an operation-specific library-internal representation.
For the multi-stage functionality, it may also contain information about which
stages have already been completed to avoid re-calculation.
Listing 2: Scaling, A := αA
19
using namespace sparseblas;
csr_view<float> A(values, rowptr, colind, shape, nnz);
matrix_frob_norm_state_t state(allocator);
// A is const; function returns Frobenius norm
auto frob_nrm = matrix_frob_norm(policy, state, A);
// or - without passing a policy, resulting in a default:
auto frob_nrm = matrix_frob_norm(state, A);
20
using namespace spblas;
csr_view<float> A(values, rowptr, colind, shape, nnz);
// actual multiplication y = A * x
multiply(policy, state, A_handle, x, y);
21
using namespace spblas;
csr_view<float> A(values, rowptr, colind, shape, nnz);
float alpha = ..., beta = ...;
22
using namespace spblas;
// Matrix T data is already in lower or upper triangular form
// physically in the matrix arrays for the triangular
// solve
csr_view<float> T(values, rowptr, colind, shape, nnz);
sampled_multiply_state_t state(allocator);
sampled_multiply_inspect(policy, state, X, Y, C);
23
5.6 Multi-Stage API for Operations with unknown Out-
put
As mentioned above, a fundamental difference to dense BLAS functionality is
that for basic operations on sparse objects, the sparsity pattern (and therewith
the sparse data structure) is not always known beforehand. An example is the
multiplication of two sparse matrices A, B, resulting in another sparse matrix
C = A · B with a pattern that typically is unknown prior to the computation.
In consequence, functionality producing sparse output with the sparsity pattern
unknown prior to the computation typically split into three phases (or four if
including the optional inspect phase):
1. inspect phase – (optional) prepare any potential optimizations for subse-
quent phases, or may do nothing
2. compute phase – computing the size of the sparse output structure, which
typically requires significant work
3. allocation phase – allocating the memory for the sparse output data struc-
ture and placing it in the output matrix object to be filled
4. fill phase – complete execution of the operation and filling the output
structure with the result.
An important design question for functionality producing sparse output is
whether the memory allocation is done by the user or the library. While there
exist strong arguments on both sides, and passing a memory allocator to the
library poses an elegant way to resolve memory location and ownership, we ini-
tially propose an interface that requires the library user to allocate the memory
for the sparse output data after providing the memory requirement informa-
tion. While we, in the following, focus on the multiplication of sparse matrices
(typically called SpGEMM in literature), the considerations and design aspects
hold also for other basic functionalities that are producing output with sparsity
pattern unknown prior to the computation. In Table 1 we list several: sparse
matrix – sparse matrix multiplication, matrix sparse format conversion, filter
operations removing elements below a pre-defined threshold, and the addition
of sparse matrices with generally differing sparsity patterns.
Acknowledging the previously motivated inspect phase, the multiplication of
two sparse matrices
C = α · op(A) · op(B) + β · D
can be realized using the following sequence of steps (using APIs that will
be defined subsequently):
24
sparse_multiply_state_t state(allocator);
sparse_multiply_inspect(policy, state, A, B, C, D); // optional
sparse_multiply_compute(policy, state, A, B, C, D);
index_t nnz = state.get_result_nnz();
// the user allocates the arrays for C
sparse_multiply_fill(policy, state, A, B, C, D);
// C structure and values are now ready to be used
We note that we change from the function name multiply (which we used for
the multiplication of sparse with dense matrices or vectors) to sparse multiply
to indicate that this operation generates sparse output and that a fundamentally
different multi-stage API is needed. In consequence, it is not possible to just
call multiply(A,B,C) on sparse objects A,B,C.
We also note that this interface gives the software developers a significant
level of flexibility and room for optimization.
A standard way to is to place the structural analysis in the multiply compute
stage, then allocate the output data structure and fill it in the multiply fill
stage. Internal buffers needed in the structural analysis can be allocated and
de-allocated by the library in the multiply compute routine.
The only step that has to critically happen in the multiply compute rou-
tine is the computation of the number of nonzero entries in the output to
be able to allocate the output data structure. The multiply fill routine
is required to fill the user provided arrays with the output data so that it is
usable after multiply fill stage. These are the only requirements on the
multiply compute and multiply fill stages, and thus the developers are left
with a lot of freedom to realize the functionality. For example, developers can
decide to already fill the CSR rowptr in the multiply compute routine. Or pack
even more into this first stage: An approach allowing for more library-internal
optimization may be to pack both the structural analysis and the multiplication
into the multiply compute routine using an internal output data structure that
is part of the state object. The multiply fill routine then only extracts the
result into the output data structure allocated by the user. Any library-internal
output data structures live in the state object and would be deleted when state
is destroyed. Such an implementation does not break the interface design.
A critical question is how the library allocates these internal buffers, via a
library-preferred allocator or an allocator passed to the library from the user.
This boils down to whether the user has buffer management on the application
level, i.e., manages all the device memory in the application. At the same time,
requiring the user to always provide an allocator to the library can become
cumbersome. In the current design proposal, we allow the user to pass in a
memory allocator (for the memory space where the computations are executed,
i.e. the “compute device”) that is then used by the library for the allocation of
buffers. If no allocator is passed, the library is free to use its own allocator.
Some scenarios additionally benefit from splitting the multiplication of sparse
matrices and other routines operating on sparse data structures into a symbolic
and a numeric stage. This, for example, can allow for additional optimization
25
in the repeated multiplication of matrices changing only the numeric values
but preserving the sparsity pattern. Hence, we propose a variant of the above
interface splitting the routines multiply compute and multiply fill into two:
sparse_multiply_state_t state(allocator);
multiply_inspect(policy, state, A, B, C, D); // optional
multiply_symbolic_compute(policy, state, A, B, C, D);
index_t nnz = state.get_result_nnz();
// allocate C arrays and put in C (possibly both structure
// and values or just structure arrays at this point)
multiply_symbolic_fill(policy, state, A, B, C, D);
// C structure is now able to be used
sparse_multiply_state_t state(allocator);
sparse_multiply_inspect(policy, state, A, B, C, D); // optional
sparse_multiply_compute(policy, state, A, B, C, D);
index_t nnz = state.get_result_nnz();
// allocate C arrays
sparse_multiply_fill(policy, state, A, B, C, D);
// C structure is now able to be used
2. Repeated invocation of the functionality with numerically differing matri-
ces:
sparse_multiply_state_t state(allocator);
sparse_multiply_inspect(policy, state, A, B, C, D); // optional
for (i=0 i<n; i++) {
// may do nothing if nnz is already computed
26
// and information is stored in the state
sparse_multiply_compute(policy, state, A, B, C, D);
index_t nnz = state.get_result_nnz();
if (i == 0 }{
// allocate C arrays
}
sparse_multiply_fill(policy, state, A, B, C, D);
// C structure is now able to be used
}
3. Using the split into numeric and symbolic phase for repeated invocation
of the functionality with numerically differing matrices:
sparse_multiply_state_t state(allocator);
sparse_multiply_inspect(policy, state, A, B, C, D); // optional
sparse_multiply_symbolic_compute(policy, state, A, B, C, D);
index_t nnz = state.get_result_nnz();
// allocate C arrays
sparse_multiply_symbolic_fill(policy, state, A, B, C, D);
for (i=0 i<n; i++) {
// only numeric computations inside the loop
multiply_numeric_compute(policy, state, A, B, C, D);
multiply_numeric_fill(policy, state, A, B, C, D);
// C structure is now able to be used
}
Up to now, we provided motivation and details for the design of the API for
the sparse matrix - sparse matrix multiply operation, but the same two-stage
interface applies also to other functionality with nonzero structure unknown
a-priori.
27
using namespace sparseblas;
auto A = std::mdspan(raw_x.data(), m, k);
csr_view<float> B(m, n);
convert_state_t state(allocator);
convert_inspect(policy, state, A, B);
convert_compute(policy, state, A, B);
index_t nnz = state.get_result_nnz();
// allocate the memory for B and put them in B
convert_fill(policy, state, A, B);
multiply_elementwise_state_t state(allocator);
multiply_elementwise_inspect(policy, state, A, B, C); // optional
multiply_elementwise_compute(policy, state, A, B, C);
index_t nnz = state.get_result_nnz();
// allocate C arrays and put in C
multiply_elementwise_fill(policy, state, A, B, C);
// C structure and values are now able to be used
28
using namespace sparseblas;
csr_view<float> A(values, rowptr, colind, shape, nnz);
// auto A = matrix_handle(csr_view<float>(...), allocator);
// ... likewise for B
csr_view<float> C(m, n);
// auto C = matrix_handle(csr_view<float>(...), allocator);
add_state_t state(allocator);
add_inspect(policy, state, A, B, C); // optional
add_compute(policy, state, A, B, C);
index_t nnz = state.get_result_nnz();
// allocate C arrays and put in C
add_fill(policy, state, A, B, C);
// C structure and values are now able to be used
filter_state_t state(allocator);
29
6 Execution Model
The world of high-performance computing has recently transformed from a
CPU-centric model to a heterogeneous model in which computations must be
performed on CPUs, GPUs, and other accelerators. To support these hetero-
geneous environments, computing languages and libraries have evolved from
supporting only synchronous execution models, where a kernel or function is
submitted, enqueued, and executed immediately to a asynchronous execution
models, where kernel submission is executed without waiting for completion.
This separation between kernel submission and kernel execution has proved
beneficial for enabling performance on heterogeneous systems by both avoiding
unnecessary synchronization as well as allowing data movement, kernel launch,
and other overheads associated with heterogeneous execution to be hidden.
Standard C++ is transitioning toward a standard model for asynchronous
execution with std::execution (senders and receivers) [19, 42], serves as the
glue for different libraries with asynchronous operations to interoperate with
one another. Given the increased performance and ease-of-use associated with
using standard tools to handle asynchrony instead of disparate vendor APIs, we
recognize the importance of using standard, interoperable tools for asynchrony.
In the future, we anticipate adopting the std::execution model, with a collec-
tion of asynchronous methods that extend the standard Sparse BLAS API using
std::execution concepts. However, given the early stage of std::execution
implementations, this full specification will likely have to wait until the ecosys-
tem is more mature.
With the current state of vendor support for asynchronous C++ program-
ming, we anticipate four guidelines at this stage for asynchronous support. First,
any Sparse BLAS API that supports asynchronous execution should be included
in the async namespace [42], to clearly denote expected behavior for users.
For example, spblas::async::multiply(). Second, all asynchronous and syn-
chronous APIs that inspect or execute shall take an execution policy as an
argument and respect it with regards to the parallel execution. In addition
to the standard execution policies defined by C++, vendors may define their
own execution policies, which have implementation-defined behavior. Vendors
should also provide APIs that do not require an execution policy. These exe-
cution policy–free APIs may perform execution in parallel so long as they do
not violate the semantic requirements of their respective algorithms. Third,
any implementation-defined objects that control asynchronous behavior (e.g.
CUDA/ROCm streams or SYCL sycl::queue and sycl::event objects) can be
provided to the algorithm through an implementation-defined execution pol-
icy. Fourth, we observe that scaling values in Sparse BLAS operations (e.g. α
and β in sparse matrix-vector multiplication, y = αA · x + βy) may be repre-
sented as either arguments provided on the host or as device-side scalars that
will be dynamically computed by another asynchronous kernel. A high-quality
implementation should support both sources of scaling values.
Some libraries provide a mechanism for the user to select which of one or
more internal algorithms they want to be used in a particular sparse BLAS
30
operation. The default behavior is for the library to make such algorithmic
choices, perhaps using information gained from analysis of the inputs in the
optional “inspect phase.” However, in the case that the user would like to select
a particular implementation-defined algorithm implementation, this should be
supported by vendors via their own implementation-defined execution policies.
7 Numerical considerations
This section describes the design space impacting the numerical properties of the
Sparse BLAS, outlines the choices to be made, explains how these choices may
tradeoff performance and “consistency” (to be defined below), and proposes
which options should be required, recommended, or not recommended. The
design space is defined by the following, often intertwined considerations:
31
Numerical Reproducibility). This is a higher, and more expensive, level
of consistency than just exception handling. So again, our initial design
does not require reproducibility.
5. Do we need test suites? We describe how extensively test code should test
a proposed SparseBLAS implementation, given the design considerations
above.
32
to reduce accumulation errors. In particular, some architecture units already
implement a higher accumulation format in hardware. In the Sparse BLAS API,
we do not specify the internal precision, but anticipate it having at least the
accuracy of the input/output format. We anticipate that any deviation from
this concept is documented. We encourage the implementation of functionality
variants, as long as also these are well-documented (including their numerical
properties).
The first term, which is itselfP a dot product, can be bounded in various ways,
n
such as by kx − x̂k1 · kyk∞ = ( i=1 |xi − x̂i |) · max1≤i≤n |yi |, kx − x̂k2 · kyk2
(Cauchy-Schwartz), or kx − x̂k∞ · kyk1 ; the second term is analogous. Thus it
suffices to bound the error of converting each xi to x̂i (and analogously for yi to
ŷi ). If there is no conversion, or if the internal format is higher precision than the
input precision, then x = x̂ (and y = ŷ). If the conversion is from floating point
to block floating point, then the error could be as large as |xi − x̂i | ≤ ǫkxi k∞
where ǫ depends on the length of the fixed point format used within the block
floating point format. If the conversion is from a higher to lower precision
floating point format, then |xi − x̂i | ≤ ǫ|xi | + U N , where ǫ is the machine
precision of the internal format, and U N is the error from underflow (which
depends on whether P flush-to-zero or gradual underflow is used).
n
The third term, | i=1 x̂i · ŷi − ẑ|, is the error in actually computing the dot
product. If this is done in floating point with machine precision ǫ and underflow
33
error U N , the error can be bounded by
X
n X
n
| x̂i · ŷi − ẑ| ≤ f (n) · ǫ · |x̂i | · |ŷi | + g(n) · U N
i=1 i=1
where f (n) and g(n) depend on the order of summation, ranging from f (n) = n
for conventional serial summation to f (n) = ⌈log2 n⌉ + 1 when using a binary
tree; g(n) ≤ 2n in both cases (see [13] for a more detailed analysis of underflow).
We note that if one knows that only m < n of the summands x̂i · ŷi are nonzero,
due to sparsity, then one can replace n by m in these bounds.
We also note that there is a large literature on using instructions like fused-
multiply-add, and floating-point tricks like two-sum, to compute dot products
with arbitrarily higher accuracy. These could possibly be of interest in using
low-precision floating point to simulate higher precision.
If the dot product is computed with block floating point using a fixed point
to represent each number, then the error bound depends on the width of the
fixed point accumulator used to sum the fixed point products. If it is wide
enough to compute the sum exactly, the error is zero. If it is necessary to right-
shift the integer parts of x̂i and ŷi to avoid integer overflow, then the error can
be bounded by
X
n
| x̂i · ŷi − ẑ| ≤ n · ǫ · max |x̂i | · max |ŷi |
1≤i≤n 1≤i≤n
i=1
Other approaches to avoiding overflow, with different error bounds, are imagin-
able.
In the case of posits, where the precision ǫ depends on the magnitude of the
numbers (numbers closer to 1 are more accurate than much larger or smaller
numbers), we refer to [16] and [14], the latter of which discusses an older pro-
posed format with similar properties. We also note that the designers of posits
propose including a ”quire”, a long accumulator capable of computing long dot
products exactly.
Long accumulators for exact dot products with standard floating point ar-
guments have also been proposed [38].
ˆ bounds the error when converting the
Finally, the last error term, |ẑ − ẑ|
ˆ If the output is
final result from the internal format ẑ to the output format ẑ.
lower precision than the input, then this can be bounded by ǫ|ẑ| + U N , where
ǫ and U N are for the output format.
34
should not. Sparsity makes this more subtle because a sparse matrix contains
implicit zeros, which are not stored and not expected to participate in arith-
metic operations, and so not create NaNs (from 0*NaN=NaN or 0*Inf=NaN) to
propagate. For example in SpMV, computing y = A*x, if x(i) = NaN and the
i-th column of A is all implicit zeros, then x(i)=NaN will not cause any entry
of y to be a NaN, which is the user’s expectation.
But in fact a sparse matrix may contain 3 kinds of zeroes, that we may treat
differently:
1. An implicit zero. As described above, this is an entry A(i,j) that is not
stored in the user’s input data structure, and the user does not expect it
to participate in any arithmetic operations. We emphasize “user’s input
data structure” to distinguish from case (3) below.
2. An explicit zero. This is a zero stored in a data structure, and it is
expected to participate in the same arithmetic operations that a nonzero
value would. The data structure could be the user’s input data structure,
or one that is created by the system to optimize performance (examples
below).
3. An explicit masked zero. This refers to a zero that is explicitly stored in
a data structure, but another part of that data structure (a “mask”) indi-
cates that it is not supposed to participate in any arithmetic operations,
like an implicit zero.
Here are two examples of why the system may create a new data structure
with explicit zeros replacing some implicit zeros. First, being able to access and
operate on a segment of matrix entries of some fixed length makes it easier to
use SIMD instructions on some architectures. Second, it may only be necessary
to store one column index for such a segment instead of multiple indices (e.g.
blocked CSR, or BCSR), saving memory and reducing memory access time.
One example of a user’s input data structure that can contain explicit masked
zeros is ELLPACK, in which the nonzeros of an m x n sparse matrix are stored
in an m x k matrix A , and row i of A contains the nonzeros in row i of the
sparse matrix, where k is the maximum number of nonzeros in any row. This
means row i of A is padded with additional explicit zeros if it contains fewer
than k nonzeros. There is also an m x k Col matrix containing the column
indices of the matrix entries stored in A, with an index of -1 to mask out any
padded zeros.
To be “consistent” given these different kinds of zeros, we require both that
implicit zeros and explicit masked zeros do not create NaNs which would then
propagate to the output.
So far we have discussed the SpMV operation defined as y = A · x. If it
were defined as y = α · A · x + β · y, as in the dense BLAS, then we need to
consider the cases α = 0 and β = 0. In the dense BLAS, β = 0 is defined to
mean that the requested operation is y = α · A · x, i.e. the input y is never read,
only written, so any Infs or NaNs it may contain are irrelevant, and should not
35
propagate. Similarly, α = 0 is defined to mean y = β · y, so Infs and NaNs in
A or x are irrelevant, and never accessed. And if α = β = 0, the vector y is
simply initialized to 0.
We note that that we are not trying define all the cases in which an entry
of y must be an Inf or a NaN. To see why, suppose the result is the sum of four
numbers: [z, z, −z, −z] where z is a finite number greater than half the overflow
threshold. Then depending on the order of summation, the result could be NaN,
+Inf, -Inf or 0 (the correct answer). And if there is a fifth summand equal to
+Inf, the result could be +Inf or NaN. So consistency requires only that an Inf
input propagates to the output as an Inf or NaN; a NaN will naturally propagate
as a NaN. The issue of getting the bitwise identical result every time SpMV is
performed is the separate question of reproducibility, discussed below.
Finally, for the more complicated case of solving a triangular system of
equations, see [15] for a discussion of the dense case.
36
cluding on separate computer architectures. Can be less performant than
Strict CNR property algorithms.
While the ”Full Numerical Reproducibility” property might seem the most
desirable, it can be extremely difficult to achieve on modern architectures and
would likely sacrifice a great amount of performance, so is unlikely to be some-
thing supported by any Sparse BLAS library. We will restrict the remaining
discussions to the support of the CNR and strict CNR properties in Sparse
BLAS, which means run to run reproducibility within a single executable and
all things equal except possibly the number of computational threads.
While it is common for some sparse formats and algorithms to be naturally
reproducible due to the method of parallelizing the work (for instance using row-
based parallelism for SpMV in CSR format where a single computation thread
processes the multiply-adds with the dense vector across a sparse matrix row
in the order that the column indices were stored in that row), there are other
cases where reductions, atomic operations or general algorithms with multiple
computational threads contributing to the floating point result that require more
care in ensuring the order of floating point operations is maintained from run
to run. We leave it to the Sparse BLAS function implementer to come up with
algorithmic ways to ensure such reproducible results can happen and will turn
our attention now to how a user could signal to the function that such care is
required.
We propose to introduce a global CNR property for the library that can be
set via the following APIs. We also recommend that when algorithms are avail-
able for selection in an API, that library implementers denote in documentation
the algorithms which have the different CNR properties themselves. This would
allow a fine grained approach to be applied by users as well without the global
switch being set. If the global property and the local algorithm selection do not
align, the local algorithm selection will be preferred.
namespace spblas {
cnrProperty get_cnr_property();
37
Finally, we note that this notion of conditional numerical reproducibility is
orthogonal to the notion of consistent exception handling discussed previously.
Both can be set or not set independently. There is a small intersection, but
The combination of “permissive“ and “cnr“ would mean that you always get
an Inf or a NaN in the same output location from run-to-run, but it may not
be “consistent“ in the sense that we cannot guarantee which NaN payload is
propagated in an operation like NaN+NaN, just that one of the arguments is
propagated (see sec. 6.2.3 of the IEEE 754 standard).
8 Future extensions
As hinted at in Section 4, we also recognize the value of many other sparse
formats that are not covered so far in this proposal, like the CSR-4 array for-
mat, the Block CSR formats, the Ellpack format, and it’s several variants and
many others found in literature and existing sparse BLAS libraries. Subsequent
extensions of this proposal should address the addition of more sparse matrix
formats to the sparse BLAS support. We do recommend that format conver-
sions between these additional formats should likely always go back through one
of the original core CSR/CSC/COO sparse matrix formats to limit the scope
and potential need for conversion routines. Likewise, as the formats expand,
each implementation will need to define how they will handle matrix products
or matrix additions between disparate input sparse formats, and denote which
format pairs for inputs are actually supported.
Functionality-wise, we plan to expand the sparse triangular solves to also
accept dense matrices and sparse vectors, and extend the scope to more complex
algorithms like solvers and graph algorithms.
We also recognize the increasing use of tensors, which are higher-dimensional
analogs of matrices, in computing.
38
Sparse tensor computations such as tensor decompositions and contractions
present unique challenges ranging from sparse tensor formats and different func-
tions such as sparse tensor times tensor multiplication and sparse matricized
tensor times Khatri-Rao product (MTTKRP).
Our current design is not optimal for representing and computing on tensors,
even when they are matricized, but we plan to provide extensions of our API
for sparse tensor computations.
References
[1] Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub
Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire To-
mov. Numerical linear algebra on emerging architectures: The PLASMA
and MAGMA projects. In Journal of Physics: Conference Series, volume
180, pages 012–037. IOP Publishing, 2009.
[2] AMD Corporation. AMD Instinct MI250X accelerator.
[9] Aydın Buluç and John R Gilbert. The Combinatorial BLAS: Design, im-
plementation, and applications. The International Journal of High Perfor-
mance Computing Applications, 25(4):496–509, 2011.
39
[10] Cerebras Corporation. Cerebras.2022.
[11] Timothy A Davis. Algorithm 1000: SuiteSparse: GraphBLAS: Graph al-
[20] Jack J. Dongarra, J. Du Croz, Iain S. Duff, and Sven Hammarling. Al-
gorithm 679: A set of Level 3 Basic Linear Algebra Subprograms. ACM
Transactions on Mathematical Software, 16:1–17, March 1990.
[21] Jack J. Dongarra, J. Du Croz, Iain S. Duff, and Sven Hammarling. A
set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on
Mathematical Software, 16:18–28, March 1990.
40
[22] Jack J. Dongarra, J. Du Croz, Sven Hammarling, and R. Hanson. Al-
gorithm 656: An extended set of FORTRAN Basic Linear Algebra Sub-
programs. ACM Transactions on Mathematical Software, 14:18–32, March
1988.
[23] Jack J. Dongarra, J. Du Croz, Sven Hammarling, and R. Hanson. An ex-
tended set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans-
actions on Mathematical Software, 14:1–17, March 1988.
[24] Jack J. Dongarra, Cleve B. Moler, J. R. Bunch, and G. W. Stewart. LIN-
PACK Users’ Guide. Society for Industrial and Applied Mathematics,
Philadelphia, PA, 1979.
[25] Iain S. Duff, Michael A. Heroux, and Roldan Pozo. An overview of the
sparse basic linear algebra subprograms: The new standard from the BLAS
Technical Forum. ACM Trans. Math. Softw., 28(2):239–267, jun 2002.
[26] Iain S. Duff, Michele Marrone, Giuseppe Radicati, and Carlo Vittoli. Level
3 basic linear algebra subprograms for sparse matrices: A user-level inter-
face. ACM Trans. Math. Softw., 23(3):379–401, sep 1997.
[27] Salvatore Filippone and Alfredo Buttari. Object-oriented techniques for
sparse matrix computations in Fortran 2003. ACM Trans. Math. Softw.,
38(4), aug 2012.
[28] Salvatore Filippone and Michele Colajanni. PSBLAS: A library for parallel
linear algebra computation on sparse matrices. ACM Trans. Math. Softw.,
26(4):527–550, dec 2000.
[29] Mark Gates, Piotr Luszczek, Jakub Kurzak, Jack Dongarra, Konstantin
Arturov, Cris Cecka, and Chip Freitag. SLATE working note 2: C++
API for BLAS and LAPACK. Technical Report ICL-UT-17-03, Innovative
Computing Laboratory, University of Tennessee, June 2017. revision 06-
2017.
[30] Gaël Guennebaud, Benoı̂t Jacob, et al. Eigen v3.
[31] J. Gustafson and I. Yonemoto. Beating floating point at its own game:
Posit arithmetic. Supercomputing Frontiers and Innovations, 4:71–86, 2017.
https://posithub.org/docs/posit standard-2.pdf.
[32] Azzam Haidar, Harun Bayraktar, Stanimire Tomov, Jack Dongarra, and
Nicholas J Higham. Mixed-precision iterative refinement using tensor cores
on GPUs to accelerate solution of linear systems. Proceedings of the Royal
Society A, 476(2243):20200110, 2020.
[33] Nicholas J Higham and Theo Mary. Mixed precision algorithms in numer-
ical linear algebra. Acta Numerica, 31:347–414, 2022.
41
[34] Mark Hoemmen, Daisy Hollman, Christian Trott, Daniel Sunderland,
Nevin Liber, Alicia Klinvex, Li-Ta Lo, Lebrun-Grandie Damien, Graham
Lopez, Peter Caday, Sarah Knepper, Piotr Luszczek, and Timothy Costa.
P1673R13: A free function linear algebra interface based on the BLAS.
Nevin Liber, Li-Ta Lo, Damien Lebrun-Grandie, Graham Lopez, Peter Ca-
day, Sarah Knepper, et al. P1673: A free function linear algebra interface
based on the BLAS. Technical report, Open Standards, 2023.
42
[45] Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza
Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric
Dellinger, Kristof Denolf, et al. Microscaling data formats for deep learning.
arXiv preprint arXiv:2310.10537, 2023.
[46] Jeremy G Siek and Andrew Lumsdaine. The Matrix Template Library:
A generic programming approach to high performance numerical linear
algebra. In International Symposium on Computing in Object-Oriented
Parallel Environments, pages 59–70. Springer, 1998.
[47] William J Starke, Brian W Thompto, Jeff A Stuecheli, and José E Moreira.
IBM’s POWER10 processor. IEEE Micro, 41(2):7–14, 2021.
[48] Joerg Walter, Mathias Koch, et al. Boost uBLAS: Basic Linear Algebra
Library.
[49] Shibo Wang and Pankaj Kanwar. BFloat16: The secret to high performance
on cloud TPUs. Google Cloud Blog, 4, 2019.
[50] Erik Welch, Jim Kitchen, Sultan Orazbayev, ParticularMiner, Stan Seib-
ert, William Zijie Zhang, Adam Lugowski, and Paul Nguyen. python-
graphblas/python-graphblas: 2024.2.0, February 2024.
[51] J. H. Wilkinson. Rounding Errors in Algebraic Processes. Prentice Hall,
1963.
[52] Carl Yang, Aydın Buluç, and John D Owens. GraphBLAST: A high-
performance linear algebra-based graph framework on the GPU. ACM
Transactions on Mathematical Software (TOMS), 48(1):1–51, 2022.
[53] AN Yzelman, D Di Nardo, JM Nash, and WJ Suijlen. A C++ GraphBLAS:
specification, implementation, parallelisation, and evaluation. Preprint,
2020.
43