KEMBAR78
Interface For Sparse Linear Algebra Operations | PDF | Matrix (Mathematics) | Software Engineering
0% found this document useful (0 votes)
41 views43 pages

Interface For Sparse Linear Algebra Operations

Uploaded by

bkrahmed163
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views43 pages

Interface For Sparse Linear Algebra Operations

Uploaded by

bkrahmed163
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

arXiv:2411.13259v1 [cs.

MS] 20 Nov 2024

Interface for Sparse Linear Algebra Operations

November 21, 2024

1
Abstract
The standardization of an interface for dense linear algebra operations in the
BLAS standard has enabled interoperability between different linear algebra li-
braries, thereby boosting the success of scientific computing, in particular in
scientific HPC. In particular, from this point on, applications could interface
highly optimized implementations through a standardized API. Despite numer-
ous efforts in the past, the community has not yet agreed on a standardization
for sparse linear algebra operations due to numerous reasons. One is the fact
that sparse linear algebra objects allow for many different storage formats, and
different hardware may favor different storage formats. This makes the defi-
nition of a FORTRAN-style all-circumventing interface extremely challenging.
Another reason is that opposed to dense linear algebra functionality, in sparse
linear algebra, the size of the sparse data structure for the operation result is not
always known prior to the information. Furthermore, as opposed to the stan-
dardization effort for dense linear algebra, we are late in the technology readiness
cycle, and many production-ready software libraries using sparse linear algebra
routines have implemented and committed to their own sparse BLAS interface.
At the same time, there exists a demand for standardization that would improve
interoperability, and sustainability, and allow for easier integration of building
blocks. In an inclusive, cross-institutional effort involving numerous academic
institutions, US National Labs, and industry, we spent two years designing a
hardware-portable interface for basic sparse linear algebra functionality that
serves the user needs and is compatible with the different interfaces currently
used by different vendors. In this paper, we present a C++ API for sparse
linear algebra functionality, discuss the design choices, and detail how software
developers preserve a lot of freedom in terms of how to implement functionality
behind this API.

2
Contents
1 Introduction and Motivation 3

2 Related Efforts 4
2.1 The (dense) BLAS standard . . . . . . . . . . . . . . . . . . . . . 4
2.2 The GraphBLAS standard . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Existing Sparse BLAS standards . . . . . . . . . . . . . . . . . . 7

3 Functionality Scope of the Sparse BLAS 8

4 Supported Sparse Matrix Storage Formats 9

5 API Design 12
5.1 Ownership and Opaqueness . . . . . . . . . . . . . . . . . . . . . 12
5.2 Horizontal and Vertical Interoperability . . . . . . . . . . . . . . 14
5.3 Review of existing API designs SpMV . . . . . . . . . . . . . . . 14
5.4 API Design Considerations . . . . . . . . . . . . . . . . . . . . . 18
5.5 Single-Stage API for Operations with known Output . . . . . . . 19
5.6 Multi-Stage API for Operations with unknown Output . . . . . . 24

6 Execution Model 30

7 Numerical considerations 31
7.1 Choices of numerical formats . . . . . . . . . . . . . . . . . . . . 32
7.2 Error bounds to be satisfied . . . . . . . . . . . . . . . . . . . . . 33
7.3 Consistent exception handling from input to output . . . . . . . 34
7.4 Bit-wise reproducibility from run to run . . . . . . . . . . . . . . 36
7.5 Test Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8 Future extensions 38

1 Introduction and Motivation


Despite the fact that Sparse matrix computations are at the heart of many
science and engineering applications the Sparse BLAS standard [25] proposed
decades ago has not gained enough traction in the community to become a stan-
dard allowing for portability across different hardware architectures. A reason
for this may be that it was focused on CPU architectures, not considering the
plethora of optimization options relevant to today’s accelerator architectures,
such as GPUs.
In the meantime, many vendors already provide support for sparse ma-
trix computations in proprietary or open-source libraries, but due to diverging
architectural constraints and new functionality required, these libraries have
different execution models, APIs, and formats supported. At the same time,
the well-recognized and successful BLAS standard for dense linear algebra has

3
been modernized by the C++ committee when integrating the functionality into
std::linalg when integrating the functionality into the C++-26 standard li-
brary.
Learning from these previous and ongoing efforts and building on top of
the C++ committee’s recent work, we propose a new C++-based Sparse BLAS
standard. Our main objectives are to enable portability across vendors, and
to provide high-performance, flexible and extensible interfaces for sparse linear
algebra functionality. Our API is designed to run on accelerators as well as
traditional CPUs alike. This is especially important given the pervasiveness
of accelerators in computing and the challenges in achieving high performance
from such hardware, which further increases the value of Sparse BLAS.
We recognize that linear algebra is a fundamental topic beyond classical
mathematical disciplines and physical sciences. Over the past decade, sparsity
has emerged as an important element of computation in AI. We argue that
a Sparse BLAS standard would provide a useful set of tools for modern AI
frameworks, potentially expanding to standardize new functionality such as low-
and mixed-precision computation and batch processing, and easing integration
with high-level deep learning frameworks.
While we may include more high-level routines in the future, such as solvers,
we initially restrict ourselves to basic functionality and also retain the familiar
name BLAS (Basic Linear Algebra Subroutines). Given the breadth and scope
of Sparse functionality (numerous matrix formats, complex algorithms, ...), we
expect the standardization effort to be an iterative process. We focus in the
current iteration on functionalities for which we found there is sufficient interest
and which are achievable for a first version of this standard. Therefore, the list
of features, kernels, and options may expand in future. The current API design
proposal is the fruit of an open, collegial collaboration between industry and
university experts. To suggest functionality for inclusion, or to get involved in
the project, feel free to contact the coordinator of the effort as identified in the
author list.
Concerning interfaces, we currently focus on the C++ interface as the key
building block, but plan to also suggest interfaces for C and Fortran. We ac-
knowledge the use of high-level languages such as Python and Julia in the HPC
community and plan to provide interfaces also to high-level programming lan-
guages in a later iteration.

2 Related Efforts
2.1 The (dense) BLAS standard
The origins of the BLAS standard can be traced back to 1973 when Hanson,
Krogh, and Lawson wrote an article in the SIGNUM Newsletter (Vol. 8, no.
4, p. 16) describes the advantages of adopting a set of basic routines for prob-
lems in linear algebra. This led to the development of the original BLAS [41],
which indeed turned out to be advantageous and very successful. It was adopted

4
as a standard and used in a wide range of numerical software, including LIN-
PACK [24]. An extended, Level 2 BLAS, was proposed for matrix-vector oper-
ations [23, 22]. At the time, these two levels were appropriate for vector proces-
sors and vector register chaining, respectively. Later in the 90s with the appear-
ance of cache-based machines, a strong need to restructure existing algorithms
to benefit from reusing data from cache emerged. The idea was to express algo-
rithms in terms of blocks that fit into cache memory, decreasing data movement
by providing a surface-to-volume effect for the ratio of O(n2 ) data movement to
O(n3 ) operations. Therefore, the Level 3 BLAS were proposed [20, 21], covering
matrix-matrix operations, and LINPACK was redesigned into LAPACK [3] to
use the new Level 3 BLAS where possible. With multi-core and GPU-accelerated
architectures, the BLAS have continued to be a valuable paradigm, including
in parallel task-based libraries such as PLASMA [1].
Several C++ interfaces have been proposed for the BLAS, ranging from
lightweight wrappers [29] to expression templates [46, 48, 30] that replace func-
tion calls with arithmetic expressions (C = A*B). Recently, the std::linalg
interface [35] was adopted as part of the C++26 standard, and has served as an
inspiration for this Sparse BLAS proposal. linalg is a templated, free-function
interface that leverages mdspan as an abstraction for dense matrices and ten-
sors. Matrix scaling and (conjugate) transposition are provided by lightweight
views of the matrix, which we have followed here. In a break from traditional
BLAS, functions use plain English names (multiply) instead of expert-focused
acronyms (gemm). Execution policies provide a mechanism for parallelization
and GPU acceleration.

2.2 The GraphBLAS standard


Many graph algorithms can be expressed in the language of linear algebra, often
with no loss of performance. Expressing graph algorithms using linear algebra
exposes natural parallelism, allowing users to capitalize on optimized sparse lin-
ear algebra routines for multi-threaded, GPU, and multi-node execution. This,
together with the fact that most large graphs are sparse, positions sparse matrix
algebra as a viable abstraction for graph computations [37, 9].
The GraphBLAS API, which provides generalized sparse linear algebra prim-
itives for implementing graph algorithms, capitalizes on the duality between
sparse matrices and graphs [6]. The GraphBLAS interface introduces several
generalizations to traditional linear algebra, including semirings, which allow
operations to be performed using different sets of operators than the plus and
times used in traditional arithmetic semiring. Different semirings are necessary
for different algorithms: for example, an implementation of single-source short-
est path (SSSP) might use a matrix-vector multiply with the plus-min semiring,
using plus to add the edge length to the current shortest path, followed by min to
update the path to the destination only if a shorter path has been found. Semir-
ings allow the same algorithm implementations to be used with many different
graph algorithms. GraphBLAS also provides a few other features necessary for
graph computation, such as masked operations, permutations, and element-wise

5
operations.
The GraphBLAS C API Specification [6] defines a set of opaque containers
such as matrices and vectors along with generalized linear algebra operations
in C. SuiteSparse provides a high-performance, multi-threaded implementation
of the GraphBLAS C API, along with some support for GPU execution [11,
12]. There are also several GraphBLAS-like libraries in other programming
libraries [50, 53, 52]. The GraphBLAS C++ Specification provides a native C++
interface, providing better support for user-defined types through templates as
well as taking advantage of C++ concepts and views to simplify the interface [7,
8].
While GraphBLAS provides much of the functionality necessary needed in
a Sparse BLAS interface, there are several issues that make it suboptimal for
adoption as a Sparse BLAS standard:
1. While GraphBLAS’ opaque objects provide a clean separation between
what data is owned by the user and what data is owned by the imple-
mentation, this separation generally requires data to be copied in and
out of GraphBLAS via its import/export API. For users who only want
to use a library to execute a few operations, this presents considerable
overhead. This issue can be resolved with the pack/unpack methods in
SuiteSparse:GraphBLAS, which takes O(1) time and does not allocate
memory. These methods require the library and GraphBLAS to share the
same memory space and are extensions to GraphBLAS that are not in the
GraphBLAS C API Specification, however.
2. GraphBLAS provides some complex functionality, such as masked oper-
ations and extract and assign routines, that are not needed by Sparse
BLAS, and that vendors may be unlikely to implement. The GraphBLAS
C API allows an unimplemented method to return an error code to indi-
cate that it is not implemented, so this issue could be resolved by creating
a well-defined Sparse BLAS subset of GraphBLAS.
3. Sparse BLAS has a few key routines, such as triangular solve, that are not
needed or included in GraphBLAS.
4. The GraphBLAS C++ API does not currently provide explicit support for
accelerators or explicitly control how and where code should be executed,
although the C++ specification does allow implementations to support ac-
celerators through C++ execution policies [7, 8]. Parallel and accelerated
execution should be a first-class consideration for a SparseBLAS specifi-
cation.
5. Finally, there are currently no vendor-provided implementations of Graph-
BLAS, with vendors deferring to support open-source implementations
where necessary due to the breadth of the API. This Sparse BLAS ef-
fort includes math library developers from many vendors (Intel, NVIDIA,
AMD, Arm), and we anticipate working with vendors to ensure the inter-
face is suitable for them to produce high-performance implementations.

6
Nevertheless, we anticipate that the proposed standard and GraphBLAS will
overlap for a subset of functionalities, and the Sparse BLAS API will be capable
of efficiently expressing them. Several GraphBLAS innovations we may explore
utilizing in this iteration of the Sparse BLAS standard include: (1) operations
between two sparse objects, (2) masking, and (3) non-blocking or asynchronous
execution.

2.3 Existing Sparse BLAS standards


The first mention of a Sparse BLAS standard was in 1985, with Dodson et
al. [18] calling for Sparse extensions to the BLAS standard, recognizing the
growing importance of sparse solvers and preconditioners in research and indus-
try. At the time, the proposed extensions primarily focused on level-1 operations
(i.e., vector-vector operations) and the use of compressed vector storage format.
Many important functions such as scatter, gather, and dot products are pro-
posed with interfaces in this short paper. The original SIGNUM newsletter
paper was extended in 1991 [17] but the focus on vector operations was kept.
Recognizing the lack of a matrix-focused Sparse BLAS proposal, in 1997,
Duff et al. [26] proposed a level-3 BLAS interface design (i.e., matrix-matrix
operations). In this paper, they propose 1) a sparse-matrix dense-matrix prod-
uct, 2) upper and lower triangular solves, 3) utility functions to convert between
matrices and scale matrices, and 4) routines to permute the rows and columns
of matrices. They also integrate workspace management into the interface with
appropriate error handling due to workspace allocations being a necessity in
several sparse-matrix operations. Thanks to Fortran 90, they also show how
to implement generic interfaces (e.g., called multiply for matrix products, with
a spmat type representing sparse matrices) that is simpler to use compared to
routines with letter codes and long argument lists.
In 2002, Duff et al. [25] grouped the previous proposals into a new standard-
ization proposal of operations on sparse matrices and vectors. This includes
operations equivalent to the dense BLAS level 1, level 2, and level 3 operations,
including sparse triangular solves, as well as operations for the creation of sparse
matrices and inserting nonzero elements into them. The proposed standard rec-
ognizes that almost every application area has a different way of storing and
accessing the nonzero entries of the sparse matrix, making it difficult to design
software applicable to all areas [25]. In response, the proposed interface takes
as input not the matrix entries themselves, but rather a pointer, or a handle
to a generic representation of a previously created sparse matrix object. This
allows algorithms to be coded using Sparse BLAS primitives without disclosing
the underlying matrix storage format. The authors do suggest a Fortran-style
interface for accessing Sparse BLAS functionality from C and Fortran. The
proposed 2002 SparseBLAS standard does not support operations between two
sparse objects (matrices or vectors) due to perceived complexity, which we plan
to address in this iteration.
In parallel with the previous efforts, several libraries with Sparse BLAS func-
tionality were developed. One such library that stands out is PSBLAS from

7
Filippone et al. [28]. The scope and breadth of the library make it akin to a
Sparse standard proposal, as most operations are supported and careful atten-
tion is given to interface design. They also reuse the work of Duff et al. [26],
e.g. use the same matrix storage type and rely on the serial Sparse BLAS of the
previous proposals. Their focus is on proposing a parallel (node-distributed)
implementation of the algorithms and implementing the library using multiple
layers, with base functionality implemented in Fortran 77 and higher-level inter-
faces in Fortran 90. In 2012, Filippone et al. [27] proposed an Object-Oriented,
Fortran 2003, interface thanks to recent programming language innovation that
greatly simplifies Sparse BLAS usage and standardization. This design enables
the support of multiple formats through class inheritance while maintaining
an abstract matrix handle type. This handle type allows for conversions and
other optimizations while enabling simpler interfaces. Notably, the design also
considers move versus copy semantics for matrix conversions and user data.
We learn from and base our standard proposal upon the previous efforts to
provide an increasingly simple-to-use yet expressive Sparse BLAS set of func-
tionality. At the same time, we extend the scope by, e.g., introducing an inter-
face for the more complex sparse-matrix times sparse-matrix product deemed
complicated in the previous iterations. The hardware evolution, namely GPUs,
requires us to also introduce a new set of functionality, user-provided allocators,
as memory management can be critical for these complex operations on such
complex hardware. Furthermore, we acknowledge in this proposal that for re-
peated application of a sparse operator, the initial investment in optimization –
analyzing sparsity patterns and characteristics, and fine-tuning format and exe-
cution policies – can quickly be amortized by faster execution. We thus propose
interfaces also allowing for an optimization phase prior to repeated execution of
functionality.

3 Functionality Scope of the Sparse BLAS


The goal of this effort is to define a common understanding of how to access a
set of standard data structures representing sparse linear algebra objects and
a common interface for basic functionality that operates on these objects. To
reduce complexity, we have omitted advanced functionalities that operate on
sparse objects such as iterative solvers, eigensolvers, and preconditioners. Even
so, the Sparse BLAS serves as an important foundation for building high-level
applications, facilitating integration between advanced algorithms and low-level
building blocks.
Our scope entails sparse storage format standardization as well as conver-
sion between formats, the multiplication and addition of matrices (including
operations involving both sparse and dense matrices), transposition and conju-
gation of sparse matrices, a solver for sparse triangular linear systems, scaling
of sparse matrices, norm calculations on sparse matrices, element-wise addition
and multiplication, and masking operations that extract a subset of the nonzero
entries of a sparse matrix as in the sampled dense times dense matrix multi-

8
plication (SDDMM) kernel common in machine learning frameworks. A list of
functionality falling into this scope is given in Table 1. For convenience, we
also summarize the API type for each functionality as detailed in Section 5.
Multi-stage APIs produce a Sparse Matrix as an output, whereas single-stage
APIs produce a dense object as an output.

Operation Notation API Type


Scaling A := αA Single-stage
Transpose and Conjugate Transpose A := AT , A := AH Multi-stage
Infinity Matrix Norm α := kAkinf Single-stage
Frobenius Matrix Norm α := kAkF Single-stage
Element-wise Multiplication C := A . ∗ B Multi-stage
Sparse Matrix – Sparse Matrix Addition C := A + B Multi-stage
Sparse Matrix – Sparse Matrix Multiplication C := A · B + D Multi-stage
Sparse Matrix – Dense Matrix Multiplication Y ′ := αA · X + βY Single-stage
Sparse Matrix – Dense Vector Multiplication y ′ := αA · x + βy Single-stage
Triangular Solve Solve A · y = x for y Single-stage
Sparse Matrix Format Conversion B = sparse(A) Multi-stage
Predicate Selection B = A(predicate) Multi-stage
Sampled Dense – Dense Matrix Multiplication Chmaski = A · B Single-stage

Table 1: List of the Sparse BLAS functionalities. Uppercase letters represent


matrices; lowercase represent vectors. Letters from the start of the alphabet
are used for sparse matrices/vectors; letters from the end are used for dense
matrices/vectors. α is a scalar value. Note that scaling may be applied to any
input matrix/vector, and the transpose/conjugate transpose operation may be
applied to any input matrix.

4 Supported Sparse Matrix Storage Formats


Over the decades, a set of commonly used sparse matrix formats has emerged,
providing a versatile set of data structures that allow for productivity and per-
formance. The formats we identified as most common are Compressed Sparse
Row (CSR)1 , Compressed Sparse Column (CSC), and Coordinate (COO). These
formats are well-supported in existing libraries and widely used for storage and
exchange. We will consider these as input formats for the standard, and each
of these formats will be supported through a lightweight, non-owning view class
defined by the Sparse BLAS standard. We will also support dense vectors and
matrices, including row and column major versions, using C++23’s mdspan as
a standard view interface for dense vector and matrix formats. The proposed
set of user inputs necessary for defining a matrix stored in these formats as
well as constructing a new matrix type are given below. It is noteworthy that
the API can be extended to accommodate non-standard formats by introducing
additional views, removing the need to otherwise modify program logic.
1 Also known as Compressed Row Storage (CRS)

9
Several additional storage formats have been proposed in the literature
throughout the years [40], outlining trade-offs between performance, memory
footprint, and creating overhead. These formats can be considered as optimiza-
tion targets, rather than input/exchange formats. Possible future extension
formats include CSR(4), CSC(4), BSR / BSC, variable BSR / BSC, Ellpack
(including variants ESB/SELL-P/SELL-C-σ), and diagonal/tridiagonal/penta-
diagonal formats. We will not go into specific details about these here; for now,
these are considered to be optional extension formats.
Some applications require the structure of sparse matrices only (i.e., the
pattern of non-zero elements) and not the values themselves. Other applications
may use the same value in all nonzero locations, we may call these iso-valued
matrices. We do not provide a specific type for structure-only matrices but
instead define iso-valued matrices. We anticipate iso-valued matrix views being
created via CSR, CSC, or COO matrix views whose value arrays are replaced
by a symbolic “iso array” whose values are all the same.
Table 2 outlines the input parameters required to construct a CSR matrix.
These input parameters are defined by three important types: scalar type,
which is the type of the scalar elements stored in the sparse matrix, index type,
which is the type used to store explicit indices, and offset type, which is the
type used to store offsets into other arrays. The index type limits the maxi-
mum dimensions of a matrix that can be stored, while offset type limits the
maximum number of nonzeros that can be stored. This is useful, for example,
when large matrices may require int64 t for rowptr elements in order to avoid
integer overflow during calculations involving rowptr. However, using int64 t
for the large colindx would be unnecessary unless the matrix dimensions are
larger than 232 and could result in lower performance than using a smaller in-
teger type.

Type Name Description


index type nrows Number of rows of the matrix
index type ncols Number of columns of the matrix
offset type nnz Structural number of non-zeros
offset type array rowptr Row pointer array (length: nrows+1)
index type array colindx Column indices array (length: nnz)
scalar type array values Structural values (length: nnz)
base type index base Base indexing

Table 2: CSR input parameters

The parameter nnz might easily be deduced from the input arrays,

nnz = rowptr[nrows] − rowptr[0],

but it can be convenient to store it explicitly in specific scenarios. Therefore,


the parameter nnz is part of the sparse data structures.
Parameters for CSC and COO are given in Tables 3 and 4, respectively.
Many sparse formats can be generalized to a block version, leading to more
suitable data structures when dealing with dense block structures typical of

10
Type Name Description
index type nrows Number of rows of the matrix
index type ncols Number of columns of the matrix
offset type nnz Structural number of non-zeros
offset type array colptr Column pointer array (length: ncols+1)
index type array rowindx Row index array (length: nnz)
scalar type array values Structural values (length: nnz)
base type index base Base indexing

Table 3: CSC input parameters

Type Name Description


index type nrows Number of rows of the matrix
index type ncols Number of columns of the matrix
offset type nnz Structural number of non-zeros
index type array rowindx Row index array (length: nnz)
index type array colindx Column index array (length: nnz)
scalar type array values Structural values (length: nnz)
base type index base Base indexing

Table 4: COO input parameters

matrices found in many simulation codes and obtained with the finite element
method (FEM) or the discontinuous Galerkin method (DG). These data struc-
tures reduce the memory footprint and propose suitable storage for using dense
linear algebra. For instance, the BSR (or BCSR, Block Compressed Sparse
Row) format extends the CSR format by using square block-dense matrices
rather than scalars for the values. Information describing the characteristics
of its blocks, such as the dimension bsize and the internal storage bstorage
(row-oriented or column-oriented layout), must be included. The array values
stores non-zero blocks contiguously.
While we acknowledge the wide-spread use of block formats, we exclude
them in this first version of an API design proposal. This is motivated by the
idea of keeping this document within reasonable length and complexity.
We anticipate using std::mdspan, introduced in C++23, to support dense
matrices in the same style as the recently standardized std::linalg C++
BLAS interface [34]. Users can create an mdspan view with a data handle
(a pointer is the most common example, although different data handles may
appear as defined by an accessor policy) along with matrix extents, represent-
ing the matrix’s dimensions. Once an mdspan view has been created, the user
can pass in a single argument to represent a dense matrix, instead of passing
in multiple arguments for the pointer, dimensions, and stride. std::mdspan
supports a large degree of compile-time customization, including compile-time
constant extents, customization of index type view extents, as well as custom
layout and accessor policies. Row- and column- major matrices are supported by
standard-defined layout policies (std::layout right and std::layout left).
We anticipate vendors restricting users to only a subset of the possible mdspans,

11
such as mdspans with standard-defined layout policies, default accessors, and an
order of one or two. These mdspans could be directly passed into most vendor-
provided implementations that expect a traditional dense matrix represented
by a pointer, dimensions, stride, and row or column major layout. Unsupported
mdspans would trigger a clear compile error message in a high-quality imple-
mentation.

Type Name Description


Type of values stored in mdspan, equivalent to
- T
scalar type.
Type specifying the number of dimensions, their types,
- Extents
and whether they are known at compile time or runtime.
A type defining how matrix elements are laid out. (e.g.
- LayoutPolicy
std::layout left, std::layout right)
- AccessorPolicy A type defining how matrix elements should be accessed.

data handle type p Handle to dense matrix values.


std::integral type exts... The matrix’s extents, or dimensions.

Table 5: Key template parameters and constructor arguments for building an


std::mdspan. Template parameters, which provide types that customize the
mdspan, are shown with “type” field empty, while runtime types have their type
listed.

5 API Design
The sparse BLAS interface design has to consider a wide range of aspects, rang-
ing from opaqueness to memory allocation for sparse outputs and data owner-
ship to multi-stage APIs and object- and function- specific execution policies
and hints for optimization. In the following sections, we address these inter-
twined aspects individually.

5.1 Ownership and Opaqueness


There exist different strategies for handling the ownership of data in sparse
linear algebra operations.
• A fully owning, less transparent API based on opaque objects.
• A non-owning, fully transparent API based on the C++ views concept.
There are multiple reasons for having these two types of API. 1) transparent,
non-owning matrix objects can prevent some vendor optimizations (e.g., using
a better-suited matrix format), but on the other hand, owning matrix objects
come with an initial usage cost and require copies; 2) non-owning types require
multiple phases for their instantiation and allocation which must be explicitly
exposed to the user, which can be especially complex and require several steps
for more complex operations (e.g., Sparse-Matrix times Sparse-Matrix product
– SpGEMM); 3) with transparent, non-owning objects, the internal data can

12
always be inspected and accessed at little cost, whereas accessors are required
for opaque objects and can come with an extra cost.
Because each of these aspects can be important in different situations and
to different users, we believe that the two options are both needed, and comple-
mentary, and providing these two types of API will satisfy most needs.
In this first version of an interface design, we focus on the second option, the
non-owning API for functionality. However, as we will detail next, we extend
this approach to allow for a moderate level of optimization. In this non-owning
API, the user owns the data and light-weight views are used to expose the
functionality to the library operations. In particular, the light-weight views
expose the raw pointers and size information of, for example, a CSR matrix,
dense matrix, or a COO matrix.
As discussed, this approach limits the optimization potential for developers.
In particular, the light-weight views do not allow to store additional informa-
tion that can be exploited by algorithms operating on the sparse matrices, like
symmetry, upper or lower triangular structure, matrix transpose, etc. Without
breaking the concept of light-weight views, we enable optimization and storing
additional information by wrapping the view into a matrix handle that con-
tains both the view to the user-owned raw data and a structure that can contain
additional data for optimization. Important is that all the optimization data is
owned by the library, not by the user, and the data is stored in library-allocated
memory and is generally opaque to the user.
Some users want to keep control of all available memory, in particular control
how data structures are allocated. For this reason, we propose a design where
the user can pass an allocator to the library that is then used by the library
to allocate additional data in the matrix handle. If no allocator is provided,
the library is permitted to use whichever allocator they deem to be appropriate.
In Section 5.1 we give an example how a CSR matrix structure is passed as
a light-weight view to a library and a matrix handle is wrapped around it to
allow for storing additional information.

using namespace sparseblas;


csr_view<float> A(values, rowptr, colind, shape, nnz);

// matrix_handle is opaque and may contain


// vendor optimization details
auto A_handle = matrix_handle(A, allocator);

Listing 1: Wrapping a CSR matrix structure composing of values, rowptr,


colind, shape, nnz into a matrix handle with user-specified allocator.

We note that the API is expected to accept both plain views or matrix handle’s.
If plain views are passed in, the library has no optimization structure at hand
and can not store any additional matrix information that persists beyond the
operation.

13
5.2 Horizontal and Vertical Interoperability
The interface and functionality standardization we propose must allow for hor-
izontal and vertical interoperability. Vertical interoperability includes both up-
ward and downward interoperability. Upward interoperability refers to usability
and productivity inside high-level algorithms like iterative solvers, precondition-
ers, and sparse neural networks. Only interfaces and functionality decisions al-
lowing for high productivity for high-level algorithms and applications justify
the design of a sparse BLAS standard. We assess upward interoperability by us-
ing the sparse BLAS functionality inside mini apps. Downward interoperability
refers to the possibility of wrapping already existing sparse BLAS functionality
developed by hardware vendors into the sparse BLAS interface. The intention
for the downward interoperability is not to create yet another layer of inter-
faces, but to ensure that design decisions taken by the vendors in the kernel
implementations do not hinder the adoption of a new interface in future ker-
nel development. Horizontal interoperability refers to compatibility with dense
BLAS functionality and the possibility to use the functionality in other pro-
gramming languages. The latter takes on a new significance in the context of
Sparse BLAS, as we aim to focus on building a high-level C++ API, as opposed
to the Fortran-based interface used in the dense BLAS to allow for ABI compat-
ibility. Instead, our goal is to ensure that basic functionality is accessible from
a wide range of languages, including Fortran, C, Julia, Python, and others. The
strategy we envision is to create C-bindings for the most central functionality.
Most vendors implement linear algebra functionality using the C++ program-
ming language. Likewise, most already implemented sparse BLAS functional-
ity, e.g. in NVIDIA’s, Arm’s, Intel’s, and AMD’s vendor libraries, are often
designed in C++. Also, for dense BLAS, the C++ standard officially adopted the
std::linalg API [34]. We adopt this strategy and build upon an interface in
C++. At the same time, we acknowledge that many users and application devel-
opers depend on Fortran, C, or Python interfaces as their application codes are
using the respective language. Designing an interface to sparse BLAS function-
ality in C or Fortran without the use of static and dynamic polymorphism is
almost impossible due to the explosion in variants, which would result in a pro-
hibitively long parameter list and a plethora of function variants. In response,
we will select the most common use cases and provide bindings in C, Fortran,
and Python to a subset of the sparse BLAS functionality.

5.3 Review of existing API designs SpMV


While the community has not followed a synchronized approach towards an
API for sparse linear algebra functionality, several groups from academia and
industry have developed and deployed sparse linear algebra operations along
with a user-facing interface. While the interfaces differ, they are all designed
to accommodate functionality- and hardware-specific needs. Most of the cur-
rently supported APIs follow the transparent, non-data-owning design. Existing
APIs are different but share certain design patterns, and we consider them as

14
blueprints guiding the design of the new API standard for sparse linear algebra
functionality. Thus, in the subsequent, we provide a comparison of existing C
and C++ library APIs for sparse matrix-vector (SpMV) operations. Some APIs
are synchronous and blocking, while others extend to asynchronous non-blocking
execution models.

15
SpMV API comparisons between existing libraries:
IEEE754 Double Precision Sparse Matrix-Vector Multiplication (SpMV)

y = α · op (A) x + β · y

(1) Arm Performance Libraries C API (has an optional hint/optimize stage)


// comm
/* comm */
info = armpl spmv exec d(ARMPL SPARSE OPERATION NOTRANS, alpha,
armpl mat, x, beta, y);

(2) oneMKL C API (has an optional hint/optimize stage)


status = mkl sparse d mv(SPARSE OPERATION NON TRANSPOSE,
alpha, csrA, descrA, x, beta, y);

(3) AOCL-sparse C API (has an optional hint/optimize stage)


status = aoclsparse dmv(aoclsparse operation none,
p alpha, A, descr a, y, p beta, x);

(4) Graph BLAS Suite Sparse C API (implements y hmaski = accum (y, A · x)
in some semiring)
// y = alpha*y
GrB apply(y, GrB NULL, GrB NULL, GrB TIMES FP64, alpha, y, GrB NULL);
y += beta*A*x
GrB mxm (y, GrB NULL, GrB PLUS FP64. GrB PLUS TIMES SEMIRING FP64,
A, x, GrB NULL);

(5) ALP/GraphBLAS C++ (minimum number of containers, optimised by


nonblocking execution)
grb::Semiring< grb::operators::add<double>, grb::operators::mul<double>,
grb::identities::zero, grb::identities::one > plusTimes;
grb::Monoid< grb::operators::mul< double >, grb::identities::one > mul;
status = grb::foldr( beta / alpha, y, mul );
status = status ? status : grb::mxv< op >( y, A, x, plusTimes );
status = status ? status : grb::foldr( alpha, y, mul );

(6) cuSPARSE CUDA C, generic API (has multiple required stages)


status = cusparseSpMV(handle, CUSPARSE OPERATION NON TRANSPOSE,
p alpha, matA, vecX, p beta, vecY,
CUDA R 64F, CUSPARSE SPMV ALG DEFAULT,
externalBuffer);

16
(7) rocSPARSE ROCm C, generic API (has multiple required stages)
status = rocsparse spmv(handle, rocsparse operation none,
p alpha, matA, vecX, p beta, vecY,
rocsparse datatype f64 r,
rocsparse spmv alg csr adaptive,
rocsparse spmv stage compute,
p buffer size, temp buffer);

(8) hipSPARSE ROCm/CUDA C API


status = hipsparseDcsrmv(handle,
HIPSPARSE OPERATION NON TRANSPOSE,
m, n, nnz, p alpha, descrA,
valA, rowptrA, colindA, x, p beta, y);

(9) oneMKL SYCL C++ API with an out-of-order queue (has an optional
optimize stage)
ev gemv = oneapi::mkl::sparse::gemv(
queue, oneapi::mkl::transpose::nontrans,
alpha, csrA, x, beta, y, {ev opt});

(10) oneMKL SYCL C++ API with an in-order queue (has an optional optimize
stage)
oneapi::mkl::sparse::gemv(
queue, oneapi::mkl::transpose::nontrans,
alpha, csrA, x, beta, y);

(11) Kokkos-kernel C++ API (can set algorithm in controls argument)


KokkosSparse::spmv(controls, KokkosSparse::NoTranspose,
alpha, A, x, beta, y);

(12) Ginkgo C++ API


// cuda/rocm stream or sycl::queue is attached to A previously
// (through an ”Executor” object)
// matrix::Csr *A; // sparse matrix
// matrix::Dense *y, *x; // vectors tall and skinny matrices
// matrix::Dense *alpha, *beta; // scalars of dimension 1x1
A−>apply(alpha, x, beta, y);

Note that the handle in rocSPARSE, cuSPARSE and hipSPARSE APIs is


an object that should be created with a call to XYZsparseCreate(p_handle)
before any other API is invoked at which time it globally initializes their Sparse
BLAS library and creates the handle on their device context. It also houses the
appropriate cudaStream_t, rocStream_t or hipStream_t stream to be used in
subsequent calls. The CPU APIs have no such wrapper, and the SYCL APIs
take a sycl::queue.

17
5.4 API Design Considerations
From a high-level point of view, there exist two classes of sparse BLAS opera-
tions:
1. Operations where the sparsity pattern of the output object is known a
priori, e.g. the multiplication of a sparse matrix with a vector or a dense
matrix, or the sampled multiplication of two dense matrices, or the scaling
of a matrix. These operations can be realized with a single kernel operating
on pre-allocated output data.
2. Operations where the sparsity pattern of the output object is not known
a priori, e.g. the multiplication of two sparse matrices, a filtering oper-
ation removing values below a pre-defined threshold, or the addition of
two sparse matrices (with – in general – different sparsity patterns). For
these operations, it is impossible to execute them using a single kernel as
the memory requirement for the output is not known a priori, and the
memory allocation can only happen after the number of nonzero elements
has been computed. For this reason, operations of this class are typically
implemented using the pattern compute, allocate, fill [26].
For both single-stage and multi-stage operations that produce sparse output,
the sparsity structure of the output is based on the symbolic operation, not on
the numeric operations. In other words, a numeric calculation resulting in a
numeric zero does not result in omitting this entry from the output sparsity
pattern.
While there exist strong arguments for unifying the interface for the differ-
ent functionality to always require the same number of stages (conceptually,
a matrix matrix multiplication is the same operation, agnostic to the non-zero
structure and the storage format of the matrices), we decided to sacrifice consis-
tency in favor of reduced complexity. In particular, we use a one-stage API for
functionality that produces output of known sparsity pattern, and a multi-stage
API for functionality that produces output of unknown sparsity pattern.
Functionality operating on sparse objects differ also in another aspect from
functionality operating on dense objects: depending on the distribution of the
nonzero entries, different parameter choices may be optimal for fast execution
– up to the level where different algorithms may be chosen depending on the
nonzero distribution. To account for this, we consider an additional inspect
phase that is optional to all functionality (single-stage or multi-stage) and al-
lows for additional optimization with regard to the selection of the algorithm
and parameters for processing the operation and for storing some additional
information in the matrix handle. The optimization may in particular pay off
in case of repeated function invocation, e.g., the repeated invocation of a sparse
matrix vector product inside an iterative solver loop. If the user feeds in plain
views into the operation, the inspect phase can invoke operation-specific opti-
mization but can not access additional information in the optimization structure
of a matrix handle. If the optional inspect phase is not invoked prior to the op-
eration, a default algorithm and parameter selection is used. This concept of

18
an optional inspect phase is in line with the current design of vendor interfaces,
see Section 5.3.
In the next sections, we will discuss the single-stage API for functional-
ity producing output with a-priori known sparsity pattern and the multi-stage
API for functionality producing output with a-priori unknown sparsity pattern.
Note, that all operations take policy and state objects as inputs. The policy
object is opaque to the user. It may contain any vendor-specific execution poli-
cies, similar to a stream or queue. The state object is operation-specific and
can be used to store any operation-specific data, e.g. a dependency graph in a
triangular solve or data in an operation-specific library-internal representation.
For the multi-stage functionality, it may also contain information about which
stages have already been completed to avoid re-calculation.

5.5 Single-Stage API for Operations with known Output


Operations with known output can be realized using a single routine. Function-
ality listed in Table 1 falling into this category includes: scaling, transpose and
conjugate transpose, norms, sparse matrix vector multiplication, sparse matrix
dense matrix multiplication, triangular solve, general and sampled dense matrix
multiplication. As previously mentioned, we allow for an optional inspect phase
for inspecting the sparse objects and passing information to the routines execut-
ing the operation. We here present the interface for some of the functionality,
noting that the functionality in general accepts both, plain views on the objects
and matrix-handles, with the latter allowing for additional optimization.

using namespace sparseblas;


csr_view<float> A(values, rowptr, colind, shape, nnz);
// auto A = matrix_handle(csr_view<float>(...), allocator);

// scale() overwrites the values of A


scale_state_t state(allocator);
scale(policy, state, 2.3, A);

Listing 2: Scaling, A := αA

19
using namespace sparseblas;
csr_view<float> A(values, rowptr, colind, shape, nnz);

// matrix_handle is opaque and may contain


// vendor optimization details
auto A_handle = matrix_handle(A, allocator);

// A is const; function returns Inf norm


matrix_inf_norm_state_t state(allocator);
auto inf_nrm = matrix_inf_norm(policy, state, A);

Listing 3: Inf Matrix Norm, α = kAkinf.

using namespace sparseblas;


csr_view<float> A(values, rowptr, colind, shape, nnz);
// auto A = matrix_handle(csr_view<float>(...), allocator);

matrix_frob_norm_state_t state(allocator);
// A is const; function returns Frobenius norm
auto frob_nrm = matrix_frob_norm(policy, state, A);
// or - without passing a policy, resulting in a default:
auto frob_nrm = matrix_frob_norm(state, A);

Listing 4: Frobenius Matrix Norm, α = kAkF

20
using namespace spblas;
csr_view<float> A(values, rowptr, colind, shape, nnz);

auto x = std::mdspan(raw_x.data(), 100);


auto y = std::mdspan(raw_y.data(), 100);

// matrix_handle is opaque and may contain


// vendor optimization details
auto A_handle = matrix_handle(A, allocator);

// create a state for the optimization of the routine


multiply_state_t state(allocator);

// optional inspect phase can add information


// to the matrix handle
multiply_inspect(policy, state, A_handle, x, y);

// actual multiplication y = A * x
multiply(policy, state, A_handle, x, y);

// alternatively, the multiplication can use the plain views


// limiting the optimization opportunities
multiply(policy, state, A, x, y);

Listing 5: Sparse matrix vector product y = A · x.

21
using namespace spblas;
csr_view<float> A(values, rowptr, colind, shape, nnz);
float alpha = ..., beta = ...;

auto X = std::mdspan(raw_x.data(), 100, 100);


auto Y = std::mdspan(raw_y.data(), 100, 100);

// matrix_handle is opaque and may contain


// vendor optimization details
auto A_handle = matrix_handle(A, allocator);

// create a state for the optimization of the routine


multiply_state_t state(allocator);

// optional inspect phase can add information


// to the matrix handle
multiply_inspect(policy, state, scaled(alpha, A_handle), X, scaled(beta, Y), Y);

// actual multiplication Y = alpha * A * X + beta * Y


multiply(policy, state, scaled(alpha, A_handle), X, scaled(beta, Y), Y);

// alternatively, the multiplication can use the plain views


// limiting the optimization opportunities
multiply(policy, state, scaled(alpha, A), X, scaled(beta, Y), Y);

Listing 6: Sparse matrix dense matrix product, Y = α · A · X + β · Y , with X


and Y being dense matrices.

22
using namespace spblas;
// Matrix T data is already in lower or upper triangular form
// physically in the matrix arrays for the triangular
// solve
csr_view<float> T(values, rowptr, colind, shape, nnz);

auto x = std::mdspan(raw_x.data(), 100);


auto b = std::mdspan(raw_y.data(), 100);

// matrix_handle is opaque and may contain


// vendor optimization details
auto T_handle = matrix_handle(T, allocator);

// create a state for the optimization of the routine


triangular_solve_state_t state(allocator);

// optional inspect phase can add information


// to the matrix handle
triangular_solve_inspect(policy, state, T_handle, b, x);

// actual triangular solver x = T \ b


triangular_solve(policy, state, T_handle, b, x);

// alternatively, the triangular solver can use the plain views


// limiting the optimization opportunities
triangular_solve(policy, state, T, b, x);

Listing 7: Triangular solve: solve T · x = b for x.

using namespace sparseblas;


auto X = std::mdspan(raw_x.data(), m, k);
auto Y = std::mdspan(raw_b.data(), k, n);
// sample mask is the pattern of CSR matrix output
csr_view<float> C(values, rowptr, colind, shape, nnz);

sampled_multiply_state_t state(allocator);
sampled_multiply_inspect(policy, state, X, Y, C);

// C = C.*( X * Y ): vendor optimizations in state for reuse


sampled_multiply(policy, state, X, Y, C);

Listing 8: Sampled dense dense matrix multiplication (SDDMM), Chmaski =


X · Y , where the C sparsity pattern encodes the mask

23
5.6 Multi-Stage API for Operations with unknown Out-
put
As mentioned above, a fundamental difference to dense BLAS functionality is
that for basic operations on sparse objects, the sparsity pattern (and therewith
the sparse data structure) is not always known beforehand. An example is the
multiplication of two sparse matrices A, B, resulting in another sparse matrix
C = A · B with a pattern that typically is unknown prior to the computation.
In consequence, functionality producing sparse output with the sparsity pattern
unknown prior to the computation typically split into three phases (or four if
including the optional inspect phase):
1. inspect phase – (optional) prepare any potential optimizations for subse-
quent phases, or may do nothing
2. compute phase – computing the size of the sparse output structure, which
typically requires significant work
3. allocation phase – allocating the memory for the sparse output data struc-
ture and placing it in the output matrix object to be filled
4. fill phase – complete execution of the operation and filling the output
structure with the result.
An important design question for functionality producing sparse output is
whether the memory allocation is done by the user or the library. While there
exist strong arguments on both sides, and passing a memory allocator to the
library poses an elegant way to resolve memory location and ownership, we ini-
tially propose an interface that requires the library user to allocate the memory
for the sparse output data after providing the memory requirement informa-
tion. While we, in the following, focus on the multiplication of sparse matrices
(typically called SpGEMM in literature), the considerations and design aspects
hold also for other basic functionalities that are producing output with sparsity
pattern unknown prior to the computation. In Table 1 we list several: sparse
matrix – sparse matrix multiplication, matrix sparse format conversion, filter
operations removing elements below a pre-defined threshold, and the addition
of sparse matrices with generally differing sparsity patterns.
Acknowledging the previously motivated inspect phase, the multiplication of
two sparse matrices

C = α · op(A) · op(B) + β · D
can be realized using the following sequence of steps (using APIs that will
be defined subsequently):

using namespace spblas;


// csr_view<float> A,B,D filled elsewhwere
// csr_view<float> C(c_nrows, c_ncols);

24
sparse_multiply_state_t state(allocator);
sparse_multiply_inspect(policy, state, A, B, C, D); // optional
sparse_multiply_compute(policy, state, A, B, C, D);
index_t nnz = state.get_result_nnz();
// the user allocates the arrays for C
sparse_multiply_fill(policy, state, A, B, C, D);
// C structure and values are now ready to be used

We note that we change from the function name multiply (which we used for
the multiplication of sparse with dense matrices or vectors) to sparse multiply
to indicate that this operation generates sparse output and that a fundamentally
different multi-stage API is needed. In consequence, it is not possible to just
call multiply(A,B,C) on sparse objects A,B,C.
We also note that this interface gives the software developers a significant
level of flexibility and room for optimization.
A standard way to is to place the structural analysis in the multiply compute
stage, then allocate the output data structure and fill it in the multiply fill
stage. Internal buffers needed in the structural analysis can be allocated and
de-allocated by the library in the multiply compute routine.
The only step that has to critically happen in the multiply compute rou-
tine is the computation of the number of nonzero entries in the output to
be able to allocate the output data structure. The multiply fill routine
is required to fill the user provided arrays with the output data so that it is
usable after multiply fill stage. These are the only requirements on the
multiply compute and multiply fill stages, and thus the developers are left
with a lot of freedom to realize the functionality. For example, developers can
decide to already fill the CSR rowptr in the multiply compute routine. Or pack
even more into this first stage: An approach allowing for more library-internal
optimization may be to pack both the structural analysis and the multiplication
into the multiply compute routine using an internal output data structure that
is part of the state object. The multiply fill routine then only extracts the
result into the output data structure allocated by the user. Any library-internal
output data structures live in the state object and would be deleted when state
is destroyed. Such an implementation does not break the interface design.
A critical question is how the library allocates these internal buffers, via a
library-preferred allocator or an allocator passed to the library from the user.
This boils down to whether the user has buffer management on the application
level, i.e., manages all the device memory in the application. At the same time,
requiring the user to always provide an allocator to the library can become
cumbersome. In the current design proposal, we allow the user to pass in a
memory allocator (for the memory space where the computations are executed,
i.e. the “compute device”) that is then used by the library for the allocation of
buffers. If no allocator is passed, the library is free to use its own allocator.
Some scenarios additionally benefit from splitting the multiplication of sparse
matrices and other routines operating on sparse data structures into a symbolic
and a numeric stage. This, for example, can allow for additional optimization

25
in the repeated multiplication of matrices changing only the numeric values
but preserving the sparsity pattern. Hence, we propose a variant of the above
interface splitting the routines multiply compute and multiply fill into two:

using namespace spblas;


// csr_view<float> A,B,D filled elsewhwere
// csr_view<float> C(c_nrows, c_ncols);

sparse_multiply_state_t state(allocator);
multiply_inspect(policy, state, A, B, C, D); // optional
multiply_symbolic_compute(policy, state, A, B, C, D);
index_t nnz = state.get_result_nnz();
// allocate C arrays and put in C (possibly both structure
// and values or just structure arrays at this point)
multiply_symbolic_fill(policy, state, A, B, C, D);
// C structure is now able to be used

// allocate, if necessary, values arrays and place them in C


multiply_numeric_compute(policy, state, A, B, C, D);
multiply_numeric_fill(policy, state, A, B, C, D);
// C structure and values are now able to be used

Because we do not specify at which phase the output matrix is prepared


internally, the compute phase (e.g. multiply numeric compute) and the fill
phase (e.g. sparse multiply numeric fill) should be regarded as a pair and
for repeated execution, both phases should be called. It is up to implementations
to detect on repeated calls which internal data in the multiply state t object
is reusable and what must be recomputed.
We want to provide some examples for the use of this API.
1. Single use:

sparse_multiply_state_t state(allocator);
sparse_multiply_inspect(policy, state, A, B, C, D); // optional
sparse_multiply_compute(policy, state, A, B, C, D);
index_t nnz = state.get_result_nnz();
// allocate C arrays
sparse_multiply_fill(policy, state, A, B, C, D);
// C structure is now able to be used
2. Repeated invocation of the functionality with numerically differing matri-
ces:
sparse_multiply_state_t state(allocator);
sparse_multiply_inspect(policy, state, A, B, C, D); // optional
for (i=0 i<n; i++) {
// may do nothing if nnz is already computed

26
// and information is stored in the state
sparse_multiply_compute(policy, state, A, B, C, D);
index_t nnz = state.get_result_nnz();
if (i == 0 }{
// allocate C arrays
}
sparse_multiply_fill(policy, state, A, B, C, D);
// C structure is now able to be used
}
3. Using the split into numeric and symbolic phase for repeated invocation
of the functionality with numerically differing matrices:

sparse_multiply_state_t state(allocator);
sparse_multiply_inspect(policy, state, A, B, C, D); // optional
sparse_multiply_symbolic_compute(policy, state, A, B, C, D);
index_t nnz = state.get_result_nnz();
// allocate C arrays
sparse_multiply_symbolic_fill(policy, state, A, B, C, D);
for (i=0 i<n; i++) {
// only numeric computations inside the loop
multiply_numeric_compute(policy, state, A, B, C, D);
multiply_numeric_fill(policy, state, A, B, C, D);
// C structure is now able to be used
}
Up to now, we provided motivation and details for the design of the API for
the sparse matrix - sparse matrix multiply operation, but the same two-stage
interface applies also to other functionality with nonzero structure unknown
a-priori.

27
using namespace sparseblas;
auto A = std::mdspan(raw_x.data(), m, k);
csr_view<float> B(m, n);

convert_state_t state(allocator);
convert_inspect(policy, state, A, B);
convert_compute(policy, state, A, B);
index_t nnz = state.get_result_nnz();
// allocate the memory for B and put them in B
convert_fill(policy, state, A, B);

Listing 9: Sparse Matrix Format Conversion, B = sparse(A). Note that this


will not remove explicit zeros except for conversion from dense. The above
convert * functions might accept the predicate function additionally to remove
some entries during conversion, which can give some performance benefit with-
out creating a matrix two times.

using namespace sparseblas;


csr_view<float> A(values, rowptr, colind, shape, nnz);
// auto A = matrix_handle(csr_view<float>(...), allocator);
// ... likewise for B
csr_view<float> C(m, n);
// auto C = matrix_handle(csr_view<float>(...), allocator);

multiply_elementwise_state_t state(allocator);
multiply_elementwise_inspect(policy, state, A, B, C); // optional
multiply_elementwise_compute(policy, state, A, B, C);
index_t nnz = state.get_result_nnz();
// allocate C arrays and put in C
multiply_elementwise_fill(policy, state, A, B, C);
// C structure and values are now able to be used

Listing 10: Element-wise Multiplication, C = A . ∗ B.

28
using namespace sparseblas;
csr_view<float> A(values, rowptr, colind, shape, nnz);
// auto A = matrix_handle(csr_view<float>(...), allocator);
// ... likewise for B
csr_view<float> C(m, n);
// auto C = matrix_handle(csr_view<float>(...), allocator);

add_state_t state(allocator);
add_inspect(policy, state, A, B, C); // optional
add_compute(policy, state, A, B, C);
index_t nnz = state.get_result_nnz();
// allocate C arrays and put in C
add_fill(policy, state, A, B, C);
// C structure and values are now able to be used

Listing 11: Sparse Matrix – Sparse Matrix Addition, C = A + B.

using namespace sparseblas;


csr_view<float> A_view(values, rowptr, colind, shape, nnz);
csr_view<float> B(values, rowptr, colind, shape, nnz);

auto pred = [](auto i, auto j, auto v) {


return v > 0;
};

auto pred = [](auto i, auto j, auto v) {


return (i < 10 ) && (j < 10);
};

filter_state_t state(allocator);

// matrix_handle is opaque and may contain vendor optimization details


matrix_handle A(A_view, allocator);
filter_compute(policy, state, A, B, pred);
index_t nnz = state.get_result_nnz();
// the user can allocate the arrays for the output structure
// finally, the output structure can be filled
filter_fill(policy, state, A, B, pred);

Listing 12: Predicate Selection, B = A. ∗ (A > 0).

29
6 Execution Model
The world of high-performance computing has recently transformed from a
CPU-centric model to a heterogeneous model in which computations must be
performed on CPUs, GPUs, and other accelerators. To support these hetero-
geneous environments, computing languages and libraries have evolved from
supporting only synchronous execution models, where a kernel or function is
submitted, enqueued, and executed immediately to a asynchronous execution
models, where kernel submission is executed without waiting for completion.
This separation between kernel submission and kernel execution has proved
beneficial for enabling performance on heterogeneous systems by both avoiding
unnecessary synchronization as well as allowing data movement, kernel launch,
and other overheads associated with heterogeneous execution to be hidden.
Standard C++ is transitioning toward a standard model for asynchronous
execution with std::execution (senders and receivers) [19, 42], serves as the
glue for different libraries with asynchronous operations to interoperate with
one another. Given the increased performance and ease-of-use associated with
using standard tools to handle asynchrony instead of disparate vendor APIs, we
recognize the importance of using standard, interoperable tools for asynchrony.
In the future, we anticipate adopting the std::execution model, with a collec-
tion of asynchronous methods that extend the standard Sparse BLAS API using
std::execution concepts. However, given the early stage of std::execution
implementations, this full specification will likely have to wait until the ecosys-
tem is more mature.
With the current state of vendor support for asynchronous C++ program-
ming, we anticipate four guidelines at this stage for asynchronous support. First,
any Sparse BLAS API that supports asynchronous execution should be included
in the async namespace [42], to clearly denote expected behavior for users.
For example, spblas::async::multiply(). Second, all asynchronous and syn-
chronous APIs that inspect or execute shall take an execution policy as an
argument and respect it with regards to the parallel execution. In addition
to the standard execution policies defined by C++, vendors may define their
own execution policies, which have implementation-defined behavior. Vendors
should also provide APIs that do not require an execution policy. These exe-
cution policy–free APIs may perform execution in parallel so long as they do
not violate the semantic requirements of their respective algorithms. Third,
any implementation-defined objects that control asynchronous behavior (e.g.
CUDA/ROCm streams or SYCL sycl::queue and sycl::event objects) can be
provided to the algorithm through an implementation-defined execution pol-
icy. Fourth, we observe that scaling values in Sparse BLAS operations (e.g. α
and β in sparse matrix-vector multiplication, y = αA · x + βy) may be repre-
sented as either arguments provided on the host or as device-side scalars that
will be dynamically computed by another asynchronous kernel. A high-quality
implementation should support both sources of scaling values.
Some libraries provide a mechanism for the user to select which of one or
more internal algorithms they want to be used in a particular sparse BLAS

30
operation. The default behavior is for the library to make such algorithmic
choices, perhaps using information gained from analysis of the inputs in the
optional “inspect phase.” However, in the case that the user would like to select
a particular implementation-defined algorithm implementation, this should be
supported by vendors via their own implementation-defined execution policies.

7 Numerical considerations
This section describes the design space impacting the numerical properties of the
Sparse BLAS, outlines the choices to be made, explains how these choices may
tradeoff performance and “consistency” (to be defined below), and proposes
which options should be required, recommended, or not recommended. The
design space is defined by the following, often intertwined considerations:

1. Which floating point formats are supported as input/output and which


format is used internally for arithmetic operations? The growing variety
of short (16-bit and 8-bit) formats, block floating point, as well as mixed
precision (e.g. using a higher or lower precision internally) suggests a
design that allows for high flexibility. We describe this design space in
more detail below, but our initial design just proposes standard IEEE 754
32-bit and 64-bit formats, pending user feedback requesting more options.
2. What error bounds should be satisfied in the absence of arithmetic ex-
ceptions? This is again motivated by the growing variety of available
precisions. We outline all the possible error bounds below, but again our
initial design will again recommend just standard error bounds associated
with IEEE 754 formats.
3. How should floating-point exceptions (Inf, NaN) be handled? Parts of
this design consideration are covered by the dense BLAS discussions [15].
But other aspects are unique to the Sparse BLAS because of the impact
of sparse data structures. For example, converting an implicit zero in a
sparse matrix to an explicit zero (e.g., as part of a SIMD optimization) may
change how exceptions are propagated, i.e., whether an implicitPnA(i, j) = 0
is multiplied by x(j) = Inf or not when computing y(i) = k=1 A(i, k) ·
x(k). We discuss different levels of consistency with which exceptions
can be handled. Our initial design does not require any particular way
to handle sparsity, to allow implementations maximum opportunities for
performance optimization.
4. Do we expect bitwise reproducibility of results from run to run? Floating
point addition is not associative, so users cannot expect bitwise identi-
cal results from different implementations on different platforms, or even
from run to run of the same implementation on the same parallel plat-
form. Reproducibility has been identified as a need for various reasons,
such as debugging, and is provided by Intel MKL with CNR (Conditional

31
Numerical Reproducibility). This is a higher, and more expensive, level
of consistency than just exception handling. So again, our initial design
does not require reproducibility.
5. Do we need test suites? We describe how extensively test code should test
a proposed SparseBLAS implementation, given the design considerations
above.

7.1 Choices of numerical formats


We begin by listing the wide and growing variety of numerical formats that are
available or will be soon. This recent growth is motivated by modern AI/ML
algorithms that can use lower precision to get accurate answers much faster,
and with less memory and energy, than using traditional 32 and 64-bit floating
point formats [36]. We note that the IEEE 754 Floating Point Standard [36]
also defines 16 and 128-bit formats.
Industry and academia have introduced several low-precision data types
in recent years, including TensorFloat32 [39], 4-bit, 8-bit and 16-bit floating
point [43], Bfloat16 [49], micro-scaling (MX) variants [45], which are variations
on classical block floating point [51], and posits, which have variable precision
[31, 16]. Also, a new IEEE committee has recently convened to try to standard-
ize 8-bit floating point, since several companies have been building their own
separate, incompatible versions 2 . Vendors are currently prioritizing the accel-
eration of specific operations, notably general matrix multiplication (GEMM)
utilizing low- and mixed-precision cores that are now available on newer hard-
ware, including NVIDIA’s Tensor Cores [39], AMD’s Matrix Cores [2], IBM’s
MMA [47], and Arm’s SME [4], among other specialized architectures [44, 10].
Accelerators equipped with these lower- and mixed-precision cores outperform
traditional CPUs/GPUs in AI, ML, and HPC workloads and are more power
efficient [5]. These advantages are in turn motivating the scientific comput-
ing community to use them when possible to accelerate their higher precision
algorithms, by using mixed precision [33, 32].
It is worth distinguishing two kinds of numerical formats, those where all
the data needed to interpret a number is stored in a contiguous set of bits, and
those where a separate scaling factor is needed for a block of numbers. The
former includes the most commonly used floating point formats, and the latter
includes the traditional block floating point, micro-scaling variants, and the
possible new 8-bit IEEE standard mentioned above. Having separate scaling
factors, that may apply to different subsets of the input and output arguments,
will require more complicated data structures, which requires further discussion
to attain portability.
For the Sparse BLAS API, we use template parameters for the input and
output formats, and anticipate the support for IEEE 754 32- and 64-bit formats,
for both real and complex data. We acknowledge that the format used for
the internal arithmetic operations might be different, i.e. of higher precision
2 https://github.com/P3109/Public

32
to reduce accumulation errors. In particular, some architecture units already
implement a higher accumulation format in hardware. In the Sparse BLAS API,
we do not specify the internal precision, but anticipate it having at least the
accuracy of the input/output format. We anticipate that any deviation from
this concept is documented. We encourage the implementation of functionality
variants, as long as also these are well-documented (including their numerical
properties).

7.2 Error bounds to be satisfied


With the emergence of various arithmetic formats, and using mixed precision,
it is becoming increasingly important for numerical software to carefully docu-
ment what accuracy and related properties users can expect (in the absence of
numerical exceptions, which are discussed in the next subsection).
Based on existing formats and algorithms, we propose a framework for de-
scribing error bounds which should cover a variety of the precisions described
in the last subsection, including mixed precision, and give several examples.
For simplicity, P we consider just the error of computing a single dot product
z = xT · y = ni=1 xi · yi .
There are three possible sources of error that an error bound needs to take
into account: converting the input format(s) of x and y to the internalP format(s)
of x̂ and ŷ in which the arithmetic is done, computing the dot product ni=1 x̂i ·ŷi
in the internal format, and converting the result ẑ to the output format, yielding
the final approximation ẑ. ˆ It is most straightforward to bound the maximum
absolute error that each step can have, and sum them, yielding the bound
X
n X
n X
n
ˆ ≤
|z − ẑ| | xi · yi − x̂i · ŷi | + | ˆ
x̂i · ŷi − ẑ| + |ẑ − ẑ|
i=1 i=1 i=1
X
n X
n X
n
≤ | (xi − x̂i ) · yi | + | x̂i · (yi − ŷi )| + | ˆ
x̂i · ŷi − ẑ| + |ẑ − ẑ|
i=1 i=1 i=1

The first term, which is itselfP a dot product, can be bounded in various ways,
n
such as by kx − x̂k1 · kyk∞ = ( i=1 |xi − x̂i |) · max1≤i≤n |yi |, kx − x̂k2 · kyk2
(Cauchy-Schwartz), or kx − x̂k∞ · kyk1 ; the second term is analogous. Thus it
suffices to bound the error of converting each xi to x̂i (and analogously for yi to
ŷi ). If there is no conversion, or if the internal format is higher precision than the
input precision, then x = x̂ (and y = ŷ). If the conversion is from floating point
to block floating point, then the error could be as large as |xi − x̂i | ≤ ǫkxi k∞
where ǫ depends on the length of the fixed point format used within the block
floating point format. If the conversion is from a higher to lower precision
floating point format, then |xi − x̂i | ≤ ǫ|xi | + U N , where ǫ is the machine
precision of the internal format, and U N is the error from underflow (which
depends on whether P flush-to-zero or gradual underflow is used).
n
The third term, | i=1 x̂i · ŷi − ẑ|, is the error in actually computing the dot
product. If this is done in floating point with machine precision ǫ and underflow

33
error U N , the error can be bounded by
X
n X
n
| x̂i · ŷi − ẑ| ≤ f (n) · ǫ · |x̂i | · |ŷi | + g(n) · U N
i=1 i=1

where f (n) and g(n) depend on the order of summation, ranging from f (n) = n
for conventional serial summation to f (n) = ⌈log2 n⌉ + 1 when using a binary
tree; g(n) ≤ 2n in both cases (see [13] for a more detailed analysis of underflow).
We note that if one knows that only m < n of the summands x̂i · ŷi are nonzero,
due to sparsity, then one can replace n by m in these bounds.
We also note that there is a large literature on using instructions like fused-
multiply-add, and floating-point tricks like two-sum, to compute dot products
with arbitrarily higher accuracy. These could possibly be of interest in using
low-precision floating point to simulate higher precision.
If the dot product is computed with block floating point using a fixed point
to represent each number, then the error bound depends on the width of the
fixed point accumulator used to sum the fixed point products. If it is wide
enough to compute the sum exactly, the error is zero. If it is necessary to right-
shift the integer parts of x̂i and ŷi to avoid integer overflow, then the error can
be bounded by
X
n
| x̂i · ŷi − ẑ| ≤ n · ǫ · max |x̂i | · max |ŷi |
1≤i≤n 1≤i≤n
i=1

Other approaches to avoiding overflow, with different error bounds, are imagin-
able.
In the case of posits, where the precision ǫ depends on the magnitude of the
numbers (numbers closer to 1 are more accurate than much larger or smaller
numbers), we refer to [16] and [14], the latter of which discusses an older pro-
posed format with similar properties. We also note that the designers of posits
propose including a ”quire”, a long accumulator capable of computing long dot
products exactly.
Long accumulators for exact dot products with standard floating point ar-
guments have also been proposed [38].
ˆ bounds the error when converting the
Finally, the last error term, |ẑ − ẑ|
ˆ If the output is
final result from the internal format ẑ to the output format ẑ.
lower precision than the input, then this can be bounded by ǫ|ẑ| + U N , where
ǫ and U N are for the output format.

7.3 Consistent exception handling from input to output


This refers to propagating Infs and NaNs that appear in the input data of a
SparseBLAS operation in a “consistent” way to the output. Sparsity makes the
definition of “consistent” more subtle than the one proposed for the dense BLAS
in [15], which says that an Inf or NaN that is input or created during execution
should propagate to the output, unless there is a good mathematical reason it

34
should not. Sparsity makes this more subtle because a sparse matrix contains
implicit zeros, which are not stored and not expected to participate in arith-
metic operations, and so not create NaNs (from 0*NaN=NaN or 0*Inf=NaN) to
propagate. For example in SpMV, computing y = A*x, if x(i) = NaN and the
i-th column of A is all implicit zeros, then x(i)=NaN will not cause any entry
of y to be a NaN, which is the user’s expectation.
But in fact a sparse matrix may contain 3 kinds of zeroes, that we may treat
differently:
1. An implicit zero. As described above, this is an entry A(i,j) that is not
stored in the user’s input data structure, and the user does not expect it
to participate in any arithmetic operations. We emphasize “user’s input
data structure” to distinguish from case (3) below.
2. An explicit zero. This is a zero stored in a data structure, and it is
expected to participate in the same arithmetic operations that a nonzero
value would. The data structure could be the user’s input data structure,
or one that is created by the system to optimize performance (examples
below).
3. An explicit masked zero. This refers to a zero that is explicitly stored in
a data structure, but another part of that data structure (a “mask”) indi-
cates that it is not supposed to participate in any arithmetic operations,
like an implicit zero.
Here are two examples of why the system may create a new data structure
with explicit zeros replacing some implicit zeros. First, being able to access and
operate on a segment of matrix entries of some fixed length makes it easier to
use SIMD instructions on some architectures. Second, it may only be necessary
to store one column index for such a segment instead of multiple indices (e.g.
blocked CSR, or BCSR), saving memory and reducing memory access time.
One example of a user’s input data structure that can contain explicit masked
zeros is ELLPACK, in which the nonzeros of an m x n sparse matrix are stored
in an m x k matrix A , and row i of A contains the nonzeros in row i of the
sparse matrix, where k is the maximum number of nonzeros in any row. This
means row i of A is padded with additional explicit zeros if it contains fewer
than k nonzeros. There is also an m x k Col matrix containing the column
indices of the matrix entries stored in A, with an index of -1 to mask out any
padded zeros.
To be “consistent” given these different kinds of zeros, we require both that
implicit zeros and explicit masked zeros do not create NaNs which would then
propagate to the output.
So far we have discussed the SpMV operation defined as y = A · x. If it
were defined as y = α · A · x + β · y, as in the dense BLAS, then we need to
consider the cases α = 0 and β = 0. In the dense BLAS, β = 0 is defined to
mean that the requested operation is y = α · A · x, i.e. the input y is never read,
only written, so any Infs or NaNs it may contain are irrelevant, and should not

35
propagate. Similarly, α = 0 is defined to mean y = β · y, so Infs and NaNs in
A or x are irrelevant, and never accessed. And if α = β = 0, the vector y is
simply initialized to 0.
We note that that we are not trying define all the cases in which an entry
of y must be an Inf or a NaN. To see why, suppose the result is the sum of four
numbers: [z, z, −z, −z] where z is a finite number greater than half the overflow
threshold. Then depending on the order of summation, the result could be NaN,
+Inf, -Inf or 0 (the correct answer). And if there is a fifth summand equal to
+Inf, the result could be +Inf or NaN. So consistency requires only that an Inf
input propagates to the output as an Inf or NaN; a NaN will naturally propagate
as a NaN. The issue of getting the bitwise identical result every time SpMV is
performed is the separate question of reproducibility, discussed below.
Finally, for the more complicated case of solving a triangular system of
equations, see [15] for a discussion of the dense case.

7.4 Bit-wise reproducibility from run to run


There are times where getting exactly the same result (bitwise identical output)
from multiple runs of a math routine is important, for instance in finance, sci-
entific research, cryptography, code debugging, and many others. However, it
is also well known that in floating point arithmetic, addition and multiplication
are not always associative, so that the order in which array elements are multi-
plied and/or added together affects the truncation and rounding and ultimately
the final result, so bitwise identical results are often not guaranteed. The con-
cept of ”conditional numerical reproducibility (CNR)” has been introduced in
some performance math libraries to address this desire to have algorithms or
functions with clear conditions for consistently obtaining reproducible floating
point results. Note that the choice to enable conditional numerical reproducibil-
ity may come with a sacrifice of performance. Those conditions under which
reproducibility is guaranteed have converged to four common scenarios, each
with more stringent level of care required in the algorithm design:
Default (no reproducibility property). Some run to run variation in out-
put results may occur based on potentially different orders of floating
point arithmetic operations within the algorithm. These are often the
most performant algorithms.
CNR property. Reproducible results when calls occur within the same exe-
cutable where the number of computational threads the library uses re-
mains constant through the run. Can be less performant than the default
case.
Strict CNR property. Reproducible results when run within the same exe-
cutable where the number of computational threads used within the algo-
rithm can vary. Can be less performant than CNR property algorithms.
Full Numerical Reproducibility. Reproducible results when calls occur from
different executables on the same machine or on separate machines in-

36
cluding on separate computer architectures. Can be less performant than
Strict CNR property algorithms.
While the ”Full Numerical Reproducibility” property might seem the most
desirable, it can be extremely difficult to achieve on modern architectures and
would likely sacrifice a great amount of performance, so is unlikely to be some-
thing supported by any Sparse BLAS library. We will restrict the remaining
discussions to the support of the CNR and strict CNR properties in Sparse
BLAS, which means run to run reproducibility within a single executable and
all things equal except possibly the number of computational threads.
While it is common for some sparse formats and algorithms to be naturally
reproducible due to the method of parallelizing the work (for instance using row-
based parallelism for SpMV in CSR format where a single computation thread
processes the multiply-adds with the dense vector across a sparse matrix row
in the order that the column indices were stored in that row), there are other
cases where reductions, atomic operations or general algorithms with multiple
computational threads contributing to the floating point result that require more
care in ensuring the order of floating point operations is maintained from run
to run. We leave it to the Sparse BLAS function implementer to come up with
algorithmic ways to ensure such reproducible results can happen and will turn
our attention now to how a user could signal to the function that such care is
required.
We propose to introduce a global CNR property for the library that can be
set via the following APIs. We also recommend that when algorithms are avail-
able for selection in an API, that library implementers denote in documentation
the algorithms which have the different CNR properties themselves. This would
allow a fine grained approach to be applied by users as well without the global
switch being set. If the global property and the local algorithm selection do not
align, the local algorithm selection will be preferred.

namespace spblas {

enum class cnrProperty {


default = 0,
none = 0,
cnr = 1,
strict_cnr = 2
};

void set_cnr_property(cnrProperty prop);

cnrProperty get_cnr_property();

Listing 13: Conditional Numerical Reproducibility properties in Sparse BLAS

37
Finally, we note that this notion of conditional numerical reproducibility is
orthogonal to the notion of consistent exception handling discussed previously.
Both can be set or not set independently. There is a small intersection, but
The combination of “permissive“ and “cnr“ would mean that you always get
an Inf or a NaN in the same output location from run-to-run, but it may not
be “consistent“ in the sense that we cannot guarantee which NaN payload is
propagated in an operation like NaN+NaN, just that one of the arguments is
propagated (see sec. 6.2.3 of the IEEE 754 standard).

7.5 Test Suites


Non-associativity of floating-point arithmetic complicates the process of testing
numerical code. The most common approach is to check if a computed quantity
is within some theoretically informed error bound of a reference value. The par-
ticular error bound used and the way that the reference values are constructed
varies with the test suite implementation. Indeed, most developers of BLAS
libraries maintain their own test suites. The test suite for reference BLAS is
available at www.netlib.org/lapack as part of the LAPACK release, and if these
tests are not satisfied by an individual test, then the test is deemed a failure.
For the Sparse BLAS effort, we will deploy a unit test framework that com-
pares the accuracy of functionality implementations against a reference imple-
mentation, typically a sequential CPU code relying on the CSR matrix format.
The test suite will then compose of a set of sparse matrices covering a wide
range of sparsity structures and numerical properties.

8 Future extensions
As hinted at in Section 4, we also recognize the value of many other sparse
formats that are not covered so far in this proposal, like the CSR-4 array for-
mat, the Block CSR formats, the Ellpack format, and it’s several variants and
many others found in literature and existing sparse BLAS libraries. Subsequent
extensions of this proposal should address the addition of more sparse matrix
formats to the sparse BLAS support. We do recommend that format conver-
sions between these additional formats should likely always go back through one
of the original core CSR/CSC/COO sparse matrix formats to limit the scope
and potential need for conversion routines. Likewise, as the formats expand,
each implementation will need to define how they will handle matrix products
or matrix additions between disparate input sparse formats, and denote which
format pairs for inputs are actually supported.
Functionality-wise, we plan to expand the sparse triangular solves to also
accept dense matrices and sparse vectors, and extend the scope to more complex
algorithms like solvers and graph algorithms.
We also recognize the increasing use of tensors, which are higher-dimensional
analogs of matrices, in computing.

38
Sparse tensor computations such as tensor decompositions and contractions
present unique challenges ranging from sparse tensor formats and different func-
tions such as sparse tensor times tensor multiplication and sparse matricized
tensor times Khatri-Rao product (MTTKRP).
Our current design is not optimal for representing and computing on tensors,
even when they are matricized, but we plan to provide extensions of our API
for sparse tensor computations.

References
[1] Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub
Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire To-
mov. Numerical linear algebra on emerging architectures: The PLASMA
and MAGMA projects. In Journal of Physics: Conference Series, volume
180, pages 012–037. IOP Publishing, 2009.
[2] AMD Corporation. AMD Instinct MI250X accelerator.

2021. [Online; accessed 30-May-2023].


[3] Ed Anderson, Z. Bai, C. Bischof, Susan L. Blackford, James W. Dem-
mel, Jack J. Dongarra, J. Du Croz, A. Greenbaum, Sven Hammarling,
A. McKenney, and Danny C. Sorensen. LAPACK User’s Guide. Society
for Industrial and Applied Mathematics, Philadelphia, Third edition, 1999.
[4] ARM Corporation. The scalable matrix extension (SME), for Armv9-

[Online; accessed 30-May-2023].


[5] Ehsan Atoofian. PTTS: power-aware tensor cores using two-sided sparsity.
J. Parallel Distributed Comput., 173:70–82, 2023.
[6] Benjamin Brock, Aydın Buluç, Timothy Mattson, Scott McMillan, and
José Moreira. The GraphBLAS C API specification, version 2.0.0. 2021.
[7] Benjamin Brock, Aydin Buluç, Timothy G. Mattson, Scott McMillan, and
José E. Moreira. A roadmap for the graphblas c++ api. In 2020 IEEE
International Parallel and Distributed Processing Symposium Workshops
(IPDPSW), pages 219–222, 2020.

[8] Benjamin Brock, Scott McMillan, Aydın Buluç, Timothy


Mattson, and José Moreira. GraphBLAS C++ specification.

[9] Aydın Buluç and John R Gilbert. The Combinatorial BLAS: Design, im-
plementation, and applications. The International Journal of High Perfor-
mance Computing Applications, 25(4):496–509, 2011.

39
[10] Cerebras Corporation. Cerebras.2022.
[11] Timothy A Davis. Algorithm 1000: SuiteSparse: GraphBLAS: Graph al-

gorithms in the language of sparse linear algebra. ACM Transactions on


Mathematical Software (TOMS), 45(4):1–25, 2019.
[12] Timothy A Davis. Algorithm 1037: SuiteSparse: GraphBLAS: Parallel
graph algorithms in the language of sparse linear algebra. ACM Transac-
tions on Mathematical Software, 49(3):1–30, 2023.
[13] J. Demmel. Underflow and the reliability of numerical software. SIAM J.
Sci. Stat. Comput., 5(4):887–919, Dec 1984.
[14] J. Demmel. On error analysis in arithmetic with varying relative precision.
In IEEE 8th Symp. on Computer Arithmetic (ARITH), pages 148–152,
1987.
[15] James Demmel, Jack Dongarra, Mark Gates, Greg Henry, Julien Langou,
Xiaoye Li, Piotr Luszczek, Weslley Pereira, Jason Riedy, and Cindy Rubio-
González. Proposed consistent exception handling for the blas and lapack,
2022.
[16] F. Dinechin, L. Forget, J.-M. Muller, and Y. Uguen. Posits: the good, the
bad and the ugly. In CoNGA (Conf. on Next Generation Arithmetic), Mar
2019. doi:10.1145/3316279.3316285.
[17] David S. Dodson, Roger G. Grimes, and John G. Lewis. Sparse extensions
to the FORTRAN basic linear algebra subprograms. ACM Trans. Math.
Softw., 17(2):253–263, jun 1991.
[18] David S. Dodson and John G. Lewis. Proposed sparse extensions to the
basic linear algebra subprograms. SIGNUM Newsl., 20(1):22–25, jan 1985.
[19] Michal Dominiak, Georgy Evtushenko, Lewis Baker, Lucian Radu
Teodorescu, Lee Howes, Kirk Shoop, Michael Garland, Eric
Niebler, and Bryce Adelstein Lelbach. P2300r7: std::execution.

[20] Jack J. Dongarra, J. Du Croz, Iain S. Duff, and Sven Hammarling. Al-
gorithm 679: A set of Level 3 Basic Linear Algebra Subprograms. ACM
Transactions on Mathematical Software, 16:1–17, March 1990.
[21] Jack J. Dongarra, J. Du Croz, Iain S. Duff, and Sven Hammarling. A
set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on
Mathematical Software, 16:18–28, March 1990.

40
[22] Jack J. Dongarra, J. Du Croz, Sven Hammarling, and R. Hanson. Al-
gorithm 656: An extended set of FORTRAN Basic Linear Algebra Sub-
programs. ACM Transactions on Mathematical Software, 14:18–32, March
1988.
[23] Jack J. Dongarra, J. Du Croz, Sven Hammarling, and R. Hanson. An ex-
tended set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans-
actions on Mathematical Software, 14:1–17, March 1988.
[24] Jack J. Dongarra, Cleve B. Moler, J. R. Bunch, and G. W. Stewart. LIN-
PACK Users’ Guide. Society for Industrial and Applied Mathematics,
Philadelphia, PA, 1979.
[25] Iain S. Duff, Michael A. Heroux, and Roldan Pozo. An overview of the
sparse basic linear algebra subprograms: The new standard from the BLAS
Technical Forum. ACM Trans. Math. Softw., 28(2):239–267, jun 2002.
[26] Iain S. Duff, Michele Marrone, Giuseppe Radicati, and Carlo Vittoli. Level
3 basic linear algebra subprograms for sparse matrices: A user-level inter-
face. ACM Trans. Math. Softw., 23(3):379–401, sep 1997.
[27] Salvatore Filippone and Alfredo Buttari. Object-oriented techniques for
sparse matrix computations in Fortran 2003. ACM Trans. Math. Softw.,
38(4), aug 2012.
[28] Salvatore Filippone and Michele Colajanni. PSBLAS: A library for parallel
linear algebra computation on sparse matrices. ACM Trans. Math. Softw.,
26(4):527–550, dec 2000.
[29] Mark Gates, Piotr Luszczek, Jakub Kurzak, Jack Dongarra, Konstantin
Arturov, Cris Cecka, and Chip Freitag. SLATE working note 2: C++
API for BLAS and LAPACK. Technical Report ICL-UT-17-03, Innovative
Computing Laboratory, University of Tennessee, June 2017. revision 06-
2017.
[30] Gaël Guennebaud, Benoı̂t Jacob, et al. Eigen v3.

[31] J. Gustafson and I. Yonemoto. Beating floating point at its own game:
Posit arithmetic. Supercomputing Frontiers and Innovations, 4:71–86, 2017.
https://posithub.org/docs/posit standard-2.pdf.
[32] Azzam Haidar, Harun Bayraktar, Stanimire Tomov, Jack Dongarra, and
Nicholas J Higham. Mixed-precision iterative refinement using tensor cores
on GPUs to accelerate solution of linear systems. Proceedings of the Royal
Society A, 476(2243):20200110, 2020.
[33] Nicholas J Higham and Theo Mary. Mixed precision algorithms in numer-
ical linear algebra. Acta Numerica, 31:347–414, 2022.

41
[34] Mark Hoemmen, Daisy Hollman, Christian Trott, Daniel Sunderland,
Nevin Liber, Alicia Klinvex, Li-Ta Lo, Lebrun-Grandie Damien, Graham
Lopez, Peter Caday, Sarah Knepper, Piotr Luszczek, and Timothy Costa.
P1673R13: A free function linear algebra interface based on the BLAS.

[35] Mark Hoemmen, Daisy Hollman, Christian Trott, Daniel Sunderland,

Nevin Liber, Li-Ta Lo, Damien Lebrun-Grandie, Graham Lopez, Peter Ca-
day, Sarah Knepper, et al. P1673: A free function linear algebra interface
based on the BLAS. Technical report, Open Standards, 2023.

[36] IEEE Microprocessor Standards Committee. IEEE standard for floating-


point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008), pages
1–84, 2019.
[37] Jeremy Kepner and John Gilbert. Graph algorithms in the language of
linear algebra. SIAM, 2011.
[38] J. Koenig, D. Biancolin, J. Bachrach, and K. Asanovic. A hardware ac-
celerator for computing an exact dot product. In IEEE 24th Symp. on
Computer Arithmetic (ARITH), pages 114–121, 2017.
[39] Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam, and Srid-
har Ramaswamy. NVIDIA Ampere architecture in-depth, 2023. [Online;
accessed 30-May-2023].
[40] Daniel Langr and Pavel Tvrdik. Evaluation criteria for sparse matrix
storage formats. IEEE Transactions on parallel and distributed systems,
27(2):428–440, 2015.
[41] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic Lin-
ear Algebra Subprograms for FORTRAN usage. ACM Transactions on
Mathematical Software, 5:308–323, 1979.
[42] Bryce Adelstein Lelbach. P3300r0: C++ asynchronous parallel algorithms.
.
[43] Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep
Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick
Judd, John Kamalu, et al. FP8 formats for deep learning. arXiv preprint
arXiv:2209.05433, 2022.
[44] Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li,
James Laudon, Cliff Young, Norman P. Jouppi, and David A. Patterson.
The design process for Google’s training chips: TPUv2 and TPUv3. IEEE
Micro, 41(2):56–63, 2021.

42
[45] Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza
Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric
Dellinger, Kristof Denolf, et al. Microscaling data formats for deep learning.
arXiv preprint arXiv:2310.10537, 2023.
[46] Jeremy G Siek and Andrew Lumsdaine. The Matrix Template Library:
A generic programming approach to high performance numerical linear
algebra. In International Symposium on Computing in Object-Oriented
Parallel Environments, pages 59–70. Springer, 1998.
[47] William J Starke, Brian W Thompto, Jeff A Stuecheli, and José E Moreira.
IBM’s POWER10 processor. IEEE Micro, 41(2):7–14, 2021.
[48] Joerg Walter, Mathias Koch, et al. Boost uBLAS: Basic Linear Algebra
Library.
[49] Shibo Wang and Pankaj Kanwar. BFloat16: The secret to high performance
on cloud TPUs. Google Cloud Blog, 4, 2019.
[50] Erik Welch, Jim Kitchen, Sultan Orazbayev, ParticularMiner, Stan Seib-
ert, William Zijie Zhang, Adam Lugowski, and Paul Nguyen. python-
graphblas/python-graphblas: 2024.2.0, February 2024.
[51] J. H. Wilkinson. Rounding Errors in Algebraic Processes. Prentice Hall,
1963.
[52] Carl Yang, Aydın Buluç, and John D Owens. GraphBLAST: A high-
performance linear algebra-based graph framework on the GPU. ACM
Transactions on Mathematical Software (TOMS), 48(1):1–51, 2022.
[53] AN Yzelman, D Di Nardo, JM Nash, and WJ Suijlen. A C++ GraphBLAS:
specification, implementation, parallelisation, and evaluation. Preprint,
2020.

43

You might also like