Gundersen 2004
Gundersen 2004
SUMMARY
In this paper we show how to utilize Java’s native arrays for matrix computations. The disadvantages
of Java arrays used as a 2D array for dense matrix computation are discussed and ways to improve the
performance are examined. We show how to create efficient dynamic data structures for sparse matrix
computations using Java’s native arrays. This data structure is unique for Java and shown to be more
dynamic and efficient than the traditional storage schemes for large sparse matrices. Numerical testing
indicates that this new data structure, called Java Sparse Array, is competitive with the traditional
Compressed Row Storage scheme on matrix computation routines. Java gives increased flexibility without
losing efficiency. Compared with other object-oriented data structures Java Sparse Array is shown to have
the same flexibility. Copyright
c 2004 John Wiley & Sons, Ltd.
INTRODUCTION
Object-oriented programming has been favored in the last decade(s) and has an easy to understand
paradigm. It is straightforward to build large scale applications designed in an object-oriented manner.
Java’s widespread acceptance in commercial applications, and the fact that it is the main language
introduced to students, insure that it will increasingly be used for numerical computations [1,2].
Matrix computation is a large and important area in scientific computation. Developing efficient
algorithms for working with matrices is of considerable practical interest. Matrix multiplication is a
classic example of an operation, which is dependent on the details of the data structure. This operation
is used as an example and we discuss several different implementations using Java arrays as the
underlying data structure. We demonstrate the well-known row-wise layout of a 2D array and
∗ Correspondence to: Trond Steihaug, Department of Informatics, University of Bergen, Postbox 7800, N-5020 Bergen, Norway.
† E-mail: trond.steihaug@ii.uib.no
implement straightforward matrix multiplication algorithms that take this layout into consideration.
Implementation of dense matrix multiplication is also illustrated using the package JAMA [3]. Even in
such a simple computational task, no dense matrix multiplication algorithm can be expected to be
superior on all possible input parameters and on different computers, operating systems and Java
Development Kit (JDK)/Java Virtual Machines (JVMs). The testing environments used in this paper
are presented in Table I and the timings in the numerical testing are in milliseconds.
This paper focus on utilizing Java’s native arrays for numerical operations. Java’s native arrays have
not been regarded as efficient enough for high-performance computing. Replacing the native arrays
with a multidimensional array [4,5], and extensions like Spar [6] have been suggested. However,
packages like JAMA [3], JAMPACK [7] and Colt [8] use the native arrays as the underlying storage
format. Investigating how to utilize the flexibility of Java arrays for creating elegant and efficient
algorithms is of great practical interest.
A sparse matrix is usually described as a matrix where ‘many’ of its elements are equal to zero and
we benefit both in time and space by working only on the non-zero elements [9]. The difficulty is that
sparse data structures include more overhead since indexes as well as numerical values of non-zero
matrix entries are stored. There are several different storage schemes for large and unstructured sparse
matrices that are used in languages like FORTRAN, C and C++. These storage schemes have enjoyed
several decades of research and the most commonly used storage schemes for large sparse matrices are
the compressed row or column storage [9,10].
We introduce the use of Java arrays for storing sparse matrices and discuss different
implementations. We discuss the Compressed Row Storage (CRS) and two flexible implementations
made possible in Java: Java Sparse Array (JSA), and Sparse Matrix Concept (SMC). These storage
schemes are discussed on the basis of performance and flexibility. The compressed storage schemes
can be implemented in most languages, while the SMC is restricted to object-oriented languages. JSA is
new and unique for languages that allow jagged arrays like C# and Java (see [11] and the discussion
therein).
JAVA ARRAYS
Java’s native arrays are objects, inherit the Object class and are handled with references. However,
Java’s native arrays is not an extendable class, and adding self-defined behavior to an array we must
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
JAVA AND MATRIX COMPUTATIONS 801
use a wrapper/enclosing class. This is also an issue for primitive types like Double, Integer and
Boolean API classes that enclose the primitive types. Java’s native arrays can in some cases be seen
as a hybrid between an object and a primitive. The object nature of Java arrays imposes overhead on
Java applications compared with equivalent C and C++ programs. Creating an array is object creation.
When creating an array of primitive elements, the array holds the actual values for those elements.
An array of objects stores references to the actual objects. Since arrays are handled through references,
an array element may refer to another array, thus creating a multidimensional array. A rectangular array
of numbers as shown in Figure 1 is implemented as Figure 2. Since each element in the outermost
array of a multidimensional array is an object reference, arrays need not be rectangular and each inner
array can have its own size as in Figure 3.
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
802 G. GUNDERSEN AND T. STEIHAUG
We can expect elements of an array of primitive elements to be stored contiguously, but we cannot
expect the objects of an array of objects to be stored contiguously. For a rectangular array of primitive
elements, the elements of a row will be stored contiguously, but the rows may be scattered. A basic
observation is that accessing the consecutive elements in a row will be faster than accessing consecutive
elements in a column.
A matrix is a rectangular array of entries and the size is described in terms of the numbers of rows
and columns. The entry in row i and column j of matrix A is denoted Aij . To be consistent with Java
the first row and column index is 0, and element Aij will be referenced as A[i][j]. A vector is either
a matrix with only one column (column vector) or one row (row vector).
Consider computing the sum of all elements of the m × n matrix A
m−1 n−1
s= Aij (1)
i=0 j =0
n−1 2
This is the basic operation in computing the square of the Frobenius-norm A2F = m−1 i=0 j =0 Aij
of matrix A.
The code examples in Figures 4 and 5 show two implementations of (1) in Java. The only difference
between the two implementations is that the two for loops are interchanged. Loop-order (i, j ) implies
that the elements of the matrix are accessed row-by-row and loop-order (j, i) implies that the access
of the elements is column-by-column. Figure 6 shows that traversing columns is much less efficient
than traversing rows when the square matrix gets larger. This demonstrates the basic observation that
accessing the consecutive elements in a row is faster than accessing consecutive elements in a column.
Following the ‘do’s and don’ts for numerical computing in Java’ [12] and [13] the performance
improves if 1D arrays, double[], are traversed instead of 2D arrays, double[][].
double s = 0;
double[] array = new double[m][n];
for(int i = 0;i<m;i++){
double[] rowi = array[i];
for(int j = 0;j<n;j++){
s+=rowi[j];
}
}
Traversing a 1D array is an average of 30% more efficient than traversing a 2D array as shown in
Figure 6. This is referred to as upacked. If the dense m × n array is copied to an mn array [14] a minor
improvement is found using the Sun JDK, while using IBM JDK 1.3.1 the gain is an average factor of
three.
Differences between row and column traversing is also an issue in FORTRAN, C and C++.
For FORTRAN 77 and ANSI C the average factor is around two while for Java the average factor
is five. Since Java implements a 2D array with array of arrays, as shown in Figure 2, the elements of
the object double[][] are objects double[]. When an object is created and gets heap allocated,
the object can be placed anywhere in the memory. These issues partially explain the increase in time
differences in row- and column-wise loop order in Java compared with FORTRAN, C and C++.
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
JAVA AND MATRIX COMPUTATIONS 803
double s = 0;
double[] array = new double[m][n];
for(int i = 0;i<m;i++){
for(int j = 0;j<n;j++){
s+=array[i][j];’
}
}
Figure 4. Loop-order (i, j ), row wise.
double s = 0;
double[] array = new double[m][n];
for(int j = 0;j<n;j++){
for(int i = 0;i<m;i++){
s+=array[i][j];
}
}
Figure 5. Loop-order (j, i), column wise.
3000
2500
2D Column
2000
milliSeconds
1500
1000
500
2D Row
Unpacked
0
0 500 1000 1500 2000 2500
Matrix Dimension
Figure 6. Time accessing a 2D Java array row wise and column wise (m = n).
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
804 G. GUNDERSEN AND T. STEIHAUG
Table II. The SMM algorithm on input AB with different loop-orders. The time is in
milliseconds and the measuring is done on Red Hat Linux with Sun’s JDK 1.4.1.01.
A straightforward implementation of (2) using Java’s native arrays is like the following code.
for(int i = 0; i<m;i++){
for(int j = 0;j<p;j++){
for(int k = 0;k<n;k++){
C[i][j] += A[i][k]*B[k][j];
}
}
}
By interchanging the three for-loops there are six distinct ways of doing matrix multiplication.
These can be divided into three groups based on the innermost loop: inner row, mixed column and row,
and inner column. If for each row of A the elements of B are accessed row-by-row, the resulting matrix
C is constructed row-by-row. This is a pure row loop-order denoted (i, k, j ). In the implementation the
second and third for-loops are interchanged compared with the straightforward implementation above
which will be (i, j, k), where the inner loop is an innerproduct of a row and a column.
Table II shows the results of performing the six straightforward matrix multiplication algorithms
on AB. It is evident from the table that the inner column algorithms are the least efficient algorithms,
while the inner row algorithms are the most efficient implementations. This is due to accessing different
object arrays when traversing columns as opposed to accessing the same object array several times
(when traversing a row).
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
JAVA AND MATRIX COMPUTATIONS 805
The pure inner row-oriented algorithm with loop-order (i, k, j ) will be more efficient than the inner
row-oriented algorithm with loop-order (k, i, j ), since the latter traverses the columns of matrix A in
the loop-body of the second for-loop.
JAMA
JAMA [3] is a basic linear algebra package for Java. It provides user-level classes for constructing and
manipulating real dense matrices. It is meant to provide sufficient functionality for routine problems,
packaged in a way that is natural and understandable to non-experts. It is intended to serve as
the standard matrix class for Java. JAMA is composed of six Java classes where JAMA’s matrix
multiplication algorithm, the A.times(B) algorithm, is part of the Matrix class. In this algorithm
the resulting matrix is constructed column-by-column, loop-order (j, i, k), as shown in Figure 7.
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
806 G. GUNDERSEN AND T. STEIHAUG
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
JAVA AND MATRIX COMPUTATIONS 807
most efficient. The pure column algorithm accumulates the columns of matrix C into a 1D array and
updates the C matrix in the loop body of the first for-loop.
Extracting a row from a 2D array to a 1D array is a (1) operation (double[] Arowi = A[i],
where A is of type double[][]). But extracting a column is (m), since we must copy all the
elements from matrix A to a 1D array in a loop.
Matrix vector products are (n2 ) operations while the matrix–matrix products are (n3 ) operations.
On average we have a 50% gain in performance if we use the algorithm for matrix vector products
instead of the more general algorithms for matrix–matrix products that treats vectors as matrices.
SPARSE MATRICES
There is a released package implemented in Java for numerical computation on sparse matrices, called
Colt [8], and there are separate algorithms like [16] using a coordinate storage scheme. The coordinate
storage scheme is the most straightforward structure to represent a sparse matrix; it simply records
each non-zero entry together with its row and column index. In [17], the coordinate storage format as
implemented in C++ in [18] is used. The coordinate storage format is not an efficient storage format for
large sparse compared with compressed row format [15]. There are also some benchmark algorithms
like [19] that perform sparse matrix computations using a compressed row storage scheme. For matrices
with special structures like symmetry the storage schemes must be modified.
All the sparse matrices used as test matrices in this paper were taken from Matrix Market [20].
All the matrices are square and symmetry has not been utilized.
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
808 G. GUNDERSEN AND T. STEIHAUG
14000
JAMA
12000
10000
milliSeconds
8000
Row
6000
4000
2000
0
0 500 1000 1500
Matrix Dimension
1000
900
800
700
milliSeconds
600
JAMA
500
400
300
200
100
Row
0
0 500 1000 1500 2000 2500
Matrix Dimension
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
JAVA AND MATRIX COMPUTATIONS 809
14000
JAMA
12000
10000
milliSeconds
8000
Row
6000
4000
2000
0
0 500 1000 1500
Matrix Dimension
14000
12000
10000
milliSeconds
8000
Row
6000
JAMA
4000
2000
0
0 500 1000 1500
Matrix Dimension
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
810 G. GUNDERSEN AND T. STEIHAUG
The JSA format is a new concept for storing sparse matrices made possible with Java and C#, which
both support jagged arrays implementation. This concept is illustrated in Figure 13. This unique
concept uses an array of arrays. There are two arrays, one for storing the references to the value arrays
and one for storing the references to the corresponding index arrays (one for each row).
With the JSA format it is possible to manipulate the rows independently without updating the rest of
the structure, as would have been necessary with CRS. This means a row can be removed or inserted
into the JSA structure without creating a new large 1D array for values and indexes. Each row consists
of a value and a corresponding index array each with its own unique reference. JSA uses 2nnz + 2n
storage locations compared with 2nnz + n + 1 for the CRS format.
The non-zero structure of the matrix (3) is stored as follows in JSA.
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
JAVA AND MATRIX COMPUTATIONS 811
In this JavaSparseArray class, there are two instance variables, double[][] and int[][].
For each row these arrays are used to store the actual value and the column index. We will see that this
structure can compete with CRS when it comes to performance and memory use.
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
812 G. GUNDERSEN AND T. STEIHAUG
Table III. The matrix multiplication algorithms for C = AA. The time is in milliseconds
and the measuring is done on Red Hat Linux with Sun’s JDK 1.4.0.
Methods that work explicitly on the arrays (values and indexes) are placed in the Rows objects
and instances of the Rows object are accessed through method calls. Breaking the encapsulation and
storing the instances of the Rows object as local variables makes the SMC very similar to JSA. If all the
methods that work explicitly on the instances of Rows are in the Rows class the efficiency is improved.
The actual storage in terms of primitive elements is the same for SMC and JSA, but JSA does not
use the extra object layer for each row creating extra references (2m + 2 compared with 3m + 1).
The object creation profiler JProfiler [24] is used to analyze the memory usage of JSA and SMC for a
sparse matrix multiplication. For n = 17 281 the profiler indicates that in SMC the Rows object causes
2.3% of the overall memory use. SMC uses 1% more memory than JSA.
On performing a matrix multiplication algorithm on CRS, we do not know the actual size (nnz) or
structure of the resulting matrix. This structure can for general matrices be found by using the data
structures of A and B. The implementation used is based on FORTRAN routines [9,25] using Java’s
native arrays.
The CRS approach when we know the exact size (nnz) of C performs well. It does have, not
surprisingly, the best performance of all the matrix multiplication algorithms we present in this paper,
see Table III. However, it is not a practical solution, since a priori information is usually not available.
If no information is available on the number of non-zero elements in C, a ‘good guess’ or a symbolic
phase is needed to compute it.
JSA is in a row-oriented manner and the matrix multiplication algorithm is performed with the
loop-order (i, k, j ). The matrices involved are traversed row-wise and the resulting matrix is created
row-by-row. For each row of matrix A, those rows of B indexed by the index array of the sparse row of
A are traversed. An element in the value array of A is multiplied by the value array of B. The result is
accumulated in a row of C, indexed by the index value of the element in the B row. We can construct
double[m][1] and int[m][1] to contain the rows of the resulting elements of matrix C, before
we actually create the actual rows of the resulting matrix.
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
JAVA AND MATRIX COMPUTATIONS 813
When comparing CRS with a symbolic phase on AB with JSA, we see that JSA is slightly more
efficient than CRS with an average factor of 1.4. This CRS algorithm is the most realistic considering
a package implementation, and therefore the most realistic comparison to the JSA timings. The SMC
approach has to create a Rows object for each value and index array in the resulting matrix, in addition
to always accessing the value and index array from method calls. It is important to state that we cannot
draw too general conclusions on the performance of these algorithms on the basis of the test matrices
we used in this paper. But these results give a strong indication of their overall performance and that
matrix multiplication on JSA is both fast and reliable. The most natural notation of retrieving an
element from a matrix A is A[i][k]. This notation is possible with JSA, unlike SMC, which first
has to get the Rows object i, then the element k by A.rows[i].elementAt(k).
Java’s native arrays, whose elements are of primitive types, have better performance inserting and
retrieving elements than the utility classes java.util.Vector, java.util.ArrayList, and
java.util.LinkedList [13].
Consider the outer product abT of the two vectors a ∈ m and b ∈ n , where many of the elements
of the vectors are 0. The outer product will be a sparse matrix where the elements in some rows are 0,
and the corresponding sparse data structure will have rows without any elements. A typical operation
is a rank one update of an m × n matrix A.
where ai is element i in a and bj is element j in b. Thus only those rows of A where ai = 0 need
to be updated. The update method (sometimes referred to as additionEquals) updates row by row. For
JSA we only need to copy data for the row we are updating to a larger resulting row array. Using CRS
we must copy the whole non-zero structure over to a larger CRS structure (creating a new value and
columnindex array); this must be done for each row. This extensive copying creates the overhead
for update on CRS compared with update on JSA. This is clearly shown in Table IV where 10%
of the elements in a are non-zero. The JSA algorithm is an average factor of 78 times faster than
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
814 G. GUNDERSEN AND T. STEIHAUG
the CRS algorithm. The overhead in creating a native Java array is proportional to the number of
elements, thus accounting for the major difference between CRS and JSA on matrix updates.
We need to consider the row-wise layout when we use Java arrays as 2D arrays. This well-known fact is
easily illustrated with the matrix sum example. To eliminate this effect arrays must either be unpacked
or the algorithms must be row-oriented. For sparse matrices we have illustrated this manipulating only
the rows of the structure without updating the rest of the structure. This excludes CRS in favor of JSA
and SMC as the preferable sparse matrix data structures in Java. The SMC has a unique implementation
in Java, as JSA. We know that JSA could replace CRS for most algorithms. However, the benefits of
JSA compared with SMC need further studies. Block matrices are fundamental in numerical linear
algebra and will be an extension of this work. However, this will have to be a more object-oriented
approach.
REFERENCES
1. Bischof CH, Bücker HM, Henrichs J, Lang B. Hands-on training for undergraduates in high-performance computing using
Java. Applied Parallel Computing: New Paradigms for HPC in Industry and Academia, Sørevik T, Manne F, Moe R,
Gebremedhin AH (eds.), Proceedings of the 5th International Workshop, PARA 2000, Bergen, Norway, June 2000 (Lecture
Notes in Computer Science, vol. 1947). Springer: Berlin, 2001; 306–315.
2. Langtangen HP, Tveito A. How should we prepare the students of science and technology for a life in the computer age?
Mathematics Unlimited—2001 and Beyond, Enquist B, Schmid W (eds.). Springer: Berlin, 2000; 809–826.
3. Hicklin J, Moler C, Webb P, Boisvert RF, Miller B, Pozo R, Remington K. JAMA: A Java matrix package.
http://math.nist.gov/javanumerics/jama [5 May 2003].
4. Moreira JE, Midkiff SP, Gupta M. Supporting multidimensional arrays in Java. Concurrency and Computation: Practice
and Experience 2003; 15(3–5):317–340.
5. Gosling J. The evolution of numerical computing in Java. Technical Memorandum, Sun Microsystems Laboratories, 1997.
6. van Reeuwijk K, Kuijlman F, Sips HJ. Spar: A set of extensions to Java for scientific computation. Concurrency and
Computation: Practice and Experience 2003; 15(3–5):277–299.
7. Stewart GW. JAMPACK: A package for matrix computations.
ftp://math.nist.gov/pub/Jampack/Jampack/AboutJampack.html [5 May 2003].
8. The Colt Distribution. Open source libraries for high performance scientific and technical computing in Java.
http://hoschek.home.cern.ch/hoschek/colt/ [5 May 2003].
9. Pissanetsky S. Sparse Matrix Technology. Academic Press: Boston, MA, 1984.
10. Saad Y. Iterative Methods for Sparse Linear Systems. PWS Publishing Company: Boston, MA, 1996.
11. Thiruvathukal GK. Java at middle age: Enabling Java for computational science. Computing in Science and Engineering
2002; 4(1):74–84.
12. Boisvert RF, Moreira J, Philippsen M, Pozo R. Java and numerical computing. Computing in Science and Engineering
2001; 3(2):18–24.
13. Shirazi J. Java Performance Tuning. O’Reilly: Cambridge, MA, 2000.
14. Blount B, Chatterjee S. An evaluation of Java for numerical computing. Scientific Programming 1999; 7(2):97–110.
15. Gundersen G. The use of Java arrays in matrix computation. Candidatus Scientarium (Master in Science) Thesis,
Department of Informatics, University of Bergen, Bergen, Norway, 2002.
16. Java Numerical Toolkit. http://math.nist.gov/jnt [5 May 2003].
17. Chang R-G, Chen C-W, Chuang T-R, Lee JK. Towards automatic support of parallel sparse computation in Java with
continuous compilation. Concurrency: Practice and Experience 1997; 9:1101–1111.
18. Pozo R, Remington K, Lumsdaine A. SparseLib++ version 1.5 Sparse Matrix reference guide. Technical Report,
Mathematical and Computational Sciences Division, National Institute of Standards and Technology, MD, April 1996.
19. SciMark 2.0. http://math.nist.gov/scimark2/about.html [5 May 2003].
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815
JAVA AND MATRIX COMPUTATIONS 815
Copyright
c 2004 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2004; 16:799–815