Full Text 01
Full Text 01
MAGNUS ÅKERFELDT
Abstract
The Stream API was added in Java 8. With the help of lambda expres-
sions (anonymous functions), streams enable functional-style operations on
sequences of elements. In this project, we evaluate how streams perform
in comparison to imperative loops in terms of execution time, from the
perspective of how streams are commonly used in public GitHub repositories.
Additionally, two algorithms are implemented with and without streams, to
assess the impact of stream usage on algorithmic performance. Parallel
streams are only examined briefly due to their infrequent usage.
We find that sequential streams in general are slower than imperative loops.
However, stream performance heavily relies on how many elements are being
processed, which is referred to as input size. For input sizes smaller than
100, most stream pipelines are several times slower than imperative loops.
Meanwhile, for input sizes between 10 000 and 1 000 000, streams are on
average only 39% to 74% slower than loops, and in some cases, they even
slightly outperform them.
Additionally, we observe that using streams when implementing algo-
rithms in some cases leads to much slower execution times, while in other
cases, it barely affects the execution time at all. We conclude that stream
performance primarily depends on input size, presumably because of the
high overhead abstraction cost of creating streams, but their performance also
depends on other factors, such as operation type and pipeline length.
Keywords
Java streams, performance, benchmarking, loops, functional programming
ii | Abstract
Sammanfattning | iii
Sammanfattning
Med Java 8 introducerades streams. Med hjälp av lambda-uttryck (anonyma
funktioner) möjliggör streams användandet av funktionella operationer på
sekvenser av element. I detta projekt mäter vi hur streams presterar i jämförelse
med imperativa loopar med hänsyn till exekveringstid, från perspektivet av
hur streams vanligen används i publika GitHub-projekt. Parallella streams
undersöks endast i begränsad utsträckning, på grund av hur sällan de används.
Resultaten visar att streams överlag är långsammare än imperativa loopar.
Skillnaden i prestanda beror dock starkt på indatastorleken, det vill säga
hur många element som streamen bearbetar. För indatastorlekar mindre än
100 element är streams ofta flera gånger långsammare än deras imperativa
motsvarigheter. Samtidigt är streams i genomsnitt endast 39% till 74%
långsammare än imperativa motsvarigheter för indatastorlekar mellan 10 000
och 1 000 000 element, och i några fall är de till och med något snabbare än
imperativ kod.
Vidare observerar vi att användning av streams vid implementation av
algoritmer i vissa fall leder till mycket längre exekveringstider, medan det
i andra fall knappt påverkar exekveringstiden alls. Vi drar slutsatsen att
prestandan av streams främst beror på indatastorlek, men också på andra
faktorer, såsom operationstyp och hur många operationer som används i en
pipeline.
Nyckelord
Java streams, prestanda, loopar, funktionell programmering
iv | Contents
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Java Streams . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Performance of streams . . . . . . . . . . . . . . . . . 3
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Research questions . . . . . . . . . . . . . . . . . . . 4
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Research Methodology . . . . . . . . . . . . . . . . . . . . . 5
1.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Input, type & output . . . . . . . . . . . . . . . . . . 7
2.1.2 Intermediate & terminal operations . . . . . . . . . . 8
2.1.3 Operations . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Laziness . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.5 Parallelization . . . . . . . . . . . . . . . . . . . . . 9
2.1.6 Encounter order . . . . . . . . . . . . . . . . . . . . . 9
2.2 Common stream usage . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Research on stream usage . . . . . . . . . . . . . . . 12
2.2.2 Common operations . . . . . . . . . . . . . . . . . . 13
2.2.3 Common pipelines . . . . . . . . . . . . . . . . . . . 13
2.2.4 Input, type & output . . . . . . . . . . . . . . . . . . 15
2.2.5 Pipeline length . . . . . . . . . . . . . . . . . . . . . 15
2.2.6 Input size . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.7 Parallelization is rare . . . . . . . . . . . . . . . . . . 16
2.2.8 Encounter order . . . . . . . . . . . . . . . . . . . . . 16
2.2.9 Lambda expressions . . . . . . . . . . . . . . . . . . 16
Contents | v
2.2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Low level Java . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Garbage collection . . . . . . . . . . . . . . . . . . . 18
2.4 Benchmarking Java . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Low level streams . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Stream speed . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 Stream bytecode . . . . . . . . . . . . . . . . . . . . 20
2.6 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.2 Dijkstra’s algorithm . . . . . . . . . . . . . . . . . . . 23
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7.1 Streams vs imperative Java . . . . . . . . . . . . . . . 23
2.7.2 Performance of streams . . . . . . . . . . . . . . . . . 24
3 Method 25
3.1 RQ 1: Average stream performance . . . . . . . . . . . . . . 25
3.2 RQ 1.1: Intermediate operations . . . . . . . . . . . . . . . . 26
3.3 RQ 1.2: Terminal operations . . . . . . . . . . . . . . . . . . 27
3.4 RQ 1.3: Pipeline length . . . . . . . . . . . . . . . . . . . . . 29
3.5 RQ 1.4: Parallel pipelines . . . . . . . . . . . . . . . . . . . . 30
3.6 RQ 2: Algorithms with streams . . . . . . . . . . . . . . . . . 31
3.7 JMH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.8 System specifications . . . . . . . . . . . . . . . . . . . . . . 32
4 Implementation 35
4.1 Replacing streams with loops . . . . . . . . . . . . . . . . . . 35
4.2 A sample benchmark . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Input for streams . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Algorithms with streams . . . . . . . . . . . . . . . . . . . . 37
6 Discussion 53
6.1 General results . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Smaller input sizes . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Garbage collection . . . . . . . . . . . . . . . . . . . . . . . 55
6.4 Errors & Larger input sizes . . . . . . . . . . . . . . . . . . . 55
6.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.6 Performance trade-off . . . . . . . . . . . . . . . . . . . . . . 56
6.7 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . 57
References 67
A Slowdown factors 71
List of Acronyms
Glossary
Slowdown factor In the context of this thesis, the slowdown factor of a stream
pipeline is the runtime of that pipeline divided by the runtime of its
imperative counterpart.
Smaller input sizes Input sizes between 1 and 1 000 (in the context of this
thesis).
Stream source The data source from which a stream is created, e.g., an
arraylist.
Chapter 1
Introduction
1.1 Background
As of 2022, Java was the third most popular programming language on GitHub
[1]. Almost 30 years after it was initially released, it is still widely used,
and it is constantly being updated. Java is heavily object oriented, has high
portability, and is strongly typed. Furthermore, it is primarily an imperative
language, since Java programs often are written as a sequence of instructions
that specify how certain tasks are to be performed in a very precise manner,
step by step.
In contrast to imperative languages, functional/declarative programming
languages describe the desired state or output of a program, instead of
specifying the exact steps needed to reach that state. The programmer specifies
what needs to be solved, without having to specify exactly how that can
be achieved. These languages offer high-level functions, which can replace
control flow statements such as if-else and loops.
Functional programming has recently become more popular, as seen by the
popularity of languages such as Haskell, Scala, and Clojure [2]. Perhaps this
is the reason for why some functional programming elements were added to
Java in 2014, with the introduction of streams and lambda expressions, which
are often used together.
Lambda expressions are similar to anonymous functions, and are written
on the form (parameter -> expression). For example, the lambda
expression (x -> x + 2) takes an argument x, and returns the value x +
2.
2 | Introduction
// I m p e r a t i v e v e r s i o n
int s u m O f E v e n S q u a r e s ( int [] v ) {
int r e s u l t = 0;
for ( int i = 0; i < v . l e n g t h ; i ++) {
if ( v [ i ] % 2 == 0) {
r e s u l t += v [ i ] * v [ i ];
}
}
return result ;
}
// S t r e a m v e r s i o n
int s u m O f E v e n S q u a r e s ( int [] v ) {
r e t u r n I n t S t r e a m . of ( v )
. f i l t e r ( x -> x % 2 == 0)
. map ( x -> x * x )
. sum () ;
}
Figure 1.1: The same function implemented without and with Java streams.
• filter takes a lambda expression which should return true for all
of the elements which should remain in the stream. In this case, the
lambda expression (x -> x % 2 == 0) returns true for all even
elements, which means that all odd elements are removed.
Because these operations are stateless (a term which is further explained in the
next chapter), the sequence of elements will only be traversed a single time,
even though three operations are specified. This is a key feature of streams
which increases performance.
1.2 Problem
The main objective of this project is to compare the runtime performance
of programs which use Java streams with programs which do not use Java
streams, and analyze how this performance difference depends on certain
factors. Specifically, the performance of Java streams as they are commonly
used in GitHub projects is examined.
Since streams can be used in many different ways, it is likely that stream
performance depends on several factors, such as input size, operation type,
and pipeline length. Thus, the main research question has been divided into
several subquestions, in order for this project to give a broader view on stream
performance, and because they give thesis a clearer structure.
1.3 Purpose
As mentioned, the performance difference between imperative Java and Java
streams has only been examined by one previous study, in a limited and non-
representative manner, and previous researchers have pointed out the lack
of thorough stream benchmarking [6]. This project could help elucidate the
question of how much performance is sacrificed through the usage of streams.
The results of this project could be useful for programmers who want to
get the perks of functional programming through the use of Java streams, but
are unsure if the performance trade-off is worth it. If developers choose to use
Java streams because of the results of this study, then this could possibly help
them construct more easily maintainable programs with fewer bugs, which
could cut down on work hours and reduce stress.
This thesis also aims to provide specific information on how the 20 most
popular stream operations perform for different input sizes. This can help
developers make more informed choices in regards to when they should use
streams, and how much the runtime performance will be impacted by this.
1.4 Goals
The main goal of this project is to find out how streams, as they are
commonly used, generally perform compared to imperative Java, and how
this performance difference depends on different factors, such as input size
and pipeline length.
1.6 Delimitations
Some streams operations which are rarely used are not benchmarked in this
project [3]. However, the majority of them are tested.
Large programs using streams are not be benchmarked, since Java
benchmarks are more accurate for smaller programs, and because it is easier
to isolate stream performance with smaller programs. However, individual
pipelines with multiple stream operations are examined. Furthermore, some
popular algorithms are implemented (and benchmarked) with and without Java
streams,
This project does not involve benchmarking real-world software projects.
Instead, imperative code which mimic commonly used stream pipelines are
constructed [7] [9]. The reasoning for this is that real-world projects often have
their most relevant function nested deep within several layers of abstraction,
which complicates benchmarking. Future studies could attempt to benchmark
real-world projects using streams, and then replace the streams with imperative
code, in order to compare the performance difference.
Parallel streams are only be considered in a limited manner, since previous
studies show that parallel streams are very rarely used in GitHub projects [7]
[9].
This project focuses on runtime speed, since this is arguably the most
important performance metric. Previous studies which have benchmarked
Java and other languages which use the JVM have also looked into startup-
time and compile time, as well the size of the compiled Java bytecode file [10]
[4].
Background | 7
Chapter 2
Background
This chapter presents relevant background theory for this thesis. This includes
how streams work and how they are commonly used, how Java (and Java
streams) are implemented on a low level, and how this is relevant for
benchmarking, and previous related research.
2.1 Streams
In this section, important stream terms, operations, and categories are
explained.
Some stateful operations may need to process the entire input before a
result can be produced, can impact performance when multiple intermediate
operations are used. An example of this is the sorted operation, for which
no results can be produced until all elements have been seen [3].
2.1.3 Operations
Figure 2.2 and Figure 2.3 explain the ten most popular intermediate and
terminal operations, all of which are examined in this project. The
intermediate operations are divided into the categories stateful and stateless,
and the terminal operations are divided into categories based on their return
type.
2.1.4 Laziness
All intermediate operations are lazy, i.e., they will not be traversed until a
terminal operation is invoked. Because of this, the stream pipeline source-
filter-map-sum, which does not have any stateful operations, only
requires one traversal, even though three different operations are specified.
Laziness also allows for avoiding unnecessary processing when not all data
from a source is required. For example, the operation “find the first integer that
is larger than 50” only needs to examine integers until the first integer larger
than 50 is found. For these reasons, laziness greatly increases the performance
of streams. In contrast, almost all terminal operations are eager, which means
that they execute the traversal of the stream [3]. A stream pipeline can only be
traversed once, and then it is be considered consumed.
2.1.5 Parallelization
Efficiently parallelizing imperative code can often be a complex and time
consuming task, whereas simple parallelization is a key feature of streams. It is
possible to switch back and forth between parallel and sequential processing in
a single pipeline, using the operations parallel and sequential. This
could be useful when only some operations in a pipeline perform well with
parallel processing.
Intermediate operations
Type Name What does it do? Example
Removes the odd elements.
Removes the elements which
filter() do not satisfy the given [3, 4, 5, 6, 7].filter(x -> x % 2 == 0)
predicate. ↓
[4, 6]
Squares all elements.
Applies the given function [3, 4, 5, 6, 7].map(x -> x * x)
map()
to each element. ↓
Stateless
↓
[5, 6, 7]
Sorts the stream.
Terminal operations
Return
type Name What does it do? Example
Returns true if any element is even.
Returns true if any
anyMatch() element satisfies the [3, 4, 5, 6].anyMatch(x -> x % 2 == 0)
↓
given predicate.
boolean
true
11
Counts the elements.
↓
5
Returns some element Returns some element
(sometimes faster than [5, 3, 2, 4].findAny()
findAny()
findFirst for parallel ↓
processing) 2
Turns the stream into a List.
Collection/array
Figure 2.3: The terminal operations which are examined in this project. The
examples have the form [inputStream].operation(arguments)
=> output. The sum example takes the stream [2, 3, 6] and produces
the value 11.
12 | Background
encountered in a defined order (such as arrays and lists, but not sets). Note that
encounter order is not the same as sort order. If functions such as sorted
are called on streams without order, then an order will be applied on the
resulting stream. Parallelization of ordered streams can sometimes result in
worse performance.
Certain aggregate operations, such as distinct, and operations which
are dependent on an order, such as limit, sometimes perform better without
an encounter order. Thus, if the ordering is not relevant, the performance of
stream pipelines can be improved if stateful and terminal operations are de-
ordered with unordered [3] [7].
Table 2.1: Most common intermediate operations. Table adapted from the
report "Characterizing Java Streams in the Wild" [7].
Table 2.2: Most common terminal operations. Table adapted from the report
"Characterizing Java Streams in the Wild" [7].
source-map-collect
source-filter-collect
source-collect
source-anyMatch
source-filter-findFirst
Figure 2.4: Most popular sequential stream pipelines [12]. The bar width
represents how many of the 610 analyzed projects included at least one
instance of the pipeline. This figure is adapted from the report "How do
developers use the Java Stream API?" [12].
Background | 15
the real input data used in the projects, which could be much larger. It is
common that real input data is not uploaded to GitHub, especially when it is
large. Furthermore, input data for unit tests tend tend to be relatively small,
since it often is manually constructed. However, their other findings are likely
to be more representative of stream usage, since they are not as dependent on
input data.
2.2.10 Summary
The most important findings regarding common stream usage are listed in
Table 2.5.
p u b l i c i n t e r f a c e S p l i t e r a t o r <T> {
b o o l e a n t r y A d v a n c e ( Consumer <? s u p e r T> a c t i o n ) ;
v o i d f o r E a c h R e m a i n i n g ( Consumer <? s u p e r T> a c t i o n ) ;
S p l i t e r a t o r <T> t r y S p l i t ( ) ;
long e s t i m a t e S i z e ( ) ;
long getExactSizeIfKnown ( ) ;
int characteristics () ;
boolean h a s C h a r a c t e r i s t i c s ( i n t c h a r a c t e r i s t i c s ) ;
C o m p a r a t o r <? s u p e r T> g e t C o m p a r a t o r ( ) ;
}
OpenJDK [20]. JMH allows the user to specify a number of warmup iterations
before the proper iterations start, which makes sure that the JVM is properly
warmed up. During each proper iteration, the benchmarked function is run
many times, and an average runtime (and error) is calculated. Furthermore,
JMH starts a new VM (Virtual Machine) for each benchmarked function,
which ensures that optimizations which occurred during earlier benchmarks
do not interfere with the current benchmark [19].
// S u m m i n g an a r r a y i m p e r a t i v e l y
int [] nums = {4 , 5 , 6};
long sum = 0;
for ( int i = 0 ; i < nums . l e n g t h ; i ++) {
sum += nums [ i ];
}
// G e n e r a t e d b y t e c o d e
// ...
17: l c o n s t _ 0
18: l s t o r e _ 2
19: i c o n s t _ 0
20: i s t o r e 4
22: i l o a d 4
24: a l o a d _ 1
25: a r r a y l e n g t h
26: i f _ i c m p g e 43
29: l l o a d _ 2
30: a l o a d _ 1
31: i l o a d 4
33: i a l o a d
34: i2l
35: ladd
36: l s t o r e _ 2
37: iinc 4, 1
40: goto 22
43: r e t u r n
Figure 2.6: Some imperative code and the bytecode generated from it.
call the sum method on that stream. Such method calls can result in abstraction
overhead performance costs.
In contrast, the for-loop version resulted in zero method calls, and instead
only used simple bytecode instructions such as ladd, which calculates the
sum of two long values on the stack, and goto, which is an unconditional
jump (in this example it used to makes the loop repeat). These types of
instructions generally result in less overhead than method calls, and are
therefore faster. Furthermore, stream operations which require the use of
lambda functions (such as map and filter) usually require even more
overhead costs.
22 | Background
// S u m m i n g an a r r a y with s t r e a m s
int [] nums = {4 , 5 , 6};
long sum = A r r a y s . s t r e a m ( nums ) . sum () ;
// G e n e r a t e d b y t e c o d e
// ...
17: a l o a d _ 1
18: i n v o k e s t a t i c #2 // C r e a t e s I n t S t r e a m
21: i n v o k e i n t e r f a c e #3 , 1 // C a l l s sum ()
26: i2l
27: l s t o r e _ 2
28: r e t u r n
Figure 2.7: Some stream code and the bytecode generated from it.
2.6 Algorithms
In order to get a wider perspective on the performance of streams,
some popular algorithms are written with and without streams, and their
performances are compared. This specifically includes Dijkstra’s algorithm
and the FFT (Fast Fourier transform) algorithm.
2.6.1 FFT
Fast Fourier transform is an algorithm that calculates Discrete Fourier
Transform (DFT) of a sequence. In this project, an implementation of the
FFT algorithm from the book Algorithms, by Sedgewick and Wayne, is used
[24]. This Java implementation is available on the book’s complementary
GitHub page [25], and is available under the GNU General Public License.
According to the authors, the implementation is relatively bare-bones, and runs
in O(n log n) time.
The original non-stream algorithm was implemented for clarity, not
necessarily performance. It calculates FFT of the complex input data x, as
well as the inverse FFT, the circular convolution of x with itself, and the linear
convolution of x with itself, and prints all of these. The authors note that it
is not the most memory efficient algorithm, partly because it re-allocates for
subarrays instead of doing in-place re-allocation, and because it uses object
types for representing complex numbers.
Background | 23
• source-sum
Chapter 3
Method
In this chapter, the method is described. Sections 3.1 - 3.5 detail the methods
which aim to answer the first research question and its subquestions, relating
to common pipelines and their imperative counterparts. Section 3.6 details
the method which aims to answer the second research question, relating
to algorithms implemented with and without streams. Finally, section 3.7
described how the benchmarking tool JMH is used, and 3.8 details the
specifications of the computer used for benchmarking.
operations are used, and most of the pipelines are applied on arraylists.
Furthermore, the elements in these lists are primarily Person objects. The
Person class contains a string attribute called name and an integer attribute
called age. The input has been described in more detail in Section 4.3.
Finally, the pipelines contain between zero and two intermediate operations,
and almost all of them use sequential processing, as opposed to parallel
processing.
The pipelines which are benchmarked and compared to imperative Java
are divided into categories based on which subquestion of RQ 1 they aim to
answer. Each subquestion has its own section in this chapter, as listed below.
Ten pipelines are constructed to isolate the performance of the ten most
popular terminal operations. They are listed and explained in Table 3.2. Some
of the terminal operations are tested without an intermediate operation, i.e.,
in isolation, and some are tested with the intermediate operation filter,
depending on what is the most realistic† . For example, findFirst and
collect are tested with filter because that is more common than using
them in isolation [12] [11].
The lambda expression for anyMatch returns true if any person has an
age of -5. This is naturally never true, which means that all stream elements
∗
It would be possible to use mapToInt with collect by calling the boxed operation,
which converts primitive integers to object representation of integers, however this would
defeat the purpose of using mapToInt instead of map.
†
One likely reason for why these operations are not commonly used in isolation
is that there already exist built-in functions in Java with similar behavior. count in
isolation simply returns the size of a list, which is equivalent to calling list.size().
Furthermore, findFirst in isolation returns the first element of a list, which is equivalent to
list.get(0). Finally, collect and toArray in isolation just convert a data structure
to another type of data structure, for which there also exist built-in functions in Java
28 | Method
Only two parallel pipelines are benchmarked and compared with their
imperative counterparts. This is for two reasons. The main reason is that
parallel pipelines are incredibly rare, only making up 0.34% of all pipelines
found on GitHub [7]. Secondly, creating parallel imperative versions of
parallel pipelines takes much more effort than creating imperative version
of sequential pipelines. The tested pipelines are source-parallel-sum
and source-parallel-map(x -> x * x)-toArray. The former
squares all elements in the input array, and the latter sums all of the integers
in the input array.
Imperative parallel versions of streams are constructed using Java threads,
with each thread working on one part of the input array in parallel.
Additionally, imperative non-parallel versions of the tested pipelines are
also constructed and benchmarked, in order to give more insights into the
performance difference between parallel streams and sequential imperative
code.
∗
Specifically, the eight terminal operations are anyMatch, allMatch, sum, count,
collect, toArray, reduce, and forEach. However, count is not tested zero
operations, since that would be equivalent to simply calling list.size().
Method | 31
3.7 JMH
JMH, which is brought up in the background chapter in Section 2.4, is used
in order to benchmark all of the constructed pipelines and algorithms in this
project. The JMH settings are similar to those used in a previous study which
examined stream performance [5]. JMH is used together with Maven, which
is a popular build automation tool for Java. The reason for why specifically
JMH is used is that it is the standard benchmarking Java tool [5], and it is
32 | Method
∗
A JMH object which can ensure that certain values are calculated by the JVM.
Method | 33
Table 3.4: Specifications of the computer which the parallel benchmarks are
run on.
34 | Method
Implementation | 35
Chapter 4
Implementation
This chapter explains some aspects of how the benchmarks are implemented in
code. Section 4.1 brings up what type of for-loops are used to replace streams.
Section 4.2 shows how one of the pipelines is benchmarked. Section 4.3 gives
details on the input used for the pipelines. Finally, Section 4.4 explains how
two algorithms are adapted to use streams instead of for-loops.
// S t r e a m b e n c h m a r k
@Benchmark
p u b l i c void s u m _ S T R ( B l a c k h o l e bh ) {
long sum = A r r a y s . s t r e a m ( nums ) . sum () ;
bh . c o n s u m e ( sum ) ;
}
// I m p e r a t i v e b e n c h m a r k
@Benchmark
p u b l i c void s u m _ I M P ( B l a c k h o l e bh ) {
long sum = 0;
for ( int i = 0 ; i < nums . l e n g t h ; i ++) {
sum += nums [ i ];
}
bh . c o n s u m e ( sum ) ;
}
Figure 4.1: Code used to benchmark the pipeline source-sum and its
imperative counterpart.
// I m p e r a t i v e v e r s i o n
C o m p l e x [] even = new C o m p l e x [ n /2];
for ( int k = 0; k < n /2; k ++) {
even [ k ] = x [2* k ];
}
// S t r e a m v e r s i o n
C o m p l e x [] even = I n t S t r e a m . r a n g e (0 , x . l e n g t h )
. f i l t e r ( i -> i % 2 == 0)
. m a p T o O b j ( i -> x [ i ])
. t o A r r a y ( C o m p l e x []:: new ) ;
Figure 4.2: An example of how a for-loop is replaced with a stream in the FFT
implementation.
Results and Validity | 39
Chapter 5
In this chapter, all sections except the last one are dedicated to one research
question (or subquestion) each. The final section brings up the validity of the
data by showing the error rates.
Primarily, the runtime difference between stream pipelines (and algo-
rithms) which are implemented with and without streams are shown, for input
sizes between 1 and 1 000 000 elements. These differences are presented using
slowdown factors. The slowdown factor for a stream pipeline is its runtime
divided by the runtime of equivalent code without streams (i.e., imperative
Java). For example, if a stream pipeline has a slowdown factor of three, then
that means that it is three times slower than equivalent code written without
streams.
The reason for why the results are shown using slowdown factors instead
of raw runtimes is that the primary interest in this project is how much
slower/faster streams are than imperative Java. This means that if a certain
operation is said to "perform worse" than another operation, then this does
not pertain to runtime, but to how much slower they are than their imperative
counterparts. Furthermore, smaller input sizes refers to input sizes between
1 and 1 000, and larger input sizes refers to input sizes between 10 000 and
1 000 000.
20
Slowdown Factor
8
6
4
0.8
1 10 100 1 000 10 000 100 000
Input size
Figure 5.1: Slowdown factors for the worst, average, and best performing
pipelines.
Almost all pipelines perform much worse for smaller input sizes than for
larger input sizes. For smaller input sizes, the pipelines are on average between
2.6 and 9.6 times slower than their imperative counterparts. Meanwhile, for
the larger input sizes, the pipelines are on average only 39% to 74% slower
than their imperative counterparts. Important to note is that the average values
are slightly skewed because of some outliers. Because of this, the median
slowdown factors are slightly lower than the average slowdown factors.
The performance range is rather large. The worst performing pipelines,
which is source-collect† for most input sizes, are much slower than
imperative Java for all input sizes. However, source-collect is an
outlier, and almost all other pipelines are less than four times slower than
their imperative counterparts for the larger input sizes. Furthermore, the best
performing pipeline is only slightly slower than its imperative counterpart
∗
There are multiple best performing pipelines, depending on which input size is being
examined. For example, at input size 100 the best performing pipeline is source-forEach,
with a slowdown factor of 1.17, but for input size 1 000 000 it is source-sum, with a
slowdown factor of 0.90.
†
This pipeline is brought up again, in Section 5.4.
Results and Validity | 41
than the smaller input sizes, and it is even slightly faster than its imperative
counterpart for the larger input sizes. The following sections looks more
deeply into the performance of different types of pipelines.
Figure 5.2 shows the average slowdown factors for the pipelines which isolate
stateless and stateful intermediate operations. The median slowdown factors
closely align with the average for both stateless and stateful operations.
5
4
Slowdown Factor
1
1
10
0
10
00
00
00
00
1
10
0
10
00
1
Input size
Figure 5.2: Average slowdown factors for the pipelines which isolate stateless
and stateful intermediate operations.
The general trend for all input sizes is that the stateful operations perform
slightly worse than the stateless operations. The biggest difference can be
seen for larger input sizes. At the largest input size, all of the intermediate
operations, except for distinct, are less than 30% slower than their
imperative counterparts.
42 | Results and Validity
6 -
5 filter
4
flatMap
Slowdown factor
3
map
2 mapToInt
mapToLong
mapToObj
1
1
10
0
10
00
00
00
00
1
10
0
10
00
1
Input size
Figure 5.3: Slowdown factors for the pipelines which isolate stateless
intermediate operations.
size reaches 100. Finally, skip and limit only manage to perform relatively
close to their imperative counterparts for the largest input size.
-
6
distinct
4
limit
Slowdown factor
skip
sorted
2
1
1
10
0
10
00
00
00
00
1
10
0
10
00
1
Input size
Figure 5.4: Slowdown factors for the pipelines which isolate stateful
intermediate operations.
Figure 5.5 shows the average slowdown factors for the pipelines which isolate
terminal operations. Exact slowdown factors are shown in Appendix A. Just
like the intermediate operations, the terminal operations perform much better
with larger input sizes. However, the curve is steeper, with worse performance
at smaller input sizes, and slightly better performance at larger input sizes.
On average, terminal operations perform relatively well for larger input sizes,
being only 24% slower than their imperative counterparts.
Figure 5.6 shows the slowdown factors for the individual pipelines
which isolate terminal operations, divided into two graphs. Just as for the
intermediate operations, the terminal operations which perform the worst for
the smaller input sizes (count and sum) perform the best for the larger input
sizes. collect and forEach, the two most popular terminal operations,
44 | Results and Validity
6
Slowdown Factor
1
1 10 100 1 000 10 000 100 000 1 000 000
Input size
Figure 5.5: Average and median slowdown factors for the pipelines which
isolate terminal operations.
perform quite well, and are less than 32% slower than their imperative
counterparts for larger input sizes. for-each is sometimes slightly faster
than its imperative counterpart. The worst performing terminal operation is
reduce.
Figure 5.7 shows how, on average, eight of the ten most popular terminal
operations perform with zero, one, or two intermediate operations. As
explained in the method chapter, the used intermediate operations are filter
and map. When the terminal operations are used with other intermediate
operations, the performance results could differ.
This figure seems to imply that adding one intermediate operation does
not change the performance, but that adding two intermediate operations
slightly decreases performance. However, the average values are somewhat
misleading, since different terminal operations show very different results,
with no overarching pattern. Graphs with slowdown factors for each individual
terminal operation, used with zero, one, or two intermediate operations, are
Results and Validity | 45
Figure 5.6: Slowdown factors for the pipelines which isolate terminal
operations.
46 | Results and Validity
1
1 10 100 1 000 10 000 100 000
Input size
Figure 5.7: Average slowdown factors for pipelines with zero, one and two
intermediate operations.
shown in Appendix B. The terminal operations can be put into four different
groups, based on pipeline length performance.
• sum, count, and reduce perform best when used in isolation, or with
one intermediate operation.
• toArray and collect perform best when used with at least one
intermediate operation.
10 000 000
1 000 000
100 000
Runtime (ns)
10 000
1 000
100
10
1
10
0
10
00
00
00
00
00
00
1
10
0
10
00
00
00
1
10
0
10
Input size
1 000 000
Runtime (ns)
10 000
100
1
10
0
10
00
00
00
00
00
00
1
10
0
10
00
00
00
1
10
0
10
Input size
Figure 5.10 shows the slowdown factors for the FFT algorithm, i.e., the
runtime of the stream version of FFT divided by the runtime of the imperative
version of FFT. For the larger input sizes, the stream version of FFT is 2 to 3
slower than its imperative counterpart.
- Stream FFT
8
Slowdown Factor
6
4
1
10 100 1 000 10 000 100 000
Input size
Figure 5.11 shows the slowdown factor for Dijkstra’s algorithm. The
results are different than those of the FFT algorithm, with the stream version of
Dijkstra’s generally performing much better at all input sizes than the stream
version of the FFT algorithm. For all input sizes except the smallest one,
the stream version of Dijkstra’s algorithm performs almost identically to its
imperative counterpart, i.e., no slowdown is seen for these input sizes, whereas
the FFT algorithm is 2 - 6 times slower than its imperative counterpart for these
input sizes.
5.7 Validity
JMH gives a confidence interval of 99.9% for each benchmarked pipeline
and input size, assuming there is a normal distribution. The error is half
this interval. Dividing the runtime error (half the confidence interval) by the
runtime gives an error in terms of percentages. These percentage errors are
shown in Table 5.1, for different categories of pipelines and input sizes.
50 | Results and Validity
- Stream Dijkstra
8
Slowdown Factor
1
10 100 1 000 10 000 100 000
Input size
Table 5.1: Average runtime error rates of the tested pipelines. The error is half
the 99.9% confidence interval divided by the runtime.
Results and Validity | 51
In general, the error is higher for larger input sizes. This is presumably
explained by the fact that higher input sizes results in more memory
management, which can trigger the garbage collector relatively sporadically,
which can give less consistent results. Furthermore, pipelines with terminal
operations which return arrays or collection (toArray and collect) have
higher error rates than operations which return scalar values, which also can
be explained by higher rates of required memory management.
The average error for the different categories of pipelines is between 1.31%
and 2.72% (the median error rates are slightly lower), and the pipelines with
one intermediate operation have the highest average error (2.72%). The
individual pipeline with the highest error rate is 25.43%, but such higher error
rates are only reached in a few cases, and only when input sizes are larger
than 100 000 elements. This error rate occurred for the pipeline source-
flatMap-map-collect at input size (n = 100 000), and it presumably
this high due to the usage of flatMap, which requires a lot of memory
management and garbage collection. If the benchmarks are run on a computer
with more access to memory, or if it in general has better hardware, then the
error rates may be lower, since garbage collection may not be triggered as
often.
The FFT benchmarks have an average error rate of 2.51%, and the
benchmarks for Dijkstra’s algorithm have an average error rate of 2.51%.
52 | Results and Validity
Discussion | 53
Chapter 6
Discussion
This chapter discusses the results of this project, and gives general insights and
thoughts on stream performance and how it can be benchmarked correctly.
of importance.
relatively similar slowdown factors as for smaller input sizes, indicating that
increasing the input size beyond 1 000 000 gives diminishing returns in terms
of stream performance.
6.5 Algorithms
The stream version of FFT algorithm performs worse than the individual
stream pipelines. For the larger input sizes, the stream version of FFT is 2 to
3 times slower than its imperative counterpart, whereas the average pipeline is
only around 1.5 to 1.7 times slower than its imperative counterpart. In contrast
to this, the stream version of Dijkstra’s algorithm performs very well, and is
only slightly slower than its imperative counterpart for almost all input sizes.
This can partly be explained by the fact that many of the calculations
executed during Dijkstra’s algorithm occur inside a prioritized queue, although
the functions which act on this queue are called using streams. Thus, a large
part of of the work is not done directly by streams, but it instead happens inside
certain data structures.
Furthermore, Dijkstra’s algorithm included larger, but fewer, streams than
the stream version of the FFT algorithm. Streams in general perform better
with larger input sizes, and thus this further explains why Dijkstra’s algorithm
performed better than the FFT algorithm when using streams. Another
explanation for why the stream version of Dijkstra’s algorithm performs
relatively well is that it frequently uses the forEach operation, which is
among the best performing terminal operations.
for-loop, for example some of the more sophisticated use cases of flatMap.
Ultimately, since the performance loss of using stream is relatively minimal
for larger input sizes, especially for the most popular stream operations,
developers might find that using streams is still worthwhile.
∗
Biboudis et al. used Java version 8, whereas this project uses Java version 11.
58 | Discussion
Conclusions and Future work | 59
Chapter 7
In this chapter, we present our conclusions and answer the specified research
questions. We also explain the limitations of this project, and what future
works can examine. Finally, we reflect on this project and what the results
mean for software developers.
7.1 Conclusions
In this project, the runtime performance of Java streams are compared to
the runtime performance of equivalent Java code written without the usage
of streams, referred to as imperative Java. The tested stream pipelines are
constructed in a way that represents how streams are realistically used in
GitHub projects [7]. The streams are mimicked with imperative code with for-
loops and if-statements. Since there are different types of streams and stream
operations, the answer to how they compare to imperative Java in terms of
runtime performance depends on several factors, such as input size, operation
types and pipeline length.. Listed below are the research questions which
address this, along with conclusions found in this project.
however no tested pipeline is more than one order of magnitude slower than its
imperative counterpart. In a few cases, (sequential) streams manage to slightly
outperform imperative Java. The reason for why streams perform worse at
smaller input sizes is likely the high overhead cost of creating streams, which
is the dominating performance factor at smaller input sizes.
Previous research has found that most stream pipelines are only slightly
slower than their imperative counterparts [7]. This project shows that while
this is true for most streams at larger input sizes, it is clearly not true when
streams are used with smaller input sizes. Thus, if the performance of a
program is of importance, then it is recommended that streams should only
be used with large input sizes. If small input sizes are used, then it is
instead recommended to use imperative alternatives, such as for-loops. To our
knowledge, this project is the first to look into the correlation between input
size and stream performance.
RQ 1.1 How do intermediate operations perform, and how does their type
impact their performance?
When intermediate operations are paired with the most popular terminal
operation collect (or in some cases toArray), they are on average
between 12% and 65% slower than their imperative counterparts, for larger
input sizes. The most popular intermediate operations, filter and map
perform relatively well.
The pipelines which aim to isolate the performance of terminal operations are
on average between 14% to 23% slower than their imperative counterparts,
for the larger input sizes. Just like the intermediate operations, they perform
much worse at smaller input sizes. The two most popular terminal operations,
collect and forEach, perform relatively well.
In this project, eight of the most popular terminal operations are tested with
zero, one and two intermediate operations (map and filter). Pipeline
length does affect performance, however there is no clear pattern. Some
terminal operations, for example anyMatch, perform best when used in
isolation. Other operations, such as collect, perform best when used
in conjunction with at least one intermediate operation. We conclude that
Conclusions and Future work | 61
they are commonly used in public GitHub projects. However, streams can
be used in many different ways, and in order to give a broader view of stream
performance, more benchmarking is required. This would not necessarily give
insights into how streams realistically perform, but it would show the potential
of stream performance, when they are used in unconventional ways. If future
studies find that streams can perform better under different circumstances, then
this could potentially influence how developers choose to use them. Listed
below are some examples of how streams can be benchmarked in more ways.
• More types of stream sources can be used (such as sets), and more
types of operations which create stream objects can be used (specifically
concat).
As mentioned, Rosales et al. found that input sizes for streams are
relatively small. However, this might not represent real stream usage, since
they only queried source code executed by unit tests, and because they may
not have had access to the real input data. Further studies could analyze more
deeply how large stream data sources commonly are, for example by analyzing
projects with real input data. This is of interest, since input size clearly has a
large impact on performance. If the findings of Rosales et al. are confirmed,
then this would indicate that streams, as they are commonly used, are much
slower than imperative Java.
As mentioned, the benchmarked streams in this project are supposed to
represent real stream usage. This is the reason for why many of the tested
pipelines use the most popular terminal operation collect. As detailed in
Section 2.3.1, using this operation often results in more garbage collection,
which in turn can make the benchmarking results slightly less consistent.
In future studies, terminal operations which return scalar values instead of
collections can be used, in order to reduce the amount of needed garbage
collection. This would result in less realistic usage of streams, however it
could show different performance results, since the error rates likely would be
lower.
Furthermore, stream performance can be examined when they are used
in real-world projects. This is more complicated than testing individual
stream pipelines, because of how hard it is to accurately benchmark Java
programs, however if done correctly it could give more insights into realistic
stream performance. Relating to this, the effect streams have on performance
when they are used in algorithms can be explored more extensively, since the
findings in this project suggest that different algorithms are affected differently
by the usage of streams.
Multiple studies have attempted (and succeeded) to improve stream
performance [29] [30] [22] [8] [31] [5]. It would be interesting to examine how
their improved streams compare to traditional imperative loops. Since many
stream pipelines already are close to being as fast as imperative loops, it is
likely that they could outcompete them with some improvements, for example
by reducing the number of required virtual calls per stream element [5].
Adjacent topics which do not relate to the performance of streams can
also be explored more deeply. Many advocates of functional programming
claim that functional programming results in fewer bugs, higher readability,
and more easily maintainable code (there is only limited research on the topic
[35] [36]). Whether or not this is true for Java streams could be interesting
to explore by future works. This is especially interesting since Java is known
64 | Conclusions and Future work
7.3 Reflections
The results and conclusions of this thesis are presented with enough detail to be
useful for Java developers who need specific information on different aspects
of stream performance. For example, this thesis provides specific performance
information on the 20 most popular stream operations, for several input sizes.
This can help developers make more specific choices in regards to when they
should use streams, and how much the performance will be impacted by this.
Thus, the conclusions of this study are of interest to developers who write
programs where performance is important, especially if they are interested
in functional programming and/or Java. Developers may find that the perks
of using streams compensate for the potential loss of performance, especially
since the performance loss in many cases is relatively small. It ultimately
depends on whether or not the developer believes that the trade-off is worth it.
However, if performance is of importance, then we recommend that
developers avoid using streams with small input sizes, and they should also
abstain from using certain pipelines, such as source-collect, which
surprisingly is the third most popular pipeline [12]. Furthermore, at least some
types of pipelines which use flatMap should also be avoided. In these cases,
we recommend developers to instead use imperative alternatives, such as for-
loops. Furthermore, if performance is the highest priority, the we recommend
developers to not use (sequential) streams at all, since they only rarely result
in performance gains.
These findings have environmental, societal, and economic impacts.
Firstly, our findings can help developers create better performing Java
programs, by showing them when stream should and should not be used, in
order to optimize for performance. Consequently, better performing programs
results in fewer needed CPU cycles, which in turn leads to less power used,
ultimately reducing the environmental impact of the programs. Thus, our
findings can help guide developers make more mindful decision in relation
to sustainable development.
As described in Section 6.6, stream usage has many possible perks, such
as more maintainable code, fewer bugs, and simpler parallelization (there
is only limited research on this topic [35] [36]). Since our findings show
Conclusions and Future work | 65
that stream usage often only results in slightly slower programs, this might
encourage developers to use streams more frequently, thus possibly gaining
the aforementioned perks of functional programming. This could result in less
stress, and less time needed to complete software projects, and in general better
work place environments, which are important societal and ethical aspects to
take into consideration. Furthermore, this could have economic benefits, as
it is cheaper to maintain more stable code bases, since it requires less work
hours.
Java has been one of the most used programming languages for the last
few decades, and it is still widely popular [1]. Because of this, the results of
this project are likely to remain relevant for the foreseeable future. In relation
to this, projects that assess the performance of programming languages are
important, because they assist language developers in prioritizing their efforts
toward relevant aspects of the language.
For example, if Oracle (the developers of the Java language) chose to focus
their efforts on improving the performance of streams because of a project
such as this one, then this could have huge implications, since even a tiny
performance improvement could save copious amounts of clock cycles, since
streams are used relatively frequently [7].
In summary, our findings can help developers create better performing,
and possibly more maintainable, programs. This can result in better work
environments and more sustainable products, which require less capital to
maintain.
66 | Conclusions and Future work
References | 67
References
[15] B. Lea, Doug (with the help of Goetz, P. Sandoz, A. Shipilev, H. Kabutz,
and J. Bowbee. When to use parallel streams. [Online]. Available:
https://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html
[Page 16.]
[34] Link to input data used for dijkstra’s algorithm. [Online]. Available:
https://algs4.cs.princeton.edu/44sp/ [Page 31.]
Appendix A
Slowdown factors
In this appendix, the exact slowdown factors are shown for the pipelines which
aim to isolate the performance of intermediate and terminal operations.
Appendix B
In this appendix chapter, the slowdown factors for eight of the ten most popular
terminal operations are showed, when they are used in a pipeline with zero,
one and two intermediate operations.
- allMatch filter_allMatch
filter_map_allMatch
10
Slowdown factor
6
4
1
1 10 100 1,000 10,000 100,000
Input size
Figure B.1: Slowdown factors for the terminal operation allMatch with
zero, one and two intermediate operations.
74 | Appendix B: Pipeline length graphs
- anyMatch filter_anyMatch
filter_map_anyMatch
10
Slowdown factor
6
4
1
1
10
,0
10
00
00
,0
0
0
00
1,
,
10
10
1,
Input size
Figure B.2: Slowdown factors for the terminal operation anyMatch with
zero, one and two intermediate operations.
8
6
Slowdown factor
1
1
10
00
00
10
00
00
,0
0,
1,
0,
10
00
10
1,
Input size
Figure B.3: Slowdown factors for the terminal operation sum with zero, one
and two intermediate operations.
Appendix B: Pipeline length graphs | 75
8
Slowdown factor
6
4
1
1
10
00
0
0
10
00
0,
00
,0
00
1,
0,
10
1,
10
Input size
Figure B.4: Slowdown factors for the terminal operation forEach with zero,
one and two intermediate operations.
8
6
Slowdown factor
1
1
10
00
00
10
00
00
,0
0,
1,
0,
10
00
10
1,
Input size
Figure B.5: Slowdown factors for the terminal operation reduce with zero,
one and two intermediate operations.
76 | Appendix B: Pipeline length graphs
- filter_count map_filter_count
8
6
Slowdown factor
1
1
10
00
00
00
10
00
0
,0
,0
1,
0,
10
00
10
0
1,
Input size
Figure B.6: Slowdown factors for the terminal operation count with one and
two intermediate operations. It is not tested with zero intermediate operations,
for reasons explained in the method chapter, in Section 3.4.
20
Slowdown factor
10
6
4
1
1 10 100 1,000 10,000 100,000
Input size
Figure B.7: Slowdown factors for the terminal operation collect with zero,
one and two intermediate operations.
Appendix B: Pipeline length graphs | 77
1
1 10 100 1,000 10,000 100,000
Input size
Figure B.8: Slowdown factors for the terminal operation toArray with zero,
one and two intermediate operations.
78 | Appendix B: Pipeline length graphs
TRITA-EECS-EX- 2023:495
www.kth.se