KEMBAR78
Full Text 01 | PDF | Anonymous Function | Parameter (Computer Programming)
0% found this document useful (0 votes)
13 views94 pages

Full Text 01

This project evaluates the performance of Java Streams compared to imperative loops, focusing on execution time and common usage in public GitHub repositories. It finds that sequential streams are generally slower than imperative loops, particularly for smaller input sizes, but can perform comparably for larger inputs. The performance of streams is influenced by input size, operation type, and pipeline length, with some cases showing significant execution time differences when implementing algorithms with streams.

Uploaded by

Ankit Mandal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views94 pages

Full Text 01

This project evaluates the performance of Java Streams compared to imperative loops, focusing on execution time and common usage in public GitHub repositories. It finds that sequential streams are generally slower than imperative loops, particularly for smaller input sizes, but can perform comparably for larger inputs. The performance of streams is influenced by input size, operation type, and pipeline length, with some cases showing significant execution time differences when implementing algorithms with streams.

Uploaded by

Ankit Mandal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Degree Project in Computer Science and Engineering

Second cycle, 30 credits

A Performance Comparison of Java


Streams and Imperative Loops
MAGNUS ÅKERFELDT

Stockholm, Sweden, 2023


A Performance Comparison of Java
Streams and Imperative Loops

MAGNUS ÅKERFELDT

Degree Programme in Computer Science and Engineering


Date: June 28, 2023

Supervisor: Per Austrin


Examiner: Johan Håstad
School of Electrical Engineering and Computer Science
Swedish title: En prestandajämförelse av Java streams och imperativa loopar
© 2023 Magnus Åkerfeldt
Abstract | i

Abstract
The Stream API was added in Java 8. With the help of lambda expres-
sions (anonymous functions), streams enable functional-style operations on
sequences of elements. In this project, we evaluate how streams perform
in comparison to imperative loops in terms of execution time, from the
perspective of how streams are commonly used in public GitHub repositories.
Additionally, two algorithms are implemented with and without streams, to
assess the impact of stream usage on algorithmic performance. Parallel
streams are only examined briefly due to their infrequent usage.
We find that sequential streams in general are slower than imperative loops.
However, stream performance heavily relies on how many elements are being
processed, which is referred to as input size. For input sizes smaller than
100, most stream pipelines are several times slower than imperative loops.
Meanwhile, for input sizes between 10 000 and 1 000 000, streams are on
average only 39% to 74% slower than loops, and in some cases, they even
slightly outperform them.
Additionally, we observe that using streams when implementing algo-
rithms in some cases leads to much slower execution times, while in other
cases, it barely affects the execution time at all. We conclude that stream
performance primarily depends on input size, presumably because of the
high overhead abstraction cost of creating streams, but their performance also
depends on other factors, such as operation type and pipeline length.

Keywords
Java streams, performance, benchmarking, loops, functional programming
ii | Abstract
Sammanfattning | iii

Sammanfattning
Med Java 8 introducerades streams. Med hjälp av lambda-uttryck (anonyma
funktioner) möjliggör streams användandet av funktionella operationer på
sekvenser av element. I detta projekt mäter vi hur streams presterar i jämförelse
med imperativa loopar med hänsyn till exekveringstid, från perspektivet av
hur streams vanligen används i publika GitHub-projekt. Parallella streams
undersöks endast i begränsad utsträckning, på grund av hur sällan de används.
Resultaten visar att streams överlag är långsammare än imperativa loopar.
Skillnaden i prestanda beror dock starkt på indatastorleken, det vill säga
hur många element som streamen bearbetar. För indatastorlekar mindre än
100 element är streams ofta flera gånger långsammare än deras imperativa
motsvarigheter. Samtidigt är streams i genomsnitt endast 39% till 74%
långsammare än imperativa motsvarigheter för indatastorlekar mellan 10 000
och 1 000 000 element, och i några fall är de till och med något snabbare än
imperativ kod.
Vidare observerar vi att användning av streams vid implementation av
algoritmer i vissa fall leder till mycket längre exekveringstider, medan det
i andra fall knappt påverkar exekveringstiden alls. Vi drar slutsatsen att
prestandan av streams främst beror på indatastorlek, men också på andra
faktorer, såsom operationstyp och hur många operationer som används i en
pipeline.

Nyckelord
Java streams, prestanda, loopar, funktionell programmering
iv | Contents

Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Java Streams . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Performance of streams . . . . . . . . . . . . . . . . . 3
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Research questions . . . . . . . . . . . . . . . . . . . 4
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Research Methodology . . . . . . . . . . . . . . . . . . . . . 5
1.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 7
2.1 Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Input, type & output . . . . . . . . . . . . . . . . . . 7
2.1.2 Intermediate & terminal operations . . . . . . . . . . 8
2.1.3 Operations . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Laziness . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.5 Parallelization . . . . . . . . . . . . . . . . . . . . . 9
2.1.6 Encounter order . . . . . . . . . . . . . . . . . . . . . 9
2.2 Common stream usage . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Research on stream usage . . . . . . . . . . . . . . . 12
2.2.2 Common operations . . . . . . . . . . . . . . . . . . 13
2.2.3 Common pipelines . . . . . . . . . . . . . . . . . . . 13
2.2.4 Input, type & output . . . . . . . . . . . . . . . . . . 15
2.2.5 Pipeline length . . . . . . . . . . . . . . . . . . . . . 15
2.2.6 Input size . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.7 Parallelization is rare . . . . . . . . . . . . . . . . . . 16
2.2.8 Encounter order . . . . . . . . . . . . . . . . . . . . . 16
2.2.9 Lambda expressions . . . . . . . . . . . . . . . . . . 16
Contents | v

2.2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Low level Java . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Garbage collection . . . . . . . . . . . . . . . . . . . 18
2.4 Benchmarking Java . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Low level streams . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Stream speed . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 Stream bytecode . . . . . . . . . . . . . . . . . . . . 20
2.6 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.2 Dijkstra’s algorithm . . . . . . . . . . . . . . . . . . . 23
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7.1 Streams vs imperative Java . . . . . . . . . . . . . . . 23
2.7.2 Performance of streams . . . . . . . . . . . . . . . . . 24

3 Method 25
3.1 RQ 1: Average stream performance . . . . . . . . . . . . . . 25
3.2 RQ 1.1: Intermediate operations . . . . . . . . . . . . . . . . 26
3.3 RQ 1.2: Terminal operations . . . . . . . . . . . . . . . . . . 27
3.4 RQ 1.3: Pipeline length . . . . . . . . . . . . . . . . . . . . . 29
3.5 RQ 1.4: Parallel pipelines . . . . . . . . . . . . . . . . . . . . 30
3.6 RQ 2: Algorithms with streams . . . . . . . . . . . . . . . . . 31
3.7 JMH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.8 System specifications . . . . . . . . . . . . . . . . . . . . . . 32

4 Implementation 35
4.1 Replacing streams with loops . . . . . . . . . . . . . . . . . . 35
4.2 A sample benchmark . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Input for streams . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Algorithms with streams . . . . . . . . . . . . . . . . . . . . 37

5 Results and Validity 39


5.1 RQ 1: Average stream performance . . . . . . . . . . . . . . 39
5.2 RQ 1.1: Intermediate operations . . . . . . . . . . . . . . . . 41
5.2.1 Stateless operations . . . . . . . . . . . . . . . . . . . 42
5.2.2 Stateful operations . . . . . . . . . . . . . . . . . . . 42
5.3 RQ 1.2: Terminal operations . . . . . . . . . . . . . . . . . . 43
5.4 RQ 1.3: Pipeline length . . . . . . . . . . . . . . . . . . . . . 44
5.5 RQ 1.4: Parallel pipelines . . . . . . . . . . . . . . . . . . . . 47
5.6 RQ 2: Algorithms with streams . . . . . . . . . . . . . . . . . 47
5.7 Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
vi | Contents

6 Discussion 53
6.1 General results . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Smaller input sizes . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Garbage collection . . . . . . . . . . . . . . . . . . . . . . . 55
6.4 Errors & Larger input sizes . . . . . . . . . . . . . . . . . . . 55
6.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.6 Performance trade-off . . . . . . . . . . . . . . . . . . . . . . 56
6.7 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . 57

7 Conclusions and Future work 59


7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 Limitations and future work . . . . . . . . . . . . . . . . . . 61
7.3 Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

References 67

A Slowdown factors 71

B Pipeline length graphs 73


List of Acronyms | vii

List of Acronyms

FFT Fast Fourier transform.

JIT Just-In-Time (compilation).

JMH Java Microbenchmark Harness.

JVM Java Virtual Machine.


viii | List of Acronyms
Glossary | ix

Glossary

Benchmarking The act of measuring the average execution time (runtime) of


an operation.
Blackhole A JMH object which can ensure that certain values are calculated
by the JVM, which can give more realistic benchmark results.
Garbage collector Automatic memory manager which reclaims allocated
memory that can no longer be accessed.
Imperative counterpart The imperative counterpart of a stream pipeline is
non-stream code which calculates the exact same result as the stream
pipeline.
Imperative Java Java code without the usage of streams.
Input size In the context of this thesis, if a stream is created from a data source
which contains x elements, then the stream’s input size is x.
Intermediate operation(s) The middle operation(s) in stream pipelines.
They transform streams into new stream objects.
Lambda expression Anonymous function written on the form (parame-
ter -> expression).
Larger input sizes Input sizes between 10 000 and 1 000 000 (in the context
of this thesis).
Parallel stream A stream which executes in parallel.
Pipeline length The number of intermediate operations in a pipeline, plus the
terminal operation.
Pipeline A stream pipeline consists of a data source which is converted to a
stream, followed by zero or more intermediate operations, and finally
one terminal operation, e.g., source-map-collect.
x | Glossary

Runtime The execution time of a certain operation.

Sequential stream A non-parallel stream.

Slowdown factor In the context of this thesis, the slowdown factor of a stream
pipeline is the runtime of that pipeline divided by the runtime of its
imperative counterpart.

Smaller input sizes Input sizes between 1 and 1 000 (in the context of this
thesis).

Stateful operation An intermediate operation which takes into account


previous elements during processing.

Stateless operation An intermediate operation which processes each element


independently.

Stream source The data source from which a stream is created, e.g., an
arraylist.

Terminal operation The final operation in a stream pipeline, which triggers


its execution. Usually returns a list or a value.
Introduction | 1

Chapter 1

Introduction

This chapter gives an introduction to this thesis.

1.1 Background
As of 2022, Java was the third most popular programming language on GitHub
[1]. Almost 30 years after it was initially released, it is still widely used,
and it is constantly being updated. Java is heavily object oriented, has high
portability, and is strongly typed. Furthermore, it is primarily an imperative
language, since Java programs often are written as a sequence of instructions
that specify how certain tasks are to be performed in a very precise manner,
step by step.
In contrast to imperative languages, functional/declarative programming
languages describe the desired state or output of a program, instead of
specifying the exact steps needed to reach that state. The programmer specifies
what needs to be solved, without having to specify exactly how that can
be achieved. These languages offer high-level functions, which can replace
control flow statements such as if-else and loops.
Functional programming has recently become more popular, as seen by the
popularity of languages such as Haskell, Scala, and Clojure [2]. Perhaps this
is the reason for why some functional programming elements were added to
Java in 2014, with the introduction of streams and lambda expressions, which
are often used together.
Lambda expressions are similar to anonymous functions, and are written
on the form (parameter -> expression). For example, the lambda
expression (x -> x + 2) takes an argument x, and returns the value x +
2.
2 | Introduction

// I m p e r a t i v e v e r s i o n
int s u m O f E v e n S q u a r e s ( int [] v ) {
int r e s u l t = 0;
for ( int i = 0; i < v . l e n g t h ; i ++) {
if ( v [ i ] % 2 == 0) {
r e s u l t += v [ i ] * v [ i ];
}
}
return result ;
}

// S t r e a m v e r s i o n
int s u m O f E v e n S q u a r e s ( int [] v ) {
r e t u r n I n t S t r e a m . of ( v )
. f i l t e r ( x -> x % 2 == 0)
. map ( x -> x * x )
. sum () ;
}

Figure 1.1: The same function implemented without and with Java streams.

1.1.1 Java Streams


A stream does not store any elements. Instead, it conveys elements from
a source, such as an array or a list, through a pipeline of computational
operations [3]. Figure 1.1 shows how a simple function can be implemented
with and without streams. The code which is written without streams is
referred to as imperative Java. The function takes an array of integers
as input, and calculates the sum of all of the even elements squared.
In the stream version, a stream of integers is created from the array
with IntStream.of(v), and then three stream operations are applied:
filter, map and sum. These operations are explained below.

• filter takes a lambda expression which should return true for all
of the elements which should remain in the stream. In this case, the
lambda expression (x -> x % 2 == 0) returns true for all even
elements, which means that all odd elements are removed.

• map applies a lambda expression to every element in the stream. In this


case, the lambda expression (x -> x * x) squares each element.

• sum sums all of the elements in the stream.


Introduction | 3

Because these operations are stateless (a term which is further explained in the
next chapter), the sequence of elements will only be traversed a single time,
even though three operations are specified. This is a key feature of streams
which increases performance.

1.1.2 Performance of streams


Because of the introduction of streams it is now possible to program Java
functionally. However, it is not clear how this impacts Java programs. A
common notion is that functional programming results in more maintainable
code with fewer bugs, but that it also results in slower programs. A previous
study found that the functional language Clojure often is multiple times slower
than Java [4]. Whether or not the usage of streams slows down Java programs
has to our knowledge only been explored by one previous study by Biboudis
et al. [5]. Schiavio et al. specifically point out the lack of stream benchmarks
[6].
Biboudis et al. only examined a few stream operations, and they did not
use streams in a manner that is representative of how streams are commonly
used [7]. For example, they primarily applied streams on arrays, whereas
lists are more commonly used, and these arrays primarily contained long
integers, however it is more common that they contain objects [7]. Thus,
something which remains unexplored is how streams perform in comparison
to imperative Java from the perspective of how streams are realistically used.
However, the answer to this question likely depends on multiple factors.
It is probable that the biggest factor which affects stream performance is
the size of the lists and arrays which streams are applied on, i.e., the number
of elements they contain, which is referred to as input size in this thesis. To
our knowledge, no previous study has examined the link between input size
and stream performance.
The reason for why input size is likely to affect stream performance is
that there is a large overhead runtime cost of creating streams [7] [8]. The
corresponding cost of creating a simple for-loop is generally much lower.
Thus, if a stream is used with small input sizes, then the overhead cost of
creating the stream is likely to be the dominating performance factor. This
cost might not be as noticeable for bigger input size. Consequently, streams
are likely to perform considerably worse than imperative Java for small input
sizes, however streams may be able to perform relatively similar to imperative
Java for larger input sizes. One of the goals of this project is to explore this.
4 | Introduction

1.2 Problem
The main objective of this project is to compare the runtime performance
of programs which use Java streams with programs which do not use Java
streams, and analyze how this performance difference depends on certain
factors. Specifically, the performance of Java streams as they are commonly
used in GitHub projects is examined.
Since streams can be used in many different ways, it is likely that stream
performance depends on several factors, such as input size, operation type,
and pipeline length. Thus, the main research question has been divided into
several subquestions, in order for this project to give a broader view on stream
performance, and because they give thesis a clearer structure.

1.2.1 Research questions


The research questions are as follows.
RQ 1 Considering how Java streams are commonly used, how do stream
pipelines perform in comparison to imperative loops in terms of runtime,
and how does the performance vary with input size?
RQ 1.1 How do intermediate operations perform, and how does their type
impact their performance?
RQ 1.2 How do terminal operations perform?
RQ 1.3 How does pipeline length affect performance?
RQ 1.4 How do parallel pipelines perform?
RQ 2 How does the utilization of streams impact the performance of popular
algorithms?
For all of the listed research questions and subquestions, specifically the
runtime difference for streams and equivalent imperative code is examined,
for varying input sizes. This means that the runtimes themselves are not of
interest. Instead, it is the runtime ratio between streams and their imperative
counterparts that is examined, which is referred to as the slowdown factor in
the Results and Validity chapter. In regards to RQ 1.1, intermediate operations
can have the type stateful and stateless, as explained in the background chapter.
Furthermore, pipeline length (RQ 1.3) refers to the number of operations in
the pipeline. Finally, RQ 1 is the primary research question, and thus only two
algorithms are examined in regards to RQ 2, specifically Dijkstra’s algorithm
and the Fast Fourier Transform (FFT) algorithm.
Introduction | 5

1.3 Purpose
As mentioned, the performance difference between imperative Java and Java
streams has only been examined by one previous study, in a limited and non-
representative manner, and previous researchers have pointed out the lack
of thorough stream benchmarking [6]. This project could help elucidate the
question of how much performance is sacrificed through the usage of streams.
The results of this project could be useful for programmers who want to
get the perks of functional programming through the use of Java streams, but
are unsure if the performance trade-off is worth it. If developers choose to use
Java streams because of the results of this study, then this could possibly help
them construct more easily maintainable programs with fewer bugs, which
could cut down on work hours and reduce stress.
This thesis also aims to provide specific information on how the 20 most
popular stream operations perform for different input sizes. This can help
developers make more informed choices in regards to when they should use
streams, and how much the runtime performance will be impacted by this.

1.4 Goals
The main goal of this project is to find out how streams, as they are
commonly used, generally perform compared to imperative Java, and how
this performance difference depends on different factors, such as input size
and pipeline length.

1.5 Research Methodology


The research methods used in this project primarily involves benchmarking
a large number of commonly used stream pipelines, and comparing their
performance to their imperative counterparts. The reason for why this method
is chosen is that this allows us to find out how stream realistically perform,
since we are mimicking real stream usage. Furthermore, We benchmark many
different types of stream pipelines, with different input data sizes (and types),
to get a broader view on stream performance. Finally, we implement and
benchmark two popular algorithms, with and without the usage of streams.
6 | Introduction

1.6 Delimitations
Some streams operations which are rarely used are not benchmarked in this
project [3]. However, the majority of them are tested.
Large programs using streams are not be benchmarked, since Java
benchmarks are more accurate for smaller programs, and because it is easier
to isolate stream performance with smaller programs. However, individual
pipelines with multiple stream operations are examined. Furthermore, some
popular algorithms are implemented (and benchmarked) with and without Java
streams,
This project does not involve benchmarking real-world software projects.
Instead, imperative code which mimic commonly used stream pipelines are
constructed [7] [9]. The reasoning for this is that real-world projects often have
their most relevant function nested deep within several layers of abstraction,
which complicates benchmarking. Future studies could attempt to benchmark
real-world projects using streams, and then replace the streams with imperative
code, in order to compare the performance difference.
Parallel streams are only be considered in a limited manner, since previous
studies show that parallel streams are very rarely used in GitHub projects [7]
[9].
This project focuses on runtime speed, since this is arguably the most
important performance metric. Previous studies which have benchmarked
Java and other languages which use the JVM have also looked into startup-
time and compile time, as well the size of the compiled Java bytecode file [10]
[4].
Background | 7

Chapter 2

Background

This chapter presents relevant background theory for this thesis. This includes
how streams work and how they are commonly used, how Java (and Java
streams) are implemented on a low level, and how this is relevant for
benchmarking, and previous related research.

2.1 Streams
In this section, important stream terms, operations, and categories are
explained.

2.1.1 Input, type & output


Stream operations can only be applied on stream objects. This means that in
order to apply stream operations on a data structure such as an array or a lists, it
must first be converted into a stream. This is the first part of a stream pipeline.
For example, IntStream.of(v) converts the array v into a stream of
integers. This is called an IntStream. There are three other stream types;
LongStream, DoubleStream and Object based streams. Streams can
also be created in other ways, e.g., from a list with list.stream(), or
from specific ranges of integers.
Streams usually produce some output. This can for example be a value, a
collection, or an array.
8 | Background

2.1.2 Intermediate & terminal operations


Stream operations are divided into terminal and intermediate operations.
Such operations can only be used in stream pipelines, which must adhere
to a specific form which is explained in Figure 2.1. Each stream
pipeline consists of a source from which the stream is created, followed
by zero or more intermediate operations, and finally exactly one terminal
operation. The pipeline source-filter-map-sum has two intermediate
operations(filter and map) and one terminal operation (sum).

Figure 2.1: The structure of all stream pipelines.

Intermediate operations transform streams into new stream objects which


can be processed further, whereas terminal operations usually produce a result
such as a list or a value. It is important to understand that stream pipelines
must include a terminal operation in order for the stream to actually produce
something, since only a terminal operation can trigger the execution of a
stream.
Intermediate operations are further divided into stateless and stateful
operations [3]. Stateless operations, such as filter and map, process
each element independently from other elements in the stream. Stateful
operations, such as sorted, take into account previously seen elements when
processing new elements. Pipelines with exclusively stateless operations can
be processed very efficiently, with minimal buffering, whereas pipelines with
stateful operations generally are slower [3] [7].
Background | 9

Some stateful operations may need to process the entire input before a
result can be produced, can impact performance when multiple intermediate
operations are used. An example of this is the sorted operation, for which
no results can be produced until all elements have been seen [3].

2.1.3 Operations
Figure 2.2 and Figure 2.3 explain the ten most popular intermediate and
terminal operations, all of which are examined in this project. The
intermediate operations are divided into the categories stateful and stateless,
and the terminal operations are divided into categories based on their return
type.

2.1.4 Laziness
All intermediate operations are lazy, i.e., they will not be traversed until a
terminal operation is invoked. Because of this, the stream pipeline source-
filter-map-sum, which does not have any stateful operations, only
requires one traversal, even though three different operations are specified.
Laziness also allows for avoiding unnecessary processing when not all data
from a source is required. For example, the operation “find the first integer that
is larger than 50” only needs to examine integers until the first integer larger
than 50 is found. For these reasons, laziness greatly increases the performance
of streams. In contrast, almost all terminal operations are eager, which means
that they execute the traversal of the stream [3]. A stream pipeline can only be
traversed once, and then it is be considered consumed.

2.1.5 Parallelization
Efficiently parallelizing imperative code can often be a complex and time
consuming task, whereas simple parallelization is a key feature of streams. It is
possible to switch back and forth between parallel and sequential processing in
a single pipeline, using the operations parallel and sequential. This
could be useful when only some operations in a pipeline perform well with
parallel processing.

2.1.6 Encounter order


Depending on the stream source and the intermediate operations, a stream may
have an encounter order, which means that the elements in the data source is
10 | Background

Intermediate operations
Type Name What does it do? Example
Removes the odd elements.
Removes the elements which
filter() do not satisfy the given [3, 4, 5, 6, 7].filter(x -> x % 2 == 0)
predicate. ↓
[4, 6]
Squares all elements.
Applies the given function [3, 4, 5, 6, 7].map(x -> x * x)
map()
to each element. ↓
Stateless

[9, 16, 25, 36, 49]


Returns an IntStream.
Turns the stream into an ["3", "4", "5", "6", "7"].
IntStream. Similar operations .mapToInt(
mapToInt() exist for Long, Double and num -> Integer.parseInt(num))
Object. ↓
[3, 4, 5, 6, 7]
Turns the stream of arrays into a
flattened stream of integers.
Flattens the data structure (for
flatMap() example, turns a 2D array into [[5, 3], [9, 4]].flatMap(Arrays::stream)
a 1D array). ↓
[5, 3, 9, 4]
Returns a stream with the first
two elements.
Returns a stream with
limit() [3, 4, 5, 6, 7].limit(2)
the first n elements.

[3, 4]
Discards the first two elements.

skip() Discards the first n elements. [3, 4, 5, 6, 7].skip(2)


Stateful


[5, 6, 7]
Sorts the stream.

sorted() Sorts the stream. [7, 3, 5, 6, 2].sorted()



[2, 3, 5, 6, 7]

Only keeps one instance of


each element.
Only keeps one instance of
distinct() [3, 5, 3, 1, 3, 2, 5].distinct()
each element.

[3, 5, 1, 2]

Figure 2.2: The intermediate operations which are


examined in this project. The examples have the
form [inputStream].operation(arguments) =>
[outputStream]. The filter example takes the stream [3, 4,
5, 6, 7] and produces the stream [4, 6].
Background | 11

Terminal operations
Return
type Name What does it do? Example
Returns true if any element is even.
Returns true if any
anyMatch() element satisfies the [3, 4, 5, 6].anyMatch(x -> x % 2 == 0)

given predicate.
boolean

true

Returns true if all elements are even.


Returns true if all
allMatch() elements satisfy the given [3, 4, 5, 6].allMatch(x -> x % 2 == 0)
predicate. ↓
false

Sums the elements.


[2, 3, 6].sum()
sum() Sums the elements.

integer

11
Counts the elements.

count() Counts the elements. [2, 3, 5].count()



3
Returns the first element

findFirst() Returns the first element. [5, 3, 2, 4].findFirst()


Optional


5
Returns some element Returns some element
(sometimes faster than [5, 3, 2, 4].findAny()
findAny()
findFirst for parallel ↓
processing) 2
Turns the stream into a List.
Collection/array

Turns the stream into a


collect() mutable container such [3, 4, 5, 6, 7].collect(Collectors.toList())

as a list or a set.
List[3, 4, 5, 6, 7]
Turns the stream into an array.
Turns the stream into an [3, 4, 5, 6, 7].toArray()
toArray()
array ↓
array[3, 4, 5, 6, 7]
Prints all elements
Performs an action on all [3, 4, 5, 6].forEach(System.out::println)
forEach() elements. ↓
3456
Other

Performs a reduction Sums the elements. Calculates [0+2], then


operation on the stream, [2+3], and finally [5+6]
reduce() with an initial value and a [2, 3, 6].reduce(0, Integer::sum)
binary operation, using ↓
accumulation. 11

Figure 2.3: The terminal operations which are examined in this project. The
examples have the form [inputStream].operation(arguments)
=> output. The sum example takes the stream [2, 3, 6] and produces
the value 11.
12 | Background

encountered in a defined order (such as arrays and lists, but not sets). Note that
encounter order is not the same as sort order. If functions such as sorted
are called on streams without order, then an order will be applied on the
resulting stream. Parallelization of ordered streams can sometimes result in
worse performance.
Certain aggregate operations, such as distinct, and operations which
are dependent on an order, such as limit, sometimes perform better without
an encounter order. Thus, if the ordering is not relevant, the performance of
stream pipelines can be improved if stateful and terminal operations are de-
ordered with unordered [3] [7].

2.2 Common stream usage


This section provides a description of how developers commonly use streams
in GitHub projects. These findings influence how the benchmark programs
are constructed in this project, as they aim to reflect typical stream usage.

2.2.1 Research on stream usage


In the last few years, four studies have analyzed how developers use streams [7]
[11] [9] [12]. The most recent one, by Rosales et al. [7], was the largest one,
analyzing 1 808 415 projects on GitHub, of which 7% used the stream API,
totalling more than 620 million streams analyzed. Instead of using manual
code inspection, they analyzed dynamic runtime data for streams, as opposed
to a static code analysis. They targeted projects using maven, and dynamically
analyzed the code by downloading project dependencies, compiling the source
code, and running the available unit tests.
An important limitation with their their project is that they only examined
source code executed by unit tests. However, it is important to understand
that the majority of the analysed code is not test code, since "real" project
code which is called by the unit tests also is analyzed, which is possible since
they are performing a runtime analysis.The authors point out that source code
executed by unit tests may not be representative of real usage scenarios, partly
because many project do not have complete test coverage. However, they also
bring up that previous works have found that this type of analysis can "provide
useful information, highlight patterns and derive statistics" [13] [14]. Even
so, the findings in regards to input size for streams may not be representative,
as explained in Section 2.2.6.
Background | 13

Intermediate operation Occurrences %


map 427 009 693 51.57
filter 227 938 406 27.53
mapToObj 77 898 391 9.41
flatMap 345 12 362 4.17
mapToInt 27 327 892 3.30
limit 12 697 146 1.53
mapToLong 6 264 703 0.76
sorted 5 275 399 0.64
skip 4 041 440 0.49
distinct 2 877 764 0.35

Table 2.1: Most common intermediate operations. Table adapted from the
report "Characterizing Java Streams in the Wild" [7].

2.2.2 Common operations


Table 2.1 shows the ten most popular intermediate used in the projects
analyzed by Rosales et al. [7]. map and filter are by far the most popular
intermediate operations, making up 79.1% of found occurrences. In general,
stateless operations are much more common than stateful operations. Table
2.2 similarly shows the ten most popular terminal operations, with collect
and forEach being the most commonly used ones. All of these 20 operations
are benchmarked in this project.

2.2.3 Common pipelines


To our knowledge, Nostas et al. are the only researchers who have looked into
which specific pipelines are the most popular [12]. They analyzed 610 Java
projects, and the five most popular pipelines are listed in Figure 2.4, however
it is possible to determine other pipelines which are likely to be popular by
looking into common operation transitions. In Table 2.3, some of the most
common transitions are listed, according to Tanaka et al. [11]. They concluded
that map and filter are followed by a terminal operation in almost 50% of
cases, with collect being the most common terminal operation to follow
all evaluated intermediate operations.
14 | Background

Terminal operation Occurrences %


collect 197 338 242 31.83
forEach 145 774 791 23.51
allMatch 68 072 088 10.98
findFirst 51 274 578 8.27
anyMatch 43 892 114 7.08
sum 38 716 266 6.24
findAny 38 505 248 6.21
reduce 9 720 348 1.57
toArray 7 555 733 1.22
count 6 033 536 0.97

Table 2.2: Most common terminal operations. Table adapted from the report
"Characterizing Java Streams in the Wild" [7].

source-map-collect

source-filter-collect

source-collect

source-anyMatch

source-filter-findFirst

0 50 100 150 200 250 300 350 400

Figure 2.4: Most popular sequential stream pipelines [12]. The bar width
represents how many of the 610 analyzed projects included at least one
instance of the pipeline. This figure is adapted from the report "How do
developers use the Java Stream API?" [12].
Background | 15

Transition type Transition


filter-findFirst
intermediate-terminal
map-toArray
map-filter
filter-map
flatMap-map
intermediate-intermediate
flatMap-filter
distinct-filter
distinct-map

Table 2.3: Some common operation transitions [11].

2.2.4 Input, type & output


Streams are created from arraylists in more than 24% of cases, making it the
most common stream source. In 9% of cases they are created from arrays.
These data structures usually consist of objects, sometimes integers, but almost
never longs or doubles. Streams usually returns arraylists, and sometimes
return void, a boolean, or an Optional object.

2.2.5 Pipeline length


Almost half of all stream pipelines only include a single intermediate
operation, e.g., stream-map-sum [7] [11]. Most pipelines (86%) include
zero, one or two intermediate operations.

2.2.6 Input size


Rosales et al. managed to query the data-source size for 63.89% of all detected
streams [7]. The data-source size is referred to as input size in this project. Of
the analyzed data sources, most had very few elements. 91% of them had less
than 10 elements. The most popular data-source sizes was (n = 3), making up
almost 24% of all stream sources. That input size was both the median and
the mode. They found that sources containing more than 1 000 elements are
relatively rare.
Such small input sizes might seem unrealistic, since streams specifically
are made to process large sequences of elements. As mentioned earlier, these
finding may not be representative of real stream usage, since Rosales et al.
only examined source code executed by unit tests. A potential explanation
for these surprising findings is that the unit tests may not have had access to
16 | Background

the real input data used in the projects, which could be much larger. It is
common that real input data is not uploaded to GitHub, especially when it is
large. Furthermore, input data for unit tests tend tend to be relatively small,
since it often is manually constructed. However, their other findings are likely
to be more representative of stream usage, since they are not as dependent on
input data.

2.2.7 Parallelization is rare


Multiple studies have concluded that parallel streams are not commonly used,
and even in projects which use them, its usage is very infrequent [12] [7]
and sometimes not efficient [9]. Almost all streams (99.66%) examined by
examined by Rosales et al. were sequential [7]. This could be seen as
surprising, since simple parallelization is a key feature of streams. Parallel
streams are usually used on simple forEach operations, whereas more
expressive features for parallelism are less frequent [12].
Stateless operations are much more common than stateful operations and
are usually parallelized more efficiently, which means that many opportunities
for speedup through parallelization are probably missed [7]. However, the
overhead cost of parallelization might still not be worth it, since most streams
operate on few elements, meaning they might not exceed the minimum
threshold [15] [7].

2.2.8 Encounter order


The vast majority of streams (82%) are created from sources with an encounter
order, which is reasonable since most streams are created from lists and arrays
[7]. Even though most streams are created from sources with an encounter
order, they usually do not have a sort order (85%).

2.2.9 Lambda expressions


As mentioned, some stream operations use lambda expressions, such as
filter and map. Lambda expressions can contain one or more statements,
just like any other function. Lambda expressions with one statement are the
most common [11].
Background | 17

Category Most common types


Stream source ArrayList (41%), int-range (14%), array (9%)
Stream type Object (81%), integer (19%)
Return types ArrayList (24%), void (24%), boolean (18%)
No. of intermediate ops. 0 - 2. Rarely more than 4.
Processing type Sequential (i.e., not parallel)

Table 2.5: Common stream usage [7].

2.2.10 Summary
The most important findings regarding common stream usage are listed in
Table 2.5.

2.3 Low level Java


In order to understand the potential performance of streams, and how to
correctly benchmark them, it is necessary to understand how Java programs
are compiled and run, which in turn requires familiarity with different types
of compilation. Static languages (such as C++ and Rust) are compiled before
runtime, which means that all machine code is generated before the program
is run. In contrast, interpreted languages (such as JavaScript and Python) are
compiled during runtime, which means that the machine code is generated
while the program is running. This allows for more dynamic and flexible
behavior, however it usually results in slower programs.
Java can be considered both a compiled and a interpreted (or dynamic)
language [16]. This is because Java is compiled through a two-step process,
where the first step is static and the second step uses interpretation and Just-
In-Time (JIT) compilation. When a Java program is compiled, a platform-
independent file is created. This process is static, since it happens before the
program is run. The produced file contains Java bytecode. This bytecode is
similar to regular machine code, but it can not be natively executed on a CPU.
Some bytecode which is generated from streams is shown in Section 2.5.2.
Since Java bytecode is platform-independent, it can be run on any platform
which has the Java Runtime Environment (JRE).
During runtime, the Java Virtual Machine (JVM) loads and executes the
bytecode. JIT compilation is used whenever a method is called, by converting
the method bytecode into machine code just before it is executed [17]. This
means that the JVM can call the machine code of that method directly instead
18 | Background

of interpreting it, speeding up execution. It is possible to run a JVM without


JIT compilation and instead only rely on interpretation, which would improve
the performance at start-up, but reduce the overall performance [17].
During execution, the JVM can perform various optimizations, as it is an
adaptive virtual machine. For example, code which is executed often can
be dynamically optimized, and different types of memory strategies can be
used depending on how much memory remains. Furthermore, if the JVM
realizes that a certain value is not used in a meaningful way, it might choose
to not calculate it at all [18] [19]. When benchmarking Java programs, it is
important to take precautions in order to ensure that the JVM does not perform
any unrealistic optimizations.

2.3.1 Garbage collection


Something else which complicates JVM benchmarking is garbage collection.
Garbage collection is a feature of the JVM, which automatically deallocates
data which is no longer accessible. If this occurs during a benchmark, then the
measured runtime will be longer, since garbage collection is costly and slows
down execution speed. Because it is hard to predict when garbage collection
will be triggered, it can cause benchmarks to give unexpected and inconsistent
results. Thus, memory intensive programs are harder to accurately benchmark,
since they are more likely to result in more garbage collection.
Because of this, some benchmark authors attempt to avoid garbage
collecting as much as possible, in order to decrease the error rate
[5]. However, others claim that this results in unrealistic (albeit more
consistent) measurements, since real programs do involve garbage collection.
Furthermore, the exclusion of garbage collection from benchmark programs
imposes limitations on which programs can be benchmarked.

2.4 Benchmarking Java


As explained in previous sections, benchmarking JVM languages correctly
is not as straightforward as it may seem, because of the adaptive nature of
the JVM. For example, the JVM should preferably be warmed up before the
benchmarking begins, in order for JIT-compilation to be properly carried out.
Furthermore, the JVM sometimes removes code if the computed results are
not used, which can lead to misleading performance results [18] [19].
In order to avoid these common benchmarking pitfalls, the benchmarking
tool JMH (the Java Microbenchmark Harness) was developed as a part of
Background | 19

p u b l i c i n t e r f a c e S p l i t e r a t o r <T> {
b o o l e a n t r y A d v a n c e ( Consumer <? s u p e r T> a c t i o n ) ;
v o i d f o r E a c h R e m a i n i n g ( Consumer <? s u p e r T> a c t i o n ) ;
S p l i t e r a t o r <T> t r y S p l i t ( ) ;
long e s t i m a t e S i z e ( ) ;
long getExactSizeIfKnown ( ) ;
int characteristics () ;
boolean h a s C h a r a c t e r i s t i c s ( i n t c h a r a c t e r i s t i c s ) ;
C o m p a r a t o r <? s u p e r T> g e t C o m p a r a t o r ( ) ;
}

Figure 2.5: Spliterator interface in Java 8 [5].

OpenJDK [20]. JMH allows the user to specify a number of warmup iterations
before the proper iterations start, which makes sure that the JVM is properly
warmed up. During each proper iteration, the benchmarked function is run
many times, and an average runtime (and error) is calculated. Furthermore,
JMH starts a new VM (Virtual Machine) for each benchmarked function,
which ensures that optimizations which occurred during earlier benchmarks
do not interfere with the current benchmark [19].

2.5 Low level streams


Streams use Spliterator(s) for sequential and parallel traversal of
elements [5] [21]. Spliterators can operate sequentially or in parallel,
and they can detect structural interference of the source while processing,
which means that they will work correctly even if the stream source is being
altered by an external source. The Spliterator interface is shown in
Figure 2.5, which includes functions which are called internally when a stream
is traversed. This layer of abstraction is hidden from developers who use
streams.
For most streams, the Spliterator uses the forEachRemaining
function in order to internally call hasNext and next to traverse the
sequence of elements, and accept in order to apply an operation to the
current element [5]. Thus, three virtual calls are required for each element.
These virtual calls are relatively slow to execute, and they are one of the
reasons for why streams may be slower than imperative Java, as detailed in
the next section. Some previous studies have changed the internal workings
of forEachRemaining in order to decrease the number of virtual calls,
20 | Background

which results in higher performance [5].

2.5.1 Stream speed


Creating a stream comes with an abstraction overhead cost that is much bigger
than the cost of creating a simple for-loop [7] [8]. For small input sizes, the
dominant factor affecting performance is likely to be the overhead cost of
abstraction. Therefore, streams are expected to perform worse than imperative
Java for small input sizes. As the input size increases, the performance
difference between streams and imperative Java may become less noticeable
and could eventually become negligible.
Something which impacts stream performance is that executing them
involves many virtual calls in the generated bytecode, which are slow to
execute [8]. A stream pipeline with N elements and length K can require up
to N * K virtual calls for the elements to be pushed through the pipeline [22]
[8].
Previous research has attempted to reduce the number of virtual calls by
inlining the consumer chain, in order to improve the performance of streams
[8]. Method inlining is a common compiler technique which replaces a method
call with the method’s body, effectively removing the abstraction cost of
calling a function. This requires the method parameters and return values to
be replaced, in order for the program flow to be preserved.
Furthermore, Java streams require the use of several different internal data
structures, such as spliterator objects and consumer objects [8]. Creating
and using these objects requires processing time, and they are usually placed
on the heap. However, since these objects only are used internally within
stream operations, it is possible to instead place them on the stack [8].
Previous researchers have done this, and they claim that this increases stream
performance, since it eliminates object allocations and reduces the need for
garbage collection [8]. The Java JIT compiler also does this, but only in a
limited manner [23] [8].

2.5.2 Stream bytecode


Figure 2.6 and 2.7 show how the values in an array can be summed in two
different ways; using a for-loop and using a stream, and the bytecode that these
two versions generate. The stream version results in two method calls. The
instruction invokestatic #2 is used in order to create an IntStream
from an array, and the instruction invokeinterface #3, 1 is used to
Background | 21

// S u m m i n g an a r r a y i m p e r a t i v e l y
int [] nums = {4 , 5 , 6};
long sum = 0;
for ( int i = 0 ; i < nums . l e n g t h ; i ++) {
sum += nums [ i ];
}

// G e n e r a t e d b y t e c o d e
// ...
17: l c o n s t _ 0
18: l s t o r e _ 2
19: i c o n s t _ 0
20: i s t o r e 4
22: i l o a d 4
24: a l o a d _ 1
25: a r r a y l e n g t h
26: i f _ i c m p g e 43
29: l l o a d _ 2
30: a l o a d _ 1
31: i l o a d 4
33: i a l o a d
34: i2l
35: ladd
36: l s t o r e _ 2
37: iinc 4, 1
40: goto 22
43: r e t u r n

Figure 2.6: Some imperative code and the bytecode generated from it.

call the sum method on that stream. Such method calls can result in abstraction
overhead performance costs.
In contrast, the for-loop version resulted in zero method calls, and instead
only used simple bytecode instructions such as ladd, which calculates the
sum of two long values on the stack, and goto, which is an unconditional
jump (in this example it used to makes the loop repeat). These types of
instructions generally result in less overhead than method calls, and are
therefore faster. Furthermore, stream operations which require the use of
lambda functions (such as map and filter) usually require even more
overhead costs.
22 | Background

// S u m m i n g an a r r a y with s t r e a m s
int [] nums = {4 , 5 , 6};
long sum = A r r a y s . s t r e a m ( nums ) . sum () ;

// G e n e r a t e d b y t e c o d e
// ...
17: a l o a d _ 1
18: i n v o k e s t a t i c #2 // C r e a t e s I n t S t r e a m
21: i n v o k e i n t e r f a c e #3 , 1 // C a l l s sum ()
26: i2l
27: l s t o r e _ 2
28: r e t u r n

Figure 2.7: Some stream code and the bytecode generated from it.

2.6 Algorithms
In order to get a wider perspective on the performance of streams,
some popular algorithms are written with and without streams, and their
performances are compared. This specifically includes Dijkstra’s algorithm
and the FFT (Fast Fourier transform) algorithm.

2.6.1 FFT
Fast Fourier transform is an algorithm that calculates Discrete Fourier
Transform (DFT) of a sequence. In this project, an implementation of the
FFT algorithm from the book Algorithms, by Sedgewick and Wayne, is used
[24]. This Java implementation is available on the book’s complementary
GitHub page [25], and is available under the GNU General Public License.
According to the authors, the implementation is relatively bare-bones, and runs
in O(n log n) time.
The original non-stream algorithm was implemented for clarity, not
necessarily performance. It calculates FFT of the complex input data x, as
well as the inverse FFT, the circular convolution of x with itself, and the linear
convolution of x with itself, and prints all of these. The authors note that it
is not the most memory efficient algorithm, partly because it re-allocates for
subarrays instead of doing in-place re-allocation, and because it uses object
types for representing complex numbers.
Background | 23

2.6.2 Dijkstra’s algorithm


Dijkstra’s algorithm is an algorithm for finding the shortest paths between
vertices in an edge-weighted graph. It is a popular and useful algorithm,
often used for finding the shortest paths in road networks. Sedgewick
and Wayne, the authors of the FFT algorithm used in this project, also
provide a Java implementation of Dijkstra’s algorithm, available on the book’s
complementary GitHub page [26].
This implementation is the single-source shortest paths variant of Dijkstra,
which computes the shortest path tree for edge-weighted directed graphs with
non-negative edge weights. The authors claim that the constructor, which
calculates the shortest paths but does not print them, takes Θ(E log V ) time in
the worst case, where E is the number of edges and V is the number of vertices
in the graph [26]. The implementation uses multiple custom built Java classes.
For example, it uses an indexed priority queue of generic keys, which in turn
uses a binary heap with an array in order for keys to be associated with integers
within a certain range [27].

2.7 Related Work


Streams have been embraced by the industry at large [28]. Multiple studies
have tried to improve stream performance [29] [5] [30] [22] [8] [31]. However,
to our knowledge only a single study has looked into how streams perform in
comparison to imperative Java. This lack of thorough stream benchmarking
was pointed out by Schiavio et al. [6].

2.7.1 Streams vs imperative Java


In 2014, Biboudis et al. compared the performance of streams to equivalent
code which was written with loops instead of streams (i.e., imperative Java).
Their study is the only one which has explored this. They only tested five
stream operations: map, filter, flatmap, sum, and count.

• source-sum

• source-map-sum (calculates sum of squares)

• source-filter-map-sum (calculates even sum of squares)

• source-flatMap-sum (calculates cartesian product)


24 | Background

• source-filter-filter-count (counts certain objects)

The first two pipelines performed almost identically to their imperative


counterparts. The third pipeline was 50% slower, the fourth pipeline was 12.7
times slower, and the fifth pipeline was 2.4 times slower than imperative Java.
In summary, they found that streams sometimes are as fast as imperative Java,
but can be up to an order of magnitude slower.
However, their code did not reflect realistic stream usage, possibly because
there was no research available on how streams were commonly used at the
time, since they published their findings the same year that streams were added
to Java. For example, streams were only applied on arrays, however arraylists
are more common. Furthermore, the arrays contained long integers, whereas
it is much more common for streams to be used on data structures which
contain objects [7]. They also only applied stream pipelines on the input size
10 000 000, however it is unclear what the most common input sizes are (as
explained in Section 2.2.6).

2.7.2 Performance of streams


Two other studies have looked into the performance of lambdas, which often
use streams, but they have not specifically looked into stream performance
[32] [33].
One previous study [6] presented a benchmark generator for Java streams,
which can convert existing workloads designed for databases to stream-
based programs. The aforementioned project did not conduct a thorough
evaluation of stream performance; rather, it primarily focused on constructing
the benchmark generator. This program has not yet been used in any other
studies, although the creators of this tool claim that they plan to address this in
future work, by generating a large benchmark suite for the Java Stream API.
Such an endeavor would result in Java streams being benchmarked on a
broad scale, since the benchmark programs would not have to be handmade.
In contrast, the programs to be benchmarked in this project are specifically
designed to reflect how streams are used in public projects. Furthermore,
this project is focused on comparing Stream performance with imperative Java
performance, instead of solely looking at the performance of Java streams.
Method | 25

Chapter 3

Method

In this chapter, the method is described. Sections 3.1 - 3.5 detail the methods
which aim to answer the first research question and its subquestions, relating
to common pipelines and their imperative counterparts. Section 3.6 details
the method which aims to answer the second research question, relating
to algorithms implemented with and without streams. Finally, section 3.7
described how the benchmarking tool JMH is used, and 3.8 details the
specifications of the computer used for benchmarking.

3.1 RQ 1: Average stream performance


RQ 1 Considering how Java streams are commonly used, how do stream
pipelines perform in comparison to imperative loops in terms of runtime,
and how does the performance vary with input size?

In this project, a large number of commonly used pipelines are benchmarked


individually, and their runtimes are compared to equivalent code written
without streams (i.e., imperative Java). The specific pipelines are described
in the following sections. Since we are interested in how stream performance
varies with input size, the pipelines are applied on lists which contain between
1 and 1 000 000 objects, increasing exponentially. For each pipeline and input
size, a slowdown factor is calculated, which is simply the stream runtime
divided by the imperative runtime.
The pipelines are constructed to reflect common usage [7]. This motivates
most of the method choices, since the aim of the study is to provide a detailed
description of how streams realistically perform in comparison to imperative
alternatives. Thus, only the ten most popular intermediate and terminal
26 | Method

operations are used, and most of the pipelines are applied on arraylists.
Furthermore, the elements in these lists are primarily Person objects. The
Person class contains a string attribute called name and an integer attribute
called age. The input has been described in more detail in Section 4.3.
Finally, the pipelines contain between zero and two intermediate operations,
and almost all of them use sequential processing, as opposed to parallel
processing.
The pipelines which are benchmarked and compared to imperative Java
are divided into categories based on which subquestion of RQ 1 they aim to
answer. Each subquestion has its own section in this chapter, as listed below.

• Section RQ 1.1: Intermediate operations describes ten pipelines which


aim to isolate the performance of intermediate operations.

• Section RQ 1.2: Terminal operations describes ten pipelines which aim


to isolate the performance of terminal operations.

• Section RQ 1.3: Pipeline length describes 23 pipelines which aim to


show how stream performance varies with pipeline length.

• Section RQ 1.4: Parallel pipelines describes two parallel pipelines.

Since the benchmarked pipelines reflect common usage, a general answer to


RQ 1 can be given by calculating the average, median, worst and best runtime
for all of the tested pipelines. However, it is more meaningful to look at the
different categories of pipelines separately.

3.2 RQ 1.1: Intermediate operations


RQ 1.1 How do intermediate operations perform, and how does their type
impact their performance?

For each of the ten most popular intermediate operations, a pipeline is


constructed which aims to isolate the operation’s performance. They are listed
and explained in Table 3.1∗ . They are divided into two categories: stateless
and stateful.
Since all pipelines require a terminal operation, most of them are paired
with the most popular terminal operation collect, which is used to convert

Note that the figure shows pseudocode, however it is very similar to actual Java code.
Certain details, such as punctuation symbols, have been removed, and some dashes have been
added, in order to increase readability.
Method | 27

the stream into a collection, such as an arraylist. However, mapToInt instead


uses the terminal operation toArray, which converts the stream to an array.
This is because mapToInt converts all stream elements to integers, which
are primitives, however collections can not contain primitives, only objects∗ .
Arrays, on the other hand, can contain primitives. This also explains why
mapToLong uses toArray.
Perhaps surprisingly, these isolating pipelines, along with the pipelines
which isolate terminal operations, are the most representative of how streams
are commonly used, since most stream pipelines on GitHub include either zero
or one intermediate operation [7] [12].
Important to note is that the imperative counterpart of the sorted stream
operation does not involve the usage of loops. Instead, we use the popular
Java function Collections.sort. This choice was made since it is not
realistic that a developer would implement their own sorting algorithm using
loops, when there already exist much more accessible (and presumably faster)
alternatives built into Java. As a reminder, we are attempting to compare
realistic stream code with realistic imperative code.

3.3 RQ 1.2: Terminal operations


RQ 1.2 How do terminal operations perform?

Ten pipelines are constructed to isolate the performance of the ten most
popular terminal operations. They are listed and explained in Table 3.2. Some
of the terminal operations are tested without an intermediate operation, i.e.,
in isolation, and some are tested with the intermediate operation filter,
depending on what is the most realistic† . For example, findFirst and
collect are tested with filter because that is more common than using
them in isolation [12] [11].
The lambda expression for anyMatch returns true if any person has an
age of -5. This is naturally never true, which means that all stream elements

It would be possible to use mapToInt with collect by calling the boxed operation,
which converts primitive integers to object representation of integers, however this would
defeat the purpose of using mapToInt instead of map.

One likely reason for why these operations are not commonly used in isolation
is that there already exist built-in functions in Java with similar behavior. count in
isolation simply returns the size of a list, which is equivalent to calling list.size().
Furthermore, findFirst in isolation returns the first element of a list, which is equivalent to
list.get(0). Finally, collect and toArray in isolation just convert a data structure
to another type of data structure, for which there also exist built-in functions in Java
28 | Method

Removes all Person objects with an odd age.


source-filter(x -> x.getAge() % 2 == 0)-collect
Extracts the name from all Person objects.
source-map(x -> x.getName())-collect
Flattens the input stream.
source-flatMap(List::stream)-collect
Stateless
Extracts the age from all Person objects.
source-mapToInt(x -> x.getAge())-toArray
Extracts the age from all Person objects.
source-mapToLong(x -> x.getAge())-toArray
Creates a person object from an integer.
source-mapToObj(x -> new Person(...))-collect
Only keeps the first third of all Person objects.
source-limit(inputSize / 3)-collect
Only keeps the last third of all Person objects.
source-skip(inputSize * 2 / 3)-collect
Stateful
Sorts the person objects according to age.
source-sorted(...)-collect
Only keeps the unique numbers.
source-distinct-collect

Table 3.1: Pipelines which isolate intermediate operations.


Method | 29

Returns true if any person has an age of -5.


source-anyMatch(x -> x.getAge() == -5)
Returns true if all people has an age greater than -5.
source-allMatch(x -> x.getAge() > -5)
Sums the integers.
source-sum
Counts how many people have an even age.
source-filter(x -> x.getAge() % 2 == 0)-count
Returns the first Person in the list which is older than a certain age.
source-filter(x -> x.getAge() >= 0.8 * maxAge)-findFirst
Returns some Person in the list which is older than a certain age.
source-filter(x -> x.getAge() >= 0.8 * maxAge)-findAny
Stores the people with an even age in a list.
source-filter(x -> x.getAge() % 2 == 0)-collect
Stores the people with an even age in an array.
source-filter(x -> x.getAge() % 2 == 0)-toArray
Adds people older than seven to a list.
source-forEach(p -> if (p.getAge() > 7) olderPeople.add(p))
Sums the age of all people using an accumulator.
source-reduce(0, (acc, x) -> acc + x.getAge(), Integer::sum)

Table 3.2: Pipelines which isolate terminal operations.

must be checked, which allows for more straight-forward benchmarking.


Similarly, the allMatch operation must also check all stream elements.
Just as with all other pipelines, imperative versions are also constructed and
benchmarked.

3.4 RQ 1.3: Pipeline length


RQ 1.3 How does pipeline length affect performance?

In order to examine how pipeline length, i.e., the number of intermediate


30 | Method

operations plus the terminal operation, affects stream performance, eight∗


of the most popular terminal operations are tested with zero, one and two
intermediate operations. For example, the terminal operation anyMatch
is tested in the three pipelines source-anyMatch, source-filter-
anyMatch, and source-filter-map-anyMatch.
Similar pipelines are constructed for the other terminal operations. The
chosen intermediate operations are filter and map, since they are by far
the most popular [7]. For the operations which require a lambda expression,
similar ones are used as for the pipelines described in previous sections.
Finally, as for all other tested pipelines, these pipelines are benchmarked with
the aforementioned input sizes, and corresponding imperative versions are also
constructed and benchmarked.

3.5 RQ 1.4: Parallel pipelines


RQ 1.4 How do parallel pipelines perform?

Only two parallel pipelines are benchmarked and compared with their
imperative counterparts. This is for two reasons. The main reason is that
parallel pipelines are incredibly rare, only making up 0.34% of all pipelines
found on GitHub [7]. Secondly, creating parallel imperative versions of
parallel pipelines takes much more effort than creating imperative version
of sequential pipelines. The tested pipelines are source-parallel-sum
and source-parallel-map(x -> x * x)-toArray. The former
squares all elements in the input array, and the latter sums all of the integers
in the input array.
Imperative parallel versions of streams are constructed using Java threads,
with each thread working on one part of the input array in parallel.
Additionally, imperative non-parallel versions of the tested pipelines are
also constructed and benchmarked, in order to give more insights into the
performance difference between parallel streams and sequential imperative
code.

Specifically, the eight terminal operations are anyMatch, allMatch, sum, count,
collect, toArray, reduce, and forEach. However, count is not tested zero
operations, since that would be equivalent to simply calling list.size().
Method | 31

3.6 RQ 2: Algorithms with streams


RQ 2 How does the utilization of streams impact the performance of popular
algorithms?

Sedgewick and Wayne’s imperative Java implementations of Dijkstra’s


algorithm and FFT are described in the background chapter, in Section 2.6.
The original implementations by Sedgewick and Wayne did not incorporate
any streams. In this project, we rewrite these imperative implementations, by
replacing for-loops with streams. The original FFT implementation was more
suited for this purpose than Dijkstra’s algorithm, since it uses many for-loops
with relatively simple array manipulation, which easily can be replaced by
streams. Almost all of the 23 for-loops found in these algorithms have been
replaced with streams.
Both the imperative version and stream version of each algorithm is
benchmarked, and runtimes are compared, for multiple input sizes. The
input data used for Dijkstra’s algorithm is supplied by the webpage which
accompanies the book from which the algorithm originated. This input data
consists of graphs which contain between eight and one million nodes, and
between fifteen and fifteen million edges [34] [24]. The input data is read
before the benchmarking starts. The input data for the FFT algorithm consists
of complex numbers which are randomly generated, as supplied by the authors
of the algorithm. These complex numbers are put into an array. The chosen
arrays sizes are powers of 2, specifically between the sizes 2 and 1,048,576,
with steps increasing exponentially.
In order for these algorithms to be benchmarked in this project, we have
adapted them to work with JMH. Instead of letting the algorithms print the
output data, the output data is consumed by Blackhole objects (a feature of
JMH), which ensures that the JVM actually calculates these values.

3.7 JMH
JMH, which is brought up in the background chapter in Section 2.4, is used
in order to benchmark all of the constructed pipelines and algorithms in this
project. The JMH settings are similar to those used in a previous study which
examined stream performance [5]. JMH is used together with Maven, which
is a popular build automation tool for Java. The reason for why specifically
JMH is used is that it is the standard benchmarking Java tool [5], and it is
32 | Method

very well maintained, with a large amount of documentation and examples


available online.
As mentioned, JMH allows the user to specify a certain number of warmup
iterations for each function which should be benchmarked. Ten warmup
iterations and ten proper iterations are run for each pipeline in this project,
since the measured runtime are stabilized at that point. Each iteration is
set up to run for at least 1 second each, however they may run longer than
that if necessary. Furthermore, the input is generated/read before the actual
measurements begin, for example when initializing and filling lists and arrays.
Costa et al. have specified how JMH sometimes is used incorrectly, which
can result in inaccurate benchmarks [19]. They have specified four of the most
common bad practises found when using JMH, and how they can be avoided.
These recommendations are followed by the benchmarks in this project. An
example of a bad JMH practise is not using results which returned by method
in a benchmarks. This can result in the method call being discarded, which
means that the code will appears faster than it actually is.
Furthermore, when benchmarking a method by calling it multiple times
in a loop, Costa et al. recommend that a Blackhole object∗ should consume
each individual method call, in order to stop unrealistic loop optimizations
from occurring (for example, multiple method calls can be merged). However,
this recommendation likely does not apply for the imperative loops which
are benchmarked in this project, since we are measuring the performance
of the loops themselves, not the methods which are called inside the loops.
Therefore, this specific recommendation has been disregarded for this project.

3.8 System specifications


All of the benchmarks, except the parallel ones, are run on a computer running
MacOS with the specifications listed in Table 3.3. The parallel pipelines are
benchmarked on a computer running Windows, with access to more cores,
which is more suited for parallel execution, as listed in Table 3.4. In order to
minimize interference, no other programs are run while benchmarking is in
progress. Each category of benchmarks takes between one and a few hours to
finish. Finally, Java 11 is used.


A JMH object which can ensure that certain values are calculated by the JVM.
Method | 33

Operating System macOS Big Sur (11.6.2)


Model Name MacBook
Model Identifier MacBook8,1
Processor Name Dual-Core Intel Core M
Processor Speed 1.3 GHz
Number of Processors 1
Total Number of Cores 2
L2 Cache (per Core) 256 KB
L3 Cache 4 MB
Hyper-Threading Technology Enabled
Memory 8 GB

Table 3.3: Specifications of the computer which the non-parallel benchmarks


are run on.

Operating System Windows 10 Pro


CPU Intel Core i7-11700K 3.6 GHz 8-Core
Motherboard Gigabyte B560M AORUS PRO AX Micro ATX LGA1200
Memory Corsair Vengeance LPX 16 GB (2 x 8 GB) DDR4-3600 CL18
Storage Kingston A2000 1 TB M.2-2280 PCIe 3.0 X4 NVME SSD
Video Card MSI GeForce RTX 3060 Ti 8 GB

Table 3.4: Specifications of the computer which the parallel benchmarks are
run on.
34 | Method
Implementation | 35

Chapter 4

Implementation

This chapter explains some aspects of how the benchmarks are implemented in
code. Section 4.1 brings up what type of for-loops are used to replace streams.
Section 4.2 shows how one of the pipelines is benchmarked. Section 4.3 gives
details on the input used for the pipelines. Finally, Section 4.4 explains how
two algorithms are adapted to use streams instead of for-loops.

4.1 Replacing streams with loops


As mentioned, the performance of each pipeline is compared to equivalent
Java code written without the usage of streams. This can usually be achieved
using for-loops and if-statements. This imperative code is benchmarked on the
exact same input as the stream code. The for-loops have traditional headers,
such as for (int i = 0; i < list.size(); i++). Enhanced
for-loops are not used (e.g., for(dataType item : list)), since
this thesis aims to specifically compare traditionally imperative loops with
streams. Using enhanced for-loops could give different performance results.

4.2 A sample benchmark


Figure 4.1 shows how the pipeline source-sum and its imperative
counterpart is benchmarked. This pipeline isolates the performance of the
terminal operation sum. Both versions calculate the same thing, i.e., the sum
of the integers in the input array. This benchmark is performed for multiple
different array sizes, increasing exponentially from 1 to 1 000 000. All other
functions which benchmark stream pipelines are also tested with those input
sizes, although most other functions use an ArrayList of Person objects
36 | Implementation

// S t r e a m b e n c h m a r k
@Benchmark
p u b l i c void s u m _ S T R ( B l a c k h o l e bh ) {
long sum = A r r a y s . s t r e a m ( nums ) . sum () ;
bh . c o n s u m e ( sum ) ;
}

// I m p e r a t i v e b e n c h m a r k
@Benchmark
p u b l i c void s u m _ I M P ( B l a c k h o l e bh ) {
long sum = 0;
for ( int i = 0 ; i < nums . l e n g t h ; i ++) {
sum += nums [ i ];
}
bh . c o n s u m e ( sum ) ;
}

Figure 4.1: Code used to benchmark the pipeline source-sum and its
imperative counterpart.

instead of an integer array as input. The structure of the input is explained


in Section 4.3.
The function takes a Blackhole object bh as an argument, which is
a feature of JMH which helps avoid a common benchmarking pitfall. This
object consumes the calculated sum value with bh.consume(sum). This
is to ensure that the JVM actually calculates this value [19]. If a Blackhole
object is not used, and the JVM realizes that this value is not used in any
meaningful way, then it might optimize the code in such a way that the value
is not calculated at all [18]. This would render the benchmark useless, since
the operation to be benchmarked would not actually be performed.
In this case, the blackhole objects is not strictly necessary, since returning
a value and consuming it with a blackhole are equivalent [18]. However, in
other benchmarks it is necessary, for example for when multiple output values
are calculated.

4.3 Input for streams


For the tested pipelines (RQ 1), primarily two different data structures were
used as input for the benchmarked pipelines. The first data structure used as
input is an integer array containing the values 0, 1, 2, ..., 999, 0,
Implementation | 37

1, 2, ..., 999, repeating. No integer value is higher than 999. This


input is similar to the input used in a previous study which also benchmarked
Java streams [5]. Several different versions of it is constructed, with input sizes
ranging between 1 and 1 000 000.
The second data structure used as input is an ArrayList containing
Person objects. These Person objects follow the same repeating sequence
as the nums array, i.e., the first person object has an age of 0, the second has
an age of 1, and person object number 1 000 had an age of 999, repeating.
The name of each person is their age concatenated with Eric. Most of the
benchmark methods use this list as input, since streams are more often used on
lists of objects rather than arrays of integers [7]. However, the sum operation is
instead applied on the list of integers, since that operation only can be applied
on streams of type IntStream.
To our knowledge, there is no data available on what size of objects streams
usually are applied on, however it is clear that streams more commonly are
applied on objects rather than integers [7]. In this study we use relatively
small objects, which only contain two attributes and getter methods for these
attributes.
A few operations use slightly different input data which represent more
realistic use cases, but all operations are tested on the same input sizes.
For example, the two operations sorted and distinct are tested on
randomized data, where the age attributes of the Person objects are random
values between 1 and N , where N is the input size.

4.4 Algorithms with streams


As mentioned in the method chapter, Dijkstra’s algorithm and FFT are
implemented with and without Java streams. Figure 4.2 shows an example of
how a for-loop in the FFT implementation is replaced with a stream pipeline.
In this example, an array even containing Complex objects is created, by
taking every other element from the array x. This is replicated using streams,
by first generating the indexes for each element in x, then filtering for only the
even indexes, mapping each index to a corresponding element in x, and finally
turning the stream of elements into an array of type Complex. In total, more
than twenty for-loops are replaced with streams.
38 | Implementation

// I m p e r a t i v e v e r s i o n
C o m p l e x [] even = new C o m p l e x [ n /2];
for ( int k = 0; k < n /2; k ++) {
even [ k ] = x [2* k ];
}

// S t r e a m v e r s i o n
C o m p l e x [] even = I n t S t r e a m . r a n g e (0 , x . l e n g t h )
. f i l t e r ( i -> i % 2 == 0)
. m a p T o O b j ( i -> x [ i ])
. t o A r r a y ( C o m p l e x []:: new ) ;

Figure 4.2: An example of how a for-loop is replaced with a stream in the FFT
implementation.
Results and Validity | 39

Chapter 5

Results and Validity

In this chapter, all sections except the last one are dedicated to one research
question (or subquestion) each. The final section brings up the validity of the
data by showing the error rates.
Primarily, the runtime difference between stream pipelines (and algo-
rithms) which are implemented with and without streams are shown, for input
sizes between 1 and 1 000 000 elements. These differences are presented using
slowdown factors. The slowdown factor for a stream pipeline is its runtime
divided by the runtime of equivalent code without streams (i.e., imperative
Java). For example, if a stream pipeline has a slowdown factor of three, then
that means that it is three times slower than equivalent code written without
streams.
The reason for why the results are shown using slowdown factors instead
of raw runtimes is that the primary interest in this project is how much
slower/faster streams are than imperative Java. This means that if a certain
operation is said to "perform worse" than another operation, then this does
not pertain to runtime, but to how much slower they are than their imperative
counterparts. Furthermore, smaller input sizes refers to input sizes between
1 and 1 000, and larger input sizes refers to input sizes between 10 000 and
1 000 000.

5.1 RQ 1: Average stream performance


RQ 1 Considering how Java streams are commonly used, how do stream
pipelines perform in comparison to imperative loops in terms of runtime,
and how does the performance vary with input size?
Figure 5.1 gives an overview of stream performance, by showing the average,
40 | Results and Validity

best performing∗ , and worst performing pipelines, taking all benchmarked


sequential pipelines into consideration.

. Worst Average Best

20
Slowdown Factor

8
6
4

0.8
1 10 100 1 000 10 000 100 000

Input size

Figure 5.1: Slowdown factors for the worst, average, and best performing
pipelines.

Almost all pipelines perform much worse for smaller input sizes than for
larger input sizes. For smaller input sizes, the pipelines are on average between
2.6 and 9.6 times slower than their imperative counterparts. Meanwhile, for
the larger input sizes, the pipelines are on average only 39% to 74% slower
than their imperative counterparts. Important to note is that the average values
are slightly skewed because of some outliers. Because of this, the median
slowdown factors are slightly lower than the average slowdown factors.
The performance range is rather large. The worst performing pipelines,
which is source-collect† for most input sizes, are much slower than
imperative Java for all input sizes. However, source-collect is an
outlier, and almost all other pipelines are less than four times slower than
their imperative counterparts for the larger input sizes. Furthermore, the best
performing pipeline is only slightly slower than its imperative counterpart

There are multiple best performing pipelines, depending on which input size is being
examined. For example, at input size 100 the best performing pipeline is source-forEach,
with a slowdown factor of 1.17, but for input size 1 000 000 it is source-sum, with a
slowdown factor of 0.90.

This pipeline is brought up again, in Section 5.4.
Results and Validity | 41

than the smaller input sizes, and it is even slightly faster than its imperative
counterpart for the larger input sizes. The following sections looks more
deeply into the performance of different types of pipelines.

5.2 RQ 1.1: Intermediate operations


RQ 1.1 How do intermediate operations perform, and how does their type
impact their performance?

Figure 5.2 shows the average slowdown factors for the pipelines which isolate
stateless and stateful intermediate operations. The median slowdown factors
closely align with the average for both stateless and stateful operations.

. Stateless operations Stateful operations

5
4
Slowdown Factor

1
1

10

0
10

00

00

00

00
1

10

0
10

00
1

Input size

Figure 5.2: Average slowdown factors for the pipelines which isolate stateless
and stateful intermediate operations.

The general trend for all input sizes is that the stateful operations perform
slightly worse than the stateless operations. The biggest difference can be
seen for larger input sizes. At the largest input size, all of the intermediate
operations, except for distinct, are less than 30% slower than their
imperative counterparts.
42 | Results and Validity

5.2.1 Stateless operations


Figure 5.3 shows the slowdown factors for the pipelines which isolate one
stateless intermediate operation each. Exact slowdown factors are shown
in Appendix A. They all use the terminal operation collect. Among
the stateless operations, flatMap is the worst performing. At input size
10 000, it is twice as slow as its imperative counterpart, whereas all of the
other stateless operations are less than 40% slower than their imperative
counterparts. Interestingly, while mapToInt and mapToLong are the
worst performing stateless operations for small input sizes, they are the best
performing ones for larger input sizes. Furthermore, the two most popular
intermediate operations, filter and map, perform quite well, and are less
than 30% slower than their imperative counterparts for moderate to large input
sizes.

6 -
5 filter
4
flatMap
Slowdown factor

3
map

2 mapToInt
mapToLong
mapToObj
1
1

10

0
10

00

00

00

00
1

10

0
10

00
1

Input size

Figure 5.3: Slowdown factors for the pipelines which isolate stateless
intermediate operations.

5.2.2 Stateful operations


Figure 5.4 shows the slowdown factors for the pipelines which isolate one
stateful intermediate operation each. Exact slowdown factors are shown in
Appendix A. The curves for the stateful operations are less smooth than the
stateless curves are. The best performing stateful operation is sorted. The
worst performing operation is distinct, and it stops improving after the input
Results and Validity | 43

size reaches 100. Finally, skip and limit only manage to perform relatively
close to their imperative counterparts for the largest input size.

-
6
distinct

4
limit
Slowdown factor

skip
sorted
2

1
1

10

0
10

00

00

00

00
1

10

0
10

00
1
Input size

Figure 5.4: Slowdown factors for the pipelines which isolate stateful
intermediate operations.

5.3 RQ 1.2: Terminal operations


RQ 1.2 How do terminal operations perform?

Figure 5.5 shows the average slowdown factors for the pipelines which isolate
terminal operations. Exact slowdown factors are shown in Appendix A. Just
like the intermediate operations, the terminal operations perform much better
with larger input sizes. However, the curve is steeper, with worse performance
at smaller input sizes, and slightly better performance at larger input sizes.
On average, terminal operations perform relatively well for larger input sizes,
being only 24% slower than their imperative counterparts.
Figure 5.6 shows the slowdown factors for the individual pipelines
which isolate terminal operations, divided into two graphs. Just as for the
intermediate operations, the terminal operations which perform the worst for
the smaller input sizes (count and sum) perform the best for the larger input
sizes. collect and forEach, the two most popular terminal operations,
44 | Results and Validity

. Average Terminal Operation


10
8

6
Slowdown Factor

1
1 10 100 1 000 10 000 100 000 1 000 000

Input size

Figure 5.5: Average and median slowdown factors for the pipelines which
isolate terminal operations.

perform quite well, and are less than 32% slower than their imperative
counterparts for larger input sizes. for-each is sometimes slightly faster
than its imperative counterpart. The worst performing terminal operation is
reduce.

5.4 RQ 1.3: Pipeline length


RQ 1.3 How does pipeline length affect performance?

Figure 5.7 shows how, on average, eight of the ten most popular terminal
operations perform with zero, one, or two intermediate operations. As
explained in the method chapter, the used intermediate operations are filter
and map. When the terminal operations are used with other intermediate
operations, the performance results could differ.
This figure seems to imply that adding one intermediate operation does
not change the performance, but that adding two intermediate operations
slightly decreases performance. However, the average values are somewhat
misleading, since different terminal operations show very different results,
with no overarching pattern. Graphs with slowdown factors for each individual
terminal operation, used with zero, one, or two intermediate operations, are
Results and Validity | 45

Figure 5.6: Slowdown factors for the pipelines which isolate terminal
operations.
46 | Results and Validity

- 0 intermediate ops. 1 intermediate ops. 2 intermediate ops.


10
8
6
Slowdown Factor

1
1 10 100 1 000 10 000 100 000

Input size

Figure 5.7: Average slowdown factors for pipelines with zero, one and two
intermediate operations.

shown in Appendix B. The terminal operations can be put into four different
groups, based on pipeline length performance.

• anyMatch and allMatch perform best when used in isolation∗ .

• sum, count, and reduce perform best when used in isolation, or with
one intermediate operation.

• toArray and collect perform best when used with at least one
intermediate operation.

• forEach performance is not impacted by the number of intermediate


operations used.

The biggest outlier is the pipeline source-collect, which is one of


the worst performing pipelines examined in this project. It is an order
of magnitude slower than its imperative counterpart for most input sizes.
However, when collect is used with at least one intermediate operation,
it performs relatively well, being less than 33% slower than its imperative
counterpart for the larger input sizes. Other terminal operations are not as
dramatically affected by pipeline length.

"Used in isolation" means that no intermediate operation is used.
Results and Validity | 47

5.5 RQ 1.4: Parallel pipelines


RQ 1.4 How do parallel pipelines perform?

In this section, wallclock∗ runtimes are shown instead of slowdown factors,


since they give a better view of the relevant performance comparisons. Figure
5.8 shows the absolute runtimes for three different implementations of a
pipeline which isolates the sum operation. The "Sequential imperative"
version is the pipeline source-sum implemented with a for-loop sequen-
tially (i.e. non-parallel). The "Parallel imperative" version is this pipeline
implemented with parallel Java threads, where each thread works on a separate
part of the input array. Finally, the "Parallel stream" version is simply the
parallel pipeline source-parallel-sum.
The parallel versions only manage to outperform the sequential versions for
input sizes above 1 000 000, usually being 2 - 3 times faster. For smaller input
sizes, the sequential code is dramatically faster than the parallel versions. In
other words, parallelizing this pipeline is only worthwhile for very large input
sizes. Furthermore, the parallel stream performs much better than the parallel
Java threads, for all input sizes below 10 000 000, usually being between 2 -
4 times faster. For input sizes larger than that, both parallel versions perform
identically.
Figure 5.9 similarly shows the runtimes for two parallel version and
one sequential version of the pipeline source-map-toArray. For this
pipeline, the parallel versions never manage to outperform the sequential
version. In other words, parallelizing this pipeline never gives any
performance gains, regardless of if you use streams or Java threads. Instead,
all three versions perform identically for the larger input sizes. For the smaller
input sizes, the performance is very similar to the previously shown parallel
pipeline.

5.6 RQ 2: Algorithms with streams


RQ 2 How does the utilization of streams impact the performance of popular
algorithms?

Exactly how the time is measured by JMH is seemingly not described in the
documentation. It seems as if it is not technically wallclock time that is measured, however,
we use that word in order to communicate that what is measured is not the total CPU-time
accumulated over all threads. Instead, it is something more akin to real time/wallclock time,
i.e. the time between when the program starts and when the program ends.
48 | Results and Validity

Sequential imperative Parallel stream Parallel imperative

10 000 000
1 000 000
100 000
Runtime (ns)

10 000
1 000
100
10
1

10

0
10

00

00

00

00

00

00
1

10

0
10

00

00

00
1

10

0
10
Input size

Figure 5.8: Runtimes for the parallel pipeline source-parallel-sum,


and imperative implementations of it (sequential and parallel).

Sequential imperative Parallel stream Parallel imperative

100 000 000

1 000 000
Runtime (ns)

10 000

100
1

10

0
10

00

00

00

00

00

00
1

10

0
10

00

00

00
1

10

0
10

Input size

Figure 5.9: Runtimes for the parallel pipeline source-parallel-map-


toArray, and imperative implementations of it (sequential and parallel).
Results and Validity | 49

Figure 5.10 shows the slowdown factors for the FFT algorithm, i.e., the
runtime of the stream version of FFT divided by the runtime of the imperative
version of FFT. For the larger input sizes, the stream version of FFT is 2 to 3
slower than its imperative counterpart.

- Stream FFT

8
Slowdown Factor

6
4

1
10 100 1 000 10 000 100 000

Input size

Figure 5.10: Slowdown factors for the FFT algorithm.

Figure 5.11 shows the slowdown factor for Dijkstra’s algorithm. The
results are different than those of the FFT algorithm, with the stream version of
Dijkstra’s generally performing much better at all input sizes than the stream
version of the FFT algorithm. For all input sizes except the smallest one,
the stream version of Dijkstra’s algorithm performs almost identically to its
imperative counterpart, i.e., no slowdown is seen for these input sizes, whereas
the FFT algorithm is 2 - 6 times slower than its imperative counterpart for these
input sizes.

5.7 Validity
JMH gives a confidence interval of 99.9% for each benchmarked pipeline
and input size, assuming there is a normal distribution. The error is half
this interval. Dividing the runtime error (half the confidence interval) by the
runtime gives an error in terms of percentages. These percentage errors are
shown in Table 5.1, for different categories of pipelines and input sizes.
50 | Results and Validity

- Stream Dijkstra

8
Slowdown Factor

1
10 100 1 000 10 000 100 000

Input size

Figure 5.11: Slowdown factors for Dijkstra’s algorithm.

Input size Stateless Stateful Terminal X-Terminal X-Y-Terminal Parallel


1 0.73% 1.04% 0.74% 2.78% 1.71% 2.80%
3 0.78% 1.72% 0.92% 0.94% 1.76% 1.51%
10 0.54% 1.56% 0.59% 0.75% 1.58% 1.06%
100 0.50% 1.27% 1.01% 1.32% 0.74% 2.24%
1 000 0.78% 0.89% 0.75% 4.01% 1.09% 4.23%
10 000 0.81% 0.60% 1.78% 6.18% 3.10% 1.86%
100 000 1.80% 0.74% 2.53% 3.57% 3.18% 2.06%
1 000 000 4.54% 8.14% 2.63% 2.26% 3.78% 4.13%

Table 5.1: Average runtime error rates of the tested pipelines. The error is half
the 99.9% confidence interval divided by the runtime.
Results and Validity | 51

In general, the error is higher for larger input sizes. This is presumably
explained by the fact that higher input sizes results in more memory
management, which can trigger the garbage collector relatively sporadically,
which can give less consistent results. Furthermore, pipelines with terminal
operations which return arrays or collection (toArray and collect) have
higher error rates than operations which return scalar values, which also can
be explained by higher rates of required memory management.
The average error for the different categories of pipelines is between 1.31%
and 2.72% (the median error rates are slightly lower), and the pipelines with
one intermediate operation have the highest average error (2.72%). The
individual pipeline with the highest error rate is 25.43%, but such higher error
rates are only reached in a few cases, and only when input sizes are larger
than 100 000 elements. This error rate occurred for the pipeline source-
flatMap-map-collect at input size (n = 100 000), and it presumably
this high due to the usage of flatMap, which requires a lot of memory
management and garbage collection. If the benchmarks are run on a computer
with more access to memory, or if it in general has better hardware, then the
error rates may be lower, since garbage collection may not be triggered as
often.
The FFT benchmarks have an average error rate of 2.51%, and the
benchmarks for Dijkstra’s algorithm have an average error rate of 2.51%.
52 | Results and Validity
Discussion | 53

Chapter 6

Discussion

This chapter discusses the results of this project, and gives general insights and
thoughts on stream performance and how it can be benchmarked correctly.

6.1 General results


It is clear that the biggest factor which impacts (sequential) stream
performance is input size. Almost all stream pipelines perform much worse
for smaller input sizes, on average being multiple times slower than imperative
Java. For larger input sizes, they are on average only 39% to 74% slower than
imperative Java. Furthermore, the average is slightly skewed because of some
outliers, and thus the median pipeline performs slightly better.
However, the performance range for different stream pipeline is quite
larger, with some pipelines performing much worse than imperative code,
and other pipelines performing very similarly to imperative Java. There
are multiple other factors which determine how well streams perform in
comparison to imperative Java. Stateful operations perform slightly worse
than stateless operations in comparison to imperative Java.
Another observation is that the stream operations which perform the worst
for smaller input sizes, sometimes also perform the best for larger input sizes,
when comparing their performance with imperative Java (however, this is
not a general rule). Examples of this include sum and mapToInt. These
operations have much lower runtimes than most other operations, which
explains why they perform badly at lower input sizes, since the overhead cost of
creating streams dominates more if the runtime is low. As the input size grows,
the overhead cost of creating the stream stops being as much of a dominating
factor, and thus the performance loss from using streams instead of imperative
54 | Discussion

Java is not as large.


The most popular operations, specifically filter, map, collect, and
forEach, all perform relatively well. For all input sizes larger than 100 they
are less than 50% slower than their imperative counterparts, at least for
the pipelines tested in this project. However, this is only true for collect
when it is not used in isolation. When it is used in isolation, it is simply
converting one data structure to another, and in this case it performs much
worse than when used with at least one intermediate operations.
Thus, a conclusion is that using collect without intermediate operations
should be avoided. Fortunately, there exist many non-stream functions in Java
which also simply convert one data structure to another, and they generally are
much faster than collect. However, even though collect performs best
when not used in isolation, the opposite is true for other functions, such as
anyMatch. Thus, there is no clear correlation between pipeline length and
performance. Instead, it depends on which terminal operation is used.
Only two parallel pipelines are benchmarked in this project, and thus more
experimentation is required in order to draw conclusions in regards to parallel
stream performance. The results suggest that using parallel streams generally
seems to be much faster than using parallel Java threads, for all input sizes
below 10 000 000. Above that size, using parallel streams or parallel Java
threads gives similar results.
Additionally, using parallel streams results in much faster code than
sequential alternatives, but only for input sizes larger than 1 000 000. For
smaller input sizes, performance is lost, not gained. Thus, parallel streams
should only be used for very large input sizes. However, the pipeline
source-map(x -> x * x)-toArray does not perform better when
run in parallel instead of sequentially. This is likely due to the fact that the
runtime is dominated by the sequential work of collecting the elements to a
single array. If a more complex map function would have been used (such as
calculating a crypto hash), then the parallel version(s) may have been faster
than the sequential version. More experimentation is required in regards to
how parallel pipelines perform in comparison to sequential pipelines, and how
this differs for different types of stream operations.
Our results also indicate that creating parallel streams have a larger
overhead cost than creating sequential streams, however it is not as large as
the overhead cost of creating imperative Java threads. This creation cost is
likely the reason for why the runtimes of the parallel programs are relatively
constant up to a certain point. This suggests that when using parallel streams
(or threads), it is even more crucial to use larger input sizes, if performance is
Discussion | 55

of importance.

6.2 Smaller input sizes


As mentioned in the background chapter in Section 2.5.1, the overhead cost
of creating a stream is much bigger than creating a simple for-loop. This is
likely the reason for why streams perform worse at smaller input sizes. For
smaller input sizes, the overhead cost of creating the stream is the dominating
performance factor, however for larger input sizes the actual processing of each
individual elements is instead what takes the most runtime. Thus, something
which developers should be mindful of is that stream usage comes at a much
larger performance loss for smaller input sizes.

6.3 Garbage collection


As explained in the background chapter in Section 2.3.1, garbage collection
can be triggered relatively sporadically during benchmark runs, which can
increase the error rates of measured runtimes. Biboudis et al. measured stream
performance and generally attempted to avoid garbage collection in their
benchmarks [5]. They did so by forcing garbage collection before execution
using JMH, and by only using terminal operations which return scalar values
(instead of, for example, lists or arrays). Thus, they did not use the most
popular terminal operation collect, since it returns a collection [7].
In this project, a different approach towards garbage collection is taken.
The popular collect operation is used in many pipelines. This results in
some garbage collection, which can result in less consistent benchmarks and
higher error rates. However, allowing for some garbage collection could be
seen as more realistic, since Java programs generally do result in garbage
collection, and because it allows us to benchmark the most popular stream
operations.

6.4 Errors & Larger input sizes


The benchmarks were initially also run for the input size 10 000 000, however
those results are not presented in this thesis because of higher error rates. The
higher error rates for that input size is likely because that number of objects
can result in more garbage collection, which increases the error rate. However,
some benchmarks at that input size have smaller error rates, and they show
56 | Discussion

relatively similar slowdown factors as for smaller input sizes, indicating that
increasing the input size beyond 1 000 000 gives diminishing returns in terms
of stream performance.

6.5 Algorithms
The stream version of FFT algorithm performs worse than the individual
stream pipelines. For the larger input sizes, the stream version of FFT is 2 to
3 times slower than its imperative counterpart, whereas the average pipeline is
only around 1.5 to 1.7 times slower than its imperative counterpart. In contrast
to this, the stream version of Dijkstra’s algorithm performs very well, and is
only slightly slower than its imperative counterpart for almost all input sizes.
This can partly be explained by the fact that many of the calculations
executed during Dijkstra’s algorithm occur inside a prioritized queue, although
the functions which act on this queue are called using streams. Thus, a large
part of of the work is not done directly by streams, but it instead happens inside
certain data structures.
Furthermore, Dijkstra’s algorithm included larger, but fewer, streams than
the stream version of the FFT algorithm. Streams in general perform better
with larger input sizes, and thus this further explains why Dijkstra’s algorithm
performed better than the FFT algorithm when using streams. Another
explanation for why the stream version of Dijkstra’s algorithm performs
relatively well is that it frequently uses the forEach operation, which is
among the best performing terminal operations.

6.6 Performance trade-off


The results suggest that stream usage usually results in at least some
performance loss, however this does not necessarily mean that streams are not
worth using, since there are other possibly perks to gain from using streams.
Streams are functional in nature, and it is often claimed that functional code
is more maintainable, more readable, has fewer bugs, is easier to test, and is
easier to use for parallelism. However, these claims are somewhat subjective,
and there is only limited research on the topic [35] [36]. However, it is clear
that streams are very simple to parallelize, since this is possible with a single
method call. In contrast, writing imperative parallel Java code often requires
much more code and deliberation.
Furthermore, certain behaviors are easier to express with streams instead of
Discussion | 57

for-loop, for example some of the more sophisticated use cases of flatMap.
Ultimately, since the performance loss of using stream is relatively minimal
for larger input sizes, especially for the most popular stream operations,
developers might find that using streams is still worthwhile.

6.7 Previous Research


As described in Section 2.7.1, Biboudis et al. compared the performance of
five stream pipelines and their imperative counterparts, with the input size
10 000 000 [5]. One of the pipelines they examined was 12.7 slower than its
imperative counterpart. In contrast to this, most pipelines are only around
20%-40% slower than their imperative counterparts for similar input sizes.
One reason for why this pipeline in particular was so much slower than most
pipelines examined in this project is that is used the flatMap operation,
which is among the worst performing operations tested in this project.
The benchmarked code used by Biboudis et al. is available on GitHub [37],
and when that benchmarking code was run on the computers which are used
for benchmarking in this project, similar results to theirs were found. This
indicates that the (few) differences in performance results likely are because
they simply tested different stream pipelines, not because of external factors
such as hardware or which version of Java is used ∗
However, two of the five pipelines Biboudis et al. benchmarked performed
very similarly to their imperative counterparts, which is in line with the results
of this project. It is clear that the effect streams has on performance depends
not only on input size, but also on factors such as input type and which pipeline
is benchmarked.


Biboudis et al. used Java version 8, whereas this project uses Java version 11.
58 | Discussion
Conclusions and Future work | 59

Chapter 7

Conclusions and Future work

In this chapter, we present our conclusions and answer the specified research
questions. We also explain the limitations of this project, and what future
works can examine. Finally, we reflect on this project and what the results
mean for software developers.

7.1 Conclusions
In this project, the runtime performance of Java streams are compared to
the runtime performance of equivalent Java code written without the usage
of streams, referred to as imperative Java. The tested stream pipelines are
constructed in a way that represents how streams are realistically used in
GitHub projects [7]. The streams are mimicked with imperative code with for-
loops and if-statements. Since there are different types of streams and stream
operations, the answer to how they compare to imperative Java in terms of
runtime performance depends on several factors, such as input size, operation
types and pipeline length.. Listed below are the research questions which
address this, along with conclusions found in this project.

RQ 1 Considering how Java streams are commonly used, how do stream


pipelines perform in comparison to imperative loops in terms of runtime,
and how does the performance vary with input size?

The conclusion for RQ 1 is that streams in general perform worse than


imperative loops, however how much worse heavily depends on input size.
Specifically, the average pipeline is several times slower than its imperative
counterpart for smaller input sizes, however it is only 40% to 70% slower for
larger input size. The performance range for different pipelines is quite large,
60 | Conclusions and Future work

however no tested pipeline is more than one order of magnitude slower than its
imperative counterpart. In a few cases, (sequential) streams manage to slightly
outperform imperative Java. The reason for why streams perform worse at
smaller input sizes is likely the high overhead cost of creating streams, which
is the dominating performance factor at smaller input sizes.
Previous research has found that most stream pipelines are only slightly
slower than their imperative counterparts [7]. This project shows that while
this is true for most streams at larger input sizes, it is clearly not true when
streams are used with smaller input sizes. Thus, if the performance of a
program is of importance, then it is recommended that streams should only
be used with large input sizes. If small input sizes are used, then it is
instead recommended to use imperative alternatives, such as for-loops. To our
knowledge, this project is the first to look into the correlation between input
size and stream performance.

RQ 1.1 How do intermediate operations perform, and how does their type
impact their performance?

When intermediate operations are paired with the most popular terminal
operation collect (or in some cases toArray), they are on average
between 12% and 65% slower than their imperative counterparts, for larger
input sizes. The most popular intermediate operations, filter and map
perform relatively well.

RQ 1.2 How do terminal operations perform?

The pipelines which aim to isolate the performance of terminal operations are
on average between 14% to 23% slower than their imperative counterparts,
for the larger input sizes. Just like the intermediate operations, they perform
much worse at smaller input sizes. The two most popular terminal operations,
collect and forEach, perform relatively well.

RQ 1.3 How does pipeline length affect performance?

In this project, eight of the most popular terminal operations are tested with
zero, one and two intermediate operations (map and filter). Pipeline
length does affect performance, however there is no clear pattern. Some
terminal operations, for example anyMatch, perform best when used in
isolation. Other operations, such as collect, perform best when used
in conjunction with at least one intermediate operation. We conclude that
Conclusions and Future work | 61

collect should never be used in isolation, since there are equivalent


functions in Java which perform much better. Interestingly, the pipeline
source-collect, where collect is used in isolation, is the third most
popular pipeline [12].
RQ 1.4 How do parallel pipelines perform?
Since streams are only very rarely used for parallel execution [7], parallel
pipelines are only examined in a limited manner in this project. We find that
in general, parallel streams perform much better than parallel Java threads,
for input sizes below 10 000, 000. For input sizes larger than that, they
perform similarly. Furthermore, parallel stream pipelines are only faster than
sequential alternatives for input sizes above 1 000 000. For one of the tested
pipelines, parallelization did not give any performance gains for any input
size. Additionally, we observe that the overhead cost of creating parallel
streams is much larger than that of creating sequential streams. However, we
acknowledge that more testing in relation to parallel streams is required for
strong conclusions to be drawn in relation to their performance.
RQ 2 How does the utilization of streams impact the performance of popular
algorithms?
In this project, two popular algorithms are implemented with and without Java
streams, and then benchmarked for different with input sizes. Both algorithms
perform better for larger input sizes. The stream version of FFT is always at
least twice as slow as its imperative counterpart, however Dijkstra’s algorithm
experienced almost no performance losses at all from the usage from streams.
We conclude that the usage of (sequential) streams in algorithms can affect the
performance differently depending on the implementation of the algorithm in
question.
However, it is unlikely that sequential streams will improve algorithmic
performance. Furthermore, algorithms which incorporate streams with small
input sizes are likely to be more adversely affected by streams in terms of
performance. Finally, we acknowledge that more testing in relation to how
streams affect algorithmic performance is required for strong conclusions to
be drawn.

7.2 Limitations and future work


The limitations of this project can be directly tied to areas which could be
explored more deeply in the future. In this project, streams are examined as
62 | Conclusions and Future work

they are commonly used in public GitHub projects. However, streams can
be used in many different ways, and in order to give a broader view of stream
performance, more benchmarking is required. This would not necessarily give
insights into how streams realistically perform, but it would show the potential
of stream performance, when they are used in unconventional ways. If future
studies find that streams can perform better under different circumstances, then
this could potentially influence how developers choose to use them. Listed
below are some examples of how streams can be benchmarked in more ways.

• More types of stream sources can be used (such as sets), and more
types of operations which create stream objects can be used (specifically
concat).

• The stream sources can contain larger objects.

• When determining how pipeline length affects the performance of


streams, different intermediate operations can be used (i.e., not just the
two most popular ones, filter and map).).

• Each operation can be tested with multiple different lambda expressions.


This could affect the performance of the operation.

• Specific operations can be used in different ways. Firstly, collect


can be used to convert streams into other types than just arraylists (e.g.,
sets or queues). Secondly, sorted can be tested on different data sets,
for example on data that is mostly sorted, or descending data, which
arguably could be seen as more realistic than shuffled data. Thirdly,
reduce can be tested with more complex lambda expressions which
result in more sophisticated reductions. Finally, flatMap can be used
and tested in more sophisticated ways.

Something which clearly requires more examination is the performance of


parallel streams. They are only very rarely used, however if it is found that
they can perform well, then developers might start using them more frequently.
This is especially true since streams were specifically implemented to be easily
parallelized∗ . Thus, performance gains can potentially be gained with relative
ease. Specifically, algorithms could be implemented with parallel streams
instead of just sequential streams, which potentially could result in much better
performance, as indicated by the parallel benchmarks in this project.

In order to parallelize a stream pipeline, all that is required is a single call to the method
parallel().
Conclusions and Future work | 63

As mentioned, Rosales et al. found that input sizes for streams are
relatively small. However, this might not represent real stream usage, since
they only queried source code executed by unit tests, and because they may
not have had access to the real input data. Further studies could analyze more
deeply how large stream data sources commonly are, for example by analyzing
projects with real input data. This is of interest, since input size clearly has a
large impact on performance. If the findings of Rosales et al. are confirmed,
then this would indicate that streams, as they are commonly used, are much
slower than imperative Java.
As mentioned, the benchmarked streams in this project are supposed to
represent real stream usage. This is the reason for why many of the tested
pipelines use the most popular terminal operation collect. As detailed in
Section 2.3.1, using this operation often results in more garbage collection,
which in turn can make the benchmarking results slightly less consistent.
In future studies, terminal operations which return scalar values instead of
collections can be used, in order to reduce the amount of needed garbage
collection. This would result in less realistic usage of streams, however it
could show different performance results, since the error rates likely would be
lower.
Furthermore, stream performance can be examined when they are used
in real-world projects. This is more complicated than testing individual
stream pipelines, because of how hard it is to accurately benchmark Java
programs, however if done correctly it could give more insights into realistic
stream performance. Relating to this, the effect streams have on performance
when they are used in algorithms can be explored more extensively, since the
findings in this project suggest that different algorithms are affected differently
by the usage of streams.
Multiple studies have attempted (and succeeded) to improve stream
performance [29] [30] [22] [8] [31] [5]. It would be interesting to examine how
their improved streams compare to traditional imperative loops. Since many
stream pipelines already are close to being as fast as imperative loops, it is
likely that they could outcompete them with some improvements, for example
by reducing the number of required virtual calls per stream element [5].
Adjacent topics which do not relate to the performance of streams can
also be explored more deeply. Many advocates of functional programming
claim that functional programming results in fewer bugs, higher readability,
and more easily maintainable code (there is only limited research on the topic
[35] [36]). Whether or not this is true for Java streams could be interesting
to explore by future works. This is especially interesting since Java is known
64 | Conclusions and Future work

to be relatively verbose, which can reduce readability. It is clear that using


streams sometimes results in slower programs, however the trade-off might be
worth it, considering the other possible perks. Ultimately, it depends on how
much performance the developer is ready to sacrifice.

7.3 Reflections
The results and conclusions of this thesis are presented with enough detail to be
useful for Java developers who need specific information on different aspects
of stream performance. For example, this thesis provides specific performance
information on the 20 most popular stream operations, for several input sizes.
This can help developers make more specific choices in regards to when they
should use streams, and how much the performance will be impacted by this.
Thus, the conclusions of this study are of interest to developers who write
programs where performance is important, especially if they are interested
in functional programming and/or Java. Developers may find that the perks
of using streams compensate for the potential loss of performance, especially
since the performance loss in many cases is relatively small. It ultimately
depends on whether or not the developer believes that the trade-off is worth it.
However, if performance is of importance, then we recommend that
developers avoid using streams with small input sizes, and they should also
abstain from using certain pipelines, such as source-collect, which
surprisingly is the third most popular pipeline [12]. Furthermore, at least some
types of pipelines which use flatMap should also be avoided. In these cases,
we recommend developers to instead use imperative alternatives, such as for-
loops. Furthermore, if performance is the highest priority, the we recommend
developers to not use (sequential) streams at all, since they only rarely result
in performance gains.
These findings have environmental, societal, and economic impacts.
Firstly, our findings can help developers create better performing Java
programs, by showing them when stream should and should not be used, in
order to optimize for performance. Consequently, better performing programs
results in fewer needed CPU cycles, which in turn leads to less power used,
ultimately reducing the environmental impact of the programs. Thus, our
findings can help guide developers make more mindful decision in relation
to sustainable development.
As described in Section 6.6, stream usage has many possible perks, such
as more maintainable code, fewer bugs, and simpler parallelization (there
is only limited research on this topic [35] [36]). Since our findings show
Conclusions and Future work | 65

that stream usage often only results in slightly slower programs, this might
encourage developers to use streams more frequently, thus possibly gaining
the aforementioned perks of functional programming. This could result in less
stress, and less time needed to complete software projects, and in general better
work place environments, which are important societal and ethical aspects to
take into consideration. Furthermore, this could have economic benefits, as
it is cheaper to maintain more stable code bases, since it requires less work
hours.
Java has been one of the most used programming languages for the last
few decades, and it is still widely popular [1]. Because of this, the results of
this project are likely to remain relevant for the foreseeable future. In relation
to this, projects that assess the performance of programming languages are
important, because they assist language developers in prioritizing their efforts
toward relevant aspects of the language.
For example, if Oracle (the developers of the Java language) chose to focus
their efforts on improving the performance of streams because of a project
such as this one, then this could have huge implications, since even a tiny
performance improvement could save copious amounts of clock cycles, since
streams are used relatively frequently [7].
In summary, our findings can help developers create better performing,
and possibly more maintainable, programs. This can result in better work
environments and more sustainable products, which require less capital to
maintain.
66 | Conclusions and Future work
References | 67

References

[1] The top programming languages. [Online]. Available: https://octoverse.


github.com/2022/top-programming-languages [Pages 1 and 65.]

[2] A small place to discover languages in github. [Online]. Available:


https://madnight.github.io/githut/#/pull_requests/2022/4 [Page 1.]

[3] Java streams documentation. [Online]. Available: https://docs.oracle.co


m/javase/8/docs/api/java/util/stream/package-summary.html [Pages 2,
6, 8, 9, and 12.]

[4] G. Krantz, “A performance comparison of clojure and java,” 2020.


[Pages 3 and 6.]

[5] A. Biboudis, N. Palladinos, and Y. Smaragdakis, “Clash of the lambdas,”


arXiv preprint arXiv:1406.6631, 2014. [Pages 3, 18, 19, 20, 23, 31, 37,
55, 57, and 63.]

[6] F. Schiavio, A. Rosà, and W. Binder, “Sql to stream with s2s: An


automatic benchmark generator for the java stream api,” in Proceedings
of the 21st ACM SIGPLAN International Conference on Generative
Programming: Concepts and Experiences, 2022, pp. 179–186. [Pages 3,
5, 23, and 24.]

[7] E. Rosales, A. Rosà, M. Basso, A. Villazón, A. Orellana, Á. Zenteno,


J. Rivero, and W. Binder, “Characterizing java streams in the wild,”
in 2022 26th International Conference on Engineering of Complex
Computer Systems (ICECCS). IEEE, 2022, pp. 143–152. [Pages 3,
6, 8, 12, 13, 14, 15, 16, 17, 20, 24, 25, 27, 30, 37, 55, 59, 60, 61, and 65.]

[8] A. Møller and O. H. Veileborg, “Eliminating abstraction overhead


of java stream pipelines using ahead-of-time program optimization,”
Proceedings of the ACM on Programming Languages, vol. 4, no.
OOPSLA, pp. 1–29, 2020. [Pages 3, 20, 23, and 63.]
68 | References

[9] R. Khatchadourian, Y. Tang, M. Bagherzadeh, and B. Ray, “An empirical


study on the use and misuse of java 8 streams.” in FASE, 2020, pp. 97–
118. [Pages 6, 12, and 16.]

[10] R. Hundt, “Loop recognition in c++/java/go/scala,” 2011. [Page 6.]

[11] H. Tanaka, S. Matsumoto, and S. Kusumoto, “A study on the current


status of functional idioms in java,” IEICE TRANSACTIONS on
Information and Systems, vol. 102, no. 12, pp. 2414–2422, 2019.
[Pages 12, 13, 15, 16, and 27.]

[12] J. Nostas, J. P. S. Alcocer, D. E. Costa, and A. Bergel, “How do


developers use the java stream api?” in Computational Science and
Its Applications–ICCSA 2021: 21st International Conference, Cagliari,
Italy, September 13–16, 2021, Proceedings, Part VII 21. Springer, 2021,
pp. 323–335. [Pages 12, 13, 14, 16, 27, 61, and 64.]

[13] A. Villazón, H. Sun, A. Rosà, E. Rosales, D. Bonetta, I. Defilippis,


S. Oporto, and W. Binder, “Automated large-scale multi-language
dynamic program analysis in the wild (tool insights paper),” in
33rd European Conference on Object-Oriented Programming (ECOOP
2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
[Page 12.]

[14] Y. Zheng, A. Rosà, L. Salucci, Y. Li, H. Sun, O. Javed, L. Bulej,


L. Y. Chen, Z. Qi, and W. Binder, “Autobench: Finding workloads
that you need using pluggable hybrid analyses,” in 2016 IEEE
23rd International Conference on Software Analysis, Evolution, and
Reengineering (SANER), vol. 1. IEEE, 2016, pp. 639–643. [Page 12.]

[15] B. Lea, Doug (with the help of Goetz, P. Sandoz, A. Shipilev, H. Kabutz,
and J. Bowbee. When to use parallel streams. [Online]. Available:
https://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html
[Page 16.]

[16] H. Austerlitz, Data acquisition techniques using PCs. Academic press,


2002. [Page 17.]

[17] Understanding jit compilation and optimizations. [Online]. Available:


https://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/geninfo/dia
gnos/underst_jit.html [Pages 17 and 18.]
References | 69

[18] Avoiding benchmarking pitfalls on the jvm. [Online]. Available:


https://www.oracle.com/technical-resources/articles/java/architect-b
enchmarking.html [Pages 18 and 36.]

[19] D. Costa, C.-P. Bezemer, P. Leitner, and A. Andrzejak, “What’s wrong


with my benchmark results? studying bad practices in jmh benchmarks,”
IEEE Transactions on Software Engineering, vol. 47, no. 7, pp. 1452–
1467, 2021. doi: 10.1109/TSE.2019.2925345 [Pages 18, 19, 32, and 36.]

[20] Official page for jmh. [Online]. Available: https://openjdk.org/projects


/code-tools/jmh/ [Page 19.]

[21] Spliterator official documentation. [Online]. Available: https://docs.ora


cle.com/javase/8/docs/api/java/util/Spliterator.html [Page 19.]

[22] O. Kiselyov, A. Biboudis, N. Palladinos, and Y. Smaragdakis, “Stream


fusion, to completeness,” in Proceedings of the 44th ACM SIGPLAN
Symposium on Principles of Programming Languages, 2017, pp. 285–
299. [Pages 20, 23, and 63.]

[23] Escape analysis. [Online]. Available: https://docs.oracle.com/javase/7


/docs/technotes/guides/vm/performance-enhancements-7.html#escape
Analysis [Page 20.]

[24] R. Sedgewick and K. D. Wayne, Algorithms. W. Ross MacDonald


School Resource Services Library, 2017. [Pages 22 and 31.]

[25] The fft algorithm, implemented in java by sedgewick and wayne.


[Online]. Available: https://github.com/kevin-wayne/algs4/blob/mast
er/src/main/java/edu/princeton/cs/algs4/FFT.java [Page 22.]

[26] Dijkstra’s algorithm (single-source), implemented in java by sedgewick


and wayne. [Online]. Available: https://github.com/kevin-wayne/alg
s4/blob/master/src/main/java/edu/princeton/cs/algs4/DijkstraSP.java
[Page 23.]

[27] Java implementation of an indexed priority queue by sedgewick and


wayne. [Online]. Available: https://github.com/kevin-wayne/algs4/blo
b/master/src/main/java/edu/princeton/cs/algs4/DijkstraAllPairsSP.java
[Page 23.]

[28] V. Niculescu, D. Bufnea, and A. Sterca, “Enhancing java streams api


with powerlist computation,” in 2020 IEEE International Parallel and
70 | References

Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2020,


pp. 375–384. [Page 23.]

[29] M. Basso, F. Schiavio, A. Rosà, and W. Binder, “Optimizing parallel


java streams,” in 2022 26th International Conference on Engineering
of Complex Computer Systems (ICECCS). IEEE, 2022, pp. 23–32.
[Pages 23 and 63.]

[30] R. Khatchadourian, Y. Tang, and M. Bagherzadeh, “Safe automated


refactoring for intelligent parallelization of java 8 streams,” Science of
Computer Programming, vol. 195, p. 102476, 2020. [Pages 23 and 63.]

[31] F. Ribeiro, J. Saraiva, and A. Pardo, “Java stream fusion: Adapting fp


mechanisms for an oo setting,” in Proceedings of the XXIII Brazilian
Symposium on Programming Languages, 2019, pp. 30–37. [Pages 23
and 63.]

[32] A. Ward and D. Deugo, “Performance of lambda expressions in java 8,”


in Proceedings of the International Conference on Software Engineering
Research and Practice (SERP). The Steering Committee of The World
Congress in Computer Science, Computer . . . , 2015, p. 119. [Page 24.]

[33] J. Jurinová, “Performance improvement of using lambda expressions


with new features of java 8 vs. other possible variants of iterating
over arraylist in java,” Journal of Applied Mathematics, Statistics and
Informatics, vol. 14, no. 1, pp. 103–131, 2018. [Page 24.]

[34] Link to input data used for dijkstra’s algorithm. [Online]. Available:
https://algs4.cs.princeton.edu/44sp/ [Page 31.]

[35] A. Khanfor and Y. Yang, “An overview of practical impacts of functional


programming,” in 2017 24th Asia-Pacific Software Engineering
Conference Workshops (APSECW). IEEE, 2017, pp. 50–54. [Pages 56,
63, and 64.]

[36] M. Coquand, “Evaluating functional programming for software quality


in rest apis,” 2019. [Pages 56, 63, and 64.]

[37] Benchmark code used by biboudis et al., which compares stream


performance with imperative loops. [Online]. Available: https:
//github.com/biboudis/clashofthelambdas [Page 57.]
Appendix A: Slowdown factors | 71

Appendix A

Slowdown factors

In this appendix, the exact slowdown factors are shown for the pipelines which
aim to isolate the performance of intermediate and terminal operations.

Input size filter flatMap map mapToInt mapToLong mapToObj


1 3.35 4.12 3.39 6.36 6.33 3.00
3 2.20 3.17 2.62 4.69 4.18 1.89
10 1.77 3.05 1.89 3.16 3.07 1.45
100 1.41 2.70 1.39 1.80 1.74 1.20
1 000 1.36 2.48 1.24 1.34 1.29 1.08
10 000 1.31 1.90 1.07 1.10 1.10 1.19
100 000 1.17 1.62 1.09 1.04 1.02 1.11
1 000 000 1.03 1.30 1.14 0.99 0.97 1.31
Average 1.70 2.54 1.73 2.56 2.46 1.53

Table A.1: Slowdown factors for isolated stateless intermediate operations.


72 | Appendix A: Slowdown factors

Input size allMatch anyMatch sum count


1 10.69 5.96 18.71 17.26
3 6.92 3.89 15.37 9.80
10 4.66 2.38 6.43 4.68
100 3.40 1.60 2.13 1.52
1 000 2.46 1.34 1.06 0.99
10 000 1.46 1.19 0.92 0.84
100 000 1.23 1.08 0.81 0.95
1 000 000 1.13 1.20 0.90 1.01
Average 3.99 2.33 5.79 4.63

Table A.2: Slowdown factors for isolated stateful intermediate operations.

Input size findAny findFirst collect toArray forEach reduce


1 10.83 10.34 3.41 5.14 4.56 14.06
3 7.37 6.70 2.71 2.89 3.00 9.09
10 5.50 5.02 1.96 1.55 1.14 8.98
100 3.92 3.72 1.43 1.29 1.17 7.16
1 000 2.61 2.53 1.25 1.12 1.20 4.65
10 000 1.77 1.85 1.32 0.80 0.92 2.38
100 000 1.23 1.17 1.20 1.14 0.95 1.80
1 000 000 1.34 1.17 1.03 0.97 1.07 1.79
Average 4.32 4.06 1.79 1.86 1.75 6.24

Table A.3: Slowdown factors for isolated terminal intermediate operations.

Input size distinct limit skip sorted


1 5.4 6.45 4.06 7.59
3 3.52 5.15 1.27 6.27
10 2.67 3.95 1.53 2.17
100 1.69 2.3 2.19 1.54
1 000 1.65 2.56 2.14 1.29
10 000 1.71 2.05 1.76 1.22
100 000 1.79 1.99 1.58 1.23
1 000 000 1.78 1.22 1.1 1.17
Average 2.53 3.21 1.95 2.81

Table A.4: Slowdown factors for isolated terminal intermediate operations.


Appendix B: Pipeline length graphs | 73

Appendix B

Pipeline length graphs

In this appendix chapter, the slowdown factors for eight of the ten most popular
terminal operations are showed, when they are used in a pipeline with zero,
one and two intermediate operations.

- allMatch filter_allMatch
filter_map_allMatch
10
Slowdown factor

6
4

1
1 10 100 1,000 10,000 100,000

Input size

Figure B.1: Slowdown factors for the terminal operation allMatch with
zero, one and two intermediate operations.
74 | Appendix B: Pipeline length graphs

- anyMatch filter_anyMatch
filter_map_anyMatch

10
Slowdown factor

6
4

1
1

10

,0
10

00

00

,0

0
0

00
1,

,
10

10

1,
Input size

Figure B.2: Slowdown factors for the terminal operation anyMatch with
zero, one and two intermediate operations.

- sum filter_sum filter_map_sum

8
6
Slowdown factor

1
1

10

00

00
10

00

00
,0

0,
1,

0,
10

00
10

1,

Input size

Figure B.3: Slowdown factors for the terminal operation sum with zero, one
and two intermediate operations.
Appendix B: Pipeline length graphs | 75

- forEach filter_forEach filter_map_forEach

8
Slowdown factor

6
4

1
1

10

00

0
0
10

00

0,
00
,0

00
1,

0,
10

1,
10
Input size

Figure B.4: Slowdown factors for the terminal operation forEach with zero,
one and two intermediate operations.

- reduce filter_reduce filter_map_reduce

8
6
Slowdown factor

1
1

10

00

00
10

00

00
,0

0,
1,

0,
10

00
10

1,

Input size

Figure B.5: Slowdown factors for the terminal operation reduce with zero,
one and two intermediate operations.
76 | Appendix B: Pipeline length graphs

- filter_count map_filter_count

8
6
Slowdown factor

1
1

10

00

00

00
10

00
0

,0

,0
1,

0,
10

00
10

0
1,
Input size

Figure B.6: Slowdown factors for the terminal operation count with one and
two intermediate operations. It is not tested with zero intermediate operations,
for reasons explained in the method chapter, in Section 3.4.

- collect filter_collect map_filter_collect

20
Slowdown factor

10

6
4

1
1 10 100 1,000 10,000 100,000

Input size

Figure B.7: Slowdown factors for the terminal operation collect with zero,
one and two intermediate operations.
Appendix B: Pipeline length graphs | 77

- toArray filter_toArray map_filter_toArray


10
8
Slowdown factor

1
1 10 100 1,000 10,000 100,000

Input size

Figure B.8: Slowdown factors for the terminal operation toArray with zero,
one and two intermediate operations.
78 | Appendix B: Pipeline length graphs
TRITA-EECS-EX- 2023:495

www.kth.se

You might also like