KEMBAR78
Mathematical Foundations of Data Science Using | PDF | Programming Paradigms | Programming
0% found this document useful (0 votes)
43 views487 pages

Mathematical Foundations of Data Science Using

The book 'Mathematical Foundations of Data Science Using R' aims to provide an introduction to the mathematical principles essential for data science, utilizing the R programming language. It is structured into three main parts: an introduction to R, graphics in R, and the mathematical basics of data science, with practical examples and theoretical foundations. The authors emphasize the importance of mathematical understanding and programming proficiency for effective data analysis and decision-making in various fields.

Uploaded by

ymvxdhm6np
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views487 pages

Mathematical Foundations of Data Science Using

The book 'Mathematical Foundations of Data Science Using R' aims to provide an introduction to the mathematical principles essential for data science, utilizing the R programming language. It is structured into three main parts: an introduction to R, graphics in R, and the mathematical basics of data science, with practical examples and theoretical foundations. The authors emphasize the importance of mathematical understanding and programming proficiency for effective data analysis and decision-making in various fields.

Uploaded by

ymvxdhm6np
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 487

Frank Emmert-Streib, Salissou Moutari, Matthias Dehmer

Mathematical Foundations of Data Science Using R


Frank Emmert-Streib, Salissou Moutari, Matthias Dehmer
Mathematical Foundations of
Data Science Using R
ISBN 9783110564679
e-ISBN (PDF) 9783110564990
e-ISBN (EPUB) 9783110565027
Bibliographic information published by the Deutsche
Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the
Deutsche Nationalbibliografie; detailed bibliographic data are
available on the Internet at http://dnb.dnb.de.
© 2020 Walter de Gruyter GmbH, Berlin/Boston
Mathematics Subject Classification 2010: 05C05, 05C20, 05C09,
05C50, 05C80, 60-06, 60-08, 60A-05, 60B-10, 26C05, 26C10, 49K15,
62R07, 62-01, 62C10,
Contents
Preface
1 Introduction
1.1 Relationships between mathematical subjects and
data science
1.2 Structure of the book
1.2.1 Part one
1.2.2 Part two
1.2.3 Part three
1.3 Our motivation for writing this book
1.4 Examples and listings
1.5 How to use this book

Part I Introduction to R
2 Overview of programming paradigms
2.1 Introduction
2.2 Imperative programming
2.3 Functional programming
2.4 Object-oriented programming
2.5 Logic programming
2.6 Other programming paradigms
2.7 Compiler versus interpreter languages
2.8 Semantics of programming languages
2.9 Further reading
2.10 Summary
3 Setting up and installing the R program
3.1 Installing R on Linux
3.2 Installing R on MAC OS X
3.3 Installing R on Windows
3.4 Using R
3.5 Summary
4 Installation of R packages
4.1 Installing packages from CRAN
4.2 Installing packages from Bioconductor
4.3 Installing packages from GitHub
4.4 Installing packages manually
4.5 Activation of a package in an R session
4.6 Summary
5 Introduction to programming in R
5.1 Basic elements of R
5.2 Basic programming
5.3 Data structures
5.4 Handling character strings
5.5 Sorting vectors
5.6 Writing functions
5.7 Writing and reading data
5.8 Useful commands
5.9 Practical usage of R
5.10 Summary
6 Creating R packages
6.1 Requirements
6.2 R code optimization
6.3 S3, S4, and RC object-oriented systems
6.4 Creating an R package based on the S3 class system
6.5 Checking the package
6.6 Installation and usage of the package
6.7 Loading and using a package
6.8 Summary

Part II Graphics in R
7 Basic plotting functions
7.1 Plot
7.2 Histograms
7.3 Bar plots
7.4 Pie charts
7.5 Dot plots
7.6 Strip and rug plots
7.7 Density plots
7.8 Combining a scatterplot with histograms: the layout
function
7.9 Three-dimensional plots
7.10 Contour and image plots
7.11 Summary
8 Advanced plotting functions: ggplot2
8.1 Introduction
8.2 qplot()
8.3 ggplot()
8.4 Summary
9 Visualization of networks
9.1 Introduction
9.2 igraph
9.3 NetBioV
9.4 Summary

Part III Mathematical basics of data science


10 Mathematics as a language for science
10.1 Introduction
10.2 Numbers and number operations
10.3 Sets and set operations
10.4 Boolean logic
10.5 Sum, product, and Binomial coefficients
10.6 Further symbols
10.7 Importance of definitions and theorems
10.8 Summary
11 Computability and complexity
11.1 Introduction
11.2 A brief history of computer science
11.3 Turing machines
11.4 Computability
11.5 Complexity of algorithms
11.6 Summary
12 Linear algebra
12.1 Vectors and matrices
12.2 Operations with matrices
12.3 Special matrices
12.4 Trace and determinant of a matrix
12.5 Subspaces, dimension, and rank of a matrix
12.6 Eigenvalues and eigenvectors of a matrix
12.7 Matrix norms
12.8 Matrix factorization
12.9 Systems of linear equations
12.10 Exercises
13 Analysis
13.1 Introduction
13.2 Limiting values
13.3 Differentiation
13.4 Extrema of a function
13.5 Taylor series expansion
13.6 Integrals
13.7 Polynomial interpolation
13.8 Root finding methods
13.9 Further reading
13.10 Exercises
14 Differential equations
14.1 Ordinary differential equations (ODE)
14.2 Partial differential equations (PDE)
14.3 Exercises
15 Dynamical systems
15.1 Introduction
15.2 Population growth models
15.3 The Lotka–Volterra or predator–prey system
15.4 Cellular automata
15.5 Random Boolean networks
15.6 Case studies of dynamical system models with
complex attractors
15.7 Fractals
15.8 Exercises
16 Graph theory and network analysis
16.1 Introduction
16.2 Basic types of networks
16.3 Quantitative network measures
16.4 Graph algorithms
16.5 Network models and graph classes
16.6 Further reading
16.7 Summary
16.8 Exercises
17 Probability theory
17.1 Events and sample space
17.2 Set theory
17.3 Definition of probability
17.4 Conditional probability
17.5 Conditional probability and independence
17.6 Random variables and their distribution function
17.7 Discrete and continuous distributions
17.8 Expectation values and moments
17.9 Bivariate distributions
17.10 Multivariate distributions
17.11 Important discrete distributions
17.12 Important continuous distributions
17.13 Bayes’ theorem
17.14 Information theory
17.15 Law of large numbers
17.16 Central limit theorem
17.17 Concentration inequalities
17.18 Further reading
17.19 Summary
17.20 Exercises
18 Optimization
18.1 Introduction
18.2 Formulation of an optimization problem
18.3 Unconstrained optimization problems
18.4 Constrained optimization problems
18.5 Some applications in statistical machine learning
18.6 Further reading
18.7 Summary
18.8 Exercises
Bibliography
Subject Index
Preface
In recent years, data science has gained considerable popularity
and established itself as a multidisciplinary field. The goal of data
science is to extract information from data and use this
information for decision making. One reason for the popularity
of the field is the availability of mass data in nearly all fields of
science, industry, and society. This allowed moving away from
making theoretical assumptions, upon which an analysis of a
problem is based on toward data-driven approaches that are
centered around these big data. However, to master data
science and to tackle real-world data-based problems, a high
level of a mathematical understanding is required. Furthermore,
for a practical application, proficiency in programming is
indispensable. The purpose of this book is to provide an
introduction to the mathematical foundations of data science
using R.
The motivation for writing this book arose out of our teaching
and supervising experience over many years. We realized that
many students are struggling to understand methods from
machine learning, statistics, and data science due to their lack of
a thorough understanding of mathematics. Unfortunately,
without such a mathematical understanding, data analysis
methods, which are based on mathematics, can only be
understood superficially. For this reason, we present in this book
mathematical methods needed for understanding data science.
That means we are not aiming for a comprehensive coverage of,
e. g., analysis or probability theory, but we provide selected
topics from such subjects that are needed in every data
scientist’s mathematical toolbox. Furthermore, we combine this
with the algorithmic realization of mathematical method by
using the widely used programming language R.
The present book is intended for undergraduate and graduate
students in the interdisciplinary field of data science with a major
in computer science, statistics, applied mathematics, information
technology or engineering. The book is organized in three main
parts. Part 1: Introduction to R. Part 2: Graphics in R. Part 3:
Mathematical basics of data science. Each part consists of
chapters containing many practical examples and theoretical
basics that can be practiced side-by-side. This way, one can put
the learned theory into a practical application seamlessly.
Many colleagues, both directly or indirectly, have provided us
with input, help, and support before and during the preparation
of the present book. In particular, we would like to thank Danail
Bonchev, Jiansheng Cai, Zengqiang Chen, Galina Glazko, Andreas
Holzinger, Des Higgins, Bo Hu, Boris Furtula, Ivan Gutman,
Markus Geuss, Lihua Feng, Juho Kanniainen, Urs-Martin Künzi,
James McCann, Abbe Mowshowitz, Aliyu Musa, Beatrice Paoli,
Ricardo de Matos Simoes, Arno Schmidhauser, Yongtang Shi,
John Storey, Simon Tavaré, Kurt Varmuza, Ari Visa, Olli Yli-Harja,
Shu-Dong Zhang, Yusen Zhang, Chengyi Xia, and apologize to all
who have not been named mistakenly. For proofreading and
help with various chapters, we would like to express our special
thanks to Shailesh Tripathi, Kalifa Manjan, and Nadeesha Perera.
We are particularly grateful to Shailesh Tripathi for helping us
preparing the R code. We would like also to thank our editors
Leonardo Milla, Ute Skambraks and Andreas Brandmaier from
DeGruyter Press who have been always available and helpful.
Matthias Dehmer also thanks the Austrian Science Fund (FWF)
for financial support (P 30031).
Finally, we hope this book helps to spread the enthusiasm and
joy we have for this field, and inspires students and scientists in
their studies and research questions.
Tampere and Brig and Belfast, March 2020
F. Emmert-Streib, M. Dehmer, and Salissou Moutari
1 Introduction
We live in a world surrounded by data. Whether a patient is visiting
a hospital for treatment, a stockbroker is looking for an investment
on the stock market, or an individual is buying a house or
apartment, data are involved in any of these decision processes.
The availability of such data results from technological progress
during the last three decades, which enabled the development of
novel data generation, data measurement, data storage, and data
analysis means. Despite the variety of data types stemming from
different application areas, for which specific data generating
devices have been developed, there is a common underlying
framework that unites the corresponding methodologies for
analyzing them. In recent years, the main toolbox or the process of
analyzing such data has come to be referred to as data science
[→73].
Despite its novelty, data science is not really a new field on its
own as it draws heavily on traditional disciplines [→61]. For
instance, machine learning, statistics, and pattern recognition are
playing key roles when dealing with data analysis problems of any
kind. For this reason, it is important for a data scientist to gain a
basic understanding of these traditional fields and how they fuel
the various analysis processes used in data science. Here, it is
important to realize that harnessing the aforementioned methods
requires a thorough understanding of mathematics and probability
theory. Without such an understanding, the application of any
method is done in a blindfolded way, a manner lacking deeper
insights. This deficiency hinders the adequate usage of the methods
and the ability to develop novel methods. For this reason, this book
aims to introduce the mathematical foundations of data science.
Furthermore, in order to exploit machine learning and statistics
practically, a computational realization needs to be found. This
necessitates the writing of algorithms that can be executed by
computers. The advantage of such a computational approach is
that large amounts of data can be analyzed using many methods in
an efficient way. However, this requires proficiency in a
programming language. There are many programming languages,
but one of the most suited programming languages for data
science is R [→154]. Therefore, we present in this book the
mathematical foundations of data science using the programming
language R.
That means we will start with the very basics that are needed to
become a data scientist with some understanding of the used
methods, but also with the ability to develop novel methods. Due to
the difficulty of some of the mathematics involved, this journey may
take some time since there is a wide range of different subjects to
be learned. In the following sections, we will briefly summarize the
main subjects covered in this book as well as their relationship to
and importance for advanced topics in data science.
Figure 1.1 Visualization of the relationship between several
mathematical subjects and data science subjects. The mathematical
subjects shown in green are essentially needed for every topic in
data science, whereas graph theory is only used when the data
have an additional structural property. In contrast, differential
equations and dynamical systems assume a special role used to
gain insights into the data generation process itself.
1.1 Relationships between mathematical
subjects and data science
In →Fig. 1.1, we show an overview of the relationships between
several mathematical subjects on the left-hand side and data
science subjects on the right-hand side. First, we would like to
emphasize that there are different types of mathematical subjects.
In →Fig. 1.1, we distinguish three different types. The first type,
shown in green, indicates subjects that are essential for every topic
in data science, regardless of the specific purpose of the analysis.
For this reason, the subjects linear algebra, analysis, probability
theory and optimization can be considered the bare backbone to
understand data science. The second type, shown in orange,
indicates a subject that requires the data to have an additional
structural property. Specifically, graph theory is very useful when
discrete relationships between variables of the system are present.
An example of such an application is a random forest classifier, or a
decision tree that utilizes a tree-structure between variables for
their representation and classification [→26]. The third subject type,
shown in blue, is fundamentally different to the previous types. To
emphasize this, we drew the links with dashed lines. Dynamical
systems are used to gain insights into the data-generation process
itself rather than to analyze the data. In this way, it is possible to
gain a deeper understanding of the system to be analyzed. For
instance, one can simulate the regulation between genes that leads
to the expression of proteins in biological cells, or the trading
behavior of investors to learn about the evolution of the stock
market. Therefore, dynamical systems can also be used to generate
benchmark data to test analysis methods, which is very important
when either developing a new method or testing the influence of
different characteristics of data.
The diagram in →Fig. 1.1 shows the theoretical connection
between some mathematical subjects and data science. However, it
does not show how this connection is realized practically. This is
visualized in →Fig. 1.2. This figure shows that programming is
needed to utilize mathematical methods practically for data science.
That means programming, or computer science more generally, is a
glue skill/field that (1) enables the practical application of methods
from statistics, machine learning, and mathematics, (2) allows the
combination of different methods from different fields, and (3)
provides practical means for the development of novel computer-
based methods (using, e. g., Monte Carlo or resampling methods).
All of these points are of major importance in data science, and
without programming skills one cannot unlock the full potential
that data science offers. For the sake of clarity, we want to
emphasize that we mean scientific and statistical programming
rather than general purpose programming skills when we speak
about programming skills.
Figure 1.2 The practical connection between mathematics and
methods from data science is obtained by means of algorithms,
which require programming skills. Only in this way mathematical
methods can be utilized for specific data analysis problems.

Due to these connections, we present in this book the


mathematical foundations for data science along with an
introduction to programming. A pedagogical side-effect of
presenting programming and mathematics side-by-side is that one
learns naturally to think about a problem in an algorithmic or
computational way. In our opinion, this has a profound influence on
the reader’s analytical problem-solving capabilities because it
enables thinking in a guided way that can be computationally
realized in a practical manner. Hence, learning mathematics in
combination with a computational realization is more helpful than
learning only the methods, since it leads to a systematic approach
for problem solving. This is especially important for us since the end
goal of mastering the mathematical foundations is for their
application to problems in data science.

1.2 Structure of the book


This book is structured into three main parts. In the following, we
discuss the content of each part briefly:

1.2.1 Part one


In Part one, we start with a general discussion on different
programming paradigms and programming languages as well as
their relationship. Then, we focus on the introduction of the
programming language R. We start by showing how to install R on
your computer and how to install additional packages from external
repositories. The programming language itself and the packages
are freely available and work for Windows, Mac, and Linux
computers. Afterward, we discuss key elements of the
programming language R in detail, including control and data
structures and important commands. As an advanced topic, we
discuss how to create an R package.

1.2.2 Part two


In Part two, we focus on the visualization of data by utilizing the
graphical capabilities of R. First, we discuss the basic plotting
functions provided by R and then present advanced plotting
functionalities that are based on external packages, e. g., ggplot.
These sections are intended for data of an arbitrary structure. In
addition, we also present visualization functions that can be used
for network data.
The visualization of data is very important because it presents
actually a form of data analysis. In the statistics community, such an
approach is termed exploratory data analysis (EDA). In the 1950s,
John Tukey advocated widely the idea of data visualization as a
means to generate novel hypothesis about the underlying data
structure, which would otherwise not come to the mind of the
analyst [→97], [→189]. Then, these hypotheses can be further
analyzed by means of quantitative analysis methods. EDA uses data
visualization techniques, e. g., box plots, scatter plots, and also
summary statistics, e. g., mean, variance, and quartiles to get either
an overview of the characteristics of the data or to generate a new
understanding. Therefore, a first step in formulating a question that
can be addressed by any quantitative data analysis method consists
often in the visualization of the data. The Chapters →7, →8 and →9
present various visualization methods that can be utilized for the
purpose of an exploratory data analysis.

1.2.3 Part three


In Part three, we start with a general introduction of mathematical
preliminaries and some motivation about the merit of mathematics
as the language for science. Then we provide the theoretical
underpinning for the programming in R, discussed in the first two
parts of the book, by introducing mathematical definitions of
computability, complexity of algorithms, and Turing machines.
Thereafter, we focus on the mathematical foundations of data
science. We discuss various methods from linear algebra, analysis,
differential equations, dynamical systems, graph theory, probability
theory, and optimization. For each of these subjects, we dedicate
one chapter. Our presentation will cover these topics on a basic and
intermediate level, which is enough to understand essentially all
methods in data science.
In addition to the discussion of the above topics, we also
present worked examples, including their corresponding scripts in
R. This will allow the reader to practice the mathematical methods
introduced. Furthermore, at the end of some chapters, we will
provide exercises that allow gaining deeper insights. Throughout
the chapters, we also discuss direct applications of the methods for
data science problems from, e. g., economy, biology, finance or
business. This will provide answers to questions of the kind, ‘what
are the methods for?’.

1.3 Our motivation for writing this book


From our experience in supervising BSc, MSc, and PhD students as
well as postdoctoral researchers, we learned over the years that a
common problem among many students is their lack of basic
knowledge and understanding of the mathematical foundations
underpinning methods used in data science. This causes difficulties
in particular when tackling more advanced data analysis problems.
Unfortunately, this deficiency is typically too significant to be easily
compensated, e. g., by watching few YouTube videos or attending
some of the many available online courses on platforms like
Coursera, EdX or Udacity. The problem is actually twofold. First, it is
the specific lack of knowledge and understanding of mathematical
methods and, second, it is an inability to ‘think mathematically’. As
already mentioned above, both problems are related with each
other, and the second one is even more severe because it has more
far-reaching consequences. From analyzing this situation, we
realized that there is no shortcut in learning data science than to
gain, first, a deep understanding of its mathematical foundations.
Figure 1.3 A typical approach in data science in order to analyze a
‘big question’ is to reformulate this question in a way that makes it
practically analyzable. This requires an understanding of the
mathematical foundations and mathematical thinking (orange
links).

This problem is visualized in →Fig. 1.3. In order to answer a big


question of interest for an underlying problem, frequently, one
needs to reformulate the question in a way that the data available
can be analyzed in a problem-oriented manner. This requires also
to either adapt existing methods to the data or to invent a new
method, which can then be used to analyze the data and obtain
results. The adaption of methods requires a technical
understanding of the mathematical foundations of the methods,
whereas the reformulation of the big question requires some
mathematical thinking skills. This should clarify our approach, which
consists of starting to learn data science from its mathematical
foundations.
Lastly, our approach has the additional side-effect that the
learner has an immediate answer to the question, ‘what is this
method good for?’. Everything we discuss in this book aims to
prepare the reader for learning the purpose and application of data
science.

1.4 Examples and listings


In order to increase the practical usability of the book, we included
many examples throughout and at the end of the chapters. In
addition, we provided many worked examples by using R. These R
examples are placed into a dedicated environment indicated by a
green header, see, e. g., Listing 1.1.

In addition, we provided also an environment for pseudo code


in blue, see Pseudocode 1, and a bash environment in gray, see
example 1 below.

1.5 How to use this book


Our textbook can be used in many different areas related to data
science, including computer science, information technology, and
statistics. The following list gives some suggestions for different
courses at the undergraduate and graduate level, for which
selected chapters of our book could be used:
Artificial intelligence
Big data
Bioinformatics
Business analytics
Chemoinformatics
Computational finance
Computational social science
Data analytics
Data mining
Data science
Deep learning
Information retrieval
Machine learning
Natural language processing
Neuroinformatics
Signal processing
The target audience of the book is advanced undergraduate
students in computer science, information technology, applied
mathematics and statistics, but newcomers from other fields may
also benefit at a graduate student level, especially when their
educational focus is on nonquantitative subjects. By using individual
chapters of the book for the above courses, the lack of a
mathematical understanding of the students can be compensated
so that the actual topic of the courses becomes more clear.
For beginners, we suggest, first, to start with learning the basics
of the programming language R from part one of this book. This will
equip them with a practical way to do something hands-on. In our
experience, it is very important to get involved right from the
beginning rather than first reading through only the theory and
then start the practice. Readers with already some programming
skills can proceed to part three of the book. The chapters on linear
algebra and analysis should be studied first, thereafter the order of
the remaining chapters can be chosen in arbitrary order because
these are largely independent.
When studying the topics in part three of this book, it is
recommended to return to part two for the graphical visualization
of data. This will help to gain a visual understanding of the methods
and fosters the mathematical thinking in general.
For more advanced students, the book can be used as a lookup
reference for getting a quick refresher of important mathematical
concepts in data science and for the programming language R.
Finally, we would like to emphasize that this book is not
intended to replace dedicated textbooks for the subjects in part
three. The reason for this is two-fold. First, we do not aim for a
comprehensive coverage of the mathematical topics, but follow an
eclectic approach. This allows us to have a wide coverage that gives
broad insights over many fields. Second, the purpose for learning
the methods in the book is to establish a toolbox of mathematical
methods for data science. This motivates the selection of topics.
Part I Introduction to R
2 Overview of programming paradigms

2.1 Introduction
Programming paradigms form the conceptual foundations of practical programming
languages used to control computers [→79], [→122]. Before the 1940s, computers
were programmed by wiring several systems together [→122]. Moreover, the
programmer just operated switches to execute a program. In a modern sense, such a
procedure does not constitute a programming language [→122]. Afterwards, the von
Neumann computer architecture [→79], [→122], [→136] heavily influenced the
development of programming languages (especially those using imperative
programming, see Section →2.2). The von Neumann computer architecture is based
on the assumption that the machine’s memory contains both commands and data
[→110]. As a result of this development, languages that are strongly machine-
dependent, such as Assembler, have been introduced. Assembler belongs to the family
of so-called low-level programming languages [→122]. By contrast, modern
programming languages are high-level languages, which possess a higher level of
abstraction [→122]. Their functionality comprises simple, standard constructions, such
as loops, allocations, and case differentiations. Nowadays, modern programming
languages are often developed based on a much higher level of abstraction and novel
computer architectures. An example of such an architecture is parallel processing
[→122]. This development led to the insight that programming languages should not
be solely based on a particular machine or processing model, but rather describe the
processing steps in a general manner [→79], [→122].
The programming language concept has been defined as follows [→79], [→122]:

Definition 2.1.1.
A programming language is a notational system for communicating computations to
a machine.
Louden [→122] pointed out that the above definition evokes some important
concepts, which merit brief explanation here. Computation is usually described using
the concept of Turing machines, where such a machine must be powerful enough to
perform computations any real computer can do. This has been proven true and,
moreover, Church’s thesis claims that it is impossible to construct machines which are
more powerful than a Turing machine.
In this chapter, we examine the most widely-used programming paradigms
namely, imperative programming, object-oriented programming, functional
programming, and logic programming. Note that so-called “declarative”
programming languages are also often considered to be a programming paradigm.
The defining characteristic of an imperative program is that it expresses how
commands should be executed in the source code. In contrast, a declarative program
expresses what the program should do. In the following, we describe the most
important features of these programming paradigms and provide examples, as an
understanding of these paradigms will assist program designers. →Figure 2.1 shows
the classification of programming languages into the aforementioned paradigms.

Figure 2.1 Programming paradigms and some examples of typical programming


languages. R is a multiparadigm language because it contains aspects of several pure
paradigms.

2.2 Imperative programming


Many programming languages in current use belong to the imperative programming
paradigm. Examples (see →Figure 2.1) include Pascal, C, COBOL, and Fortran, see
[→110], [→122]. A programming language is called imperative [→122] if it meets the
following criteria:
1. Commands are evaluated sequentially.
2. Variables represent memory locations that store values.
3. Allocations exist to change the values of variables.
We emphasize that the term “imperative” stems from the fact that a sequence of
commands can modify the actual state when executed. A typical elementary
operation performed by an imperative program is the allocation of values. To explain
the above-mentioned variable concept in greater detail, let us consider the command
x:=x+1.
Now, one must distinguish two cases of x, see [→136]. The l-value relates to the
memory location, and the r-value relates to its value in the memory. Those variables
can be then used in other commands, such as control-flow structures. The most
important control-flow structures and other commands in the composition of
imperative programs are [→136], [→168]:
Command sequences (C1; C1; ...; Ck).
Conditional statements (if, else, switch, etc.).
Loop structures (while, while ... do, for, etc.).
Jumps and calling subprograms (goto, call)).
A considerable disadvantage of imperative programming languages is their strong
dependence on the von Neumann model (i. e., the sequence of commands operates
on a single data item and, hence, parallel computing becomes impossible [→122]).
Thus, other nonimperative programming paradigms (that are less dependent on the
von Neumann model) may be useful in program design. Direct alternatives include
functional and logic programming languages, which are rooted in mathematics. They
will be discussed in the sections that follow.

2.3 Functional programming


The basic mechanism of a functional programming language is the application and
evaluation of functions [→110], [→122]. This means that functions are evaluated and
the resulting value serves as a parameter for calling another function.
The key feature that distinguishes functional and imperative programming
languages is that variables, variable allocations, and loops (which require control
variables to terminate) are not available in functional programming [→110], [→122].
For instance, the command x:=x+1 is invalid in functional programming, as the above-
mentioned terms, memory location, l-value, and r-value (see Section →2.2) do not
carry any meaning in mathematics. Consequently, variables in functional
programming are merely identifiers that are bound to values. An example that
illustrates this concept is the command x=5 as in functional programming (and in
mathematics), variables stand for actual values only. In this example, the actual value
equals 5. The command x=6 would change the value of x from 5 to 6. In summary,
variables (as explained in Section →2.2) do not exist in functional programming, but
rather constants, parameters (which can be functions) and values [→122] are used.
The exclusive application of this concept is also referred to as pure functional
programming.
We emphasize that loop structures, which are a distinctive feature of imperative
programming, are here replaced by recursive functions. A weak point thereof is that
recursive programs may be less efficient than imperative programs. However, they
also have clear advantages, as functional programming languages are less machine-
dependent and functional programs may be easier to analyze due to their declarative
character (see Section →2.1). It is important to mention, however, that some
functional programming languages, such as LISP or Scheme, nonetheless, allow
variables and value allocations. These languages are called multiparadigm
programming languages as they support multiple paradigms. In the case of LISP or
Scheme, either purely functional or exclusively imperative features may be
implemented.
The following examples further illustrate the main features of functional
programming in comparison with equivalent imperative programs. The first example
shows two programs for adding two integer numbers.
Using (pseudo-)imperative programming, this may be expressed as follows:

In LISP or Scheme, the program is simply (+ a b), but aand bneeds to be predefined,
e. g., as

The first program declares two variables aand b, and stores the result of the
computation in a new variable sum. The second program first defines two constants,
aand b, and binds them to certain values. Next, we call the function (+ )(a function
call is always indicated by an open and closed bracket) and provide two input
parameters for this function, namely aand b. Note also that defineis already a
function, as we write (define ... ). In summary, the functional character of the
program is reflected by calling the function (+ )instead of storing the sum of the two
integer numbers in a new variable using the elementary operation “+”. In this case,
the result of the purely functional program is (+ 4 5)=9.
Another example is the square function for real values expressed by

and

The imperative program squareworks similarly to sum. We declare a real variable a,


and store the result of the calculation in the variable square. In contrast to this, the
functional program written using Scheme defines the function (square )with a
parameter xusing the elementary function (* ). If we define xas (define (x 4)), we
yield (square 4)=16.
A more advanced example to illustrate the distinction between imperative and
functional programming is the calculation of the factorial, n! := n ⋅ (n − 1) ⋯ 2 ⋅ 1 .
The corresponding imperative program in pseudocode is given by

This program is typically imperative, as we use a loop structure (while ... do) and
the variables band nchange their values to finally compute n! (state change). As loop
structures do not exist in functional programming, the corresponding program must
be recursive. In purely mathematical terms, this can be expressed as follows:
n! = f (n) = n ⋅ f (n − 1) if n > 1 , else f (n) = 0 if n = 0 . The implementation of

n! using Scheme writes as follows [→1]:

Calling the function (factorial n)(see f (n) ) can be interpreted as a process of


expansion followed by contraction (see [→1]). If the expansion is being executed, we
then observe the creation of a sequence of so-called deferred operations [→1]. In this
case, the deferred operations are multiplications. This process is called a linear
recursion as it is characterized by a sequence of deferred operations. Here, the
resulting sequence grows linearly with n. Therefore, this recursive version is relatively
inefficient when calling a function with large nvalues.

2.4 Object-oriented programming


The development of object-oriented programming languages began during the
1960s, with Simula among the first to be developed. The basic idea in developing such
a language was to establish the term object as an entity that has certain properties
and can react to events [→122]. This programming style has been developed to
model real-world processes, as real-world objects must interact with one another.
This is exemplified in →Figure 2.2. Important properties of object-oriented programs
include the reusability of software components and their independence during the
design process. Classical and purely object-oriented programming languages that
realize the above-mentioned ideas include, for example, Simula67, Smalltalk, and
Eiffel (see [→122]). Other examples include the programming languages C++ or
Modula 2. We emphasize that they can be purely imperative or purely object-oriented
and, hence, they also support multiple paradigms. As mentioned above (in Section
→2.3), LISP and Scheme also support multiple paradigms.

Figure 2.2 Interaction between objects.

When explaining object-oriented programming languages, concepts such as


objects, classes to describe objects, and inheritance must be discussed (see [→47],
[→111]). Objects that contain both data (relating to their properties) and functions
(relating to their abilities) can be described as entities (see →Figure 2.2). A class
defines a certain type of object while also containing information about these objects’
properties and abilities. Finally, such objects can communicate with each other by
exchanging messages [→47], [→111]. We only explain the idea of inheritance in brief
[→47], [→111]. To create new classes, the basic idea is to inherit data and methods
from an already existing class. Advantages of this concept include the abstraction of
data (e. g., properties can be subsumed under more abstract topics) and the
reusability of classes (i. e., existing classes can be used in further programs and easily
modified), see [→47], [→111].
The above-mentioned concepts are in contrast to imperative programming
because the latter classifies data and functions into two separate units. Moreover,
imperative programming requires that the user of the program must define the data
before it is used (e. g., by defining starting values) and that they ensure that the
functions receive the correct data when they are called. In summary, the advantages
of object-oriented programming can be briefly summarized as follows [→47], [→111]:
The error rate is less than that of, for example, imperative languages, as the
object itself controls the data access.
The maintenance effort can be increased compared to other programming
paradigms as the objects can modify data for new requirements.
This programming paradigm has high reusability as the objects execute
themselves.

2.5 Logic programming


Before sketching the basic principles of logic programming, a short history of this
programming paradigm will be provided here. We note that formal logic [→175]
serves as a basis for developing logic programming. Methods and results from formal
logic have been used extensively in computer science. In particular, formal logic has
been used to design computer systems and programming languages. Examples
include the design of computer circuits and control-flow structures, which are often
embedded in programming languages using Boolean expressions [→122].
Another area in which the application of formal logic has proven useful is the
description of the semantics of programming languages (see Section →2.8; [→122],
[→126]). Using formal logic together with axiomatic semantics (see Section →2.8) has
been also essential in proving the correctness of program fragments [→122].
Moreover, seminal work has been conducted in theoretical computer science
when implementing methods and rules from formal logic using real computers. An
example thereof is automated theorem proving [→175] which places the emphasis on
proving mathematical theorems using computer programs. Interestingly, extensive
research in this area led to the opposite insight, that the results of computation can
be interpreted as proof of a particular problem [→122]. This has triggered the
development of programming languages, which are based logic expressions. A
prominent example of a logic and declarative programming language is Prolog,
developed in the seventies [→95]. A concrete example is provided in the paragraph
that follows.
To express logical statements formally, the so-called first-order predicate calculus
[→175], a mathematical apparatus that is embedded in mathematical or formal logic,
is required. For in-depth analysis of the first-order predicate calculus and its method,
see, for example, [→35], [→175]. A simple example of the application of the first-order
predicate calculus is to prove the statement naturalnumber(1). We assume the logical
statements

to be true (see also the Peano axioms [→12]). Here, the symbol → stands for the
logical implication. Informally speaking, this means that if 1 is a natural number and if
n is a natural number (for all n), and that, therefore, the successor is also a natural
number, then 3 is a natural number. To prove the statement, we apply the last two
logical statements as so-called axioms [→35], [→175], and obtain

Logic programming languages often use so-called Horn clauses to implement and
evaluate logical statements [→35], [→175]. Using Prolog, the evaluation of these
statements is given by the following:

Further details of Prolog and the theoretical background of programming


languages in general can be found in [→34], [→95], [→122], [→152].

2.6 Other programming paradigms


In this section, we sketch other programming paradigms and identify some typical
examples thereof. To classify them correctly, we emphasize that they form
subparadigms of those already discussed.
We begin by mentioning so-called languages for distributed and parallel
programming, which are often used to simplify the design of distributed or parallel
programs. An example thereof is the language OCCAM, which is imperative [→5]. The
unique feature of so-called script languages, such as Python and Perl, is their simplicity
and compactness. Both languages support imperative, object-oriented, and
functional programming. Furthermore, programs written using script languages are
often embedded into other programs implemented using structure-oriented
languages. Examples of the latter include HTML and XML, see [→60]. Statistical
programming languages, such as S and R, have been developed to perform statistical
calculations and data analysis on a large scale [→14], [→153]. A successor of S, called
the New S language, has strong similarities to S-PLUS and R, while R supports
imperative, object-oriented, and functional programming. The last paradigm we want
to mention are declarative query languages, which have been used extensively for
database programming [→166]. A prominent example thereof is SQL [→166].

2.6.1 The multiparadigm language R


As already discussed in the last section, R is a so-called statistical programming
language that supports multiple paradigms. In fact, it incorporates features from
imperative, functional, and objective-oriented programming. In the following, we give
some code examples to demonstrate this. The R-statements in Listing 2.8
prove the existence of variables (see Section →2.2). That means, this part is
imperative. Another way to demonstrate this is the procedure in Listing 2.9.

The function fun_sum_imperativecomputes the sum of the first nnatural numbers,


and is here written in a procedural way. By inspecting the code, one sees that there is
a state change of the declared variable, again underpinning its imperative character.
In contrast, the functional approach (see Section →2.3) to express this problem is
shown in Listing 2.10.

This version uses the concept of recursion for representing the following formula:
sum(n)=n+sum(n-1)(see also Section →2.3). As mentioned in Section →2.3, recursive
solutions may be less efficient especially when calling a function with large values
than iterative ones by using variables in the sense of imperative programming.
To conclude this section, we demonstrate the object-oriented programming
paradigm in R. For this, we employ the object-oriented programming system S4
[→129] and implement the same problem as above. The result is shown in Listing
2.11.
First, we use the predefined class series_operationwith a predefined data- type.
Then, we define a prototype of the method fun_sum_object_orientedusing the standard
class series_operation. Using the setMethodcommand, we define the method
fun_sum_object_orientedconcretely, and also create a new object from
series_operationwith a concrete value. Finally, calling the method gives the desired
result.

2.7 Compiler versus interpreter languages


In the preceding sections, we discussed programming paradigms and distinct
features thereof. We also demonstrated these paradigms by giving some examples
using programming languages such as Scheme or Prolog.
In general, the question as to how a programming language can be implemented
concretely on a machine arises. Two main approaches exist to tackle this problem,
namely by means of an interpreter or compiler (see [→122], [→167], [→202]). We start
by explaining the basic principle of an interpreter. Informally speaking, an interpreter
J receives a program and input symbols as its input and computes an output. In

mathematical terms, this can be expressed by a mapping J : L × I ⟶ O , where L


is an arbitrary programming language and I and O are sets corresponding to the
input and output, respectively [→202]. →Figure 2.3 shows the principle of an
interpreter schematically.

Figure 2.3 The basic principle of an interpreter (left) and compiler (right) [→122].
A distinct property of an interpreter is that the program p ∈ L and the sequence
of input symbols are executed simultaneously without using any prior information
[→202]. Typical interpreter languages include functional programming languages,
such as Lisp and Miranda, but other examples include Python and Java (see [→122],
[→168]). A key advantage of interpreter languages is that the debugging process is
often more efficient compared to that of a compiler, as the code is executed at the
runtime only. The frequent inefficiency of interpreter programs may be identified as a
weakness because all fragments of the program, such as loops, must be translated
when executing the program again.
Next, we sketch the compiler approach to translate computer programs. A
compiler translates an input program as a preprocessing step into another form,
which can then be executed more efficiently. This preprocessing step can be
understood as follows: A program written in a programming language (source
language) is translated into machine language (target language) [→202]. In
mathematical terms, this equals a mapping C : L ⟶ L that maps programs of a
1 2

programming language L to other programs of programming language L . After


1 2

this process, a target program can be then executed directly. Typical compiler
languages include C, Pascal, and Fortran (see [→122], [→168]). We emphasize that
compiler languages are extremely efficient compared to interpreter languages.
However, when changing the source code, the program must be compiled again,
which can be time consuming and resource intensive. →Figure 2.3 shows the
principle of a compiler schematically.

2.8 Semantics of programming languages


Much research has been conducted exploring the effect and behavior of programs
using programming languages [→122], [→126], [→168]. For example, a program’s
correctness can be proven mathematically using methods from so-called formal
semantics [→126]. Besides their extensive use in theory [→122], [→126], they have
also proven useful in practice as they have positively stimulated the development of
modern programming languages. Moreover, such methods have been developed as
it is required to describe the effect of programming languages independently from
concrete machines.
Below, we briefly sketch the three main approaches to formally describe the
semantics of programming languages [→122], [→126], [→168]:
Operational semantics describes a programming language by operations of a
concrete or hypothetical machine.
Denotational semantics is based on the use of semantical functions to describe
the effect of programs. This can be done by defining functions that assign
semantical values to the syntactical constructions of a programming language.
Axiomatic semantics uses logical axioms to describe the semantics of
programming languages’ phrases.
2.9 Further reading
For readers interested in more details about the topics presented in this chapter, we
recommend [→122], [→126], [→168].

2.10 Summary
The study of programming paradigms has a long history and is relatively complex.
Nevertheless, we considered it important to introduce this fundamental aspect to
show that programming is much more than writing code. Indeed, although
programming is generally perceived as practical, it has a well-defined mathematical
foundation. As such, programming is less practical than it may initially appear, and
this knowledge can be utilized by programmers in their efforts to enhance their
coding skills.
3 Setting up and installing the R program

In this chapter, we show how to install R on three major operating systems that are
widely used: Linux, MAC OS X, and Windows. As a note, we would like to remark that
this order reflects our personal preference of the operating systems based on the
experience we gained over the years making maximum use of computers.
From our experience, Linux is the most stable and reliable operating system of
these three and is also freely available. An example of such a Linux-operating system
is Ubuntu, which can be obtained from the web page →http://www.ubuntu.com/. We
are using Ubuntu since many years and can recommend it to anyone, no matter
whether it is for a professional or a private usage. Linux is in many ways similar to the
famous operating system Unix, developed by the AT&T Bell Laboratories and released
in 1969, however, without the need of acquiring a license. Typically, a research
environment of professional laboratories has a computer infrastructure consisting of
Linux computers, because of the above-mentioned advantages in addition to the free
availability of all major programming languages (e. g., C/C++, python, perl, and Java)
and development tools. This makes Linux an optimal tool for developers.
Interestingly, the MAC OS X system is Unix-based like Linux, and hence, shares
some of the same features with Linux. However, a crucial difference is that one
requires a license for many programs because it is a commercial operating system.
Fortunately, R is freely available for all operating systems.

3.1 Installing R on Linux


Most Linux distributions have a centralized package repository and a package
manager. The easiest installation for Ubuntu is to open a terminal and type:

Alternatively, one can install R by using the Ubuntu software center, which is
similar to an App store. For other Linux distributions the installation is similar, but
details change. For instance, for Fedora, the installation via terminal uses the
command:

3.2 Installing R on MAC OS X


There are two packages available to install R on a MAC OS X operating system. The
first is a binary package and the second contains all source files. In the following we
focus on the first type, because the second is not needed for the regular user, but the
developer.
From the R web page (or CRAN), locate the newest version of R for your MAC OS X
operating system. It is important to decide if you need a 64-bit or a 32-bit version. At
the time of writing this book, the current R version is R 3.6.1. The installation is quite
simple by double-clicking the Installer package.

3.3 Installing R on Windows


The installation for Windows is very similar to MAC OS X as described above and all
files can also be found at CRAN.

3.4 Using R
The above installation, regardless for which operating system, allows you to execute R
in a terminal. This is the most basic way to use the programming language. That
means one needs, in addition, an editor for writing the code. For Linux, we
recommend emacs and for MAX OS X Sublime (which is similar to emacs). Both are
freely available. However, there are many other editors that can be used. Just try to
find the best editor for your needs (e. g., nice command highlighting or additional
tools for writing or debugging the code) that allows you to comfortably write code.
Some people like this vi-feeling1 of programming, however, others prefer to have
a graphical-user interface that offers some utilities. In this case, RStudio
(→https://www.rstudio.com/) might be the right choice for you. In →Fig. 3.1, we show
an example how an RStudio session looks. Essentially, the window is split into four
parts. A terminal for executing commands (bottom-left), an editor (top-left) to write
scripts, a help window showing information about R (bottom-right) or for displaying
plots, and a part displaying variables available in the working space (top-right).
Figure 3.1 Window of an Rstudio session.

3.5 Summary
For using the base functionality of R, the installation shown in this chapter is
sufficient. That means essentially everything we will discuss in Chapter →5 regarding
the introduction to programming can be done with this installation. For this reason,
we suggest to skip the next chapter discussing the installation of external packages
and come back to it when it is needed to install such packages.
4 Installation of R packages

After installing the base version of R, the program is fully functional. However, one of
the advantages of using R is that we are not limited to the functionality that comes
with the base installation, but we can extend it easily by installing additional
packages. There are two major sources from which such packages are available. One
is the COMPREHENSIVE R ARCHIVE NETWORK (CRAN) and the other is BIOCONDUCTOR. Recently,
GitHub has been emerging as a third major repository. In what follows, we explain
how to install packages from these and other sources.

4.1 Installing packages from CRAN


The simplest way to install packages from CRAN is via the install.packages function.

Here, package.name is the name of the package of interest. In order to find the
name of a package we want to install, one can go the CRAN web page
(→http://cran.r-project.org/) and browse or search the list of available packages. If
such a package is found, then we just need to execute the above command within an
R session and the package will be automatically installed. It is clear that in order for
this to work properly, we need to have a web connection.
As an example, we install the bc3net package that enables infering networks from
gene expression data [→45].

At the time of writing this book CRAN provided 14435 available packages. This is
an astonishing number, and one of the reasons for the widespread use of R since all
of these packages are freely available.

4.2 Installing packages from Bioconductor


Installing packages from Bioconductor is also straightforward. We just need to go to
the web page of Bioconductor (→http://www.bioconductor.org/) and search for the
packages of interest. Then, on each page there is some information about how to
install the package. For example, if we want do install the package graph that provides
functions to manipulate networks, one needs to execute the following commands:
The first command sets the source from where to download the package, and the
second command downloads the package of interest.

4.3 Installing packages from GitHub


GitHub is a website providing a home for git repositories, whereas git is a freely
available and open source distributed version control system that supports software
development. The general way to install a package is

That means, first, the package devtools from CRAN needs to be installed and then
a package from GitHub with the name ID/packagename can be installed. For instance, in
order to install ggplot2 one uses the command

4.4 Installing packages manually

4.4.1 Terminal and unix commands


For the manual installation of packages, described below, on a Windows system, it is
necessary to install first the freely available program Cygwin. Cygwin is an interface
enabling a Unix-like environment similar to a terminal for Linux or MAC, which has
this available right out of the box. In general, a terminal enables entering commands
textually via the keyboard and provides the most basic means to communicate with
your computer. It is interesting to note that when using your mouse and clicking a
button, this action is internally converted into commands that are fed to the
computer processor for execution. The problem is that, for what we describe below,
there are no buttons available that could be clicked with your mouse. The good thing
is that this is not really a problem as long as we have a terminal that allows us to
enter the required commands directly.
Before we proceed, we would like to encourage the reader to get at least a basic
understanding of unix commands because this gives you a much better
understanding about the internal organization of a computer and its directory
structure. In →Table 4.1, we provide a list of the most basic and essential unix
commands that can be entered via a terminal.
Table 4.1 Essential unix commands that can be entered via a terminal.
Unix Command Description
cd path change to directory
clear clear the terminal screen
cp source destination copy files and directories
df display used and available disk space
du shows how much space each file uses
file filename determine what type of data is within a file
head filename display the beginning of a file
history displays the last commands typed
kill pid Stop a process
ls list directory contents
man [command] display help information for the specified command
mkdir directory create a new directory.
mv [options] source destination rename or move file(s) or directories
ps [options] display a snapshot of the currently running processes
pwd display the pathname for the current directory
rm [options] directory remove file(s) and/or directories
rmdir [options] directory delete only empty directories
top displays the resources being used (Press q to exit)
wc file count number of words in file

4.4.2 Package installation


The most basic method to install packages is to download a package to your local
hard drive and then install it. Suppose that you downloaded such a package to the
directory “home/new.files”. Then you need to execute the following command within
a terminal (and not within an R session!) from the home directory:
R CMD INSTALL package.name

4.5 Activation of a package in an R session


In order to use the functions provided by a package in R, first, one needs to activate a
package. This is done by the library function:

Only after the execution of the above command the content of the package
package.name is available. For instance we activate the package bc3net as follows:
To see what functions are provided by a package we can use the function help:

4.6 Summary
In this chapter, we showed how to install external packages from different package
repositories. Such packages are optional and are not needed for utilizing the base
functionality of R. However, there are many useful packages available that make
programming more convenient and efficient. For instance, in an academic
environment it is common to provide an R package when publishing a scientific article
that allows reproducing the conducted analysis. This makes the replication of such an
analysis very easy because one does not need to rewrite such scripts.
5 Introduction to programming in R

This chapter will provide an introduction to programming in R. We will discuss key


commands, data structures, and basic functionalities for writing scripts. R has been
specifically developed for the statistical analysis of data. However, here we focus on
its general purpose functionalities that are common to many other programming
languages. Knowledge of these functionalities is necessary for utilizing its advanced
capabilities discussed in later chapters.

5.1 Basic elements of R


In R a value is assigned to a parameter by the “<−” operator:

In principle also, the symbol “=” can be used for an assignment, but there are
cases where this leads to problems, and for this reason we suggest using always the
“<−” operator, because it can be used in all cases.
The basic elements of R, to which different values can be assigned, are called
objects. There are different types of objects and some of them are listed in →Table 5.1.

Table 5.1 Information about basic object types in R.


Object type Example Description
NULL NULL place holder (initialized but empty)
NA NA missing value (non-available)
environment a <- new.env(hash=TRUE) an environment
logical TRUE, FALSE logical values
integer 1, 2, 3, … integer values
double 1.3452 real values
character ’hallo’ a string of character values
expression exp <- parse(text=c(”x <- 2”)) an expression object
list x <- list(3, c(5,3)) a list

The command typeof() provides information about the type of an object. An


interesting type is NULL, which is not an actual object-type, but serves more as a place
holder allowing an empty initialization. In Section →5.3.1, we will show how this can
be useful. Another interesting type is NA, indicating missing values.

5.1.1 Navigating directories


When opening an R session, it may be unclear in which directory we actually are. In
order to find this out one can use the get working directory function getwd():

This will result in a character string showing the full path to the current working
directory of the R session. In case one would like to change the directory, one can use
the set working directory function setwd():

Here, new.dir is a character string containing a valid name of a directory you


would like to set as your current working directory.

5.1.2 System functions


An important part of R and any programming language is a system function. A system
function has a name followed by a list of arguments in parentheses. A simple
example for such a function is sqrt(). In order to use such a function appropriately,
one needs to know:
the name of the function
the arguments of the function
the meaning of the function and its arguments
Let us assume that we know the name of the function, but not of its arguments. In R,
there are two ways to find this out. First, one can use the function args() and, as an
argument for this function, the name of the function with unknown arguments:

The information resulting from args() is usually only informative if one is already
familiar with the function of interest, but just forgot details about its arguments. For
more information, we need to use the function help(), which is described in detail in
the next section.
In the following, we use the term “function” and “command” interchangeably,
although a command has a more general meaning than a function.

5.1.3 Getting help


When there is a command that we want to use, but we are unfamiliar with its syntax,
e. g., sqrt(), R provides a help function, which is evoked by either help(sqrt) or ?sqrt:
The output of either of these commands is a textual information, providing
information about this function (see →Figure 5.1).

Figure 5.1 Help information provided by R for the sqrt() function.

At this early stage in the book, we would like to highlight the fact that R provides
helpful information about functions, but this does not necessarily mean that this
information will be to the extend you expect or would wish for. Instead, usually, the
provided help information is rather short and not sufficient (or intended) to fully
understand the very details of the complexity of the function of interest.
However, most help information comes with R examples at the end of the help
file. This allows you to reproduce, at least parts, of the capabilities of the described
functions by using the provided example code. It is not necessary to type these
examples manually but there is a useful function available, called example(), that
executes the provided example code automatically:
That means you do not need to manually copy-and-paste (or type) the example
code, but just apply the example() command to the function you wish to learn more
about.

5.2 Basic programming

5.2.1 If-clause
A basic element of every programming language is an if-clause. An if-clause can be
used to test the truth of a logical statement. For instance, the logical statement in the
example below is: a > 2 . If the variable a is larger than 2, then this statement is true,
and the code in the first {} brackets is executed. However, if this statement is false,
then the code that follows the brackets {} after else will be executed.

The general form of an if-clause is given by the following structure. Here, a


general logical statement is the argument of the if-clause. Depending on if this
statement is true or false, the commands in part 1 or 2 are executed. That means, the
outcome of the test of the provided logical statement selects the commands to be
executed.

The usage of an if-clause is very flexible, allowing the removal of the else clause,
but also to include further conditional statements by means of the else if command.
We would like to note that, e. g., the statement “a=4” is not a logical statement,
but an assignment, and will for this reason not work as an argument for an if-clause.

5.2.2 Switch
The switch() command is conceptually similar to an if-clause. However, the difference
is that one can test more than one condition at the same time. For instance, in the
example below, the switch() command tests 3 conditions, because it has 3 executable
components, indicated by the “{ }” environments. If the variable “a” is 1, the first
commands are executed, if “a” is 2 the second, and so on.

For all other values of “a”, there will be no true condition and, hence, none of the
above commands will be executed.
For clarity, we just want to mention that for reasons of a better readability, we
split the switch() command in the above example into three different lines. This way
one can see that it consists of 3 executable components. When you write your own
programs, you will see that such a formatting is in general very helpful to get a quick
overview of a program, because this increases the readability of the code.

5.2.3 Loops
In R, there are two different ways to realize a looping behavior. The first is by using a
for-loop, and the second by using a while-loop. A looping behavior means the
consecutive execution of the same procedure for a number of steps. The number of
steps can be fixed, or variable.

5.2.4 For-loop
A for-loop repeats a statement for a predefined number of times. In the following
example, i is successively assigned the values 1 to 3, and the command print(i) is
executed 3 times, for three different values of i:

5.2.5 While-loop
Another looping function is while(). Its syntax is

In contrast with a for-loop, which executes a loop a predefined number of times, a


while-loop repeats a statement as long as the argument of while() is logically true:

One needs to make sure that the argument of the while() function becomes at
some time during the looping process logically false, because otherwise the function
is iterated infinitely. This is a frequent programming bug.

5.2.6 Logic behind a For-loop


Before we continue, we want to present a look behind the curtains of the logic behind
a for-loop. The reader who has already some familiarity with programming can skip
this section, but in our experience the following information is helpful for beginners
to see in detail how a for-loop works.
The general form of a for-loop consists of a for() function that executes the body
of the for-loop comprised of individual R commands, depending on the argument of
the for-loop.
In the following, we will discuss each of the three components of the for-loop and
their connections.

5.2.6.1 Argument of the For-loop

In order to keep the logic simple, let us assume that the argument of the for-loop is
argument = i in 1:N (5.1)

This argument contains one variable and one parameter. Here i is the variable,
because its value changes systematically with every loop that is executed, and N is a
parameter, because its value is fixed throughout the whole loop. The values that can
be assumed to i are determined by 1:N, because the argument says i in 1:N. If you
define N=4 and execute 1:N in an R session, you get

That means 1:N is a vector of integers of length N. To see this, you can define a <-
1:N and access the components of vector a by a[1], e. g., for the first component.

The values that i can assume are systematically assigned according to the order of the
vector 1:N, i. e., in loop 1, i is equal to 1; in loop 2, i is equal to 2; until finally in loop N,
i is equal to N. For this reason, the “argument” of a for-loop is—in our example—
dependent on the variable i, i. e.,
argument(i), (5.2)

and in general it is dependent on the number of the loop step, i. e.,


argument(loop step). (5.3)

Note that this is just a symbolic writing to emphasize that the argument of a loop is
connected to the step of the loop. Here, it is important to realize that the variable of
the “argument” changes its value in every loop step.

5.2.6.2 Body of the For-loop


The body of the for-loop is just a list of commands provided between the curled “{}”
brackets. In principle, the body can consist of one or more commands—zero
commands are allowed as well, but this does not result in a meaningful action— that
are executed consecutively (see the next section). In order to be precise, one needs to
realize that the body is a function of the argument, i. e., symbolically we can write,
body(argument). (5.4)

Due to the fact that the argument itself is a function of the loop step, we have the
following dependency chain:
body(argument(loop step)). (5.5)

5.2.6.3 For-function

The third part is an actual R function. In R, you can always recognize a function by its
name, followed by round brackets “()” containing, optionally, an argument. In the
case of the for-function, it contains an argument, as discussed above. The purpose of
the for-function is to execute the body consecutively.
To make this clear, especially with respect to the argument of the body, which
depends on the number of the loop step, let us consider the following example:

The for-function converts this into the following consecutive execution of the
body, as a function of the argument:

First, the value of the variable i changes with every loop, according to the
argument. In our case “i” just assumes the values 1, 2, 3. Then the concrete value of i
is used in every loop, leading to different values of a. From a more general point of
view, this means that the for-function does not only execute the body of the function
consecutively, but it changes also the content of the workspace, which is the memory
of an R session, with every loop step. This is the exact meaning of body(argument(loop
step)).

Figure 5.2 Unrolling the functional working mechanism in time of a for-loop.

In →Fig. 5.2, we visualize the general working mechanism of a for-loop by


unrolling it in time. Overall, if one wants to understand what a particular For-loop
does, one just needs to unroll its “body” in the way depicted in →Fig. 5.2, considering
the influence of the “argument” on it.
This discussion demonstrates that a for-loop, or any other R function, can be quite
complicated if we want to understand in more detail how it works. However, once we
understand its principle working mechanism, we can fade-out these details focusing
on key factors only. For the for-loop this is the systematic modification of the variable
in the argument of the loop, and the consecutive execution of its body.

5.2.7 Break
Both loop functions can be interrupted at any time during the execution of the loop
using the break() command. Frequently, this is used in combination with an if-clause
within a loop to test for a specific decision that shall lead to the interruption of the
loop.

Combining loops with if-clauses and the break() function allows creating very
flexible constructs that can exhibit a rich behavior.

5.2.8 Repeat-loop
For completeness, we want to mention that there is actually a third type of loop in R,
the repeat-loop. However, in contrast with a for-loop and a while-loop, this does not
come with an interruption condition, but is in fact an infinite loop that does never
stop. For this reason, the repeat() command needs to be used always in combination
with the break() statement:

5.3 Data structures

5.3.1 Vector
A vector is a 1-dimensional data structure. As the example below shows, a vector can
be easily defined by using the combine function c(). This function concatenates its
elements forming a vector. Individual elements can be accessed in various ways,
using squared brackets.
There are many functions to obtain properties of a vector, e. g., its length or the
sum of its elements. In order to make sure that the sum of its elements can be
computed, the function mode() or typeof() allows determining the data-type of the
elements. Examples of different types are character, double, logical, or NULL.
Accessing elements of a vector can be done either individually (a[3] gives the
third element of vector a) or collectively by specifying the indices of the elements
(a[c(2,5)] gives the second and fifth element).
One can also assign names to elements of a vector using the command names().
When assessing an element with its name, one needs to make sure to use the same
index. In the below example, one needs to use "C" and not C, because the latter
indicates a variable rather than the capital letter itself.

There are also several functions available to generate vectors by using predefined
functions, e. g., the sequence (seq()) of numbers or letters (letters()). A general
characteristic of a vector is that whatever the type its elements, they need to be all of
the same type. This is in contrast with lists, discussed in Section →5.3.3.
It is also possible to define a vector of a given length and mode initiated by zeros.
For example, vector(mode = "numeric", length = 10) results in a numeric vector of
length 10, whereas each element is initialized with a 0.
It is also possible to apply a function element-wise to a vector without the need to
access its elements, e. g., in a for-loop.

Other useful functions that can be either applied to a vector or used to generate
vectors are provided in →Table 5.2.
If we want to add an element to a vector, we can use the command append():

Here the option after allows specifying a subscript, after which the values are to be
appended.
Table 5.2 Examples of functions that can be applied to vectors or can be used to
generate vectors.
Command name Description
LETTERS capital letters
letters lower case letters
month.name month names
rep("hello", times=3) repeats the first argument n-times
sum sum of all elements in the vector
length length of vector
rev reverse order of elements

This command allows us also to demonstrate the usefulness of the NULL object
type, introduced in Section →5.1.

Although the variable a does not contain an element with a value, it contains one
initialized element as a place holder of type NULL. Repeating the above example with
an uninitialized object would result in an error message.
A simplified form of the above can be written as follows:

5.3.2 Matrix
A matrix is a 2-dimensional data structure. It can be constructed with the command
matrix(), see Listing 5.28.
Here, the option byrow allows controlling how a matrix is filled. Specifically, by
setting it to “FALSE” (default), the matrix is filled by columns, otherwise the matrix is
filled by rows. Accessing the elements of a matrix is similar to a vector by using the
squared brackets. Again, this can be done either individually (a[1,2] giving the
element in row 1 and column 2) or collectively by specifying the indices of the
elements (a[c(1,3),] gives all the elements of row 1 and 3). It is interesting to note
that by not specifying element, i. e., by using “,”, all elements are selected.
There are several commands available to obtain the properties of a matrix. Some
of these commands are provided in →Table 5.3.
Sometimes, it is useful to assign names to rows and columns. This can be
achieved by the commands rownames() and colnames().

Table 5.3 Examples of functions that can be applied to matrices.


Command name Description
dim dimension of a matrix: c(nrow, ncol)
ncol number of columns
nrow number of rows
length total number of elements
Once these attributes are set, they can be retrieved by using the same command,
e. g., rownames(a) would give you a vector of length nrow, including the names of the
rows. The names of the rows or columns can also be used to access the rows and
columns:

There are alternative ways to create a matrix. For instance, by using the
commands cbind(), rbind(), or dim():

Also, functions can be applied to a matrix element-wise. For example, sqrt()


calculates the square root of each element.
All basic operations known from linear algebra can be performed with a matrix, e.
g., addition, multiplication, or matrix multiplication.
For mathematical calculations, it is frequently necessary to “swap” the rows and
the columns of a matrix. This can be conveniently achieved by the transpose function
t().

5.3.3 List
A list is a more complex data structure than the previous ones, because it can contain
elements of different types. This was not allowed for either of the previous data
structures. Formally, a list is defined using the function list(), see Listing 5.33.

In the above example, the list b consists of 4 elements, which are different data
structures. In order to access an element of a list, the double-squared brackets can be
used, e. g., b[[2]], to access the second element. This appears similar to a vector,
discussed in Section →5.3.1. In fact, there are many commands for vectors that can
also be applied to lists, e. g., length() or names(). If name attributes are assigned to the
elements of a list, then these can be accesses by the “$” operator.
In this case, the usage of double-squared brackets or the “$” operator provide the
same results. It is also possible to assign names to the elements when defining a list.
The following example shows that even a partial assignment is possible. In this case,
the first two elements could be accessed by their name, whereas the latter two can
only be accessed by using indices, e. g., b[[3]] for the third element.

5.3.4 Array
Arrays are a generalization of vectors, matrices, and lists in the sense that they can be
of arbitrary dimension and type.
Elements of an array can be accessed using squared brackets, and the number of
indices corresponds to the number of dimensions.

The components of an array can assume any type, like a list.

5.3.5 Data frame


A data frame is a generalization of a matrix and a list, allowing inheriting some of
their properties. Briefly, a data frame is a rectangle data structure that allows
columns (or rows) to be of a different data-type.

Again the command names() can be used to identify the names of the elements in
a data frame. Interestingly, the “$” operator can be used with "x" as well as x to
access elements. →Table 5.4provides an overview of further commands for data
frames.

Table 5.4 Some examples of commands that can be used with data frames.
Command name Description
dim dimension of a matrix: c(nrow, ncol)
ncol number of columns
nrow number of rows
length total number of elements
names names of the elements

5.3.6 Environment
An environment is similar to a list, however, it needs named elements. That means,
the name of an element needs to be a character string. The command ls() provides a
list of the names of all elements in an environment.

Alternatively, one can use the function assign() to assign a new element to an
environment:

Correspondingly, the function get() can be used to obtain elements of an


environment:
When we are unsure about the names of elements, we can use the function
exists() to perform a logical test, resulting in either a true or a false depending on
whether the element exists in the environment or not:

In order to delete elements of an environment the command remove() can be


used:

5.3.7 Removing variables from the workspace


Sometimes it is necessary to delete variables that have been defined. In contrast to
the above example for an environment, this will delete the variable from the
workspace of R. This can be done by using the remove function rm():

If we want to delete many variables, we need to specify the “list” argument of the
command rm() providing a character vector naming the objects to be removed. We
can also delete all variables in the current working space in the following way:

5.3.8 Factor
For analyzing data containing categorial variables, a data structure called factor is
frequently encountered. A factor is like a label or a tag that is assigned to a certain
category to represent it. In principle, one could define a list containing the same
information, however, the R implementation of the data structure factor is more
efficient. An example for defining a factor is given below.
Here, we assign 4 different factors to height, but only three “values” are different.
In the case of a factor, different values are actually called levels. The different levels of
a factor can also be obtained with the command levels().
In the above example, the factors were categorial variables, meaning that the
levels have no particular ordering. An extension to this is to define such an ordering
between the levels. This can be done implicitly or explicitly.

The first example above, defines ordered factors by setting “ordered=T”. As a


result, there is an ordering between the three levels despite the fact that we did not
specify this order explicitly. However, this order is not due to any semantic meaning
of these words, but this is just an alphabetic ordering of the words.
If we would like defining a different order between the levels, we can include the
levels option. Then, the resulting order will follow the order of the levels specified by
this option.

5.3.9 Date and Time


For assessing the date and time of the computer system, we can use the following
functions:

Each of these functions results in an R object of a specific type. The first function
returns an object of class POSIXct and the second of class Date. The reason for this is
that objects of the same type can be manipulated in a convenient way, e. g., using
subtraction, we can get the time difference between two time points or dates.
5.3.10 Information about R objects
In the above sections, we showed how to define basic R objects of different types. In
all these cases, we knew the type of these object, because we defined them explicitly.
However, when using packages, we may not always have this information. For such
cases, R provides various commands to get information about the types of objects.

5.3.10.1 The functions attributes() and class()

The function attributes() gives information about different attributes an R object can
have, including information about class, dim, dimnames, names, row.names, or levels. In
case the attributes() function does not provide information about the class of an R
object, one can obtain this information with the command class().

5.3.10.2 The functions summary() and str()

In Chapter →2, discussed programming paradigms in detail, but we want to repeat


here that every R object is a member of a certain class. Classes are powerful data
structures that do not only have attributes, but come also with specific functions.
Here, it is only important to know that this implies that behind a simple variable can
be a complex structure that is not recognizable without help. R provides the functions
summary() and str() to get information about the structure of an object, where the
latter is a simplified version of the former.

5.3.10.3 The function typeof()

Sometimes it is important to know how an R object is stored internally, because this


gives information about the amount of bytes that are required. The function typeof()
gives this information and its possible outputs are logical, integer, double, complex,
character, or S4.

5.4 Handling character strings

5.4.1 The function nchar()


There is a variety of commands available in R to manipulate character strings. One of
the simplest functions is nchar(), which returns the length of a string:
If we use the function length() instead, it would not return the number of
characters, but count the whole string as 1.

5.4.2 The function paste()


For concatenating strings together one can use the function paste():

The sep option allows specifying what separator is used for concatenating the
strings; the default introduces a blank between two strings.
It is also possible to include a variable to form a new string.

This is useful if we want to read many files from a directory within a loop and their
names vary in a systematic way, e. g., by an enumeration. It can also be used to create
names for an environment (see Sec. →5.3.6), because an environment needs strings
as indices for elements.
Furthermore, the function paste() can be used to connect more than just two
strings:

5.4.3 The function substr()


A substring of a certain length can be extracted from a string by the command
substr(x, start, stop):

If we want to overwrite parts of a string s with another string, we need to use the
function substring() with start, specifying where to start overwriting. In case we just
want to insert a new string without overwriting parts of the string s, we need to use
the function substr():

5.4.4 The function strsplit()


Splitting a string in one or more substrings can be done using the function strsplit():

The command is a bit tricky if one uses certain symbols as a split:

The reason why a ”.” does not work as a split symbol, but ”[.]” does, is due to the
fact that the argument split is a regular expression (see Section →5.4.5).

5.4.5 Regular expressions


R provides very powerful functions to search strings for matching patterns:

The first argument of the above function characterizes the pattern we try to find
and text is the string to be searched.
Both functions result in similar outputs, but displayed in different ways. While
regexpr() returns an integer vector of the same length as text, whose components
provide information about the position of a match or no match, resulting in −1,
grepexpr() returns a list of this information. Furthermore, both functions have the
attribute match.length that indicates the number of elements that are actually
matched. One may wonder how is it possible that the length of a match could not
correspond to the length of pattern. This is where (nontrivial) regular expression
come into play.
A regular expression is a pattern that can include special symbols, as listed in
→Table 5.5 below.
Table 5.5 Some special symbols that can be used in regular expressions.
Symbols Meaning of the symbols
* an asterisk matches zero or more of the preceding character
. a dot matches any single character
+ a plus sign matches one or more of the preceding character
[...] the square brackets enclose a list of characters that can be matched alternatively
{min, max} the preceding element is matched between min and max times
\s matches any single whitespace character
| the vertical bar separates two or more alternatives
\t match a tab
\r match a carriage return
\n match a linefeed
0−9 match any number between 0 and 9
A-Z match any upper letter between A and Z
a-z match any lower letter between a and z

For example, the regular expression x+ matches any of the following within a
string: "x", "xx", "xxx", etc. This means that the length of the regular expression is not
equal to the length of the matched pattern. By using special symbols, it is possible to
generate quite flexible search patterns, and the resulting patterns are not necessarily
easy to recognize from the regular expression.
To demonstrate the complexity of regular expressions, let us consider the
following example. Suppose that we want to identify a pattern in a string, of which we
do not know the exact composition. However, we know certain components. For
example, we know that it starts with a “G” and is followed either by none or several
letters or numbers, but we do not know by how many. After this, there is a sequence
of number, which is between 1 and 4 elements long:

The above code realizes such a search and it finds at position 4 of txt a match that
is 7 elements long.
In order to extract the matched substring of txt, the function regmatches() can be
used. It expects as arguments the original string used to match a pattern and the
result from the function regexpr():
For our above example the matched substring is “GGA0234”:

This example demonstrates that with regular expressions it is not only possible to
match substrings that are exactly known, but also to match substrings that are only
partially known. This flexibility is very powerful.

5.5 Sorting vectors


The elements of numerical vectors can be sorted according their size using the
function sort():

The option decreasing enables specifying whether the sorting should be in


decreasing (TRUE) or nondecreasing (FALSE-default) order. It is important to note
that the result of sort(x) does not directly affect the input vector x. For this reason, we
need to assign the result of sort(x) to x, if we want to overwrite the input vector.

If we are interested in the positions of the sorted elements in the original vector x
we can get these indices by using the function order().

A somewhat related function to order() is rank(). However, rank(x) gives the rank
numbers (in increasing order) of the elements of the input vector x:

In the case of ties, there are several options available to handle the situation, and
one of them is ties.method.

5.6 Writing functions


So far, we learned how to use R functions either provided in the base package or
in additional packages, evoked by the command library(). In this section, we will see
how to write our own functions. First, we will focus on the definition of a function with
exactly one argument and one return value. Later, we will extend this to more
arguments and return variables.

5.6.1 One input argument and one output value


To write a new function, one needs to define the name, the argument, and the
content of the function. The general syntax is shown in Listing 5.67.

Here, fct.name is the name of the new function you want to define, argument is the
argument you submit to this function, and body is a list of commands that are
executed, applied to argument.
A new definition for a function utilizes itself an R function called function. If the
body of the new function consists merely of one command, one can use the simplified
syntax:

However, for reasons of clarity and readability of the code, we recommend always
to define the body of the function, starting with a “{” and ending with a “}”.
Let us consider an example defining a new function that adds 1 to a real number
given by the argument x:

In this example, the name of the new function is add.one(). One should always pay
attention to the name to not accidentally overwrite some existing function. For
instance, if we would call the new function sqrt(), then the square root function, part
of the R base package, will be overwritten.
It is good practice to finish the body with the command return() that contains as
its argument the variable we would like to get as a result from the application of the
new function. However, the following will result in the exact same behavior as the
function add.one():
Here, it is important not to write y <- x + 1, but instead x+1, without assignment
to a variable. We do not recommend this syntax, especially not for beginners, because
it is less explicit in its meaning.
We would like to note that the above-defined function is just a simple example
that does not include checks in order to avoid errors. For instance, one would like to
ensure that the argument of the function, x, is actually a number, because otherwise
operations in the body of the function may result in errors. This can be done, for
instance, using the command is.numeric().
The usage of such a self-defined function is the same as for a systems function,
namely fct.name(x). The following is an example:

In general, the choice of the name of a function is arbitrary as long as it consists


of alphanumeric symbols, starting with a letter. Even the name of an existing function
can be chosen. However, as mentioned above, in this case, this function is overwritten
and no longer available in the current R season.

5.6.1.1 Merits of writing functions

Some reasons for writing your own functions are to help you to
organize your programs
make your programs more readable
limit the scope of variables
The last point is very important and shall be visualized with the following example.
Start a new R session (this is important!) and copy the following code into the R
workspace:
What will be the output of print(yzxv)? It will result in an error message, because
the variable yzxv is defined within the scope of the function fct.test(), and as such, it is
not directly accessible from outside the function. This is actually the reason why we
need to specify with the return() function the variables we want to return from the
function. If we could just access all variables defined within the body of a function,
there would be no need to do this.
The rationale behind our recommendation to start a new R session is to clear any
variable in the session, already defined yzxv; since, in this case, print(yzxv) would
output that existing variable rather than the value calculated inside the function
fct.test(). For the specific choice of our variable name, this may be unlikely (that is why
we used yzxv), but for more common variable names, such as a, i or m, there is a real
possibility that this could happen.
In general, functions allow us to separate our R workspace into different parts,
each containing their own variables. For this reason, it is also possible to reuse the
same variable name in different functions without the danger of collisions.
This point addresses the so-called scope of a variable, which is an important issue,
because it is the source of common bugs in programs.

5.6.2 Scope of variables


In order to understand the full complexity of the scope of variables, let us consider
the following situation. Suppose that we have just one function, then we can have
three different scopes of variables depending on where and how they have been
defined.
First, all variables that are defined outside the function are global variables. This
means that the value of these variables is accessible inside the function and outside
the function. Second, all variables that are defined inside the function are local
variables, because they are only accessible inside the function, but not outside.
Finally, all variables that are defined inside a function by using the super-assignment
operator “<<−” are also global variables.
The following script provides an example:
5.6.3 One input argument, many output values
In order to return more than one output variable, we need to apply a little trick,
because an R function does not directly permit returning more than one variable with
the return command. Instead, we need to define a single variable, which contains all
the variables we want to return. The script below shows an example, utilizing a list.

In this case, the list variable y serves as a container to transmit all desired
variables. That means, formally, one has just one output variable, but this variable
contains additional output variables that can be accessed via the components of the
list. For example, we can access its third component by y[[3]].

5.6.4 Many input arguments, many output values


The case of multiple input arguments of a function is considerably easier, because R
allows calling a function with more than one argument. Consider the following
example:

Rprovides the useful command args(), which gives some information on the input
arguments of a function. Try, for example, args(matrix).

5.7 Writing and reading data


Writing and reading data from and to files is important in order to populate variables
and data structures with, e. g., information from experiments, and to store the
obtained results, as an outcome from the application of a program to such data. In
some sense, this completes a programming language by providing an interface to the
outside world, whereas the “world” is represented by the data.
In general, writing data to a file is much easier than reading data from a file. The
reason for this asymmetry is that when writing data to a file, we do have the entire
control over the format of the data to save them. In contrast, when reading data from
an existing file, we need to deal with the given data format as it is, which can be very
laborious and frustrating, as we will illustrate below. For reasons of simplicity, we
start by discussing functions to write data to a file.

5.7.1 Writing data to a file


The easiest way to save one or more R objects from the workspace to a file is to
use the function save():

Here, the option file defines the name of the file in which we want to save the
data. In principle, any name is allowed, with or without extension. However, it is
helpful to name this file filename.RData, where the extension RData indicates that it is a
binary R data file. Here, binary file means that if we open this file within any text
editor, its content is not visible because of its coding format. Hence, in order to view
its content, we need to load this file again in an R workspace.
If we want to save more than one R object, two different syntax variations exist
that can be used. The first way to save more than one R object is to just name these
objects, separated by a comma:

The second way is to define a list that contains the variable names as character
elements:

If we want to save all the variables in the current workspace and not just the
selected ones, we can use the function save.image():

This function is a short cut for the following script, which accomplishes the same
task:

For the above examples, we did not need to care about the formatting of the file
to which we save the data, but R makes essentially a copy of the workspace, either for
selected variables or for all variables. This is a very convenient and fast way to save
variables to a file. One disadvantage of this way is that these files can only be loaded
with R itself, but not with other programs or programming languages. This is a
problem if we plan to exchange data with other people, friends, or collaborators and
we are unsure whether they either have access to R or do not want to use it, for some
reason. Therefore, R provides additional functions that are more generic in this
respect. In the following, we discuss three of them in detail.
There are 3 functions in the base package, namely, write.table(), write.csv(), and
write.csv2(), that allow saving tables as a text file. All of these functions have the
following syntax:

Here, M is a matrix or a data frame, and the option sep specifies the symbol used
to separate elements in M from each other. As a result, the data saved in file can be
viewed by any text editor, because the information is saved as a text rather than a
binary file as the one generated, for example, by the function save(). The functions
write.csv() and write.csv2() provide a convenient interface to Microsoft EXCEL, because
the resulting file format is directly recognized by this program. This means we can
load these files directly in EXCEL.
A potential disadvantage of these 3 functions appears when the output of an R
program is not just one table, but several tables of different size and additional data
structures in the form of, e. g., lists, environments, or scalar variables. In such cases, a
function like write.table() would not suffice, because you can only save one table. On
the other hand, the functions save() or save.image() can be used without the need to
combine all data structures into just one table.

5.7.2 Reading data from a file


At the beginning of this section we said that, in general, it is more difficult to read
data than to save them. This is true with the exception of binary files saved with the
functions save() or save.image(). Because in this case, the counterpart to read data
from a file is the function load():

Since the function save() makes essentially a copy of the workspace, or parts of it,
and saves it to a file, then the function load() just pastes it back into the workspace.
Hence, there are no formatting problems that we need to take care of.
In contrast, if tabular data are provided in a text file, we need to read this file
differently. R provides 5 functions to read such data, namely, read.table(), read.csv(),
read.csv2(), read.delim(), and read.delim2(). For example, the function read.table() has
the following syntax:

The option header is a logical value that indicates whether the file contains the
names of the variables as its first line. The option skip is an integer value indicating
the number of lines that should be skipped when we start reading the file. This is
useful when the file contains at the beginning some explanations about its content or
general comments.
Let us consider an example:

The content of the file infile is shown in →Fig. 5.3. This file contains a comment at its
beginning spanning one row. For this reason, we skip this line with the option skip=1.
Furthermore, this file contains a header giving information about the columns it
contains. By using “header=TRUE” this information is converted into the column names
of the table we are creating using the function read.table(). Using colnames(dat) will
give us this information. Most importantly, we need to specify the symbol that is used
to separate the numbers in the input file. This is accomplished by setting “sep=","”. As
a result, the variable dat will be a data frame containing the tabular data in the input
file having the information about the corresponding columns as column names. We
can access the information in the individual columns by using either dat[[1]], e. g., for
the first column or dat$names. Try to access the information in the second column. Is
there a problem?

Figure 5.3 File content of infile and the effect the options in the command
read.table() have on its content.

5.7.3 Low level reading functions


All the functions discussed so far, for reading data from a file, can be considered as
high-level functions, because they assume a certain structural organization of a file
that makes it relatively easy for a user to read these data into an R session. That
means, the structural organization can be captured by the supplied options of these
functions, e. g., by setting sep, skip etc. appropriately.
In case there is a text file that has a more complex format that cannot be read
with one of the above functions, R provides a very powerful, low-level reading function
called readLines(). This function allows reading a specified number of lines, n, from a
given file:

If n is a negative value, the whole file will be read. Otherwise, the exact number of
lines will be read. The advantage of this way of reading a file is that the formatting of
the file can change, but does not need to be fixed.
For text files with a complex, irregular formatting it is necessary to read these files
line-by-line in order to adopt the formatting separately for each line. This can be done
in the following way:

The function file() opens a connection to the file specified by the option
description and the option open that we want to read the information from. Then
calling readLines() reads exactly one line from the file. That means, if called repeatedly,
for example within a for-loop, it gives one line after the other, and these can be
processed individually. In this way, arbitrarily formatted files can be read and stored
in variables so that the information provided by the file can be used in an R session. If
we want to restart reading from this file, we just need to apply the function file()
again.

Figure 5.4 File content of infile and the effect the options in the command
read.table() have on its content.
To demonstrate the usage of the function readLines(), let us consider the following
example reading data from the file shown in →Fig. 5.4. In this case, our file contains
some irregular rows, and we would either like to entirely omit some of them, such as
row 5, or only use them partially, e. g., row 4. The following code reads the file and
accomplishes this task:

As a result, we receive a data frame “dat” containing the following information:

This corresponds to the information in the input file, skipping row 5 and omitting
the second element in row 4. From this example, we can see that “low-level
functions” offers some degree of flexibility, which translates into a considerable
amount of additional coding that we need to do to process an input file.
There is another function similar to readLines(), called scan(). The function scan()
does not result in a data frame, but a list or vector object. Another difference with
readLines() is that it allows specifying the data-types to be read by setting the what
option. Possible values of this options are, e. g., double, integer, numeric, character, or
raw. The following code shows an example for its usage:
As one can see, the object dat is a vector and the components of the input file,
separated according to sep, form the components of this vector. In our experience,
the function readLines() is the better choice for complex data files.
We just would like to mention without discussion that the function writeLines
allows a similar functionality and flexibility for writing data to a file in a line-by-line
manner.

5.7.4 Summary of writing and reading functions


In →Table 5.6, we provide a brief overview of some R functions discussed in the
previous sections. Column three indicates the difficulty level in using these functions,
which is directly proportional to the flexibility of the corresponding functions.
Table 5.6 Brief overview of R functions to read and write data.
Command name File type Usage level
save binary file easy
save.image binary file easy
write.table text file intermediate
write.csv text file intermediate
writeLines text file difficult
load binary file easy
read.table text file intermediate
read.csv text file intermediate
read.delim text file intermediate
readLines text file difficult

5.7.5 Other data formats


In addition to the R functions discussed above, which are included in the base
package, there are some additional packages available that allow importing data files
from other programs. In →Table 5.7, we list some of the most common formats
provided by other software and the corresponding package name, where one can
find the functions. This list is not intended to be exhaustive and it is recommended, if
one has a data file from a well-established software or program, to search first if
there is an R package available to read such data before one tries to implement a
program from scratch.

Table 5.7 Reading data files from other programs.


Command name File type Package name
read.spss SPSS file foreign
read.dta Stata file foreign
read.systat Systat file foreign
sasxport.get SAS file Hmisc
readMat Matlab file R.matlab
read.octave Octave file foreign

5.8 Useful commands


In this section, we discuss some commands that we find particularly useful for the
day-to-day applications of R.

5.8.1 The function which()


For identifying the indices, in a vector or a matrix, whose components have certain
values, we can use the function which(). This function expects a logical vector or a
matrix and returns the indices of the TRUE elements. A logical vector from a numerical
vector v can be, e. g., obtained by an expression like v==3. This results into a logical
vector that has the same length as the vector v, but its components are either TRUE or
FASLE depending on whether the component equals 3 or not.

When we are interested in identifying the indices of a matrix that have a certain
value, one can use the option arr.ind=TRUE to get the matrix indices:

If we would set this option to FALSE (which is the default value), the result is just
the number of TRUE elements, but not their indices.

5.8.2 The function apply()


The function apply() enables applying a certain function to a matrix or array along the
provided dimension. Its syntax is:

Here, X corresponds to a matrix or array, FUN is the function that should be applied
to X, and MARGIN indicates the dimension of X to which FUN will be applied. The
following example, calculates the sum of the rows for a matrix:

A similar result could be obtained by using a for-loop over the rows of the matrix
A.
In the case where the variable X is a vector, there exists a similar function called
sapply(). This function has the following syntax:
There are two differences compared to apply(). First, no MARGIN argument is
needed, because the function FUN will be applied to each component of the vector X.
Second, there is an option called simplify resulting in a simplified output of the
function sapply(). If set to TRUE, the result will have the form of a vector, whereas if set
to FALSE the result will be a list. It depends on the intended usage, i. e., which form
one might prefer, but a vector is usually most suitable for visual inspections. These
results can also be obtained with the command lapply().
The next example results in a vector, where each element is the third power of the
components of the vector X.

5.8.3 Set commands


A (mathematical) set is a collection of elements without duplications. This is different
to a vector, which may contain duplicated elements:

The function union() results in a set containing all elements, without duplication,
provided in the two sets of its argument.
Other commands for sets include intersect(), which returns only elements that are
in both sets, and setdiff() gives only elements, which are in the first, but not in the
second set, i. e., if X = setdiff(Y, Z), then all the elements in the set X are also in the
set Y, but not in the set Z. →Table 5.8 provides an overview of set operations.

Table 5.8 Each of these commands will discard any duplicated values in its
arguments.
Command name Description
union(x,y) combines the values in x and y
interest(x,y) finds the common elements in x and y
setdiff(x,y) removes the elements in x that are also in y
setequal(x,y) returns the logical value true if x is equal to y and false otherwise
is.element(x, y) returns the logical value true if x is a element in y and false otherwise

5.8.4 The function unique()


When we have a vector x that may contain multiple duplications of serval elements,
we can use the function unique() to remove all such duplications:

This can be useful if we want to use the values in the vector x as indices, and we
want to use each index only once.

5.8.5 Testing arguments and converting variables


When discussing the definition of functions in Section →5.6, we mentioned the
importance of making sure that the provided arguments are of the required type. In
general, R provides several useful comments for testing the nature of arguments. In
→Table 5.9, we give an overview of the most useful ones.

Table 5.9 Each of these commands allows testing its argument and returns a logical
value.
Command name Description
is.numeric(x) returns TRUE if argument is numerical value (double or integer)
is.character(x) returns TRUE if argument is of character type
is.logical(x) returns TRUE if argument is of logical type
is.list(x) returns TRUE if argument is a list
is.matrix(x) returns TRUE if argument is a matrix
is.environment(x) returns TRUE if argument is an environment
is.na(x) returns TRUE if argument is NA
is.null(x) returns TRUE if argument is of null type (NULL)

The above commands are complemented by conversion functions that transform


arguments into a specific type. Some examples are given in →Table 5.10.
Table 5.10 Each of these commands allows to convert its argument to a specific type.
Command name Description
as.numeric(x) converts the argument in a numerical value (double or integer)
as.character(x) converts the argument in a character-type
as.logical(x) converts the argument in a logical-type
as.list(x) converts the argument in a list
as.matrix(x) converts the argument in a matrix
as.na(x) converts the argument to NA
as.null(x) converts the argument to NULL

5.8.6 The function sample()


In order to sample elements from a given vector x, we can use the function sample().
To sample just means that the vector x contains a certain number of elements, i. e., its
components, from which we can draw a certain number according to some rules.

Here, x is a vector, from which elements will be sampled. The option size indicates
the number of elements that will be sampled, and replace indicates if the sampling is
with (TRUE), or without (FALSE) replacement. In the case replace = FALSE, the option
size needs to be smaller than the number of elements (length of the vector) in vector
x.
In →Figure 5.5, we visualize the two different sampling strategies. The column x
(before) indicates the possible values that can be sampled, and the column x (after)
contains the elements that are “left” after drawing a certain number of elements
from it. In the case of sampling with replacement, there is no difference since each
element that is “removed” from x is replaced with the same element. However, for
sampling without replacement, the number of elements in x decreases. It is important
to note that in the case of sampling with replacement, we can sample the same
element multiple times (see green ball in →Figure 5.5). This is not possible without
replacement.
Figure 5.5 Visualization of different sampling strategies. A: Sampling with
replacement. B: Sampling without replacement.

The option prob allows assigning a probability distribution to the elements of the
vector x. By default, a uniform distribution is assumed, i. e., selecting all elements in x
with the same probability.

In the case where we just want to sample all integer values from 1 to n, the
following version of the function sample() can be used:

For n=5, this realizes the same sampling function as above.


In summary, the function sample() allows sampling from a one-dimensional
distribution prob with elements in x.
5.8.7 The function try()
In some cases, it may be possible that there is a command, whose execution might
cause an error leading to the interruption of a program. If such a command is used
within a larger program, this will of course result in the crash of the whole program.
To prevent this, there is the command try(), which is a wrapper function to run an
expression in a protected manner. That means, an expression will be evaluated and in
case it would result in an error, it will capture this error, but without leading to a
formal error causing the crash of a program. For example, executing sqrt("two")
results in an error, because the function sqrt() expects a numeric argument and not a
character string. However, using the following, by setting silent=T does not generate
a formal error, but captures it in the object ms:

In order to get the error message, one can execute either of the following
commands:

The difference between both commands is that the function geterrmessage() gives
only the last error message in the current R session. That means, if you execute
further commands that also result in an error, you cannot go back in the history of
crashed functions.
In order to use the functionality of the function try() within a program, one can
test if the output of try() is as expected or not. For our example above, this can be
done as follows:

In this way, a numeric output can be used in some way, whereas an error
message, resulting in a FALSE for this test, can be handled in a different manner.
One may wonder how could it be possible that a command within a “functional”
program can result in an error. The answer is that before a program is functional, it
needs to be tested. And during the testing stage, there may be some irregularities,
and using the function try() may help to find these. Aside from this, R may use an
external input, e. g., provided by an input file, containing information that is outside
the definition of the program. Hence, it may contain information that is not as
expected in a certain context.
In addition to the function try(), R provides the function tryCatch(), which is a more
advanced version for handling errors and warning events.
5.8.8 The function system()
There is an easy way to invoke operating system (OS) specific commands by using the
command system(). This command allows the execution of OS commands like pwd or ls
as if they would be executed from a terminal. However, the real utility of the function
system() is that it can also be used to execute scripts.

5.9 Practical usage of R


In the previous sections, we discussed many important base functions of R. All of
these can be directly executed within an R session. This works fine for exploring these
functions and for playing around with their options in order to get to know them.
However, this is not a good way to use R when doing serious work. Instead, it is
recommended to write all the functions within a script, and then either execute the
whole script, or copy-and-paste parts of the script into an R session for its execution.
In order to execute a script containing an R program, we can use the function
source() as follows:

The input of the function source() is a character string containing the name of the
file.
The advantage of writing an R program in a file and then executing it is that the
results are easily reproducible in the future. This is particularly important if we are
writing a scientific paper or a report and we would like to make sure that no detail
about the generation of the results is lost. In this respect, it can be considered a good
practice to store all of our programs in files.
Aside from this, it is also very helpful since we do not need to remember every
detail of a program, which is anyway hardly possible if a program is getting more and
more complex and lengthy. In this way, we can create over time our own library of
programs, which we can use to look up how we solved certain problems, in case we
cannot remember.

5.9.1 Advantage over GUI software


The script-wise execution of programs is actually a very important advantage of R,
and any other programming language, over softwares that are based on graphical-
user-interfaces (GUI). In order to understand this argument, that may even seem
counterintuitive at first, let us remember how such GUI-based software, e. g., Excel or
Partek, work. Usually, one selects sequentially commands from a menu and executes
them. This can be seen as the sequential execution of commands directly written in
an R session with exactly the same disadvantages. That means, if one would like to
execute the same sequence of commands again, one needs to select them again
manually from the menu. However, in contrast to R, it is not possible to save the
sequence of commands so that it can be invoked automatically in an iterative manner
for a future application. On the other hand, one can easily convert an R script into an R
function to make it executable in any R program.
This is one argument that demonstrates the advantage of R over general GUI-
based software. For fairness, we would like to add that this is only an advantage if you
make use of this capability, for example when you are a developer for designing new
data analysis software. If you are only interested in the application of “standard”
solution methods for problems, then the usage of a GUI-based software can be very
well justified.
In Chapter →2, more details about this have been presented when we introduced
different programming paradigms.

5.10 Summary
In this chapter, we provided an introduction to programming with R that covered all
base elements of programming. This is sufficient for the remainder of the book and
should also allow you to write your own programs for a large number of different
problems. A very good free online resource for getting more details about functions,
options, and packages is STHDA →http://www.sthda.com/english developed by
Alboukadel Kassambara. For unlocking advanced features of R, we recommend the
book by [→46]. This is not a cookbook, but provides in-depth explanations and
discussions.
6 Creating R packages

R is an open-source interpreted language with the purpose to conduct statistical


analysis. Nowadays it is widely used for statistical software development, data
analysis and machine learning applications in multiple scientific areas. R is easy to
implement and has a large number of packages available, which allows users to
extend their code easily and efficiently.
In this chapter, we show how you can create your own R package for functions
you implemented. This makes your code reusable and portable. An R package is not
only the most appropriate way to achieve this, but it also enables a convenient use of
these functions and ensures the reproducibility of results. Furthermore, R packages
enable us to easily integrate our code with other R packages.

6.1 Requirements

6.1.1 R base packages


Installation of the R environment: the first step for programming in R and developing
R packages is the installation of the R software environment itself. R is an open-source
programming environment, which can be downloaded free from the following
address →https://cran.r-project.org/.
The basic R environment provides the following core packages: base, stats, utils,
and graphics.
base:This package contains all the basic functions (including instructions and
syntax), which allow a user to write code in R. It contains functions, e. g., for
basic arithmetic operations, matrix operations, data structure, input/output,
and for programming instructions.
utils: This package contains all utility functions for creating, installing, and
maintaining packages and many other useful functions.
stats: This package contains the most basic functions for statistical analysis.
graphics: This package provides functions to visualize different types of data in
R.

A user can utilize the functions available in the R-based environment to create their
packages rather than creating functions or objects from scratch. Below are examples
to get a list of all functions in these packages.
6.1.2 R repositories
R repositories provide a large number of packages for statistical analysis, machine
learning, modeling, visualization, web mining, and web applications. A list of currently
available R repositories is shown in →Table 6.1.

Table 6.1 A table of repositories in R.


Repository URL Description Installation
cran →cran.org R packages install.packages ("[package
for all name]")
purpose
Bioconductor →https://bioconductor.org/packages/ Bioconductor Installation details are at
provides a →https://bioconductor.org/install
large
number of
packages for
high
throughput
genomic
data analysis
Neuroconductor →https://neuroconductor.org Provides source("https://neuro
packages for conductor.org/neurocLite.R")
image neuro_install("aal")
analysis
Github →https://github.com/trending/r Githhub also install_github ("github package
provides a url")
large
number of
packages.
Additionally,
it is also an
alternative
source of R
packages
available in
other R
repositories.
Omegahat →http://www.omegahat.net/ Provides install.packages (packageName,
different repos =
packages for "http://www.omegahat.net/R")
statistical
analysis

6.1.3 Rtools

Rtools is required for building R packages. It is installed with R-base for Linux and
MacOs, but for windows, it needs to be installed. The ".exe" file of Rtools for
installation can be obtained at the following address: →http://cran.r-
project.org/bin/windows/Rtools/.
6.2 R code optimization
For an efficient R functioning, a developer should provide code that is efficient and
fast. It is always advised to developers to perform profiling on their code to check
about memory size taken by the code, execution time, and performance of each
instruction in the code for making some performance improvement. The function
debug() in R-base allows a user to test the code execution, line by line. Furthermore,
the function traceback() helps a user to find the line where the code crashed.

6.2.1 Profiling an R script


The following Listing provides an illustration of script profiling in R:

6.2.2 Byte code compilation


From version 2.13.0, R includes the byte code compiler, which allows users to speedup
their codes. In order to use the byte code compiler, the user needs to install the
package compiler, which is available in CRAN. For the byte compilation of the whole
package during the installation a user must add ByteCompile: true to the Description
file of the package. This will avoid the use of ”cmp-fun” for each function.
The following listing provides an illustration of the byte code compilation in R:
6.2.3 GPU library, code, and others
Using GPU libraries, such as gpuR, h2o4gpu, gmatrix, provides an R interface to use GPU
devices for computationally expensive analysis. Users can also parallelize their R code
using either of the following packages: parallel, foreach, and doParallel. Furthermore,
users can write scripts in C or C++ and run them in R using the package Rcpp for a
faster execution of their overall code.

6.2.4 Exception handling


For a package development in R, it is essential to use error handling. R allows various
error handling functions to deal with various unusual conditions, errors, and
warnings that can occurred during the execution of a function or a package. Error
handling provides a crisp and efficient code execution, which has the following
advantages:
Separating the main code from error-handling routines.
Allows the complete execution of the code when the exceptions are identified
and handled.
Describing relevant errors, error types, and warnings when an error occurs.
Preventing code or a package from crashing and recover from errors when
unexpected error occured.
This makes the debugging and profiling of complex code and packages easy. R
provides two types of error handling mechanisms; the first one is try() or tryCatch(),
and the second is withCallingHandlers(). The command tryCatch() registers existing
handlers. When the condition is handled the control returns to the context where
tryCatch() was called. Thus, it causes code to exit when a condition is signaled. The
tryCatch() command is suitable to handle error conditions. The command
withCallingHandlers() defines local handlers which are called in the same context
where the condition is signaled and the control returns to the same context where
the condition was signaled. Hence, it resumes the execution of the code after
handling the condition. It maintains a full call stack to the code line or the segment
that signals the condition. The command withCallingHandlers() is specifically useful to
handle non-error conditions [→201].
An example of exception handling is shown in the following Listing.

6.3 S3, S4, and RC object-oriented systems


Rutilizes functional and object-oriented features of programming. R has multiple
object-oriented programming (OOP) systems, which are S3, S4, RC, and R6. However,
most packages in R have been developed using the S3 OOP system.
6.3.1 The S3 class
In a S3 system, methods belong to generic functions instead of the objects of a class.
A generic function checks the class of the first input object in the function argument
and then dispatches a relevant method of that class. For example, plot() is a generic
function to visualize data. Different packages inherit plot functions and develop their
own functions. For instance, plot.hclust() is a member function, which provides
visualization for dendrograms of the hclust class object. In the example given below,
we create a generic trigonometric value (trgval()) function, with a default definition
described by trgval.default() when the class of the object is unknown. To create a new
function trgval() for the class cosx, we can leverage this generic function.
6.3.2 The S4 class
S4works similarly to S3, but it provides a stricter definition of the object-oriented
concept of programming, Hence, S3 classes allow the representation of complex data
more simplistically. Below, we provide an example of class inheritance with the use of
constructor and accessing a method:
6.3.3 Reference class (RC) system
The classes in the RC system provide reference semantics and support public and
private methods, active bindings, and inheritance. In this system, methods belong to
objects, not to generic functions. The objects are mutable. Creating RC objects is
similar to creating S4 objects. The methods library available in R implements RC-based
OOP. Also, the library R6 provides functionalities to implement RC-based OOP. An
example of object mutability in the RC system is shown below.
In the above example, S2 is not the copy of S1, but provides a reference of S1 so
any changes on S2 will reflect on S1, and vice versa.

6.4 Creating an R package based on the S3 class system


In order to start a package generation, we first write our R program in a file with .R
extension. In the next step, we call the function package.skeleton() with different
arguments required for the package creation. Different files and folders in a package
skeleton are shown in →Figure 6.1.
Figure 6.1 Schematic view of the file hierarchy in an R package.

6.4.1 R program file


We create an R program file named trg.R with the S3 system, as shown in the example
below. This function creates a basic R skeleton with all the necessary folders and files
required for building a package. In our example, we create a method for values of the
sin() function. We start with our main function with a generic name trgval(), which calls
the function UseMethod(). In the next step, we create a default function trgval.default()
as well as the function plot.trg().

6.4.1.1 The file trg.R


6.4.1.2 Package skeleton
Once the package skeleton is created, we need to edit the DESCRIPTION, NAMESPACE, and
the package description files in the folder man. The content of the edited files are
shown in Section →6.7.1. The folder trgpkg contains the following files and folders:
Description: This file contains some basic information of the package, for example:
version, author, description, and package dependency.
Man: This folder contains documentation of functions and data of the
package in .Rd format; the developer needs to edit these files to
describe their function’s inputs, output, and examples.
Data: If a package contains any data-set, those files should be kept in the
data folder. These data files should be R-objects such as matrix,
vector, data frame, and saved in the folder Data.

6.4.2 Building an R package


Now we need to check, compile, and build the package. First, we go to the command
prompt and change to the directory, where the package is kept and run the
command R CMD build [package name] to build the package tarball. We can also use
the same command in R calling it inside the system function of R. Below we provide an
example.

Now we get the tarball of package, which is named trgpkg_1.0.tar.gz.

6.5 Checking the package


In order to check errors and warnings before installing a package, it needs to be
debugged properly. A package is checked by the commands shown below.

The check command creates a folder with the package name and .Rcheck
extension. All the error logs and warning files are created inside this folder. The user
can check all these files to evaluate the package.
6.6 Installation and usage of the package
When a package is built and checked properly, and all errors and warnings have been
addressed, then it is ready for installation in R. The following command is used to
install a package in R:

6.7 Loading and using a package


When the package is installed, we can easily load it, call its functions and load data
associated with the package.
Below, we provide some examples of using functions in the package built in the
previous section (i. e., the package trgpkg.)

6.7.1 Content of the files edited when generating the package


Content of the file ”DESCRIPTION”

Package: trgpkg
Type: Package
Title: Example package for package creation
Version: 1.0
Date: 2019-01-20
Author: Shailesh Tripathi and Frank Emmert Streib
Maintainer: Shailesh Tripathi <shailesh.tripathy@gmail.com>
Description: This provides simple example for package creation in
R
License: GPL (>= 2)
LazyLoad: yes

Content of the file ”NAMESPACE”

exportPattern("^[[:alpha:]]+")
export(plot.trg, trgval)
importFrom("graphics", "abline", "axis", "plot",
"points", "segments", "text")
importFrom("stats", "runif")

Content of the file ”man/plot.trg.Rd”

\name{plot.trg}
\alias{plot.trg}
\title{
Plots the sin function and the input value of "trg" class.
}
\description{
Plots the sin function and the input value of "trg" class.
}
\usage{
plot.trg(trg, wave = T, minang = -450,
maxang = 450, ...)
}
\arguments{
\item{trg}{
this is a "trg" class object generated using "trgval" function.
}
\item{wave}{
It is a logical value. If true gives a wave plot of sin
function.
}
\item{minang}{
the minimum value of the domain of sin function for visualizing
sin function on the x-axis.
}
\item{maxang}{
maximum value of domain of sin function for
visulaizing sin function on x- axis.
}
\item{\dots}{
all other input type as availavle in "plot" function
}
}
\value{
Provides a graphic view of the sin function
}

\author{
Shailesh Tripathi and Frank Emmert Streib}

\seealso{
\code{\link{plot}}
}

\examples{
zz <- trgval(90)
plot(zz)
plot(zz, wave=FALSE)
}

Content of the file ”man/trgval.Rd”

\name{trgval}
\alias{trgval}
%- Also NEED an ’\alias’ for EACH other topic documented here.
\title{
A generic function which is used to calculate
trignometric values.
}
\description{
A generic function which is used to calculate
trigonometric values.}
\usage{
trgval(x, ...)
}
\arguments{
\item{x}{
is a numeric value or vector}
\item{\dots}{
}
}

\value{
returns a "trg" class object
}

\author{
Shailesh Tripathi and Frank Emmert-Streib}

\seealso{
plot.trg, trgval.default}

\examples{
zz <- trgval(c(30, 60, 90))
plot(zz)
plot(zz, wave=FALSE)
}

6.8 Summary
In this chapter, we provided a brief introduction how to create an R package. This
topic can be considered advanced and for the remainder of this book it is not
required. However, in a professional context the creation of R packages is necessary
for simplifying the usage and exchange of a large number of individually created
functions.
Nowadays, many published scientific articles provide accompanying R packages to
ensure that all obtained results can be reproduced. Despite the intuitive clarity of this,
the reproducability of results has recently sparked heated discussions, especially
regarding provisioning the underlying data [→70].
Part II Graphics in R
7 Basic plotting functions
In this chapter, we introduce plotting capabilities of R that are part
of the base installation. We will see that there is a large number of
different plotting functions that allow a multitude of different
visualizations.

7.1 Plot
The most basic plotting tool in R is provided by the plot()
function, which allows visualizing y as a function of x. The following
script gives two simple examples (see →Figure 7.1 (A) and (B)):

In this example, we first define the elements of vector x as a


sequence of points ranging from 0 to 2π with 50 intermediate
values. That means, vector x contains 50 equally-spaced points from
0 to 2π . The elements of vector y correspond to a sinus evaluated at
the 50 points provided by x.
The first example, shown in →Figure 7.1 (A), plots each element
in the vector x against each element in the vector y. That means, the
plot function visualizes always pairs of elements in x and y, i. e.,
(x , y ) for all i ∈ {1, … , 50} . In contrast, →Figure 7.1 (B) shows
i i

the same result, but with the line option for the type of the
visualization. The difference is that in this case, the 50 pairs of
points are connected by smooth line segments that result in a
smooth line visualization.
At first glance, →Figure 7.1 (B) may appear as the natural
visualization of a sinus function, because we know it is a smooth
function. However, it is important to realize that a computer
graphics is always pixel-based, i. e., a line is always a sequence of
points. But what is then the difference to a point-based graphics? It
is the spacing between consecutive points (and their size). In this
sense, →Figure 7.1 (B) is realized, internally by R, as a sequence of
points that are very close to each other so that the resulting plot
appears as a continuous line.
For instance, change the value of the option length.out in the
seq command to see what consequence this has on the resulting
plot.
Figure 7.1 Examples for the basic plotting function plot().

The two examples shown in →Figure 7.1 (A) and (B)


demonstrate just a quick visualization of the functional relation
between x and y. However, in order to improve the visual
appearance of these plots, usually, it is advised to utilize additional
options. Below we show two further examples, see →Figure 7.1 (C)
and (D), that use different options.
The type option allows specifying if we want points (p), lines (l),
or both (b) options, to be used simultaneously. The option cex
allows changing the font size of the axis (cex.axis) and the labels
(cex.lab), and font.lab leads to a bold face of the labels. Finally, the
line width can be adjusted by setting the lwd to a positive numerical
value, and col specifies the color of the lines or points.
There is one further command in the above examples (par) that
appears unimpressive at first. However, it allows adjusting the
margins of the figure by setting mar. Specifically, we need to set a
four-dimensional vector to set the margin values for (bottom, left,
top, right) (in this order). This command is important, because
when setting the font size labels larger than a certain value, it can
happen that the labels are cut-off. For preventing this, the mar
option needs to be set appropriately.
In the following, we will always modify a basic plot by setting
additional options to improve its visual appearance.

7.1.1 Adding multiple curves in one plot


There are two functions available that enable adding multiple lines
or points into the same figure, namely lines and points. Two
examples for the script below are shown in →Figure 7.2 (A) and (B).
In order to distinguish different lines or points from each other, we
can specify the line-type (lty) or point-type (pch) option.
Figure 7.2 Examples for multiple plots in one figure.

This can be extended to multiple lines or points commands as


shown for the next examples; see →Figure 7.2 (C) and (D). Here, we
add in a legend to the figures that allows a better identification of
the different lines with the parameters that have been used. The
legend command allows specifying the position of the legend within
the figure. Here, we used bottomright so that the legend does not
overlap with the lines or points. Further, we need to specify the text
that shall appear in the legend and the symbol, e. g., lty or pch.
There are more options available, but these provide the basic
functionality of a legend.
For the above example, we used the xlab and ylab option to
change the appearance of the labels of the x- and y-axis. In the
previous examples, we did not specify them explicitly. For this
reason, R uses as default names for the labels the names of the
variables that have been used in the function plot().

7.1.2 Adding horizontal and vertical lines


We can also add straight horizontal and vertical lines on the graph
using the function abline(). Depending on the option used, i. e., h or
v, horizontal or vertical lines are added to a figure at the provided
values. Also this command allows changing the line-type (lty) or
color (col). In →Figure 7.3, we show an example that includes one
horizontal and one vertical line.

Figure 7.3 Example for adding horizontal and vertical lines to a plot
using the function abline().
7.1.3 Opening a new figure window
In order to plot a function in a new figure by keeping a figure that is
already created, one needs to open a new plotting window using
one of the following commands:
X11(),for Linux and Mac if using R within a terminal
macintosh(), for a Mac operating system
windows(), for a Windows operating system

If these commands are not executed, then every new plot()


command executed will overwrite the old figure created so far.

7.2 Histograms
An important graphical function to visualize the distribution of
data is hist(). The command hist() shows the histogram of a data set.
For instance, we are drawing n = 200 samples from a normal
distribution with a mean of zero, a standard deviation of one, and
saving the resulting values in a vector called x, see the code below.
The left →Figure 7.4 shows a histogram of the data with 25 bars of
an equal width, set by the option breaks. Here, it is important to
realize that the data in x are raw data. That means, the vector x does
not provide directly the information displayed in →Figure 7.4 (Left),
but indirectly. For this reason, the number of occurrences of values
in x, e. g., in the interval 0.5 ≤ x ≤ 0.6 , need to be calculated by
the hist() function. However, in order to do that one needs to specify
what are the boundaries of the intervals to conduct such
calculations. The function hist() supports two different ways to do
that. The first one is to just set the total number of bars the
histogram should contain. The second one is by providing a vector
containing the boundary values explicitly.

An example for the second way is shown in →Figure 7.4 (Right).


Here, the boundary values are provided by the vector b. Also, in this
case we set the option freq to TRUE, which results in a histogram of
densities. In contrast, →Figure 7.4 (Left) shows the frequencies for
each bar corresponding to the number of x-values that fall in the
boundaries of one bar.

Figure 7.4 Examples for histograms. Left: Providing the total


number of bars. Right: Providing the boundary values.
7.3 Bar plots
If the bin information of the individual cells is already available,
one can use the command barplot(). In the following example, we
assign to each of the n = 10 bars a color randomly. We do that by
using the command color(), which provides a vector of 667 defined
color names available in R. In order to select the colors randomly,
we use the sample() command to sample n integer values between 1
and 667 randomly without replacement. Here, randomly means that
each element has the same probability to be selected, namely
p = 1/667 .

In →Figure 7.5 (Right), we show a second example for a bar plot


that splits each bar into individual contributing components. Such
plots are called stacked bar charts. For instance, suppose we have 3
factors that contribute to the outcome of a variable that is
measured for 5 different conditions indexed by the letters A–E.
Then, for each condition, the outcome of a variable can be broken
down into its constituting 3 values.
Figure 7.5 Examples for a normal bar chart (Left) and a stacked bar
chart (Right).

In the example below, we use two new options. The first one is
space allowing to adjust the spacial distance between adjacent bars.
The second one is names.arg, which allows specifying the labels that
appear below each bar. For specifying the labels, we use the
function LETTERS() to conveniently assign the first 5 capital letters of
the alphabet to names.arg. Alternatively, the function letters() can be
used to generate lowercase letters.

7.4 Pie charts


An alternative visualization to a bar plot is a pie chart, also
called circle chart. In a pie chart, the arc length of each slice is
proportional to the quantity which it represents.

Figure 7.6 A simple pie chart.


7.5 Dot plots
For the visualization of a large number of single-valued entities,
a dot plot can be useful. A dot plot is like a graphical version of a
table that makes it easier to recognize the relative differences
between the values of the different entities.
In →Figure 7.7, we show an example visualizing the mortality of
cancer in European countries normalized per 100,000 people. The
information is taken from the World Health Organization (WHO)
data for the year 2013. These data are provided in the file WHO.RData.
Figure 7.7 Information about cancer mortality in Europe taken
from the World Health Organization (WHO) data for the year 2013.

In →Figure 7.7, the average number of cancer deaths (averaged


over gender) is alphabetically ordered according the country
names. Overall, the information about the 53 countries and the
range of all possible values is easy to grasp, making a dot plot an
attractive graphical alternative to a table.
In the following code, we first identify all European countries in
the data frame dat.who.ave, because it contains also information
about non-european countries. Then, we use the dotchart() function
specifying the entries which shall be visualized
dat.who.ave$deaths[ind] and the labels (dat.who.ave[ind,1]) that
should be used to identify the corresponding rows in the dot plot.

As usually, there is more than one way to visualize a data set in


a meaningful way, depending on the perspective. In the following,
we present just one alternative representation of the same data set
by sorting the mortality values of the countries.
Figure 7.8 Ordered information about the cancer mortality in
Europe. Same data as in →Figure 7.7.

In →Figure 7.8, we ordered the mortality values and grouped


the countries into three categories. Each of these categories is
highlighted in a different color by specifying the option color. The
category of a country is specified with the groups option by
providing a vector of factors. If visualized in this way, subgroups
within the data set can be highlighted additionally. Of course, there
are further modification one could conduct, e. g., the alphabetic
organization of the countries within the subgroups or subdividing
the subgroups, e. g., highlighted by specifying different symbols
using the gpch option.

Finally, we would like to note that a dot plot is also called a


Cleveland dot plot, because William Cleveland pioneered this kind of
visualization.

7.6 Strip and rug plots


Another plotting style, which is also due to Cleveland, is called a
strip plot. It can be used to visualize one-dimensional data along a
line, plotting each data point to its corresponding spot. In R, the
stripchart() function allows the application of this style, and in
→Figure 7.9 we show three examples (corresponding to the green,
red, and blue data points).
Figure 7.9 Examples for strip plots (corresponding to the green,
red, and blue data points) and a rug plot.

For these examples, we generated 100 integer values from the


interval 1 to 100. The example shown by the red data points
corresponds to the base form of this plot function that overwrites
the data points, as specified by the option method. Alternatively, the
data points can be stacked leading to a kind of histogram (in blue),
although the height does not exactly reflect the count values, but is
merely proportional to the relative density of the x values within a
certain region. Finally, the data points can be jittered by applying a
small offset between data points of the same x-value.

In addition to a strip plot, there is the rug() function available in


R, which we included also in →Figure 7.9. The function rug is similar
to the function stripchart(); however, the difference is that the rug
function does not provide an option similar to at for stripchart(),
allowing to shift the rug of data points along the y-axis. Also, there
is no option to specify the symbol to be displayed as a data point,
instead, data points are shown as vertical lines.

7.7 Density plots


The next visualization style we will discuss is the density plot. A
density plot can be seen as an extension of strip plots and
histograms. The idea of a density plot is to convert the density of
the data points within certain sensible regions into relative height
values so that a summation over all density values adds up to one.
The example in the previous section for stripchart(), with the
option method="stack", is almost a density plot, according to the
above description. However, instead of converting regions of x-
values into relative height values, individual data points are stacked
if they are identical. Furthermore, height values are merely the
number of identical data points and, hence, these values would not
sum up to one. However, summation over all data points and
division of the stacked heights normalizes the resulting values
giving, in principle, a valid density plot. Also, if we are creating a
histogram with a certain bin size and normalize the resulting height
values by the total sum over all bins, we obtain another valid
density plot.
Despite the fact that these simple modifications of a strip plot
and a histogram lead to density plots, there are two characteristics
missing that enable general density plots. These characteristics are
(1) to average over overlapping regions, and (2) weighted regions
of the data points by application of a sliding window.
In order to understand these two characteristics, we show in
→Figure 7.10 some examples. The data in these examples are again
the average cancer mortalities from the WHO, however, this time
also for non-European countries. In the top row, we show an
example that averages over overlapping regions, but does not
apply a weighting to these regions. This corresponds to a sliding
window of a fixed size that counts for each position of the window
the number of data points within this window, and discards all
other data points outside. In R, this can be obtained by using the
function density() and specifying rectangular for the option kernel.
The size of the window is specified by the option bw (band width).
One can see that for bw=0.5, the resulting density plot is much more
rugged than for bw=5, hence, this option allows a smoothing of the
plot.
Figure 7.10 Density plots for cancer mortality worldwide. Data are
from the WHO. Top row: Averaged over overlapping regions.
Bottom row: Averaged and weighted over overlapping regions.

In the bottom row in →Figure 7.10, we show examples, which


additionally invoke a weighting of the data points. For these
examples, we used a normal distribution (kernel="gaussian"). In
→Figure 7.11, we depict the principle idea that underlies the
weighted averaging. In this figure, a normal distribution with a
mean value of μ = m. k = 175 and a standard deviation of
σ = sd. k = 10 (given formally by

2 (7.1)
1 (x − μ)
f (x) = exp (− ), − ∞ ≤ x ≤ ∞,
2
√ 2πσ 2σ

and discussed in detail in Chapter →17) is shown. The mean value


m. k is highlighted by the red vertical line to indicate the current

position of the averaging. The averaging itself involves all data


points, however, the weight is proportional to the density of the
normal distribution for the given values of m. k and sd. k .
Figure 7.11 Principle idea of the sliding window assigning weights
to data points.

In order to emphasize that different data points contribute


differently, we added vertical lines of different length, according to
the density values above these data points. Hence, the value for
m. k is proportional to the sum of all values weighted by the

density at the data points, resulting in


L (7.2)
v(m. k) = ∑ f (num(i); m. k, sd. k).

Here, L is the total number of data points and num(i) is the


number of data points in window i. In this way, for each position
along the x-axis, the value v(m. k) is evaluated by changing the
values of m. k .
The corresponding results are shown in the →Figure 7.10
(bottom row). Again, depending on the value of the band width (bw),
the obtained density plots can be smoothed. For the normal
distribution, the parameter bw changes the standard deviation,
making the normal distribution broader for larger values.
We would like to finish this section by mentioning that the
above discussion focused on the graphical meaning of density plots
and their underlying idea. However, it is important to note that the
quantitative estimation of the probability density of a given data set
is an important statistical problem in its own right.

7.8 Combining a scatterplot with histograms: the layout


function
R provides a very powerful command that allows to combine
different plot functions with each other. This command is the
layout() function. Basically, the function layout() enables the partition
of a figure into different sections where different plots can be
places. The option mat allows to define such a separation by
choosing an integer number for each section where we want to
place a plot in. In the example below, we split the whole figure into
4 regions, but we would like to put plots in only 3 of them. The
integer number corresponds to the order in which these regions
are plotted. In our example, the first plot is placed in the bottom-
left region.
In addition to the partitioning of the figure itself, we need also
to specify what width and height these regions should have. For
instance, in the example below, we use a relative width of 3 to 1 for
regions (2,1) to (0,3). By using the function layout.show(lla), we can
display the defined regions explicitly in order to see if the ratios are
as desired.
Finally, we needs to make sure that the different plots we uses
do actually go over the same range of the x-axis. Here, we
guarantee this by providing the break points of the histograms
explicitly.
Figure 7.12 Combining a scatterplot with histograms of its x- and y-
components.
In this way, we can create very complex figures that carry a lot
of information.

7.9 Three-dimensional plots


The visualization capability of R is not limited to one- and two-
dimensional plots. It is also possible to create three-dimensional
visualizations. The example below shows an application of the
command persp() for visualizing the density of a two-dimensional
normal distribution.
Figure 7.13 A three-dimensional visualization of the two-
dimensional normal distribution.

7.10 Contour and image plots


Alternatives to three-dimensional plots, available in R, include
contour maps and image plots. Such plots are projects of thee-
dimensional plots to a two-dimensional canvas.
Figure 7.14 Examples for a contour (Left) and an image plot (Right)
of a normal distribution.

7.11 Summary
Despite the fact that all of the commands discussed in this chapter
are part of the base installation of R, they provide a vast variety of
options for the visualization of data, as we have seen in the last
sections. All extension packages either address specific problems,
e. g., for the visualization of networks, or for providing different
visual aestatics.
8 Advanced plotting functions: ggplot2

8.1 Introduction
The package ggplot2 was introduced by Hadley Wickham [→200].
The difference of this package to many others is that it does not
only provide a set of commands for the visualization of data, but it
implements Leland Wilkinson’s idea of the Grammar of Graphics
[→204]. This makes it more flexible, allowing to create many
different kinds of visualizations that can be tailored in a problem-
specific manner. In addition, its aesthetic realizations are superb.
The ggplot2 package is available from the CRAN repository and
can be installed and loaded into an R session by

There are two main plotting functions provided by the ggplot2


package:
qplot(): for quick plots
ggplot(): allows the control of everything (grammar of
graphics)
In this chapter, we discuss both of these plotting functions.

8.2 qplot()
The function qplot() is similar to the basic plot function in R. The q in
front of plot stands for “quick”, in the way that it does not allow
getting access to the full potential provided by the package ggplot2.
The full potential is accessible via ggplot, discussed in Section →8.3.
To demonstrate the functionality of qplot(), we use the penguin
data provided in the package FlexParamCurve.

The penguin.data data frame has 2244 rows and 11 columns of


the measured masses for little penguin chicks between 13 and 74
days of age (see [→33]).
In →Figure 8.1, we show the basic functionality of qplot(),
generating a scatter plot using the following script:
Figure 8.1 Example of a scatter plot for multiple data sets using
qplot().
Table 8.1 Examples for different options for qplot.
Option for geom Description
point scatterplot
line connects ordered data points by a line
smooth smoothed line between data points
path connects data point by a line in the order provided by data
step step function
histogram histogram
boxplot boxplot

Similar to the base plot function in R, we first specify the values


for the x and y coordinates using the column names in the data file
penguin.data. The geom option defines the geometry of the object.
Here, we chose “point” to produce a scatter plot for all data points
(x , y ) . Other options are given in →Table 8.1. The last option we
i i

use is color to allow different colors for the observation points


depending on the year the observation has been made. We use the
function “factor” to indicate that the values of the variable “year”
are only used as categorical variable. The aesthetics command I()
can be used to set the color of the data points manually.
In order to further distinguish data points from each other, one
can use the shape option using the factor ck for the hatching order.
Because this can lead to a crowded visualization, qplot() offers the
additional option facets. The effect of this option is shown in
→Figure 8.2.
Figure 8.2 An example for the usage of facets.

The two columns in →Figure 8.2 are labeled A and B,


corresponding to the factors of the ck variable indicating first
hatched (A), and second hatched (B).
Next, we visualize the effect of the option value smooth for geom.

For this, we use only the first 10 observation points. As we can see
in →Fig. 8.3, in addition to these 10 data points, there is a smooth
curve added as a result from the smoothing function. We would like
to note that here, we used a vector to define the option geom,
because we wanted to show the data points in addition to the
smooth curve.
Figure 8.3 An example for smoothing data with qplot().

Similar to the base plot function in R, there are options available


to enhance the visual appearance of a plot. →Table 8.2 shows some
additional options to enhance plots.
Table 8.2 Further options to improve the visual appearance of a plot
for qplot().
Option Description
xlim, ylim limits for the axes, e. g., xlim=c(-5, 5)
log character vector indicating logged axes, e. g., log=”x” or log=”xy”
main main title for the plot
xlab, ylab labels for the x- and y-axes

8.3 ggplot()
The underlying idea of the function ggplot() is to construct a figure
according to a certain grammar that allows adding the desired
components, features, and aspects to a figure and then generate
the final plot. Each of such components is added as a layer to the
plot.
The base function ggplot() requires two input arguments:
data: a data frame of the data set to be visualized
aes(): a function containing aesthetic settings of the plot

8.3.1 Simple examples


In the following, we study some simple examples by using the
Orange data set containing data about the growth of orange trees.
To get an overview of these data, we show the first lines.
> head(Orange)
Grouped Data: circumference ~ age | Tree
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142

The data set contains only three variables (tree, age, and
circumference), whereas the variable “Tree” is an indicator variable
for a particular tree.

In order to plot any figure, we need to use the ggplot()


command, and specify how we want to plot these data by providing
information about the geometry. In the above case, we just want to
plot the circumference of the trees as a function of their age by
means of points, see →Figure 8.4 (Left). The same result can be
obtained by splitting the whole command into separate parts as
follows:

In order to demonstrate the working principle of the different


layers, we improve the visual appearance of the above plot by
setting a variety of different options.
Figure 8.4 Examples for point plots. Left: Base functionality without
setting options. Right: Modified point size.

The corresponding result is shown in →Figure 8.4 (Right). For this


plot we involved three different layers, namely
geoms:controls the geometrical objects
scales: controls the mapping between data and aesthetics
themes: controls nondata components of the plot

by setting options for the following functions:


geom_point()
scale_x_continuous()
theme()
The meaning of the available options is rather intuitive, if we know
all options available. This information can be acquired from the
manual of ggplot(), which is quite extensive.

8.3.2 Multiple data sets


Beyond the simple usage of ggplot() demonstrated above, the
combination of many options within different layers becomes
quickly involved. In →Figure 8.5, we show two additional examples
that highlight the presence of multiple data sets.

By adding the option color to the function aes() in ggplot(),


different colors will be assigned to different factors, as given by
Orange$Tree, and a figure legend will be automatically generated.
Specifying the types of the shape for the data points will, in addition,
assign different point shapes corresponding to the different trees.
Figure 8.5 Examples of multiple data sets plotted using ggplot().

In →Figure 8.5 (Right), we provide an example using geom_line()


instead of geom_point(). This displays the connected points in form
of different lines for the different trees.

Furthermore, we rearranged the 5 trees in the legend in their


numerical order. This is a bit tricky, because there is no option
available that would allow to do this directly. Instead, it is necessary
to provide this information for the different factors, by generating a
new factor Tree2 that contains this information.
Before we continue, we would like to comment on the logic
behind ggplot() used to add either multiple points or lines to a plot.
In contrast to the basic plotting function plot() discussed in Chapter
→7.1.1, which adds multiple data sets successively by, e. g., using
the lines() command, ggplot() can accomplish this by setting an
option (shape). However, this requires the data frame to contain
information about this in the form of an indicator variable (in our
case “Tree”). Hence, the simplification in the commands for
multiple lines needs to be compensated by a more complex data
frame. This can be in fact nontrivial.
The good news is that it is possible to use ggplot() in the same
logical way as the basic plotting function. An example for this is
shown in Listing 8.11.

For the shown example in Listing 8.11, there is certainly no


advantage in using ggplot() in this way, because a data frame with
the required information exists already. However, if one has two
separate pairs of data in the form D = {(x , y )} available, the
i i i

advantage becomes apparent.

8.3.3 geoms()
There is a total of 37 different geom() functions available to specify
the geometry of plotted data. So far, we used only geom_point() and
geom_line(). In the following, we will discuss some additional
functions listed below.
In →Table 8.3, we list geom() functions with their counterpart in
the R base package.

Table 8.3 Functions associated with ggplot() and geom() and their
corresponding counter parts in the R base package.
ggplot function Base plot function
geom_point() points()
geom_line() lines()
geom_curve() curve()
geom_hline() hline()
geom_vline() vline()
geom_rug() rug()
geom_text() text()
geom_smooth(method = ”lm”) abline(lm(y x))
geom_density() lines(density(x))
geom_smooth() lines(loess(x, y))
geom_boxplot() boxplot()

Additional geom functions include:


geom_bar()
geom_dotplot()
geom_errorbar()
geom_jitter()
geom_raster()
geom_step()
geom_tile()
For adding straight lines into a plot, we can use the functions
geom_abline() and geom_vline(), see →Figure 8.6 (Left). Because a
straight line is fully specified by an intercept and a slop, these two
options need to be set for geom_abline(). If we use a zero slop, we
obtain a horizontal line. For adding vertical lines, the function
geom_vline() can be used, specifying the option xintercept. In
addition, both functions allow setting a variety of additional options,
to change the visual appearance of the lines. For example, valid
linetype values include solid, dashed, dotted, dashdot, longdash, and
twodash.

Figure 8.6 Examples using geom_abline() and geom_vline() (Left) and


geom_step() (Right).

In →Figure 8.6 (Right), we show an example for the geom_step()


function. This function connects the data points by horizontal and
vertical lines making it easier to recognize horizontal and vertical
jumps.

In →Figure 8.7, we show examples for boxplots using the


function geom_boxplot(). For these examples, we do not distinguish
the different trees, but we are rather interested in the distribution
of the circumferences of the trees at the 7 different time points of
their measurement.
In →Figure 8.7 (Top row, Right), we added the original data
points, for which the boxplots are assessed. In order to avoid a
potential overlap between the data points, the function geom_jitter()
can be used to introduce a slight horizontal shift to the data points.
These shifts are randomly generated and, hence, different
executions of this function lead to different visual arrangements of
the data points.
Figure 8.7 Different examples for boxplots.
Furthermore, it can be informative to add projections of the
data points next to the coordinate axes. By using the function
geom_rug(), this information can be added. Specifying the option
sides allows to include these projections in form of dashed lines to
the left (l), right (r), bottom (b) or the top (t) of the plot; see
→Figure 8.7 (bottom-left). Also, it is possible to color these lines
according to the color of the boxplots; see →Figure 8.7 (bottom-
right).
8.3.4 Smoothing
In this section, we demonstrate the application of statistical
functions for data smoothing. In →Figure 8.8 (Top), we show two
examples using the function stat_smooth(). For both figures, we
used the “loess” method as the smoothing method. This method
averages over a sliding window along the x-axis to obtain averaged
values for the outcome variable, depicted by the blue line. The
purpose of the application of a smoothing function is to provide a
graphical regression, which summarizes data points. The option se
corresponds to the standard error, which we disable in the left
figure by setting se=F.
On the right figure, we add the information of the standard
error in form of a gray band that underlies the loess curve.
Furthermore, we set the color of this band with the fill option. We
want to point out that by not specifying this option, the default
value is to use a transparent background. Unfortunately, our
experience is that this can cause problems, depending on the
operating system. For this reason, setting this option explicitly is a
trick to circumvent potential problems.
Figure 8.8 Examples for smoothing functions.
Next, in →Figure 8.8 (bottom-left), we add to the color band
explicit error bars by using the geom option for stat_smooth(). Since
we have a continuous x-axis, we need to specify the number n of
error bars we want to add to the smoothed curve.
Finally, in →Figure 8.8 (bottom-right), we show an example for a
different smoothing function. In ggplot2, the available options are lm
(linear model), glm (generalized linear model), gam (generalized
additive model), loess and rlm (robust linear model) and in this
figure we use lm. A linear model means that the resulting curve will
be restricted to a straight line obtained from a least-squared fit.
8.4 Summary
The purpose of this chapter was to introduce the base capabilities
offered by ggplot2 and to highlight some aesthetic extensions it
offers over the basic R plotting functions. It is clear that the
Grammar of Graphics offers a very rich framework with incredibly
many aspects that is continuously evolving. For this reason, the best
way to learn further capabilities is by following online resources, e.
g., →https://ggplot2.tidyverse.org/ or
→http://moderngraphics11.pbworks.com/f/ggplot2-
Book09hWickham.pdf.
9 Visualization of networks

9.1 Introduction
In this chapter, we discuss two R packages, igraph and NetBioV
[→42], [→187]. Both have been specifically designed to visualize
networks. Nowadays, network visualization plays an important role
in many fields, as they can be used to visualize complex
relationships between a large number of entities. For instance, in
the life sciences, various types of biological, medical, and gene
networks, e. g., ecological networks, food networks, protein
networks, or metabolic networks serve as a mathematical
representation of ecological, molecular, and disease processes
[→9], [→71]. Furthermore, in the social sciences and economics,
networks are used to represent, e. g., acquaintance networks,
consumer networks, transportation networks, or financial networks
[→74]. Finally, in chemistry and physics, networks are used to
encode molecules, rational drugs, and complex systems [→20],
[→55].
All these fields, and many more, benefit from a sensible
visualization of networks, which enables gaining an intuitive
understanding of the meaning of structural relationships between
the entities within the network. Generally, such a visualization
precedes a quantitative analysis and informs further research
hypotheses.

9.2 igraph

In Chapter →16, we will provide a detailed introduction to networks,


their definition, and their analysis. Here, we will only restate that a
network consists of two basic elements, nodes and edges, and the
structure of a network can be defined in two ways, by means of:
an edge list or
an adjacency matrix
An edge list is a two-dimensional matrix that provides, in each row,
information about the connection of nodes. Specifically, an edge list
contains exactly two columns and its elements correspond to the
labels of the nodes. The following script provides an example:

The second command defines an edge list (el). The function


graph.edgelist() converts the matrix el into an igraph object g
representing a graph. Calling the functions V and E, with g as
argument, provides information about the vertices and the edges in
the graph g. This is an example of a simple graph consisting of
merely three nodes, labeled as 1, 2, and 3. This graph contains only
two edges between the nodes 1 and 2, and nodes 2 and 3. By using
the function plot(), the igraph object g can be visualized.

In →Figure 9.1 (Top-left), the output of the above plot function


is shown. In order to understand the effect of the option directed in
the function graph.edgelist(), we show in →Figure 9.1 (Top-right) an
example for setting this option TRUE.
Figure 9.1 Some examples for network visualizations with igraph.
The result is that the edges have now arrows, pointing from one
node to another. Specifically, for directed=T, the first column of the
edge list contains information about the nodes from which an edge
points toward the nodes contained in the second column.
An alternative definition of a graph can be obtained by an
adjacency matrix. The following script produces exactly the same
result as in →Figure 9.1 (Top-left):

By setting the option mode="directed", we obtain the graph in


→Figure 9.1(Top-right).
Here, an adjacency matrix is a binary matrix containing only
zeros and ones. If node i is connected with node j, then the
corresponding element (i, j) is one, otherwise it is zero. The
adjacency matrix is a square matrix, meaning that the number of
rows is the same as the number of columns. The number of rows
corresponds to the number of nodes of the graph.

9.2.1 Generation of regular and complex networks


For the examples above, we defined the structure of a graph
manually, either by defining an edge list or its adjacency matrix.
However, the igraph package provides also a large number of
functions to generate networks with certain structural properties. In
→Table 9.1 and →9.2, we list some of these.
Table 9.1 Examples of different regular network-types provided by
igraph.

Type Syntax
star graph.star(n, mode = c("in", "out", "mutual", "undirected"))

lattice graph.lattice(length, dim, nei = 1, directed = F, mutual = F, circular = F)

ring graph.ring(n, directed = F, mutual = F, circular=T)

tree graph.tree(n, children = 2, mode="out")

In →Figure 9.1 (Bottom), we show two such examples.


This functionality for generating such networks is very
convenient because implementing network generation algorithms
can be tedious.

Table 9.2 Examples of different complex network-types provided by


igraph.

Type Syntax
random network erdos.renyi.game(n, p.or.m, type=c("gnp", "gnm"), directed =
F, loops = F)

scale-free network barabasi.game(n, power = 1, m)

small-world network watts.strogatz.game(dim, size, nei, p, loops = F, multiple =


F)

geometric random grg.game(nodes, radius, torus = F, coords = F)


network

9.2.2 Basic network attributes


There is a variety of options to change the attributes of vertices and
edges. In →Table 9.3 and →9.4, we list the most important ones.
Basically, the appearance of each vertex and edge can be set
independently for most options. This gives a large flexibility with
respect to the graphical appearance of networks, allowing to adjust
a network individually.
Table 9.3 Basic vertex attributes that can be modified.
Option Data Description
structure
vertex.size numeric components set the size of each vertex
vector
vertex.label character components set the lable of each vertex
vector
vertex.label.dist numeric components set the distance of the labels from
vector the vertex
vertex.color character components set the color of each vertex
vector
vertex.shape character components set the shape of each vertex
vector

→Figure 9.2 illustrates the outputs from network visualization


when modifying the vertex and edge attributes.
Table 9.4 Basic edge attributes that can be modified.
Option Data Description
structure
edge.color character components set the color of each edge
vector
edge.width numeric same width for every edge
value
edge.lty numeric 0 (“no line”), 1 “solid”, 2 (“dashed”), 3 (“dotted”), 4
vector (“dotdash”), 5 (“longdash”), 6 (“twodash”)
edge.label character components set the label of each edge
vector
edge.label.cex numeric size of the edge labels
value
edge.label.color character components set the color of each edge
vector
edge.curved logic or if logic values, ”true” draws curved edges; if numeric
numeric values specify the curvature of the edge, zero
vector curvature means straight edges, negative values mean
the edge bends clockwise, whereas positive values
mean the opposite
edge.arrow.mode numeric 0 (no arrow), 1 (backward arrow), 2 (forward arrow), 3
vector (both)

Although, in principle, the attributes for each vertex and edge


can be set independently by specifying a numeric or character
vector, this is not necessary in the case where the attributes have to
be identical for all vertices or edges. Then, it is sufficient to provide
a scalar numeric or character value to set the option throughout the
network.
Figure 9.2 Examples demonstrating vertex and edge attributes.

9.2.3 Layout styles


It is important to realize that a network is a topological object, and
not a geometric one. That means, by defining an edge list or an
adjacency matrix of a network, its structure is defined. However,
this does not provide any information regarding the graphical
visualization of the networks. Therefore, for a given adjacency
matrix, the spacial coordinates of the nodes of a network are not
part of the definition of a network, but they are part of its graphical
visualization. In order to make this important point more clear, we
provide, in →Figure 9.3, four different graphical visualizations of the
same network.
Figure 9.3 Effect of different layout functions to generate the
cartesian coordinates of the nodes of a network. Importantly, for all
cases the same network is used.

Specifically, we generate a scale-free network with n = 500


nodes and use four different layout functions to generate the
cartesian coordinates for the nodes of this network.
It is clear from →Figure 9.3, that depending on the used
algorithm to generate the cartesian coordinates, the same network
“looks” quite different. The coordinates, contained in la1 to la4, are
represented in the form of a matrix with n rows (the number of
nodes) and 2 columns (corresponding to the x and y coordinates of
a node). That means, the x- and y-coordinates of the nodes are used
to place the nodes of the network onto a 2-dimensional plane, as
shown in →Figure 9.3.
The reason why all four layout styles result in different
coordinates is that any layout style is in fact an optimization
algorithm. And each of these optimization algorithms uses a
different optimization function. For example,
layout.fruchterman.reingold and layout.kamada.kawai are two force-
based algorithms proposed by Fruchterman & Reingold and
Kamada & Kawai [→82], [→106] that optimize the distance between
the nodes in a way similar to spring forces. In contrast,
layout.random chooses random positions for the x- and y-
coordinates. Hence, it is the only layout style among those
illustrated that is not based on an optimization algorithm.
An important lesson from the above examples is that for a given
network, the graphical visualization is not trivial, but requires
additional work to select a layout style that corresponds best to the
intended expectations of the user.
9.2.4 Plotting networks
There are two possible ways to plot an igraph object g representing
a graph. The first option is to use the function plot(). This option has
been used in the previous examples. The second option is to use
the function tkplot(). In contrast with the function plot(), the function
tkplot() allows the user to change the position of the vertices
interactively by providing a graphical user interface (GUI). Hence,
this can be done by means of the computer mouse.

At first glance, the function tkplot() function may appear


superior, because of its interactive capability. However, for large
networks, i. e., networks with more than 50 vertices, it is hardly
possible to adjust the position for each vertex manually. That
means, practically, the utility of tkplot() is rather limited because
only small networks can be adjusted. A second argument against
the usage of tkplot() is that due to the involvement of a graphical
user interface, there may be operating system specific problems
caused by the usage of TK libraries. Basically, such TK libraries are
freely available for all common operating systems, however, some
systems may require these libraries to be installed when they are
not available.

9.2.5 Analyzing and manipulating networks


In addition to the visualization of networks, the igraph package
offers a variety of functions to analyze and manipulate networks
quantitatively. For instance, one can easily find shortest paths, the
minimum spanning tree or study the modularity of a community
structure of a graph. In Chapter →16, we will discuss some of these
methods, e. g., finding shortest paths or the depth-first search, in
more detail.
9.3 NetBioV

NetBioV is another package for visualization networks. It provides


three main layout architectures, namely global, modular, and
layered layouts [→187]. These layouts can be used either separately
or in combination with each other. The rationale behind this
functionality is motivated by the fact that a network should be
visualized not only using one layout, but through many
perspectives. Furthermore, since many real-world networks are
generally acknowledged to have a scale-free, modular, and
hierarchical structure, these three categories of layouts enable the
highlighting of, e. g., specific biological aspects of the network.
Moreover, NetBioV includes an additional layout category, which
enables a spiral-view of the network. In the spiral view, the nodes’
placement can be made using force-based algorithm, or using
network measures for nodes. Overall, this provides a more abstract
view on networks.
The NetBioV package has been implemented in the R
programming environment and is based on the igraph library. For
this reason, it can be seen as complementing igraph by providing
more advanced visualization capabilities.
A list of its main functions used for different layout architectures
available in NetBioV is shown in →Table 9.5. In the subsequent
sections, we provide a brief description of the different types of
layouts offered by NetBioV.
Table 9.5 An overview of different layouts provided by NetBioV.
Layout Functions in R
categories
Global mst.plot, mst.plot.mod
layout

Modular plot.modules, plot.abstract.modules,


layout plot.abstrat.nodes, splitg.mst

Layered level.plot
layout
Layout Functions in R
categories
Spiral plot.spiral.graph
layout

9.3.1 Global network layout


Real-world networks, e. g., biological or social networks are usually
not planar. That means, they have edges crossing each other if a
graph is displayed in a two-dimensional plane. However, for a more
effective visualization, the crossing of edges should be minimized.
The global network layouts of NetBioV aim to minimize such
crossings.
The most important features that can be highlighted via a
global layout include the backbone of the network, the spread of
information within the network, and the properties of the nodes, e.
g., using various network measures. For instance, for highlighting
the backbone structure of a network NetBioV applies the following
strategy. In the first step, we define the backbone of a network. For
this, we use the minimum spanning tree (MST) algorithm to extract
a subnetwork from a given network. In the second step, we obtain
the coordinates for the nodes by applying a force-based algorithm
to the subnetwork consisting of the MST. In the third step, we
assign a unique color to the MST edges whereas the remaining
edges are colored according to the distance between the nodes.

9.3.2 Modular network layout


Most networks have a modular characteristics, that means there are
groups of nodes that are more strongly connected with each other
than the rest of the nodes. Depending on the origin of the network,
such modules serve a different purpose. For instance, in biology,
these modules can be thought of as performing a specific biological
function for the organism.
The modular network layouts in NetBioV allow to highlight the
individual modules by using standard graph-layout algorithms. The
principle approach to this works as follows. In the first step, we
determine the relative coordinates of the nodes within each
module. In the second step, we optimize the coordinates for each
module using standard-layout algorithms, and then we place the
modules according to these positions. In general, modules can be
identified with module detection algorithms. However, in specific
application areas also other approaches are possible. For instance
in biology, modules can be defined by gene-sets defined via
biological databases, such as gene ontology [→6] or KEGG [→107].
As an additional feature, the nodes in a module can be colored
based on the rank assigned to the nodes, e. g., ranks assigned
according to their node degree. Alternatively for biological
networks one could utilize, e. g., gene expression values.

9.3.3 Layered network (multiroot) layout


In order to emphasize the hierarchy in a network, NetBioV provides a
layered network layout. This algorithm organizes the nodes by
hierarchy levels that are directly obtained from the distance
between nodes. The layered network layout function assumes an
initial subset, N, of nodes in a graph G. That means the resulting
hierarchical graph does not need to have a unique root node but
can have multiple roots. Starting from this initial set, the distances
to all other nodes are determined and the nodes are plotted on
their corresponding hiearchy level.
9.3.4 Further features

9.3.4.1 Information flow

For visualizing the spread of information within a network, NetBioV


provides an algorithm which highlights either the shortest paths
between modules or the nodes in the modular and layered network
layouts. Highlighting such information is useful for visualizing key
connections between nodes or modules that may play an important
role in exchanging information. More specific interpretations
depend on the nature of the underlying network.

9.3.4.2 Spiral view

The spiral layout included in the NetBioV package provides the user
with some options to visualize networks in different spiral forms.
The aestetics of the spirals can be influenced by setting a tuning
parameter for the angle of the spiral. In addition, a wide range of
color options is provided as an input to highlight, e. g., the degrees
of nodes. In addition, the placement of nodes can be either
determined by standard layout functions or by a user-defined
function.

9.3.4.3 Color schemes, node labeling

NetBioV provides many options to color edges, vertices, and


modules either based on different properties of the network, or
based on user input. For the global network layouts, the edges
corresponding to the backbone of the network (MST) are shown in
one color, and the remaining edges are colored according to a
range of colors reflecting the distance between nodes. The vertices
or nodes of the network can be highlighted using a range of colors
and sizes. For instance, the expression values of nodes representing
genes or proteins can be shown with shades of colors from high- to
low-expression values or vice versa. One can also assign ranks to
the nodes based on network-related measures, which are visualized
by the size of the nodes.
Also for the modular graph layout functions a variety of color
options are available. The default color scheme for modules is a
heat map of colors, where nodes with a high degree are assigned
dark colors, whereas low-degree nodes are represented by light
colors in a module. The nodes in a graph can also be colored
individually in two ways. The first coloring option is based on the
global rank in the network, whereas the second coloring option is
based on local ranks in the modules. The ranks are determined by
the different properties of nodes, such as the degree or expression
value observed from, e. g., experimental data. The global rank
describes the rank of an individual node with respect to all other
nodes in the network, whereas the local rank of a node in a module
is obtained with respect to the nodes from the same module. Edges
for different modules can be colored differently so that the
connectivity of individual modules can be highlighted. Additionally,
the node-size can be used to highlight the rank of the nodes in the
network. Moreover, for each module, an individual graph layout can
be defined as a parameter vector as argument for a modular layout
function.
For the layered network layout, the color scheme is defined as
follows. For a directed network, the levels of the network are
divided into three sections, namely the lower, the initial, and the
upper section. Importantly, only the initial section and the upper
section are used for undirected networks. A user can assign
different colors to different levels. For a directed network, if edges
connect nodes with a level difference greater than one, then edges
are colored using two colors for two opposite directions (up and
down). If edges connect nodes on the same level, then the edges
are shown in a curved shape and in a unique color.
9.3.4.4 Interface to R and customization

The availability of NetBioV in R enables it to make use of various


additional packages to enhance a visualization. For instance,
various biological packages related to gene ontology (GO, TopGO)
can be utilized to include information about the enrichment of
biological pathways. Such information is particulary useful for the
visualization of modules.
Furthermore, information obtained about genes, proteins, and
their interactions, as well as network measures from many external
R libraries, e. g., available in CRAN and Bioconducor, can be used as
a part of the visualization of a network.

9.3.5 Examples: Visualization of networks using NetBioV


In this section, we demonstrate the capabilities of NetBioV by
visualizing various networks with different layout and plotting
options. The applications of the NetBioV functions are provided for
some example networks. Some details about the investigated
networks are shown in →Table 9.6.

Table 9.6 Examples of networks available in the NetBioV package.


Networks Number of vertices Number of edges
Artificial network 5000 23878
B-Cell lymphoma network 2498 2654
PPI (Arabidopsis thaliana) 1212 2574
Figure 9.4 Global network layouts using different options available
in NetBiov. Left: Coloring vertices of the B-cell lymphoma network
based on external information, such as expression value (red to
blue—smaller to higher expression value). Right: Edges of the MST
are shown in "green" and the remaining edges in "blue".
Figure 9.5 Modular layouts using different options available in
NetBiov: Left: Abstract modular view of A. thaliana; each module is
labeled with the most significant enriched GO-pathway. Edge width
is proportional to the number of connections between modules.
Right: Information flow in A. thaliana network by highlighting
shortest paths between nodes of modules 1, 5, 17 and 21.
Figure 9.6 Layered network layouts. Left: The B-cell lymphoma
network is shown. Right: The protein-protein interaction (PPI)
network of Arabidopsis thaliana is shown.
9.4 Summary
Networks from biology, chemistry, economy or the social sciences
can be seen as a data-type. For the visualization of such networks,
we provided in this chapter an introduction for igraph and NetBioV.
Overall, igraph provides many helpful base commands for the
generation, manipulation, but also visualization of graphs, whereas
NetBioV focuses on high-level visualizations from a global, modular,
and layered perspective.
In contrast to conventional data-types from measurements, e.
g., from sensors that provide direct numerical data, network data
are considerably different. For this reason dedicated plots for their
visualization have been developed that allow to gain a more
intuitive understanding of the meaning of the provided networks.
Part III Mathematical basics of data science
10 Mathematics as a language for science

10.1 Introduction
In data science, all problems will be approached computationally. For this reason, we started this
book with an introduction to the programming language R. The next step consists in the
understanding of mathematical methods needed for the data analysis models, because all
analysis models are based on mathematics and statistics. However, before we present in the
subsequent chapters the mathematical basis of data science, we want to emphasize in this
chapter a more general point concerning the mathematical language itself. This point refers to
the abstract nature of data science.
In →Figure 10.1, we show a very general visualization that holds for every data analysis problem.
The key point here is that every data analysis is conducted via a computer program that
represents methodological ideas from statistics and machine learning, and every computer
program consists of instructions (commands) that enable the communication with the processor
of a computer to perform computations electronically. Due to the fact that every data analysis is
conducted via a computer program that contains instructions in a programming language, a good
data scientist needs to “speak” fluently a programming language. However, the base of any
programming language for data analysis is mathematics, and its key characteristics is
abstractness. For this reason, a simplified message from the above discussion can be summarized
as follows:

Thinking in abstract mathematical terms makes you a better programmer and, hence, a better data scientist.

This is also the reason why mathematics is sometimes called the language of science [→185],
[→188] (as already pronounced by Galileo).
Before we proceed, we would like to add a few notes for clarification. First, by a programmer
we mean actually a scientific programmer that is concerned with the conversion of statistical and
machine learning ideas into a computer program rather than a general programmer that
implements graphical user interfaces (GUIs) or web sites. The crucial difference is that the level of
mathematics needs for, e. g., the implementation of a GUI is minimal comparable to the
implementation of a data analysis method. Also, such a way of programming is usually purely
deterministic and not probabilistic. However, the nature of a data analysis is to deal with
measurement errors and other imperfections of the data. Hence, probabilistic and statistical
methods cannot be avoided in data science but are integral pillars.
Figure 10.1 Generic visualization of any data analysis problem. Data analysis is conducted via a
computer program that has been written based on statistical- and machine-learning methods
informed with domain-specific knowledge, e. g., from biology, medicine, or the social sciences.

Second, it is certainly not necessary to implement every method for conducting a data analysis,
however, a good data scientist could implement every method. Third, the natural language we are
speaking, e. g., English, does not translate equally well into a computer language like R, but there
are certain terms and structures that translate better. For instance, when we speak about a
“vector” and its components, we will not have a problem to capture this in R, because in Chapter
→5 we have seen how to define a vector. Furthermore, in Chapter →12, we will learn much more
about vectors in the context of linear algebra. This is not a coincidence, but the meaning of a
vector is informed by its mathematical concept. Hence, whenever we use this term in our natural
language, we have an immediate correspondence to its mathematical concept. This implies that
the more we know about mathematics, the more we become familiar with terms that are well
defined mathematically, and such terms can be swiftly translated into a computer program for
data analysis.
We would like to finish by adding one more example that demonstrates the importance of
“language” and its influence on the way humans think. Suppose, you have a twin sibling and you
both are separated right after birth. You grow up in the way you did, and your twin grows up on a
deserted island without civilization. Then, let us say after 20 years, you both are independently
asked a series of questions and given tasks to solve. Given that you both share the same DNA, one
would expect that both of you have the same potential in answering these questions. However,
practically it is unlikely that your twin will perform well, because of basic communication problems
in the first place. In our opinion the language of “mathematics” plays a similar role with respect to
“questions” and “tasks” from a data analysis perspective.
In the remainder of this chapter, we provide a discussion of some basic abstract mathematical
symbols and operations we consider very important to (A) help formulating concise mathematical
statements, and (B) shape the way of thinking.

10.2 Numbers and number operations


In mathematics, we distinguish five main number systems from each other:
natural numbers: N
integers: Z
rational numbers: Q
real numbers: R
complex numbers: C
Each of the above symbols represents a set of all numbers that belong to the corresponding
number system. For instance, N represents all natural numbers, i. e., 1, 2, 3, … ; Z represents all
integer number, i. e., … , −2, −1, 0, +1, +2, … ; Q represents all rational numbers with a a

and b being any integer number; and R is the set of all real numbers, e. g., 1.4271.
There is a natural connection between these number systems in the way that
N ⊂ Z ⊂ Q ⊂ R ⊂ C. (10.1)

That means, e. g., that every integer number is also a real number, but not every integer number
is a natural number. Furthermore, the special sets Z and R denote the set of all positive
+ +

integers and positive reals.

Intervals
When defining functions, it is common to limit the value of numbers to specific intervals. One
distinguishes finite from infinite intervals. Specifically, finite intervals can be defined in four
different ways:
[a, b]= {x ∣ a ≤ x ≤ b} open interval, (10.2)

[a, b)= {x ∣ a ≤ x < b} half -closed interval, (10.3)

(a, b]= {x ∣ a < x ≤ b} half -closed interval, (10.4)

(a, b)= {x ∣ a < x < b} open interval. (10.5)

Similarly, for infinite intervals one defines the following:


[a, ∞)= {x ∣ a ≤ x < ∞}, (10.6)

(a, ∞)= {x ∣ a < x < ∞}, (10.7)

(−∞, b]= {x ∣ −∞ < x ≤ b}, (10.8)

(−∞, b)= {x ∣ −∞ < x < b}, (10.9)

(−∞, ∞)= R. (10.10)


The difference between a closed interval and an open interval is that for a closed interval the
end point(s) belong to the interval, whereas this is not the case for an open interval.

Modulo operation
The modulo operation gives the remainder of a devision of two positive numbers a and b. It is
defined for a ∈ R and b ∈ R ∖ {0} by
+ +

a mod b = modulo(a, b) = a − n ⋅ b = r. (10.11)

Here, n ∈ N is a natural number, and r ∈ R is the remainder of the division of a by b. For


+

programming, the modulo operation is frequently used for integer numbers a and b, because a
cyclic mapping can be easily realized, i. e., N + 1 → 1 can be obtained by
modulo(N + 1, N ). (10.12)

In R, the module operation is obtained by the following code:

Example 10.2.1.

We calculate 17 mod 4 = modulo(17, 4) and 3 mod 7 = modulo(3, 7) . In these examples,


a and b are integers. Therefore, we use the fact that in case we determine , we always find
a
b

q, r ∈ Z such that a = bq̇ + r , see [→199].

We start with 17 mod 4 and see that 17 = q ⋅ 4 + r . Hence, q = 4 , and r = 1 . Thus,


17 mod 4 = 1 . If we consider 3 mod 7 , we find 3 = q ⋅ 7 + r . Thus, q = 0 , and r = 3 . This

yields to 3 mod 7 = 3 .

Rounding operations
The floor and ceiling operations round a real number to its nearest integer value up or down. The
corresponding functions are denoted by
⌊x⌋ f loor f unction, (10.13)

⌈x⌉ ceiling f unction. (10.14)

As an example, the value of x = 1.9 results in ⌊x⌋ = 1 and ⌈x⌉ = 2 .


In contrast, the command round(x) , rounds the value of the real number x to its nearest
integer value. For instance, round(0.51) = 1 . In R, the value 0.5 is rounded toward the lower
integer value, e. g., round(−3.5) = −4 .
Finally, the truncation function, trunc(x) , of a real number x is just the integer part of the
number x without the “after comma” numbers. For instance, trunc(3.91) = 3 .
Sign function
For any real number x ∈ R , the sign function, sign(x) , gives

⎧+1 if x > 0; (10.15)

sign(x) = ⎨0 if x = 0;

−1 if x < 0.

Absolute value
The absolute value of a real number x ∈ R is
+x if x ≥ 0; (10.16)
abs(x) = |x| = {
−x if x < 0.

10.3 Sets and set operations


In the following, we introduce sets and some of their basic operations. In general, a set is a well-
defined collection of objects. For instance, A = {1, 2, 3} contains the three natural numbers 1, 2,
and 3, B = {△, ∘} is a set consisting of two geometric objects,
C = {K, Q, N, B, p} (10.17)

is the set containing chess pieces, and D = {A , C } is a set of sets. From these examples, one
can see that an object is something very generic, and a set is just a container for objects. Usually,
the objects of a set are enclosed by the brackets “{” and “}”.

The symbol denotes the membership relation to indicate that an object is contained in a set.
For instance, 2 ∈ A , and ∘ ∈ B . Here, the objects 2 and ∘ are also called elements of their

corresponding sets. The symbol is a relation, because it establishes a connection between an
object and a set and, hence, relates both with each other.
If we have two sets, A and A , and every element in A is also contained in A , but there
1 2 1 2

are also elements in A that are not in A , we write A ⊂ A . In this case A is called a subset of
2 1 1 2 1

A . In contrast, if every element in A is also contained in A , and there are no additional


2 1 2

elements in A , we write A = A , because both sets contain the same elements. Finally, if every
2 1 2

element in A is also contained in A , and there is at least one additional element in A , we write
1 2 2

A ⊆ A . In this case A is a proper subset of A .


1 2 1 2

A special set is the empty set, denoted by , which does not contain any element. |A| is the
cardinality of A, i. e., the number of its elements. It is possible that a set contains a finite or infinite
number of elements. For instance, for the above set B, we have |B| = 2 , and for the set of
natural numbers |N| = ∞ .
The set A 1 ∪ A2 = {x : x ∈ A1 ∨ x ∈ A2 } is called the union of A , and A .
1 2

A1 ∩ A2 = {x : x ∈ A1 ∧ x ∈ A2 } is called the cut set of A and A . For the definition of these


1 2

sets, we used the colon symbol “:” within the curled brackets. This symbol means “with the
property’’. Hence, the set {x : x ∈ A ∨ x ∈ A } can be read explicitly as every x that is element
1 2

in A or every x that is element of A is a member of the set A ∪ A . Alternatively, sometimes


1 2 1 2

the symbol “|” is used instead of “:”.


In →Figure 10.2, we show a visualization of the union and the cut set. It is important to realize
that both operations create new sets, i. e., B = A ∪ A , and C = A ∩ A are two new sets. If
1 2 1 2

A ∩ A = ∅ , then A , A are called disjoint sets.


1 2 1 2

Figure 10.2 Visualization of set operations. Left: The union of two sets. Right: The cut set of A 1

and A . 2

An alphabet Σ is a finite set of atomic symbols, e. g., Σ = {a, b, c} . That means, Σ contains all
elements for a given setting. No other elements can exist.
Σ is the set of all words over Σ. For example if Σ = {b} , then Σ = {ϵ, b, bb, bbb, bbbb …} .
⋆ ⋆

Here ϵ is the empty word.


There are three quantifiers from predicate logic that allow a concise description of properties
of elements of sets.

Definition 10.3.1.

The expression means for all. For example if A = {a 1, a2 , a3 } , then by ∀ x ∈ A , we mean all
elements in set A, i. e., a , a , a .
1 2 3

Definition 10.3.2.

The expression means there exits. For example if B = {−1, 2, 3} , then by ∃ x ∈ B : x < 3 , we
mean that in set B there exists an element, which is less than 3. Possibly, there is more than one
such element, as is the case for B.

Definition 10.3.3.

The expression ! means there exits only one. For example: ∃! x ∈ B : x < 2 means that in the
set B there exists only one element, which is less than 2.
10.4 Boolean logic
∧ ∨
The operators and are the logical or and and, respectively. They form logical operators to
combine logical variables v, q ∈ {1, 0} . Sometimes the logical variables are expressed as
{true, f alse} . By using operations from the set

O := {¬, ∧, ∨, ()}, (10.18)

of logical operators, we can easily construct logical formulas. For instance, the formulas
v ∨ q, (v ∨ q), (v ∨ q) ∧ ¬(v ∨ q) (10.19)

represent valid logical formulas as they are derived by using the operators in (→10.18). However,
according to this definition, the formulas
(v ∨ q)qq, (v q) (10.20)

are not valid (their meaning is undefined).


Suppose that S , S , S are logical expressions (statements) derived by using the elements of
1 2 3

the set operators O , similar to the ones given in equation (→10.19). The following statements
about logical formulas hold:
Theorem 10.4.1 (Commutative laws [→98]).
S1 ∧ S2 ⟺ S2 ∧ S1 (10.21)

S1 ∨ S2 ⟺ S2 ∨ S1 (10.22)

Theorem 10.4.2 (Associative laws [→98]).


(S1 ∧ S2 ) ∧ S3 ⟺ S1 ∧ (S2 ∧ S3 ) (10.23)

(S1 ∨ S2 ) ∨ S3 ⟺ S1 ∨ (S2 ∨ S3 ) (10.24)

Theorem 10.4.3 (Distributive laws [→98]).


S1 ∨ (S2 ∧ S3 ) ⟺ (S1 ∨ S2 ) ∧ (S1 ∨ S3 ) (10.25)

S1 ∧ (S2 ∨ S3 ) ⟺ (S1 ∧ S2 ) ∨ (S1 ∧ S3 ) (10.26)

Theorem 10.4.4 (Rules of de Morgan [→98]).


¬(S1 ∨ S2 ) ⟺ ¬S1 ∧ ¬S2 (10.27)

¬(S1 ∧ S2 ) ⟺ ¬S1 ∨ ¬S2 (10.28)

Theorem →10.4.1 says that the logical arguments can be switched for the logical operators
and and or. Theorem →10.4.2 says that we may successively shift the brackets to the right.
Similarly, when expanding expressions over the reals, for instance x(x + 1) = x + x , Theorem
2

→10.4.3 gives a rule for expanding logical expressions.


The rules of de Morgan given by Theorem →10.4.4 state that a negation applied to the single
expressions flips the logical operator. Note that these rules can be formulated for sets
accordingly.
Theorem 10.4.5 (Rules of de Morgan for sets [→100]).
¯
¯
¯ (10.29)
A ∪ B= A ∩ B

¯
¯
¯ (10.30)
A ∩ B= A ∪ B

The resulting statements (or forms) are called normal forms, and important examples thereof
are the disjunctive normal form and conjunctive normal form of logical expressions, see [→98].

Definition 10.4.1 (Disjunctive normal form (DNF) [→98]).


A logical expression S is given in disjunctive normal form if
S = S1 ∨ S1 ∨ ⋯ ∨ Sk , (10.31)

where
Si = Sj1 ∧ Sj2 ∧ ⋯ ∧ Sjk . (10.32)
j

The terms S are literals, i. e., logical variables or the negation thereof.
ji

Two examples for logical formulas given in disjunctive normal form are
(v ∧ q) ∨ (¬v ∧ ¬q) (10.33)

or
v ∨ (v ∧ q). (10.34)

Here we denote the literals by using the notations v and q for logical variables.

Definition 10.4.2 (Conjunctive normal form (DNF) [→98]).


A logical expression S is given in conjunctive normal form if
S = S1 ∧ S1 ∧ ⋯ ∧ Sk , (10.35)

where
Si = Sj1 ∨ Sj2 ∨ ⋯ ∨ Sjk . (10.36)
j

The terms S are literals.


ji

Examples for logical formulas given in conjunctive normal form are


(v ∨ q) ∧ (¬v ∨ ¬q) (10.37)

or
v ∧ (v ∨ q). (10.38)

In practice, the application of Boolean functions [→98] has been important to develop
electronic chips for computers, mobile phones, etc. A logic gate [→98] represents an electronic
component that realizes (computes) a Boolean function f (v , … , v ) ∈ {0, 1} ; v are logical
∧∨
1 n i

variables. These logic gates use the logical operators , , ¬ and transform input signals into
output signals. →Figure 10.3 shows the elementary logic gates and their corresponding truth
tables.

Figure 10.3 Elementary logic gates of Boolean functions and their corresponding truth table. The
top symbol corresponds to the IEC, and the bottom to the US standard symbols.


We see in →Figure 10.3 that the OR-gate is based on the functionality of the operator . That
means, the output signal of the OR-gate equals 1 as soon as one of its input signals is 1.
The output signal of the AND-gate equals 1 if and only if all input signals equal 1. As soon as
one input signal equals 0, the value of the Boolean function computed by this gate is 0.
The NOT-gate computes the logical negation of the input signal. If the input signal is 1, the
NOT-gate gives 0, and vice versa.

10.5 Sum, product, and Binomial coefficients


For a given set A = {a , … , a } , the sum and product of its components can be conveniently
1 n

summarized by the sum operation (∑) and the product operation (∏).

Sum
The sum, ∑, is defined for numbers a involving all integer indices i , i
i l u ∈ N from i , i
l l + 1 … , iu

, i. e.,
iu (10.39)
∑ ai = ail + ail +1 + ⋯ + aiu .

i=il

Here “l” indicates “lower”, whereas “u” means “upper”, to denote the beginning and ending of
the indices. For i = 1 , and i = n , we obtain the sum over all elements of A,
l u

a = a + ⋯ + a . Alternatively, the sum can also be written by a different notation for


n
∑ i 1 n
i=1

the index of the sum symbol,


(10.40)
∑ ai = ail + ail +1 + ⋯ + aiu .

i∈{il ,il +1,…,iu }

The latter form needs to be used if only selected indices should be used for the summation. For
instance, suppose, I = {2, 4, 5} is an index set containing the desired indices for the summation
then
(10.41)
∑ ai = ∑ ai = a2 + a4 + a5 .

i∈I i∈{2,4,5}

Product
Similar to the sum, the product, ∏, is also defined for numbers a involving all integer indices
i

i , i ∈ N from i , i + 1 … , i , i. e.,
l u l l u

iu (10.42)
∏ ai = ail ⋅ ail +1 ⋅ ⋯ ⋅ aiu ;

i=il

(10.43)
∏ ai = ail ⋅ ail +1 ⋅ ⋯ ⋅ aiu .

i∈{il ,il +1…,iu }

Remark 10.5.1.

In the above discussions of the sum and product, we assumed integer indices for the
identification of the numbers a , i. e., i ∈ N . However, we would like to remark that, in principle,
i

this can be generalized to arbitrary “labels”. For instance, for the set A = {a , a , a } , we can △ ∘ ⊗

define the sum and product over its elements as


(10.44)
∑ ai = a△ + a∘ + a⊗ ;

i∈{△,∘,⊗}
(10.45)
∏ ai = a△ ⋅ a∘ ⋅ a⊗ .

i∈{△,∘,⊗}

Hence, from a mathematical point of view, the nature of the indices is flexible. However, whenever
we implement a sum or a product with a programming language, integer values for the indices
are advantageous, because, e. g., the indexing of vectors or matrices is accomplished via integer
indices.
In R, the most flexible way to realize sums and products is via loops. However, if one just wants a
sum or a product over all elements in a vector A, from i = 1 to i = N , one can use the
l u

following commands:

Binomial coefficients
For all natural numbers k, n ∈ N with 0 ≤ k ≤ n , the binomial coefficient, denoted C(n, k) , is
defined by
n n! (10.46)
C(n, k) = ( ) = .
k k!(n − k)!

It is interesting to note that a binomial coefficient is a natural number itself, i. e., C(n, k) ∈ N .
For the definition of a binomial coefficient the factorial “!” of a natural number is used. The
factorial of n is just the product of the numbers from 1 to n, i. e.,
n (10.47)
n! = ∏ i = 1 ⋅ 2 ⋅ ⋯ ⋅ n.

i=1

The binomial coefficient has the combinatorial meaning that from n objects, there are C(n, k)
ways to select k objects without considering the order in which the objects have been selected. In
→Figure 10.4, we show an urn with n = 4 objects. From this urn, we can draw k = 2 objects in 6
different ways.
Also, the factorial n! has a combinatorial meaning. It gives the number of different arrangements
of n objects by considering the order. For instance, the objects {1,2,3} can be arranged in 3!=6
different ways:
(1, 2, 3) − (1, 3, 2) − (2, 3, 1) − (2, 1, 3) − (3, 1, 2) − (3, 2, 1). (10.48)
Figure 10.4 Visualization of the meaning of the Binomial coefficient C(4, 2) .

Properties of Binomial coefficients


The binomial coefficients have interesting properties. Some of these are listed below.
n (10.49)
C(n, 0) = ( ) = 1,
0

n (10.50)
C(n, n) = ( ) = 1,
n

n n (10.51)
( ) = ( ),
k n − k

∀n ∈ N, and 0 ≤ k ≤ n .
The following recurrence relation for binomial coefficients is called Pascal’s rule:
n + 1 n n (10.52)
( ) = ( ) + ( ).
k + 1 k k + 1

In →Figure 10.5, we visualize the result of Pascal’s rule for n ∈ {0, … , 6} . The resulting object is
called Pascal’s triangle.
Figure 10.5 Pascal’s triangle for Binomial coefficients. Visualized is the recurrence relation for
Binomial coefficients in equation (→10.52).

10.6 Further symbols


Let us again assume we have a given set A = {a 1, … , an } , where its elements a are numbers.
i

Minimum and maximum


The minimum and the maximum of the set A are defined by

amin = min {A} = {ai ∣ ai ∈ A and ai ≤ aj ∀j ≠ i}; (10.53)
i=1,…,n


amax = max {A} = {ai ∣ ai ∈ A and ai ≥ aj ∀j ≠ i}. (10.54)
i=1,…,n
If there is more than one element that is minimum or maximum, then the corresponding sets
a

min
and a ∗
max
contain more than one element.

Argmin and Argmax


There are two related functions to the minimum and maximum that return the indices of the
minimal/maximal elements instead of their values:

imin = argmin {A} = {i ∣ ai ∈ A and ai ≤ aj ∀j ≠ i}; (10.55)
i=1,…,n


imax = argmax {A} = {i ∣ ai ∈ A and ai ≥ aj ∀j ≠ i}. (10.56)
i=1,…,n

Logical statements
A logical statement may be defined verbally or mathematically, and has the values true or false.
For simplicity, we define the Boolean value 1 for true, and 0 for false. One can show that the set
{true, f alse} is isomorphic to the set {0,1}.

The Boolean value of the statement “The next autumn comes for sure” equals 1 and, hence, the
statement is true. From a probabilistic point of view, this event is certain and its probability equals
one. Therefore, we may conclude that this statement does not contain any information, see also
[→169]. The following inequalities and equations
i= −5, (10.57)

100= 50 + 20 + 30, (10.58)

−1≥ 5, (10.59)

1< 2, (10.60)

n
n(n + 1) (10.61)
∑ j= , n ∈ N,
2
j=1

are mathematical statements, which are true or false. The first equation is false, as i = √−1 ,
where i is the imaginary unit of a complex number z = a + ib . The second equation is obviously
true, as 50+20+30 equals 100. For the third statement, a negative number cannot be greater or
equal, as then a positive number and its Boolean value is therefore false. The fourth statement
represents an inequality too, and is true. Strictly speaking, the fifth equation is a statement form
(Sf) over the natural numbers, as it contains the variable n ∈ N .
In general, statement forms contain variables and are true or false. In case of equation (→10.61),
we can write ⟨Sf (n)⟩ = ⟨∑ j = ⟩ . This statement form is true for all n ∈ N and can be
n n(n+1)

j=1 2

proven by induction. Another example of a statement form is

⟨Sf (x)⟩ = ⟨x + 5 = 15⟩.


(10.62)

For x = 10 , ⟨Sf (x)⟩ is true. For x ≠ 10 , ⟨Sf (x)⟩ is false.


Generally, we can see that the statement changes if the variable of the statement form (Sf)
changes. Once we define statements (or statement forms), they can be combined by using logical
operations. We demonstrate these operations by first assuming that S and S are logical 1 2

statements. The statement S ∧ S means that S and S hold. This statement may have the
1 2 1 2

value true or false, see →Fig. 10.3. For instance, S := 2 + 2 = 4 ∧ S := 3 + 3 = 6 is true, but
1 2

S := 2 + 2 = 4 ∧ S := 3 + 3 = 9 is false. Similarly, S
1 3 ∨ S means that S or S holds.
1 2 1 2

Here, S := 2 + 2 = 4 ∨ S = 3 + 3 = 6 is true, but S := 2 + 2 = 4 ∨ S := 3 + 3 = 9 is


1 2 1 3

true as well. The logical negation of the statement S is usually denoted by ¬S . The well-known
triangle equation,
|x1 + x2 | ≤ |x1 | + |x2 |, x1 , v2 ∈ R, (10.63)

holds true, but not


¬(|x1 + x2 | ≤ |x1 | + |x2 |). (10.64)

This means
|x1 + x2 | > |x1 | + |x2 | (10.65)

is generally false.

Statement: ⇒
The logical implication S ⟹ S means that S implies S . Verbally, one can say S “logically
1 2 1 2 1

implies” S , or if S holds, then follows S .


2 1 2

Statement ⇔
The statement S 1 ⟺ S2 is stronger, because S holds if and only if S holds.
1 2

For the above statements, it is important to note that to go from the left statement to the right
∧∨
one, or vice versa, one needs to apply logical operators (¬, , ) or algebraic operations (+, −, /,
etc.). For instance, by assuming the true statement n ≥ 2n , n > 1 , we obtain the implications
2

2 2 2 2 (10.66)
n ≥ 2n ⟹ n − 2n ≥ 0 ⟹ n − 2n + 1 = (n − 1) ≥ 0.

Finally, we want to remark that a false statement may imply a true statement; i 2
= 1 (false as
i = −1 ) implies 0 ⋅ i = 0 ⋅ 1 (true).
2 2

10.7 Importance of definitions and theorems


In order to develop and formulate mathematical concepts and ideas precisely, we need a concise
language. For instance, if we want to define a mathematical term, we first need to understand
what a mathematical definition is. A definition is a concept formation of a mathematical term that
is (possibly) based on other (mathematical) terms, which are either immediately clear or which
have already been defined. It is important not to confuse the terms definition and theorem. As
mentioned above, a definition is just a concept formation, and not a statement and, therefore, it
cannot be proven, but it is assumed to be true. In contrast, a theorem is a mathematical
statement that needs to be proven by using other statements. In the following, we give some
examples of definitions:
Definition 10.7.1.
Let a, b ∈ R . The sum of these two real numbers are defined by
sum(a, b) := a + b. (10.67)

Definition →10.7.1 defines the sum of two real numbers based on the trivial definition of the
symbol “+”.

Definition 10.7.2.
Let a, b ∈ R . The function fL : R ⟶ R, given by
fL (x) := ax + b, (10.68)

defines a linear function or a linear mapping.


The next statement can be formulated as a theorem based on the previous definition.
Theorem 10.7.1.
The unique solution of the equation
fL (x) = 0 (10.69)

is given by x = − . b
a

The proof of Theorem →10.7.1 is very simple, as f (x) := ax + b = 0 leads directly to


L

x = −
a
b
by performing elementary calculations. Specifically, the first elementary calculation is
subtracting b from ax + b = 0 . Second, we divide the resulting equation by a and obtain the
result.
Another example is the famous binomial theorem.
Theorem 10.7.2.
Let a, b ∈ R and n ≥ 1 . Then,
n
n
(10.70)
n n−k k
(a + b) = ∑( )a b .
k
k=1

Theorem →10.7.2 can be proven by induction over n.


Sometimes, one uses the term lemma instead of theorem. Also a lemma is a statement that
needs to be proven, however, it is not as important as a theorem. An example of an important
theorem is the well-known fundamental theorem of Algebra [→127], stating that any complex-
valued polynomial with degree n has exactly n zeros. To give a function-theoretic proof, one needs
several lemmas to conclude this theorem, see, e. g., [→49].
Another term of a statement is a corollary. Also a corollary is a theorem (statement), but it follows
immediately from a theorem proven before. The following corollary follows from Theorem
→10.7.2 straightforwardly:

Corollary 10.7.1.
2 2 2 (10.71)
(a + b) = a + 2ab + b .

10.8 Summary
In general, the mathematical language is meant to help with the precise formulation of problems.
If one is new to the field, such formulations can be intimidating at first and verbal formulations
may appear as sufficient. However, with a bit of practice one realizes quickly that this is not the
case, and one starts to appreciate and to benefit from the power of mathematical symbols.
Importantly, the mathematical language has a profound implication on the general mathematical
thinking capabilities, which translate directly to analytical problem-solving strategies. The latter
skills are key for working successfully on data science projects, e. g., in business analytics, because
the process of analyzing data requires a full comprehension of all involved aspects, and the often
abstract relations.
11 Computability and complexity
This chapter provides a theoretical underpinning for the programming in R that we introduced in
the first two parts of this book. Specifically, we introduced R practically by discussing various
commands for computing solutions to certain problems. However, computability can be defined
mathematically in a generic way that is independent of a programming language. This paves the
way for determining the complexity of algorithms. Furthermore, we provide a mathematical
definition of a Turing machine, which is a mathematical model for an electronic computer. To
place this in its wider context, this chapter also provides a brief overview of several major
milestones in the history of computer science.

11.1 Introduction
Nowadays, the use of information technologies and the application of computers are ubiquitous.
Almost everyone uses computer applications to store, retrieve, and process data from various
sources. A simple example is a relational database system for querying financial data from stock
markets, or finding companies’ telephone numbers. More advanced examples include programs
that facilitate risk management in life insurance companies or the identification of chemical
molecules that share similar structural properties in pharmaceutical databases [→54], [→170].
The foundation of computer science is based on theoretical computer science [→163], [→164].
Theoretical computer science is a relatively young discipline that, put simply, deals with the
development and analysis of abstract models for information processing. Core topics in
theoretical computer science include formal language theory and compilers [→121], [→160],
computability [→22], complexity [→37], and semantics of programming languages [→122],
[→126], [→167] (see also Section →2.8). More recent topics include the analysis of algorithms
[→37], the theory of information and communication [→40], and database theory [→124]. In
particular, the mathematical foundations of theoretical computer science have influenced modern
applications tremendously. For example, results from formal language theory [→160] have
influenced the construction of modern compilers [→121]. Formal languages have been used for
the analysis of automata. The automata model of a Turing machine has been used to formalize
the term algorithm, which plays a central role in computer science. When dealing with algorithms,
an important question is whether they are computable (see Section →2.2). Another crucial issue
relates to the analysis of algorithms’ complexity, which provides upper and lower bounds on their
time complexity (see Section →11.5.1). Both topics will be addressed in this chapter.

11.2 A brief history of computer science


In this section, we briefly sketch the history of computer science with respect to the most
important milestones for its theoretical foundations [→32], [→89], [→146]. As early as 1100 BC,
the first mechanical calculators were constructed. The abacus, for example, is over 3000 years old.
In approximately 300 BC, Euclid contributed to the development of computational methods by
calculating the greatest common divisor (GCD). Another milestone was achieved in approximately
820 AD by Al-Khwarizmi, who explored the fundamental aspects of computing methods. The term
algorithm is derived from the Latinization of his name: Algorithmi. From about 1518, the scientist
Adam Riese developed algorithms with the aim of establishing the decimal system.
Further milestones in computer science were achieved in the seventeenth century (see [→32],
[→89]). Pascal (approx. 1641) developed a patent for his calculator, Pascaline, which was used for
accounting and tax calculations. Leibniz (approx. 1673) developed a calculating machine to
perform the four fundamental arithmetic operations. In 1679, Leibniz was also the first to develop
the dual system, which uses only the digits 0 and 1. Its development had a fundamental influence
on modern computers, as well as processors.
The development of mechanic calculating machines controlled by programs was advanced in
the nineteenth century [→32], [→89]. The idea was to use control-based programming to perform
more complex calculations than were possible using the simple machines described above. A
highlight was the seminal work by Babbage (1822), who developed the concept of a computer
called the analytical engine. A contribution with significant impact on modern computer science
was achieved by Boole in 1854. He developed the mathematical foundations of so-called Boolean
logic, based on logic operations. Another breakthrough, attributable to Hollerith in 1886, was the
development of a system for data processing using card-to-tape calculations. This system was
used until the second half of the twentieth century, and contributed greatly to modern
information processing.
Turing developed the concept of the so-called Turing machine in the 1930s [→32], [→89]. This
automaton-based model has had a considerable influence on modern (theoretical) computer
science, and nowadays serves as a theoretical foundation for computers. Zuse, in 1941, was
among the pioneers who contributed to the development of electronic calculating machines. He
developed the program-controlled computer Z3 together with a programming language called
Plankalkül. The first fully electronic computer, developed by Eckert and Mauchly (1946), was called
ENIAC (for electronic numerical integrator and computer), and industrial production of computers
started since the 1950s.
Another computer science pioneer was John von Neumann, who developed the so-called Von
Neumann architecture published in 1945, as a basis for computing machines that are
programmable from memory. We wish to emphasize that, besides the above-mentioned
developments and findings, mathematical principles from information theory, signal processing,
computer linguistics, and cybernetics have also influenced the development of modern electronic
computers.

11.3 Turing machines


The search for a precise definition of an algorithm has challenged mathematicians for several
decades [→37]. In fact, the quest to resolve this problem began at the beginning of the twentieth
century during the search for solutions to complex computational problems. Hilbert’s tenth
problem, which addressed the question of whether an arbitrary diophantine equation [→30]
possesses a solution, is an example of this. Various methods were developed in the attempt to
solve this problem; however, it was disproven in 1970. Interestingly, the quest to solve such
computational problems also led to questions about the computability of algorithms (or
functions). This will be discussed in greater detail in Section →11.4.
Figure 11.1 Illustration of the principle behind a Turing machine.

Turing machines constituted an important contribution to the above-mentioned developments. A


Turing machine is a mathematical machine with relatively primitive operations and constraints that
mimics a real computer. Since a Turing machine is a mathematical model, its memory can be
infinite (e. g., given by an infinite strip or tape that is sub-divided into fields; see →Figure 11.1).
Formally, a Turing machine is defined as follows:

Definition 11.3.1 (Turing machine).


A deterministic Turing machine is a tuple TM = (S, Σ, Γ, δ, s 0, $, F ) consisting of the following:
S, |S| < ∞ is a set of states. (11.1)

Σ ⊂ Γ is the input alphabet. (11.2)

Γ is the alphabet of the strip (tape). (11.3)

δ : S × Γ ⟶ S × Γ × {l, r} is called the transition f unction, where (11.4)

l and r denote lef t shif t and right shif t, respectively

s0 ∈ S is the initial state. (11.5)

$ ∈ Γ − Σ is the blank symbol. (11.6)

F ⊂ Z is the set of f inal states. (11.7)

Given an alphabet Σ, one can print only one character c ∈ Γ in each field. A special character (e.
g., $) is used to fill the empty fields (blank symbol).
The transition function δ is crucial for the control unit, and encodes the program of the Turing
machine (see →Figure 11.1). The Turing table conveys information about the current and
subsequent stages of the machine after it reads a character c ∈ Γ . This initiates certain actions of
the read/write head, namely
l: moving the head exactly one field to the left.
r: moving the head exactly one field to the right.
x: overwriting the content of a field with x ∈ Γ ∪ {$} without moving the head.
A fundamental question of theoretical computer science concerns the types of functions that are
computable using Turing machines. For example, it emerged that functions defined on words (e.
g., f : Σ ⟶ Σ ) are Turing-computable if there is at least one Turing machine that stops after
⋆ ⋆

a finite number of steps in the final state. We wish to emphasize that this also holds for other
functions (e. g., multivariate functions over several variables).
We conclude this section with an important observation regarding Turing completeness. This
term is relevant for basic paradigms of programming languages (see Chapter →2). A
programming language is deemed Turing-complete if all functions that are computable with this
language can be computed by a universal Turing machine. For example, most modern
programming languages (from different paradigms), such as Java, C++, and Scheme, are Turing-
complete [→122].

11.4 Computability
We now turn to a fundamental problem in theoretical computer science: the determination as to
whether or not a function is computable [→164]. This problem can be discussed intuitively as well
as mathematically. We begin with the intuitive discussion, and then provide its mathematical
formulation. It is generally accepted that function f : N ⟶ N is computable if an algorithm to
compute f exists. Therefore, assuming an arbitrary n ∈ N as input, the algorithm should stop
after a finite number of computation steps with output f (n) . When discussing this simple model,
we did not take into account any considerations regarding a particular processor or memory.
Evidently, however, it is necessary to specify such steps to implement an algorithm. In practical
terms, this is complex, and can only be accomplished by a general mathematical definition to
decide whether a function f : N ⟶ N is computable.
A related problem is whether any arbitrary problem can be solved using an algorithm, and, if
not, whether the algorithm can identify the problem as noncomputable. This is known as the
decision problem formulated by Hilbert, which turned out to be invalid [→36]. A counter-example
is Godel’s well-known incompleteness theorem [→36]. Put simply, it states that no algorithm exists
that can verify whether an arbitrary statement over N is true or false. To explore Gödel’s
statement in depth, several formulations of the term algorithm as a computational procedure
have been proposed. A prominent example thereof was proposed by Church, who explored the
well-known Lambda calculus, which can be understood as a mathematical programming
language (see [→122]). It was in this context also that Turing developed the concept of a Turing
machine [→36], [→122] (see Section →11.3). Another contribution by Gödel is an alternative
computational procedure based on the definition of complex mathematical functions composed
of simple functions. The result of all these developments was that the Church-Turing thesis, which
states that all the above-mentioned computational processes (algorithms) are equivalent, was
proven.
Furthermore, it has been proven that computability does not depend on a specific
programming language (see [→122]). In other words, most programming languages are
equipotent [→122]. For example, suppose that we solve a problem by using an imperative
programming language, such as Fortran (see Section →2.2). Then, an equivalent algorithm exists
that can be implemented using a functional language, such as Scheme (see Section →2.3).
A mathematical definition of computable can be formulated as follows:

Definition 11.4.1 (Computable function).


A function f : N
k
⟶ Nis called computable if an algorithm exists that computes
f (n , n , … , n ) . That means n , n , … , n is the input of f such that the algorithm stops after
1 2 k 1 2 k

a finite number of computation steps in the case where f is defined for n , n , … , n . In the case
1 2 k

where f is not defined on n , n , … , n , the algorithm does not terminate.


1 2 k

We wish to note that a similar definition can be given for functions defined on words (e. g.,
f : Σ ⟶ Σ , see [→36], [→164]). Examples of computable functions include the following:
⋆ ⋆

The functions f : N ⟶ N , f := a ⋅ b and f : N ⟶ N , f := a + b .


1
2
1 2
2
2

The (successor) function f : N ⟶ N , f (n) := n + 1 .


The recursive function sum : N ⟶ N defined by sum(n) := n + sum(n − 1) ,
sum(0) := 0 .

11.5 Complexity of algorithms


Algorithms play a central role not only in mathematics and computer science, but also in machine
learning and data science [→37]. The term algorithm may be understood intuitively as a
description of a general method for solving a class of problems. Mathematically speaking, an
algorithm is defined by a set of rules that are executed sequentially, some of which may be
repeated under certain conditions. Programming languages are excellent tools for implementing
algorithms. For example, the class of recursive algorithms has been often used to implement
recursive problems. A prominent example is the well-known Ackermann function [→122], which
can easily be coded using functional programming languages (see Section →2.3). By contrast,
iterative algorithms have been used to compute problems efficiently. The imperative
implementation (see Section →2.2) of the shortest path problem, proposed by Dijkstra [→58], is a
standard case study in computer science, which is frequently used to illustrate iterative
algorithms. Examples of typical algorithms in mathematics include the GCD-algorithm proposed
by Euclid [→122] and the Gaussian elimination method for solving linear equation systems [→27].
It seems plausible that many algorithms exist to address a particular problem. For example,
the square of a real number can be computed using either a functional or an imperative algorithm
(see Sections →2.2 and →2.3). However, this raises the question as to what type of algorithm is
most suited to solving a given problem.
Listing important properties/questions in the context of algorithm design offers insight into the
complexity of the latter problem. Such properties and questions include
What level of effort is required to implement a particular algorithm?
How can the algorithm be simplified as far as possible?
How much memory is required?
What is the time complexity (i. e., execution time) of an algorithm?
What is the correctness of the algorithm?
Does the algorithm terminate?
In attempting to define a reasonable measure for assessing algorithms, time complexity has
emerged as crucial. This measure should be rather abstract and general, as an algorithm’s
execution time depends on several factors, including (i) the style of programming, (ii) the
particular programming language used, (iii) processor speed, and (iv) whether a compiler or
interpreter is used. It would be unreasonable to define a cost function to judge each algorithm’s
time complexity separately, as it is not clear what kind of data structure should be used. For
example, a list depends on the number of elements, whereas a matrix depends on the number of
rows and columns. Hence, it is impossible to estimate the parameters of an algorithm in advance
and, consequently, the evaluation of its time complexity is an intricate process.

11.5.1 Bounds
Let n be the input size of an algorithm (i. e., the number of data elements to be processed). The
time complexity of an algorithm is determined by the maximal number of steps (e. g., value
assignments, arithmetic operations, memory allocations, etc.) in relation to input size required to
obtain a specific result.
In the following, we describe how to measure the time complexity of an algorithm asymptotically,
and describe several forms thereof. First, we state an upper bound for the time complexity that
will be attained in the worst case (O-notation). To begin, we provide a definition of real
polynomials, as they play a crucial role in the asymptotic measurement of algorithms’ time
complexity.

Definition 11.5.1 ([→151]).

The function f : R ⟶ R, defined by

f (x) = an x
n
+ an−1 x
n−1
+ ⋯ + a0 , an ≠ 0 , ak ∈ R , k = 0, 1, … , n,
(11.8)

is a real polynomial of degree n.


By definition, the input variable x and the value of the function f (x) are real numbers. By taking
only real coefficients c into account, the polynomial f : N ⟶ N (see equation (→11.8)) is also a
k

real polynomial. Generally speaking, a polynomial is called real if its coefficients are real.
To define an asymptotic upper bound for the time complexity of an algorithm, the O-notation is
required.

Definition 11.5.2 (O-notation [→37]).

Let f , g : N ⟶ N be two polynomials. We define


f (n) = O(n) ⟺ ∃c ∈ R, c > 0, n0 ∈ N : f (n) ≤ c ⋅ g(n) ∀n ≥ n0 . (11.9)

Definition →11.5.2 means that g(n) is an asymptotic upper bound of f (n) if a constant c > 0
exists and a natural number n such that f (n) is less or equal c ⋅ g(n) for n ≥ n .
0 0

In contrast to the worst case, described by the O-notation, we now define an asymptotic lower
bound that describes the “least” complexity. This is provided by the Ω-notation.

Definition 11.5.3 (Ω-notation [→37]).

Let f , g : N ⟶ R+ be two polynomials. We define


f (n) = Ω(n) ⟺ ∃c ∈ R, c > 0, n0 ∈ N : f (n) ≥ c ⋅ g(n) ∀n ≥ n0 . (11.10)
According to Definition →11.5.3, g(n) is an asymptotic lower bound of f (n) if a constant c > 0
exists and a natural number n such that f (n) ≥ c ⋅ g(n) for n ≥ n .
0 0

To simultaneously define upper and lower bounds for the time complexity, the Θ-notation is used.

Definition 11.5.4 (Θ-notation [→37]).

Let f , g : N ⟶ R+ be two polynomials. We define


f (n) = Θ(n) ⟺ ∃c1 , c2 ∈ R+ , c1 , c2 > 0, n0 ∈ N : (11.11)

c1 ⋅ g(n) ≤ f (n) ≤ c2 ⋅ g(n) ∀n ≥ n0 .

According to Definition →11.5.4, g(n) is an exact asymptotic bound of f (n) if two constants
c , c > 0 exist, and a natural number n such that f (n) lies in between c ⋅ g(n) , and c ⋅ g(n)
1 2 0 1 2

if n ≥ n .0

11.5.2 Examples
In this section, some examples are given to illustrate the definitions of the asymptotic bounds. In
practice, the O-notation is the most important and widely used. Hence, the following examples will
focus on it.
To simplify the notation, we denote the number of calculation steps in an algorithm by f (n) . Let
f (n) := n + 3n . To determine the complexity class O(n ), k ∈ N , the constants c and n must
2 k
0

be determined. Using Definition →11.5.2, setting c = 4 and g(n) = n , the following inequalities 2

can be verified:

n
2
+ 3n ≤ 4n
2
or 3n ≤ 3n .
2 (11.12)

This gives 1 ≤ n . That means, with c = 4 and n 0 = 1 , we have

n
2
+ 3n ≤ c ⋅ n .
2 (11.13)

Thus, we obtain n 2
+ 3n ∈ O(n )
2
.
A second example is the function

f (n) := c5 n
5
+ c4 n
4
+ c3 n
3
+ c2 n
2
+ c1 n + c0 .
(11.14)

If n goes to infinity, we can disregard the terms c n + c n + c n + c n + c as well as the


4
4
3
3
2
2
1 0

constant c , and obtain f (n) ∈ O(n ) . We see that the O-notation always emphasizes the
5
5

dominating power of the polynomial (here n ). 5

To demonstrate the general case, we use the polynomial

f (n) = ak n
k
+ ak−1 n
k−1
+ ⋯ + a1 n + a0 , ak ≠ 0,
(11.15)

and obtain
c := |a | + |a
k k−1
f (n)= |ak n

| + |a
= n

k−2
k


≤ n (|ak | +

k
k

ak +
+ ak−1 n

ak−1

|ak−1 |

0
k−1

n
+ ⋯ + a1 n + a0 |

ak−2

+
2
+ ⋯ +

|ak−2 |

n
2

≤ n (|ak | + |ak−1 | + |ak−2 | + ⋯ + |a0 |) = cn .

f (n) ≤ cn for j ≥ k and n ∈ N . Finally, we obtain f (n) ∈ O(n ) for j ≥ k .


j

c= O(1),

c ⋅ O(f (n))= O(f (n)),

O(f (n)) + O(f (n))= O(f (n)),

O(logb (n))= O(log (n)),

O(f (n) + g(n))= O(max {f (n), g(n)}),

O(f (n)) ⋅ O(g(n))= O(f (n) ⋅ g(n)).


a0

n
k

+ ⋯ +
|a0 |

Inequality (→11.16) has been obtained using the triangle inequality [→178]. By setting
j
k
n
)

| + ⋯ + |a | , inequality (→11.16) is satisfied for n ≥ 1 . That means,

In the final example, we use a simple imperative program (see Section →2.2) to calculate the sum
of the first n natural numbers ( sum = 1 + 2 ⋯ + n ). Basically, the pseudocode of this program
consists of the initialization step, sum = 0 , and a for-loop with variable i and body sum = i + 1
(11.16)

for 1 ≤ i ≤ n . The first value assignment requires constant costs, say, c . In each step of the for-
loop to increment the value of the variable sum , constant costs c are required. Then we obtain
the upper bound for the time complexity

and finally, f (n) ∈ O(n) .


f (n) = c1 + c2 ⋅ n,

11.5.3 Important properties of the O-notation


2

An algorithm with a constant number of steps has time complexity O(1) (see equation (→11.18)).
The second rule given by equation (→11.19) means that constant factors can be neglected. If we
execute a program with time complexity O(f (n)) sequentially, the final program will have the
same complexity (see equation (→11.20)). According to equation (→11.21), the logarithmic
complexity does not depend on the base b. Moreover, the sequential execution of two programs
with different time complexities has the complexity of the program with higher time complexity
(see equation (→11.22)). Finally, the overall complexity of a nested program (for example, two
nested loops) is the product of the individual complexities (see equation (→11.23)).

11.5.4 Known complexity classes


1

In view of the importance of the O-notation for practical use, several of its properties are listed
below:
(11.17)

(11.18)

(11.19)

(11.20)

(11.21)

(11.22)

(11.23)
Finally, we list some examples of algorithm complexity classes:
O(1) consists of programs with constant time complexity (e. g., value assignments,

arithmetic operations, Hashing).


O(n) consists of programs with linear time complexity (e. g., calculating sums and linear

searching procedures).
O(n ) consists of programs with quadratic time complexity (e. g., a simple sorting
2

algorithm, such as Bubblesort [→37]).


O(n ) consists of programs with cubic time complexity (e. g., a simple algorithm to solve
3

the shortest path problem proposed by Dijkstra [→58] where, n is the number of vertices in
a network).
O(n ) generally consists of programs with polynomial time complexity. Obviously, O(n) ,
k

O(n ) and O(n ) are also polynomial.


2 3

O(log (n)) consists of programs with logarithmic time complexity (e. g., binary searching

[→37]).
O(2 ) consists of programs with exponential time complexity (e. g., enumeration problems
n

and recursive functions [→37]).


Algorithms with complexity O(1) are highly desirable in practice. Logarithmic and linear time
complexity are also favorable for practical applications as long as log (n) < n , n > 1 . Quadratic
and cubic time complexity remain sufficient when n is relatively small. Algorithms with complexity
O(2 ) can only be used under certain constraints, since 2 grows significantly compared to n .
n n k

Such algorithms could possibly be used when searching for graph isomorphisms or cycles, whose
graphs have bounded vertex degrees, for example (see [→130]).

11.6 Summary
At this juncture, it is worth reiterating that, despite the apparent novelty of the term data science,
the fields on which it is based have long histories, among them theoretical computer science
[→61]. The purpose of this chapter has been to show that computability, complexity, and the
computer, in the form of a Turing machine, are mathematically defined. This aspect can easily be
overlooked in these terms’ practical usage.
The salient point is that data scientists should recognize that all these concepts possess
mathematical definitions which are neither heuristic nor ad-hoc. As such, they may be revisited if
necessary (e. g., to analyze an algorithm’s runtime). Our second point is that not every detail
about these entities must be known. Given the intellectual complexity of these topics, this is
encouraging, because acquiring an in-depth understanding of these is a long-term endeavor.
However, even a basic understanding is preferable and helps in improving practical programming
and data analysis skills.
12 Linear algebra
One of the most important and widely used subjects of mathematics is linear algebra [→27]. For
this reason, we begin this part of the book with this topic. Furthermore, linear algebra plays a
pivotal role for the mathematical basics of data science.
This chapter opens with a brief introduction to some basic elements of linear algebra, e. g.,
vectors and matrices, before discussing advanced operations, transformations, and matrix
decompositions, including Cholesky factorization, QR factorization, and singular value
decomposition [→27].

12.1 Vectors and matrices


Vectors and matrices are fundamental objects that are used to study problems in many fields,
including mathematics, physics, engineering, and biology (see e. g. [→21], [→27], [→186]).

12.1.1 Vectors
Vectors define quantities, which require both a magnitude, i. e., a length, and a direction to be
fully characterized. Examples of vectors in physics are velocity or force. Hence, a vector extends a
scalar, which defines a quantity fully described by its magnitude alone. From an algebraic point of
view, a vector, in an n-dimensional real space, is defined by an ordered list of n real scalars,
x , x , … , x arranged in an array.
1 2 n

Definition 12.1.1.

A vector is said to be a row vector if its associated array is arranged horizontally, i. e.,
(x1 , x2 , … , xn ),

whereas a vector is called a column vector when its array is arranged vertically, i. e.,
x1
⎛ ⎞

x2
.

⎝ ⎠
xn

Geometrically, a vector can be regarded as a displacement between two points in space, and it

is often denoted using a symbol surmounted by an arrow, e. g., V .

Definition 12.1.2.

→ →
Let V = (x1 , x2 , … , xn ) be an n-dimensional real vector. Then, the p-norm of V , denoted

∥ V ∥p , is defined by the following quantity
1
(12.1)
n p

p
∥ V ∥p = (∑ |xi | ) ,

i=1

where |x | denotes the modulus or the absolute value of x .


i i

In particular,
1. →
the 1-norm of the vector V is defined by
n

∥ V ∥1 = ∑ |xi |,

i=1

2. →
the 2-norm of the vector V is defined by
1

n 2

2
∥ V ∥2 = (∑ |xi | ) ,

i=1

3. when p = ∞ , the p-norm, also called the maximum norm, is defined by



∥ V ∥∞ = max |xi |.
i∈1,…,n

Definition 12.1.3.

Let E denote an n-dimensional space. Let d be a function defined by


n

n n
d : E × E ⟶ R,

such that for any x, y, z ∈ E


n
the following relations hold:
1. d(x, y) ≥ 0 ;
2. d(x, y) = 0 ⟺ x = y ;
3. d(x, y) = d(y, x) ;
4. d(x, z) ≤ d(x, y) + d(y, z)

Such a function, d, is called a metric, and the pair (E n


, d) is called a metric space.
1/2
When E = R and d(x, y) = (∑ (x − y ) ) , then the pair (E , d) = (R , d) is called
n n n
i=1 i i
2 n n

an n-dimensional Euclidean space, also referred to as an n- dimensional Cartesian space.


From Definition →12.1.3, it is clear that R , the set of real numbers (or the real line), is a 1-
dimensional Euclidean space, whereas R , the real plane, is a 2-dimensional Euclidean space.
2

Remark 12.1.1.

The p-norm, with p = 2 , of a vector in an n-dimensional Euclidean space is referred to as the


Euclidean norm.
Let A = ( xA
yA
) and B = ( xB
yB
) be two points in a 2-dimensional Euclidean space. Then, the vector
−→ →
AB = V is the displacement from the point A to the point B, which can be specified in a
cartesian coordinates system by
−→ → xB − xA
AB = V = ( ).
yB − yA

This is illustrated in →Figure 12.1 (a).

Figure 12.1 (a) Vector representation in a two-dimensional space. (b) Decomposition of a


standard vector in a 2-dimensional space.

Definition 12.1.4.

−→
The magnitude of a vector AB is defined by the non-negative scalar given by its Euclidean norm,

−→ →
denoted ∥AB∥ or simply ∥AB∥ .
2

−→
Specifically, the magnitude of a 2-dimensional vector AB is given by
− →
2 2
∥AB∥ = √ (x − x ) + (y − y ) . B A B A

In applications, the Euclidean norm is sometimes also referred to as the Euclidean distance.
Using R, the norm of a vector can be computed as illustrated in Listing 12.1.
Definition 12.1.5.

→ →
Two n-dimensional vectors V and W are said to be parallel if they have the same direction.

Definition 12.1.6.

→ →
Two n-dimensional vectors V and W are said to be equal if they have the same direction and the
same magnitude.
Various transformations and operations can be performed on vectors, and some of the most
important will be presented in the following sections.
Figure 12.2 An example where the Euclidean distance ∥x − x ∥ is used for the classifier k-NN. A
i

point x is assigned the label i based on a majority vote, considering its nearest k neighbors. In this
example, k = 4 .

Example 12.1.1.

For supervised learning, k-NN (k nearest neighbors) [→96] is a simple yet efficient way to classify
data. Suppose that we have a high-dimensional data set with two classes, whose data points
represent vectors. Let x be a point that we wish to assign to one of these two classes. To predict
the class label of a point x, we calculate the Euclidean distance, introduced above (see Remark
→12.1.1), between x and all other points x , i. e., d = ∥x − x ∥ . Then, we order these distances
i i i

d in an increasing order. The k-NN classifier now uses the nearest k distances to obtain a majority
i

vote for the prediction of the label for the point x. For instance, in →Figure 12.2, a two-
dimensional example is shown for k = 4 . Among the four nearest neighbors of x are three red
points and one blue point. This means the predicted class label of x would be “red”. In the
extreme case k = 1 , the point x would be assigned to the class with the single nearest neighbor.
The k-NN method is an example of an instance-based learning algorithm. There are many
variations of the k-NN approach presented here, e. g., considering weighted voting to overcome
the limitations of majority voting in case of ties.
12.1.1.1 Vector translation


→ → →
Let V = AB denote the displacement between two points A and B. The same displacement V ,
−→
starting from a point A to another point B , defines a vector A B .
′ ′ ′ ′




−→ → → →
The vector A B is called a translation of the vector AB , and the two vectors AB and A B are
′ ′ ′ ′


− → → →
equal, when they have the same direction, and
′ ′
∥AB∥ = ∥ V ∥ = ∥A B ∥ . Hence, the translation
−→
of a vector AB is a transformation that maps a pair of points A and B to another pair of points A ′

and B , such that the following relations hold:



−→ → 1.
AA

= BB

;

−→ → 2.
AA

and BB are parallel.


−→ →
In a two-dimensional space, the vector AB and its translation A B form opposite sides of a ′ ′

parallelogram, as illustrated in →Figure 12.3 (a). Thus,


xA′ xA + kx xB′ xB + kx
′ ′
A = ( ) = ( ) and B = ( ) = ( ),
yA′ yA + ky yB′ yB + ky

where k and k are scalars.


x y

Definition 12.1.7.


→ →

V = A B

A vector
A

is called a standard vector if its initial point (i. e., the point ) coincides with
the origin of the coordinate system. Hence, using vector translation, any given vector can be
transformed into a standard vector, as illustrated in →Figure 12.3 (a).

12.1.1.2 Vector rotation

A vector transformation, which changes the direction of a vector while its initial point remains
unchanged, is called a rotation. This results in an angle between the original vector and its rotated

counterpart, called the rotation angle. Let V = (xA , yA ) be a 2-dimensional vector, and


V = (x ,y ) its rotation by an angle θ (see →Figure 12.3 (b)). Then, the following properties

A

A

of the vector rotation should be noted:


xA′ = xA cos (θ) − yA sin (θ); (12.2)

yA′ = xA sin (θ) + yA cos (θ);


→ →
∥ V ∥= V .
Various operations can be carried out on vectors, including the product of a vector by a scalar,
the sum, the difference, the scalar or dot product, the cross product, and the mixed product. In
the following sections, we will discuss such operations.

Figure 12.3 Vector transformation in a 2-dimensional space: (a) Translation of a vector. (b)
Rotation of a vector.

12.1.1.3 Vector scaling


Let V = (v1 , v2 , … , vn ) be an n-dimensional vector, and let k be a scalar. Then, the product of k
→ → →
with V , denoted k × V , is a vector U defined as follows:

U = (k × v1 , k × v2 , … , k × vn ).

→ →
Geometrically, the vector U is aligned with V , but k times longer or shorter.

Definition 12.1.8.

→ → → →
If V is non-null (i. e., not all of its components are zero), then V and U = kV are said to be
→ →
parallel if k > 0 , and anti-parallel if k < 0 . In the particular case where k = −1 , then V and U

are said to be opposite vectors (see →Figure 12.4 (b) for an illustration).
Figure 12.4 Vector transformation in a 2-dimensional space: (a) Orthogonal projection of a
vector. (b) Vector scaling.

Figure 12.5 Vector operations in a two-dimensional space: (a) Sum of two vectors. (b) Difference
between two vectors.

Definition 12.1.9.

→ → →
For any scalar k, the vectors V and U = k × V are said to be collinear.
12.1.1.4 Vector sum

→ →
Let V = (v1 , v2 , … , vn ) and W = (w1 , w2 , … , wn ) be two n-dimensional vectors. Then, the
→ → → → →
sum of V and W , denoted V + W , is a vector S defined as follows:
→ → →
S = V + W = (v1 + w1 , v2 + w2 , … , vn + wn ).


In a two-dimensional space, the vector sum S can be obtained geometrically, as illustrated in

→Figure 12.5 (a); i. e., we translate the vector W until its initial point coincides with the terminal
→ →
point of V . Since translation does not change a vector, the translated vector is identical to W .
→ →
Then, the vector S is given by the displacement from the initial point of V to the terminal point

of the translation of the translation of W . Note that the sum of vectors is commutative, i. e.,
→ → → →
V + W = W + V .

→ →
This means that if V has been translated instead of W , the result will be the same sum

vector S . This is illustrated in →Figure 12.5 (a).

12.1.1.5 Vector difference

→ →
Let V = (v1 , v2 , … , vn ) ,W = (w1 , w2 , … , wn ) be two n-dimensional vectors. Then, the
→ → → → → →
difference between V and W , denoted V − W , is the a vector D defined by the sum of V

and the opposite of W , i. e.,
→ → →
D = V + (−W ) = (v1 − w1 , v2 − w2 , … , vn − wn ).

This is illustrated geometrically in a 2-dimensional space in →Figure 12.5 (b). Note that, in contrast
→ → → →
with the sum, the difference between two vectors is not commutative, i. e., V − W ≠ W − V .

12.1.1.6 Vector decomposition

→ → →
It is often convenient to decompose a vector V into the vector components V || and V ⊥ , which

are respectively parallel and perpendicular to the direction of another vector W , and such that
→ → →
V = V || + V ⊥ .


In this case, the vector components of V are given by
→ →
→ V ⋅ W →
V || = W,
→ →
W ⋅ W

→ → →
V ⊥ = V − V || .


→ →
V = OA In a two-dimensional orthonormal space, the standard components of a vector , where
O denotes the origin of the coordinate system, with respect to the x-axis and the y-axis, are simply
the coordinates of the point A, i. e.,

→ → xA − xO xA − 0 xA
V = OA = ( ) = ( ) = ( ).
yA − yO yA − 0 yA


If α denotes the angle between the vector V and the x-axis, as illustrated in →Figure 12.1 (b),
then we have the following relationships:
→ (12.3)
xA =cos (α) × ∥ V ∥,


yA =sin (α) × ∥ V ∥,

xA
=tan (α).
yA

12.1.1.7 Vector projection

→ →
The projection of an n-dimensional vector V onto the direction of a vector W , in an m-

dimensional space, is a transformation that maps the terminal point of the vector V to a point in
→ → →
the space associated with the direction of W . This results in a vector P that is collinear to W .

Definition 12.1.10.

→ → →
Let θ denote the angle between V and W . If the magnitude of the vector P is given by
→ →
∥ P ∥ =cos (θ) × ∥ V ∥,

→ →
then the projection of V onto the direction of W is said to be orthogonal.

Clearly, in a two-dimensional space, as depicted in →Figure 12.4 (a), the vector P , the orthogonal
→ →
project of V onto the direction of W , is nothing but the vector component of V parallel to W.
Thus,
→ →
→ xP V ⋅ W xW
P = ( ) = ( ).
yP → → yW
W ⋅ W

12.1.1.8 Vector reflection

→ →
The reflection of an n-dimensional vector V with respect to the direction of a vector W , in an m-
→ →
dimensional space, is a transformation that maps the vector V to an n-dimensional vector U ,
such that (see →Figure 12.6)
→ →
→ → V ⋅ W →
U = V − 2 W.

2
∥W ∥


In a two-dimensional orthonormal space, the components of the vector U are given by
→ →
→ xU xV xW V ⋅ W
U = ( ) = ( ) − k( ), with k = 2 .
yU yV yW →
2
∥W ∥
Figure 12.6 Vector reflection in a two-dimensional space.

12.1.1.9 Dot product or scalar product

→ →
Let V = (v1 , v2 , … , vn ) and W = (w1 , w2 , … , wn ) be two n-dimensional vectors. The dot
→ → → →
product, also called the scalar product, of V and W , denoted V ⋅ W , is a scalar p, defined as
follows:
→ →
p = V ⋅ W = v1 w1 + v2 w2 + ⋯ + vn wn .

Geometrically, the dot product, can be defined through the orthogonal projection of a vector onto
→ →
another. Let αbe the angle between two vectors V and W . Then,
→ →
→ → → → V ⋅ W
V ⋅ W =cos (α) × ∥ V ∥ × ∥W ∥, with cos (α) = .
→ →
∥ V ∥ × ∥W ∥


In →Figure 12.4 (a), the norm of the projected vector P can be interpreted as the dot product
→ →
between the vectors V and W .

Definition 12.1.11.

→ →
When the angle α between two vectors, V and W , is + kπ , where k is an integer, then the
π
2

two vectors are said to be perpendicular or orthogonal to each other, and their dot product is given
by
π → → → →
cos ( + kπ) × ∥ V ∥ × ∥W ∥ = 0 × ∥ V ∥ × ∥W ∥ = 0.
2

Furthermore, the dot product has the following properties:


1. → → → →
For any vector V : V ⋅ V = ∥V ∥
2
;
2. → → → → → →
For any two vectors V and W : V ⋅ W = W ⋅ V ;
3. → → → →
2
→ → → →
For any two vectors and : V W (V ⋅ W ) ≤ ( V ⋅ V )( V ⋅ V ) . (This is
referred to as the Cauchy–Schwarz inequality [→27]);
4. → → → → →
For any two vectors V and W and a scalar k: k × ( V ⋅ W ) = (k × V ) ⋅ W ;
5. → → → → → → → → → →
For any three vectors U , V , and W : (U + V ) ⋅ W = (U ⋅ W ) + (V ⋅ W ) .

12.1.1.10 Cross product

The cross product is applicable to vectors in an n-dimensional space, with n ≥ 3 . To illustrate this,
→ →
let V and W be two three-dimensional standard vectors defined as follows:
x x

− → ⎛ A⎞ → ⎛ B⎞
→ →
V = OA = yA and W = OB = yB .

⎝ ⎠ ⎝ ⎠
zA zB

→ → → → →
Then, the cross product of the vector V by the vector W , denoted V × W , is a vector C
→ →
perpendicular to both V and W , defined by
xC yA × zB − yB × zA
→ ⎛ ⎞ ⎛ ⎞

C = yC = −xA × zB + xB × zA ,
⎝ ⎠ ⎝ ⎠
zC xA × yB − xB × yA
or
→ → → →
C = ∥ V ∥ × ∥W ∥× sin (θ) × u ,

→ → → →
is the unit vector1 normal to both

where, u V and W , and θ is the angle between V and W .
Thus,
→ → →
∥ C ∥ = ∥ V ∥ × ∥W ∥× sin (θ) = A ,

→ →
where A denotes the area of the parallelogram spanned by V and W , as illustrated in →Figure
12.7.

Figure 12.7 Cross product of two vectors in a three-dimensional space.

The cross product has the following properties:


1. → → →
For any vector V , we have: V × V = 0 ;
2. → → → → → →
For any two vectors V and W , we have: V × W = −(W × V ) ;
3. →
For any two vectors V and W and a scalar k, we have:
→ → → → → →
(k × V ) × W = V × (k × W ) = k × ( V × W ) ;
4. → → → → → → →
V × (U + W ) = V × U + V × W ;
5. → → → → → →
V × (U × W ) ≠ (V × U ) × W .

12.1.1.11 Mixed product

This is an operation on vectors, which involves both a cross and a scalar product. To illustrate this,
→ → → →
let V , U , and W denote three three-dimensional vectors. Then, the mixed product between V ,
→ →
U , and W is a scalar p, defined by
→ → → → → → → → → (12.4)
p= ( V × U ) ⋅ W = (U × W ) ⋅ V = (W × V ) ⋅ U

→ → → → → → → → →
= V ⋅ ( U × W ) = U ⋅ (W × V ) = W ⋅ ( V × U )

= ±V ,

→ → →
where V denotes the volume of the parallelepiped spanned by V , U , and W .
In R, the above operations can be carried out using the scripts in Listing 12.2.
12.1.2 Vector representations in other coordinates systems
For various problems, the quantities characterized by vectors must be described in different
coordinate systems [→27]. Depending on the dimension of its space, a vector can be represented

→ →
in different ways. For instance, in a two-dimensional space, a standard vector
V = OA , where O
denotes the origin point, can be specified either by:
1. The pair (xA, yA ) , where x and y denote the coordinates of the point A, the
A A


terminal point of V , in a two-dimensional Euclidean space. The pair (x A, yA )

defines the representation of the vector V in cartesian coordinates.
2. → →
The pair (r, θ) , where r = ∥ V ∥ is the magnitude of V and θ is the angle

between the vector V and a reference axis in a cartesian system, e. g., the x-axis.

The pair (ρ, θ) defines the representation of the vector V in polar coordinates.
The polar coordinates can be recovered from cartesian coordinates, and vice versa. Let

→ →
x
V = OA = ( A )
yA
be a standard vector in a two-dimensional cartesian space, as depicted in

→Figure 12.8. Then, the polar coordinates of V can be obtained as follows:
(12.5)
2 2
r= √ x + y ,
A A

−1
yA
θ=tan ( ).
xA

Conversely, the cartesian coordinates can be recovered as follows:


xA = r cos (θ), (12.6)

yA = r sin (θ).
Figure 12.8 Representation of a 2-dimensional vector in a polar coordinates system.

In R, the above coordinate transformations can be carried out using the commands in Listing
12.3.

→ →
In a three-dimensional space, a standard vector
V = OA , where O denotes the origin point, can
be specified either by one of the following:
1. The triplet (xA, yA , zA ) , where x , y and z denote the coordinates of the
A A A


point A, the terminal point of V , in a three-dimensional Euclidean space. The

triplet (x , y , z ) defines the representation of the vector
A A A V in cartesian
coordinates (see →Figure 12.9 (a) for illustration).
Figure 12.9 Representation of a point in a three-dimensional space in different
coordinate systems: (a) cartesian coordinates system; (b) cylindrical coordinates
systems; (c) spherical coordinates system.

2. →
The triplet (ρ, θ, z A) , where ρ is the magnitude of the projection of V on the x-y

plane, θ is the angle between the projection of the vector on the x-y plane and
V

the x-axis, and z is the third coordinate of A in a cartesian system. The triplet
A


defines the representation of the vector
(ρ, θ, zA ) V in cylindrical coordinates
(see →Figure 12.9 (b) for illustration).
3. → →
The triplet (r, θ, φ) , where r = ∥ V ∥ is the magnitude of V , θ is the angle

between the projection of the vector V on the x-y plane and the x-axis, and φ is

angle between the vector V and the x-z plane. The triplet (ρ, θ, φ) defines the

representation of the vector V in spherical coordinates (see →Figure 12.9 (c) for
illustration).
Mutual relationships exist between cartesian, cylindrical and spherical coordinates. Let

→ →
V = OA =

2.
x
⎛ A⎞

yA

relationships between the different coordinate systems:


1.
The cylindrical coordinates of

The spherical coordinates of

θ= θ,

φ=tan
V

V

ρ= √ x

θ=tan

zA = zA .

r= √ x

θ=tan

φ=cos
2
+ y
2

−1

−1

⎜⎟

2
+ y ,

−1

2
+ z ,
A
zA

yA = ρ sin (θ),

zA = zA .

yA = r sin (φ) sin (θ),

zA = r cos (φ).

−1
2
r= √ ρ2 + z ,

(
A

zA
).

Conversely, the cylindrical coordinates can be recovered as follows:


(

be a standard vector in a three-dimensional cartesian space. Then we have the following

can be obtained as follows:

yA

xA

zA

r
A

yA

xA

),

).
),

Conversely, the cartesian coordinates can be recovered as follows:


xA = ρ cos (θ),

can be obtained as follows:

Conversely the cartesian coordinates can be recovered as follows:


xA = r sin (φ) cos (θ),

Relationships between cylindrical and spherical coordinates also exist. From cylindrical
coordinates, spherical coordinates can be obtained as follows:
(12.7)

(12.8)

(12.9)

(12.10)

(12.11)
(12.12)
2 2
ρ= √ r − z ,
A

θ= θ,

zA = r cos (φ).

In R, the above coordinate system transformations can be carried out using the scripts in
Listing 12.4.

Example 12.1.2.

Classification methods are used extensively in data science [→41], [→64]. An important
classification technique for high-dimensional data is referred to as support vector machine (SVM)
classification [→41], [→64] (see also Section →18.5.2).
For high-dimensional data, the problem of interest is to classify the (labeled) data by determining
a separating hyperplane. When using linear classifiers, it is necessary to construct a hyperplane
1 2 3

be a three-dimensional point with

∥δ∥ = √ (δ1 )

The distance between δ and H is given by

da,H =
δ =

⎜⎟
for optimal separation of the data points. To this end, it is necessary to determine the distance
between a point representing a vector and the hyperplane.
Let H : δ ⋅ x + δ ⋅ y + δ ⋅ z − a = 0 be a three-dimensional hyperplane, and let


δ1

δ2

δ2

+ (δ2 )

∥δ∥

2 2
+ (δ3 ) .

δ1 ⋅ x + δ2 ⋅ y + δ3 ⋅ z − a
.

A two-dimensional hyperplane is shown in →Figure 12.10 to illustrate the SVM concept for a two-
(12.13)

(12.14)

class classification problem. The data points represented by rectangles and circles represent the
two classes, respectively. →Figure 12.10 illustrates a case of a two-class problem, where a linear
classifier represented by a hyperplane, can be used to separate the two classes. The optimal
hyperplane is the one whose distance from the points representing the support vectors (SV) is
maximal.
Figure 12.10 Constructing a two-dimensional hyperplane for SVM-classification.

12.1.2.1 Complex vectors

A complex number is a number of the form x + iy , where x and y are real numbers and i is the
imaginary unit, such that i = √−1 , see [→158]. Specifically, the number x is called the real part of
the complex number x + iy , whereas the part iy is called the imaginary part. The set of complex
numbers is commonly denoted by C ; R is a subset of C , since any real number can be viewed as
a complex number, for which the imaginary part is zero, i. e. y = 0 . Any complex number
z = x + iy can be represented by the pair of reals (x , y ) ; thus, a complex number can be
z z z z

viewed as a particular two-dimensional standard real vector. Let θ denote the angle between the
z

vector z = (x , y ) and the x-axis. Then, using the vector decomposition in a two-dimensional
z z

space, we have
(12.15)
2 2
xz = rz cos (θz ), yz = rz sin (θz ), where rz = √ xz + yz .

The number r is called the modulus or the absolute value of z, whereas θis called the argument of
z

z.
From (→12.15), we can deduce the following alternative description of a complex number
z = x + iy :
z z

z= xz + iyz (12.16)

= rz cos (θz ) + irz sin (θz )

iθz
= rz [cos (θz ) + i sin (θz )] = rz e .

This representation of a complex number is referred to as the Euler formula [→158].


A complex number z = x + iy can be represented either by the pair (x , y ) or the pair
z z z z

(r , θ ) , as illustrated in →Figure 12.11.


z z
Figure 12.11 Vector representation of a complex number.

Let z = x + iy , and let w = x + iy be two complex numbers and n an integer. Then, the
z z w w

following elementary operations can be performed on these complex numbers [→158]:


Complex conjugate: The complex number z̄ = x − iy is called the complex conjugate of z.
z z

Power of a complex number: z = r [cos (nθ) + i sin (nθ)] ;


n n
z

Complex addition: z + w = (x + iy ) + (x + iy ) = (x + x ) + i(y + y ) .


z z w w z w z w

Complex subtraction: z − w = (x + iy ) − (x + iy ) = (x − x ) + i(y − y ) .


z z w w z w z w

Complex multiplication:
z × w = (x + iy ) × (x
z z + iy ) = (x x
w w − y y ) + i(x y
z w + y x ).
z w z w z w

Complex division: .
z xz +iyz (xz xw +yz yw )+i(yz xw −xz yw )
= =
w xw +iyw x2 +y2
w w
xw +iyw

Complex exponentiation: z
w
= (xz + iyz )
xw +yw
= (xz + yz )
2 2 2
e
irz (xz +iyz )
, where r is the
z

complex modulus of z.
In R, the above basic operations on complex numbers can be performed using the script in Listing
12.5.


An n-dimensional complex vector is a vector of the form V = (x , x , … , x ) , whose
1 2 n

components x , x , … , x can be complex numbers. The concept of vector operations and


1 2 n

vector transformations, previously introduced in relation to real vectors, can be generalized to


complex vectors using the elementary operations on complex numbers.

12.1.3 Matrices
In the two foregoing sections, we have presented some basic concepts of vector analysis. In this
section, we will discuss a generalization of vectors also known as matrices.
Let m and n be two positive integers. We call A an m × n real matrix if it consists of an ordered
set of m vectors in an n-dimensional space. In other words, A is defined by a set of m × n scalars
aij ∈ R , with i = 1, … , m and j = 1, … , n , represented in the following rectangular array
1j 2j mj
A =

⎜⎟


a11

a21

am1
a12

a22

am2

i1

i2

set of entries (a , a , … , a ) is called its main diagonal.


11 22 nn
a1n

a2n

amn

in
.

The matrix A, defined by (→12.17), has m rows and n columns. For any entry a , with
ij

i = 1, … , m and j = 1, … , n , of the matrix A, the index i is called the row index, whereas the

index j is the column index. The set of entries (a , a , … , a ) is called the i row of A, and the
th

set (a , a , … , a ) is the j column of A. When n = m , A is called a squared matrix, and the


th
(12.17)

In the case m = 1 and n > 1 , the matrix A is reduced to one row, and it is called a row vector;
likewise, when m > 1 and n = 1 , the matrix A is reduced to a single column, and it is called a
column vector. In the case m = n = 1 , the matrix A is reduced to a single value, i. e., a real scalar.
An m × n matrix A can be viewed as a list of m n-dimensional row vectors or a list of n m-
dimensional column vectors. If the entries a are complex numbers, then A is called a complex
matrix.
ij

The R programming environment provides a wide range of functions for matrix manipulation
and matrix operations. Several of these are illustrated in the three listings below.
Example 12.1.3.

In this example, we demonstrate the utility of computing powers of adjacency matrices of


networks, see Definition →16.2.2 in Section →16.2.1. They can be used to determine the number
of walks with a certain length from v to v ; v to v are two vertices ∈ V , and G = (V , E) is a
i j i j

connected network. The application of Definition →16.2.2 in Section →16.2.1 to the graph shown
in →Figure 12.12 yields the following matrix:
0 1 1 0 0 (12.18)
⎡ ⎤

1 0 0 0 0

A(G) = 1 0 0 1 1 .

0 0 1 0 0

⎣ ⎦
0 0 0 0 0

If we square this matrix, we obtain


2
A (G) =

⎢⎥


2

Figure 12.12 Walks in an example graph G for A

12.2 Operations with matrices


0

2
0

(G) .
1

0
1

0


.
(12.19)

The power of the adjacency matrix (→12.18) (here 2) gives the length of the walk. The entry a of
A (G) gives the number of walks of length 2 from v to v . For instance, a
2
i j = 2 means there

14
11
ij

exist two walks of length 2 from vertex 1 to vertex 1. Moreover, a means there exists only one
walk of length 2 from vertex 1 to vertex 4. These numbers can be understood by inspection of the
network, G, shown in →Figure 12.12.

Let A be an m × n matrix. Then, A , the transpose of A, is the matrix obtained by


T
T

interchanging the rows and columns of A. Consequently, A is an n × m matrix. A matrix A is


T

called symmetric if A = A . The transposition may be regarded as a particular rotation of a


matrix. Using R, the transposition of a matrix A can be performed as follows:
Let A and B be two m × n real matrices. Some mathematical operations, which require the
matrices to have the same dimensions (i. e., having the same number of rows and the same
number of columns) include:
Matrix addition: The sum of the matrices A and B is an m × n real matrix, C, whose entries
are c = a + b for i = 1, … , m, j = 1, … , n . In R, this operation is carried out using
ij ij ij

the command C <- A+B.


Matrix subtraction: The difference A − B is an m × n real matrix, C, whose entries are
cij = a − b
ij for i = 1, … , m, j = 1, … , n . In R, this operation is carried out using the
ij

command C <- A-B.


Matrix element-wise multiplication: The element-wise product of the matrices A and B is an
m × n real matrix, C, whose entries are c ij = a ij × bij for i = 1, … , m and j = 1, … , n .
In R, this operation is performed using the command C <- A*B.
Matrix comparison: The matrix A is equal to B if a = b for i = 1, … , m, j = 1, … , n .
ij ij
An

important operation on matrices is matrix multiplication. Let A and B be two matrices; the matrix
multiplication, A × B , requires the number of columns of the matrix A to be equal to the number
of rows of the matrix B, i. e., if A is an m × n real matrix, then B must be an n × l real matrix. The
result of this operation is an m × l real matrix, C, whose entries c for
ik

i = 1, … , m, k = 1, … , l , are obtained as follows:

1 (12.20)
cik = ∑ aij × bjk .

j=1

When m = 1 , then, the result is a product between a row vector and a matrix.
Note that, even if both products A × B and B × A are defined, i. e., if l = m , A × B generally
differs from B × A .

Definition 12.2.1.
Let A be an n × n matrix, and let 0 denote the n-dimensional null vector.
n

A is said to be positive definite if and only if x Ax > 0 , for all x ∈ R , x ≠ 0 .


T n
n

A is said to be positive semidefinite if and only if x Ax ≥ 0 , for all x ∈ R .


T n

A is said to be negative definite if and only if x Ax < 0 , for all x ∈ R , x ≠ 0 .


T n
n

A is said to be negative semidefinite if and only if x Ax ≤ 0 , for all x ∈ R .


T n

A is said to be indefinite if and only if there exist x and y ∈ R such that x Ax > 0 and
n T

y Ay < 0 .
T

Using R, matrix multiplications can be carried out as follows:

12.3 Special matrices


Special matrices are characterized by their patterns, which can be used to define matrix
classes. Examples of such special matrices include [→27]:
Diagonal matrix: A matrix A is called diagonal, if and only if its entries a ij = 0 for all i ≠ j .
Identity matrix: If the nonzero entries of a diagonal matrix are all equal to unity, then the
matrix is called an identity matrix, and it is often denoted I.
Upper trapezoidal matrix: An m × n matrix, A, is said to be upper trapezoidal if and only if its
entries a = 0 for all i > j . If m = n , i. e., A is a square matrix, then an upper trapezoidal
ij

matrix is referred to as an upper triangular matrix.


Lower trapezoidal matrix: An m × n matrix, A, is said to be lower trapezoidal if and only if its
entries, a , for all j > i are zeros. If m = n , i. e., A is a squared matrix, then a lower
ij

trapezoidal matrix is referred to as a lower triangular matrix.


Tridiagonal matrix: A matrix, A, is said to be tridiagonal if and only if its entries a = 0 for all
ij

i, j such that |i − j| > 1 .


Orthogonal matrix: A matrix, A, is said to be orthogonal if and only if A A = I .
T

Symmetric matrix: A matrix, A, is said to be symmetric if and only if A = A .


T

Skew-symmetric matrix: A matrix, A, is said to be skew-symmetric if and only if A = −A .


T

Sparse matrix: A matrix is called sparse if it has relatively few nonzero entries. The sparsity of
an n × m matrix, generally expressed in %, is given by
r
%, where r is the number of nonzero entries in A.
nm

Diagonal and tridiagonal matrices are typical examples of sparse matrices.

Definition 12.3.1.
Let A be an n × n squared matrix, and let I be the n × n identity matrix. If there exists an
n

n × n matrix B such that

AB = In = BA, (12.21)

then B is called the inverse of A. If a squared matrix, A, has an inverse, then A is called an invertible
or nonsingular matrix; otherwise, A is called a singular matrix.
The inverse of the identity matrix is the identity matrix, whereas the inverse of a lower
(respectively, upper) triangular matrix is also a lower (respectively, upper) triangular matrix.
Note that for a matrix to be invertible, it must be a squared matrix.
Using R, the inverse of a squared matrix, A, can be computed as follows:
12.4 Trace and determinant of a matrix
Let A be an n × n real matrix. The trace of A, denoted tr(A) , is the sum of the diagonal entries of
A, that is,
n

tr(A) = ∑ aii .

i=1

The determinant of A, denoted det (A) , can be computed using the following recursive relation
[→27]:

a11 , if n = 1,
det (A) = { i+j
n
∑ (−1) aij det (Mij ), if n > 1,
i=1

where, M is the (n − 1) × (n − 1) matrix obtained by removing the i row and the j


ij
th th

column of A.
Let A and B be two n × n matrices and k a real scalar. Some useful properties of the
determinant for A and B include the following:
1. det (AB) =det (A) det (B) ;
2. det (A
T
) =det (A) ;
3. det (kA) = k
n
det (A) ;
4. det (A) ≠ 0 if and only if A is nonsingular.

Remark 12.4.1.
If an n × n matrix, A, is a diagonal, upper triangular, or lower triangular matrix, then
n

det (A) = ∏ aii ,

i=1

i. e., the determinant of a triangular matrix is the product of the diagonal entries. Therefore, the
most practical means of computing a determinant of a matrix is to decompose it into a product of
lower and upper triangular matrices.
Using R, the trace and the determinant of a matrix are computed as follows:

12.5 Subspaces, dimension, and rank of a matrix


The dimension and rank of a matrix are essential properties for solving systems of linear
equations.

Definition 12.5.1.
Consider m vectors {A , i = 1, 2, … , m} , whereas A
i i ∈ R
n
. If the only set of scalars λ for
i

which
m

∑ λi Ai = 0n

i=1

is λ = λ = … = λ = 0 , then the vectors A , i = 1, 2, … , m are said to be linearly


1 2 m i

independent.
Otherwise, the vectors are said to be linearly dependent.

Definition 12.5.2.
A subspace of R is a nonempty subset of R , which is also a vector space.
n n

Definition 12.5.3.
The set of all linear combinations of a set of m vectors {A , i = 1, 2, … , m} in R is a subspace
i
n

called the span of {A , i = 1, 2, … , m} , and defined as follows:


i
n
span{Ai , i = 1, 2, … , m} = {V ∈ R }, such that

V = ∑ λi Ai with λi ∈ R ∀ i = 1, … , m}.

i=1

If the vectors {A , i = 1, 2, … , m} are linearly independent, then any vector


i

V ∈ span{Ai , i = 1, 2, … , m} is a unique linear combination of the vectors


{A , i = 1, 2, … , m} .
i

Definition 12.5.4.
A linearly independent set of vectors, which spans a subspace S is called a basis of S.
All the bases of a subspace S have the same number of components, and this number is called
the dimension of S, denoted dim (S) .
Two key subspaces are associated with any m × n matrix A, i. e., A ∈ R m×n
:
1. the subspace
m n
im (A) = {b ∈ R : b = Ax f or some x ∈ R },

referred to as the range or the image of A; and


2. the subspace
n
ker (A) = {x ∈ R : Ax = 0},

called the null space or the kernel of A.


If the matrix, A, is defined by a set of n vectors {A , i = 1, 2, … , n} , where A
i i ∈ R
m
, then
im (A) = span{Ai , i = 1, 2, … , n}.

Definition 12.5.5.
The rank of a matrix, A, denoted rank (A) , is the maximum number of linearly independent rows
or columns of the matrix A, and it is defined as follows:

rank(A) =dim (im (A)).

Let A be an m × n real matrix. Then, the following properties should be noted:


If m = n , then
1. if rank (A) = n , then A is said to have a full rank;
2. A is a nonsingular matrix if and only if it has a full rank, i. e., rank (A) = n ;
If m ≤ n , then A has a full rank if the rows of A are linearly independent.
Let A be an m × n real matrix and B an n × p real matrix, then the following relationships hold;

rank (A
T
A) =rank (AA
T
) =rank (A) =rank (A
T
) ;
rank (AB) ≤min (rank (A), rank (B)) ;
If rank (A) = n , then rank (AB) =rank (A) ;
rank (A)+ rank (B) − n ≤rank (AB) (this is known as the Sylvester’s rank inequality).

Definition 12.5.6.
Let A be an m × n real matrix, i. e., A ∈ R m×n
, and u ∈ R and v ∈ R . Then, the matrix
m n

B ∈ R
m×n
such that

B = A + uv
T (12.22)

is called the rank-1 modification of A.


Let A ∈ R n×n
, and U , V ∈ R
n×p
such that
1. A is nonsingular, and
2. (I V
n
T
A U ) , where I denotes the n × n identity matrix, is nonsingular.
−1
n

Then,
T
−1
−1 −1 T −1
−1
T −1 (12.23)
(A + U V ) = A − A U (In + V A U) V A .

The above equation (→12.23) is referred to as the Sherman–Morrison–Woodbury formula.


If p = 1 , then the above matrices U and V reduce to two vectors u, v ∈ R . Then, the Sherman– n

Morrison–Woodbury formula simplifies to

−1 A
−1
uv
T
A
−1 (12.24)
T −1
(A + uv ) = A − .
T −1
1 + v A u

Since B = A + uv , due to (→12.22), then if the inverse of A is known, the Sherman–Morrison–


T

Woodbury formula provides the easiest way to compute the inverse of the matrix B, the rank-1
change of A. Thus,

−1
1 −1 T −1
B = (In − (A u)v )A .
T −1
1 + v (A u)

Using R, the rank of a matrix can be determined as follows:


12.6 Eigenvalues and eigenvectors of a matrix
In this section, we introduce the eigenvalues of a matrix as zeros of a graph polynomial.
Eigenvalues have various applications in many scientific disciplines. For instance, eigenvalues are
used extensively in mathematical chemistry [→90], [→186] and computer science [→28]. Let A be
an n × n matrix. The eigenvalues of A, denoted λ (A) , i = 1, 2, … , n or
i

λ = {λ , i = 1, 2, … , n} , are the n zeros of the polynomial in λ of degree n defined by


i

det (A − λI ) , i. e., the eigenvalues are solutions to the following equation: [→27]

det (A − λI ) = 0.

The polynomial det (A − λI ) is called the characteristic polynomial.


If A is a real matrix, then the eigenvalues of A are either real or pairs of complex conjugates. If
A is a symmetric matrix, then all its eigenvalues are real.
The following properties hold for eigenvalues:
1.
2.
det (A) = ∏

tr(A) = ∑

Definition 12.6.2.
A non-null vector x, such that

(A − λ (A)I )x = 0 .
i
i
n
i=1
n
i=1
λi (A)

aii = ∑
n
i=1
λi (A)

i=1,2,…,n

Ax = λi (A)x,

is called the right eigenvector associated with the eigenvalue λ (A) .

Let A be an n × n real matrix. The following properties hold:

If A is symmetric, then there exists an orthogonal matrix Q ∈ R

λi (A)
T
AQ = D,

where D is an n × n diagonal matrix, whose diagonal entries are λ


If A is nonsingular, i. e., λ (A) ≠ 0 for i = 1, … , n , then
i

λi (A
−1
) =
1
,


A squared matrix, A, is called nonsingular if and only if all its eigenvalues are nonzero.

Definition 12.6.1.
The spectral radius of a squared matrix A, denoted ρ(A) , is given by

ρ(A) = max λi (A) .

For each eigenvalue λ (A) , its right eigenvector x is found by solving the system

f or
i

If A is diagonal, upper triangular or lower triangular, then its eigenvalues are given by its
diagonal entries, i. e.,
λi (A) = aii , f or i = 1, 2, … , n.

If A is orthogonal, then |λ (A)| = 1 for all i = 1, 2, … , n .


i
n×n

i = 1, … , n.

Eigenvalues can be used to determine the definiteness of symmetric matrices. Let A be an n × n


symmetric real matrix and λ (A) , i = 1, 2, … , n its eigenvalues. Then we have the following
relationships:
i

A is said to be positive definite if and only if λ (A) > 0 for all i = 1, … , n .


i

A is said to be positive semidefinite if and only if λ (A) ≥ 0 for all i = 1, … , n .

Remark 12.6.1.
i
i

A is said to be negative definite if and only if λ (A) < 0 for all i = 1, … , n .


i

A is said to be negative semidefinite if and only if λ (A) ≤ 0 for all i = 1, … , n .


i

A is said to be indefinite if and only if λ (A) > 0 for some i and λ (A) < 0 for some j.j
such that

1 (A) ,λ
2 (A) ,…, λ
(12.25)

n (A) .
If a nonsingular n × n symmetric matrix, A, is positive semi-definite (respectively negative
semidefinite), then A is positive definite (respectively negative definite).
Using R, the eigenvalues for a matrix, A, and their associated eigenvectors, as well as the
spectral radius of the matrix A, can be computed as follows:

12.7 Matrix norms


The concept of a vector norm discussed earlier can be generalized to a matrix [→86], [→134].

Definition 12.7.1.
A matrix norm, denoted by ‖·‖, is a scalar function defined from R m×n
to R , such that
1. ∥A∥ ≥ 0 for all A ∈ Rm×n
;
2. ∥A∥ = 0 ⟺ A = 0 m×n , where 0 denotes an m × n null matrix;
m×n

3. ∥A + B∥ ≤ ∥A∥ + ∥B∥ for all A, B ∈ R ;m×n

4. ∥kA∥ = |k| × ∥A∥ , for all k ∈ R and A ∈ R . m×n

Furthermore, if
m×n n×q
∥AB∥ ≤ ∥A∥ × ∥B∥, f or all A ∈ R and B ∈ R ,

then the matrix norm is said to be consistent.

Definition 12.7.2.
Let A ∈ R m×n
and x ∈ R . Then, the subordinate matrix p-norm of A, denoted ∥A∥ , is defined
n
p

in terms of vector norms as follows:


∥Ax∥p
∥A∥p =max .
x≠0n ∥x∥p

In particular,
1. the subordinate matrix 1-norm of A is defined by
m
∥Ax∥1
∥A∥1 =max = max ∑ |aij |.
x≠0n ∥x∥1 j=1,2,…,n
i=1

2. the subordinate matrix 2-norm of A is defined by


∥Ax∥2
∥A∥2 =max ;
x≠0n ∥x∥2

if m = n , then

∥A∥2 = ρ(A) = max λi (A) ,


i=1,2,…,n

where ρ(A) denotes the spectral radius of A.


3. in the case where p = ∞ , the subordinate matrix ∞-norm of A is defined by
n
∥Ax∥∞
∥A∥∞ =max = max ∑ |aij |.
x≠0n ∥x∥∞ i=1,2,…,m
j=1

Furthermore, if m = n , we have

∥A∥2 ≤ √ ∥A∥1 × ∥A∥∞ ≤ √n × ∥A∥2 .

Remark 12.7.1.
The subordinate matrix p-norm is consistent and, for any A ∈ R m×n
and x ∈ R , n

∥Ax∥p = ∥Ax∥p ∥x∥p .

Definition 12.7.3.
The Frobenius norm of a matrix A is defined by
1

m n 2

2 T
∥Ax∥F = (∑ ∑ |aij | ) = tr(AA ).

i=1 j=1

12.8 Matrix factorization


Matrix factorization is an essential tool for solving systems of linear equations [→27]. The
most commonly used factorization methods are the LU factorization for a squared matrix, and the
QR factorization for a rectangular matrix, i. e., squared or not.

12.8.1 LU factorization
Let A ∈ R . The LU factorization of A consists of the decomposition of A into a product of a
n×n

unit lower triangular matrix L and an upper triangular matrix U, that is,
A = LU ,

where
1 0 0 ⋯ 0 u11 u12 u13 ⋯ u1n
⎡ ⎤ ⎡ ⎤

l21 1 0 ⋯ 0 0 u22 u23 ⋯ u2n

L = and U = .

⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮

⎣ ⎦ ⎣ ⎦
ln1 ln2 ln3 ⋯ 1 0 0 0 ⋯ unn

Therefore, the determinants of L and U are det (L) = 1 and det (U ) = ∏ n


i=1
uii , respectively.
Consequently,
n

det (A) =det (LU ) =det (L)× det (U ) = ∏ uii .

i=1

However, when a principal submatrix of A is singular, then a permutation, i. e., the reordering of
the columns of A, is required. If A is nonsingular, then there exists a permutation matrix
P ∈ R such that
n×n

P A = LU . (12.26)

An equivalent formulation of the LU factorization (→12.26) is

P A = LDÛ ,

where D ∈ R is a diagonal matrix, whose diagonal entries are u and Û


n×n
ii ∈ R
n×n
is a unit
upper triangular matrix; i. e., U = DÛ .
Computing the LU factorization of A is formally equivalent to solving the following nonlinear
system of n equations where the unknowns are the n + n coefficients of the triangular
2 2

matrices L and U:
min(i,j)

aij = ∑ lik ukj .

k=1

Using R, the LU factorization of a matrix, A, is performed using the command expand(lu(A)), which
outputs three matrices L, U, and P. The matrices L and U are the lower and upper triangular
matrices we are looking for, whereas the matrix P contains all the row permutation operations
that have been carried out on the original matrix A for the purpose of obtaining L and U.
Therefore, the product LU gives a row-permuted version of A, whereas the product P LU
enables the recovery of the original matrix A.

Let A be an n × n

symmetric matrix. If A has an LU factorization, then there exists a unit lower triangular matrix
L ∈ R , and a diagonal matrix D ∈ R
n×n
such that
n×n

A = LDL
T
.
(12.27)

If a principal submatrix of A is singular, then a permutation, i. e., a reordering of both rows and
columns of A is required, and this results in the following factorization:

P AP
T
= LDL
T
,
(12.28)

where P ∈ R
n×n
is a permutation matrix.

12.8.2 Cholesky factorization


If an n × n real matrix, A, is positive definite, then there exists a unit lower triangular matrix
L ∈ R
n×n
and a diagonal matrix D ∈ R n×n
with d > 0 for i = 1, 2, … , n such that
ii

T
T (12.29)
A = LDL = L̃L̃ ,

where L̃ = LD , with D a diagonal matrix, whose diagonal entries are √d for


1 1

2 2
ii

i = 1, 2, … , n . The factorization (→12.29) is referred to as the Cholesky factorization.

An illustration of Cholesky factorization, using R, is provided in Listing 12.18.

12.8.3 QR factorization
Let A ∈ R m×n
. Then,
1. if m = n , and A is nonsingular, then there exits an orthogonal matrix Q ∈ R n×n

and a nonsingular upper triangular matrix R such that the QR factorization of A is


defined by
T
A = QR ⟺ Q A = R,

since Q is orthogonal, i. e., Q Q = I ;


T
n

2. if m > n and rank (A) = n , then there exists an orthogonal matrix Q ∈ R m×m

and a nonsingular upper triangular matrix R ∈ R such that the QR


n×n

factorization of A is defined by
R R (12.30)
T
A = Q[ ] ⟺ Q A = [ ],
0m−n,n 0m−n,n
since Q is orthogonal, i. e., Q Q = I . Here, 0 T
denotes the (m − n) × n
m m−n,n

matrix of zeros.
When rank(A) < n , i. e., a principal submatrix of A is singular, then a
permutation, i. e., the reordering, of the columns of A, is introduced, and the QR
factorization of A is defined by
R
T
Q AP = Q[ ],
0m−n,n

where P ∈ R
n×n
is a permutation matrix for reordering the columns of A.

Let V ∈ R m×n
and W ∈ R denote the n first columns and (m − n) last columns of the
m×(m−n)

orthogonal matrix Q ∈ R , respectively; that is, Q = [V , W ] . Then, the submatrices V and W


m×m

are also orthogonal. Indeed,


T
T
V
Q Q= [ ][V W]
T
W

T T
V V V W
= [ ].
T T
W V W W

Since Q is orthogonal, then Q T


Q = Im = [
In
0 Im−n
0
] , and therefore
T T
V V Y Y In 0
[ ] = [ ],
T T
W V W W 0 Im−n

i. e., V V = I , W W = I V W
T
n
T
m
T
= 0n,m−n and W V T
= 0m−n,n (that is, V and W are
orthogonal). Substituting Q with [V W ] in (→12.30) gives

T
R V R
T
Q A = [ ] ⟺ [ ]A = [ ],
T
0m−n,n W 0m−n,n

and therefore

V
T
A = R ⟺ A = V R,
(12.31)

W
T
A = 0m−n,n .
(12.32)

Equations (→12.31) and (→12.32) yield several important results, which link the QR factorization
of a matrix, A, to its subspaces im (A) (i. e., the range of A) and ker (A) (i. e. the kernel or the
null space of A). In particular,
1. since V is an orthogonal matrix, then, thanks to (→12.31), the columns of V form
an orthogonal basis for the subspace im (A) , that is, A is uniquely determined by
the linear combination of the column of V through A = V R . Consequently, the
matrix V V provides an orthogonal projection onto the subspace im (A) .
T

2. also, since W is an orthogonal matrix, then, thanks to (→12.32), the columns of W


form an orthogonal basis for the subspace ker (A) , and the matrix W W T

provides an orthogonal projection on to the subspace ker (A) .


12.8.4 Singular value decomposition
The singular value decomposition (SVD) is another type of matrix factorization that generalizes
the eigendecomposition of a square normal matrix to any matrix. It is a popular methods because
it has widespread applications for recommender systems and the determination of a
pseudoinverse. Let A ∈ R . Then,
m×n

if m ≥ n , then the SVD of A is given by


D
T
A = U[ ]V ,
0m−n,n

where U ∈ R
m×m
and V n×n
∈ R are orthogonal matrices, i. e., U U = I and
T
m

V
T
V = In , and D ∈ R n×n
is a diagonal matrix whose diagonal entries, d or simply
ii

denoted d i for i = 1, 2, … , n , are ordered in descending order, that is,


d1 ≥ d2 ≥ … ≥ 0.

if n > m , then the SVD of A is given by


A = U [D 0m,n−m ],

where D ∈ R m×m
is a diagonal matrix, whose diagonal entries d ii = di for
i = 1, 2, … , m , are rearranged in descending order.

Definition 12.8.1.

The singular values of a matrix A ∈ R m×n


, denoted σ (A) , are given by the diagonal entries of
i

the matrix D, that is,


σi (A) = di , f or i = 1, 2, … , p,

where p =min (m, n) .


The columns of U are called the left singular vectors, whereas the columns of V are called the right
singular vectors.
The following relationships should be noted:
The singular values of a matrix A ∈ R m×n
are given by

T
σi (A) = √ λi (A A) f or i = 1, 2, … , p,

where p =min (m, n) and λ (A A), i = 1, 2, … , p , are the nonzero eigenvalues of the
i
T

matrix A A , rearranged in descending order.


T

Let r denote the rank of a matrix A ∈ R , then


m×n

σ1 (A) ≥ ⋯ ≥ σr (A) > σr+1 = σr+2 = ⋯ = σmax(m,n) = 0;

that is, the rank of the matrix is the number of its nonzero singular values. However, due to
rounding errors, this approach to determine the rank is not straightforward in practice, as it
is unclear how small the singular value should be to be considered as zero.
Furthermore, if A has a full rank, i. e., r =min (m, n) , then the condition number of A,
denoted κ(A) , is given by
σ1 (A)
κ(A) = .
σr (A)
12.9 Systems of linear equations
A system of m linear equations in n unknowns consists of a set of algebraic relationships of the
form
n (12.33)
∑ aij xj = bi , i = 1, … , m,

j=1

where x are the unknowns, whereas a , the coefficients of the system, and b , the entries of the
j ij i

right-hand side, are assumed to be known constants.


The system (→12.33) can be rewritten in the following matrix form:
Ax = b, (12.34)
where A is an m × n matrix with entries a , i = 1, … , m , j = 1, … , n , b is a column vector of
ij

size m with entries b , i = 1, … , m and x is a column vector of size n with entries x ,


i j

j = 1, … , n .

Theoretically, the system (→12.34) has a solution if and only if b ∈im (A) . If, in addition,
ker (A) = {0} , then the solution is unique. When a solution exists for the system (→12.34), then

it is given by Cramer’s method, as follows:


det (Mj ) (12.35)
x = ,
det (A)

where M is the matrix obtained by substituting the j column of A with the right-hand side
j
th

term b.
However, when the size of the matrix A is large, Cramer’s method is not sustainable, and
computing the solution, x, requires several efficient numerical methods. More often, the efficiency
with which these methods work depends on the patterns or structure of the matrix A. Depending
on the form of the matrix A, the systems of the form (→12.34) can be categorized as follows:
1. Triangular linear systems: If A ∈ R is either a nonsingular lower or an upper
n×n

triangular matrix, then the system (→12.34) can be solved efficiently.


For instance, if A is a nonsingular lower triangular matrix, i. e., A = L ∈ R , n×n

then the solution to the system (→12.34) can be readily obtained using the
following method, known as forward substitution:
b1 (12.36)
x1 = ,
l11

(bi − ∑
i−1
lij xj )
(12.37)
j=1
xi = f or i = 2, 3, … , n.
lii

If A = L is a unit lower triangular matrix, then l = 1 for all i = 1, 2, … , n .


ii

Therefore, the denominators in (→12.36) and (→12.37) simplify.


However, if A is a nonsingular upper triangular matrix, i. e., A = U ∈ R , then n×n

the solution to the system (→12.34) can easily be obtained using the following
method, known as backward substitution:
bn (12.38)
xn = ,
unn

(bi − ∑
n
uij xj ) (12.39)
j=i+1
xi = f or i = n − 1, n − 2, … , 1.
uii

2. Well-determined linear systems: If A ∈ R , then the system (→12.34) has n linear


n×n

equations in n variables, and a such system is said to be well-determined. If A is


nonsingular, then the solution to the system is given by
−1
x = A b.

Since factorization methods, such as LU factorization, are efficient ways to


calculate the inverse of a squared matrix, they can be purposely used here. Let L
be a unit lower triangular matrix, U an upper triangular matrix, and P a
permutation matrix such that P A = LU . Then A = P LU , and the system −1

(→12.34) can be rewritten as follows

P
−1
LU x = b ⟺ LU x = P b,
(12.40)

where the right-hand side term P b is a permutation of b, i. e., the reordering of


the entries of b.
The system (→12.40) can be solved in two stages, as follows: First, solve for y the
system Ly = P b using forward substitution, and then use the obtained values of
y to solve for x the system U x = y .
Furthermore, if A is symmetric, then it is more convenient to use the LDL T

factorization (→12.27)–(→12.28) to solve the system (→12.34). If A is symmetric


T
positive definite, then the Cholesky factorization (→12.29), i. e., A = L̃L̃ , is likely
to be the most efficient method to solve the system (→12.34).
3. Over-determined linear systems: If A ∈ R with m > n , then the system
m×n

(→12.34) has more equations than variables, and such a system is said to be over-
determined. When b ∈im (A) , then the system (→12.34) has a solution, and so
does
T T
A Ax = A b.

If A is a full rank matrix, that is rank (A) = n , then the matrix A T


A is
nonsingular and therefore invertible. Thence,
−1
T T
x = (A A) A b.

The matrix A = (A A) A is referred to as the pseudoinverse of A or the


−1
+ T T

Moore–Penrose generalized inverse of A. Note that when A is a squared matrix, that


is, m = n , then A = A .
+ −1

If b ∉im (A) , then the system (→12.34) has no solution.


4. Under-determined linear systems: If A ∈ R with m < n , then the system
m×n

(→12.34) has fewer equations than variables, and such a system is said to be
under-determined. If the equations in (→12.34) are consistent, then the system has
an infinite number of solutions. If A is a full rank matrix, that is, rank (A) = m ,
then the matrix AA is a nonsingular matrix. Thus, one of the solutions to the
T

system (→12.34) is given by


−1
T T
x̄ = A (AA ) b.

Consider the following linear system:

⎧3x + 2y − 5z = 12 (12.41)

⎨ x − 3y + 2z = −13

5x − y + 4z = 10

Thus,
2.

3.
(c)

(d)
(e)
(f)
A =

⎜⎟


3

5
2

−3

−1
−5

4

Using R, the linear system (→12.41) can be solved as follows:

12.10 Exercises
1. →
Let U = (−1,
vectors:
(a)
(b)
−2, 4) , and let

Compute the Euclidean norms of


Compute the dot product of
and

Compute the orthogonal projection of



V .
Compute the reflection of
Compute the cross product of
Compute the components of
spherical coordinates.
and

Let z = 5 + 2i and w = 7 + 4i be two complex numbers.


(a)

(b)
Let
b =

V = (2,


V

U


3,

U
and
12

−13

10


U
5)


and

in cylindrical and
U V

Compute the conjugates of z and w, the sum of z and w, and


difference z − w .
Compute the product z × w and the divisions and .

U

V
.

be two 3-dimensional

.

V

.
.

onto the direction of

with respect to the direction of



and

w
w

z

U .
4.
(a)
(b)
(c)
(d)

(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
A =

⎜⎟


1

−4
3

−2

1
0

−1


and

be two 3 by 3 matrices, and let V = (17, 9, 6) .


Compute the sum of A and B.
Compute the difference B − A .
B =


7

−3

Compute the subordinate matrices 2-norm of A and B.


Compute the subordinate matrices ∞-norm of A and B.
Compute the Frobenius norm of A and B.
2

4
1

−1

−2

Compute the transpose and the inverse matrices of A and B.


Compute the following matrix products: A × B , V × A and
V × B.

Compute the trace and determinant of A and B.


Find the rank of A and B.
Compute the eigenvalues and eigenvectors of A and B.
Compute the spectral radius of A and B.
Compute the subordinate matrices 1-norm of A and B.

Compute the LU , Cholesky and QR factorization of A and B.


Use the appropriate R functions to solve the following linear systems:

⎧2x

⎨4x

8x

⎧2x

⎨ x

3x

+
3y

9y

2y
+


3z

7z

15z

3z

z
=

=
=

8
10

9.

3
3

5


13 Analysis
Similar to linear algebra, also analysis [→158] is omnipresent in nearly all applications of
mathematics. In general, analysis deals with examining convergence, limits of functions,
differentiation, integration as well as metrics. In this chapter, we provide an introduction to these
topics and demonstrate how to conduct a numerical analysis using R.

13.1 Introduction
Differentiation and integration are fundamental mathematical concepts, having a wide range of
applications in many areas of science, particularly in physics, chemistry, and engineering [→158].
Both concepts are intimately connected, as integration is the inverse process of differentiation,
and vice versa. These concepts are especially important for descriptive models, e. g., providing
information about the position of an object in space and time (physics) or the temporal evolution
of the price of a stock (finance). Such models require the precise definition of the functional,
describing the system of interest, and related mathematical objects defining the dynamics of the
system.

13.2 Limiting values


The concept of limiting values forms the basis of many areas in mathematics [→158]. For instance,
limiting values are important for investigating the values of functions. For a given function, f (x) ,
we can distinguish two types of limiting values:
If x goes to ±∞
If x goes to a finite value x 0

Before we begin investigating limiting values, we introduce a class of functions, namely, real
sequences. For limiting values of complex sequences or functions, we refer to the reader to
[→158].

Definition 13.2.1.
A real sequence is a function a n : N ⟶ R .
We also write (a n )n∈N = (a1 , a2 , a3 , … , an , …) . Typical examples of sequences include:

1 1 1 (13.1)
(an ) = ( , , , …),
n∈N
2 4 6

(bn )
3
= (1 , 2 , 3
3 3
…).
(13.2)
n∈N

From the above sequences, it can be observed that a and b have the closed forms: a =
n n n
1
2n

and b = n , respectively. Now, we are ready to define the limiting value of a real sequence.
n
3

Definition 13.2.2.
A number l is called the limiting value or limes of a given real sequence (a ) , if for all ε > 0
n n∈N

exists N (ε) such that |a − l| < ε , for all n > N (ε) . In this case, the sequence, (a )
0 n 0 , isn n∈N
Example 13.2.1.

|a − 1| <
n
n
n⟶∞

holds.
1

10
1

ε
a = l.
n

|an − 1| =


said to converge to l, and the following short-hand notation is used to summarize the previous
statement: lim

Let us consider the sequence a = 1 − . We want to show that (a


n

(1 −
1

setting l = 1 in Definition →13.2.2, we obtain


1

n
) − 1 =
1

n
=

such that |a − 1| < ε for all n > N (ε) := . For example, if we set ε =
0
1
ε
1

n
< ε.
n )n∈N

Thus, we find n > =: N (ε) . In summary, for all ε > 0 , there exists a number N (ε) :=
, then n ≥ 11 .
This means that for all elements of the given sequence a = 1 − , starting from n = 11 ,
n
1
n

In →Figure 13.1, we visualize the concept introduced in Definition →13.2.2.

Figure 13.1 Elements of the sequence a


the ε-strip are indicated by green points.
n = 1 −
1
n
converges to 1. By

1
10
0

are shown as points. The elements that lie in

Before we give some examples of basic limiting values of sequences, we provide the following
proposition, which is necessary for the calculations that follow [→158].

Proposition 13.2.1.
1

ε
(13.3)
Let (a n )n∈N and (b )
n n∈N
be two convergent sequences with lim n⟶∞ an = a and
limn⟶∞ bn = b . Then, the following relationships hold:

lim (an + bn )= lim an + lim bn = a + b, (13.4)


n→∞ n→∞ n→∞

lim (an ⋅ bn )= lim an ⋅ lim bn = a ⋅ b, (13.5)


n→∞ n→∞ n→∞

an limn→∞ an a (13.6)
lim = = .
n→∞ bn limn→∞ bn b

Example 13.2.2.
Let us examine the convergence of the following two sequences:
3n + 1 (13.7)
an = ,
n + 5

bn = (−1) .
n
(13.8)

For a , we have
n

n(3 +
1
) limn→∞ (3 +
1
)
(13.9)
3n + 1 n n
lim = lim =
5 5
n→∞ n + 5 n→∞
n(1 + ) limn→∞ (1 + )
n n

1
3+ limn→∞ ( ) 3 + 0
n
= = = 3.
5
1+ limn→∞ ( ) 1 + 0
n

Note that the sequences lim ( ) and lim


n→∞
1

n
( ) converge to 0. By examining the values
n→∞
5

of b , we observe that its values alternate, i. e., they always flip between −1 and 1. According to
n

Definition →13.2.2, the sequence b is not convergent and, hence, does not have a limiting value.
n

In the following, we generalize the concept of limiting values of sequences to functions.

Definition 13.2.3.
Let f (x) be a real function and x a sequence that belongs to the domain of f (x) . If all
n

sequences of the values f (x ) converge to l, then l is called the limiting value for x → ±∞ , and
n

we write lim x→±∞ = l.

For a general function, f : X → Y , we call the set X domain and Y the co-domain of function f.

Example 13.2.3.
Let us determine the limiting value of the function f (x) = 2x−1

x
for large and positive x. This
means, we examine lim x→∞ and find
2x−1

2x − 1 1 1 (13.10)
lim = lim (2 − ) = 2− lim ( ) = 2.
x→∞ x x→∞ x x→∞ x

Here, we used Proposition →13.2.1 for functions as it can be formulated accordingly (see [→158]).
The limiting value of f (x) = 2x−1

x
for large x can be seen in →Figure 13.2.

We conclude this section by stating the definition for the convergence of a function, f (x) , if x
tends to a finite value x . 0

Definition 13.2.4.
Let f (x) be a real function defined in a neighborhood of x . If for all sequences x in the domain
0 n

of f (x) with x → x , x ≠ x the equation lim


n 0 n 0 f (x ) = l holds, we call l the limiting
n→∞ n

value of f (x) at x . Symbolically, we write lim


0 f (x) = l .
x→x0

Example 13.2.4.
Let us calculate the limiting value of lim . Note that the function f (x) = is
3 3
2x −8x 2x −8x
x→−2 x+2 x+2

not defined at x = −2 . However, if we factorize 2x − 8x , we obtain 3

2x − 8x = 2x(x + 2)(x − 2) . Hence, we obtain


3

2x
3
− 8x 2x(x + 2)(x − 2) (13.11)
lim = lim = lim 2x(x − 2) = 16.
x→−2 x + 2 x→−2 x + 2 x→−2

The following sections will utilize the concepts introduced here to define differentiation and
integration.

Figure 13.2 Limiting value for f (x) = 2x−1


x
with l = 2 .

13.3 Differentiation
Let f : R ⟶ R be a given continuous function. Then, f is called differentiable at the point x if 0

the following limit exists:


f (x0 + h) − f (x0 ) (13.12)
lim .
h⟶0 h

If f is differentiable at the point x , the derivative of f denoted or f is finite at x , and


df (x) ′
0 (x) 0
dx

can be approximated by
df (x0 ) f (x0 + h) − f (xh ) (13.13)

f (x0 ) = ≈ , f or h ⟶ 0.
dx h

Therefore, the derivative of a function f at a point x can be viewed as the slope of the tangent
0

line of f (x) at the point x , as illustrated geometrically in →Figure 13.3. The tangent line in
0

→Figure 13.3 (left) corresponds to the limit of the displacement of the secant line in →Figure 13.3
(left) when h tends to zero, i. e., when x + h is getting closer to x . The intermediate dashed
0 0

lines correspond to the different positions of the secant line as h decreases to 0. →Figure 13.3
(right) shows the tangent line when h ≈ 0 , i. e., x ≈ x + h , which corresponds approximately
0 0

to equation (→13.13).

Figure 13.3 Geometric interpretation of the derivative.

The above approximation can be extended to multivariate real functions as follows: Let f be a
scalar valued multivariable real function, i. e., f : R ⟶ R . Then, the first-order partial
n

derivative of f at a point x = (x , x , … , x ) with respect to the variable x , generally denoted


1 2 n i

or ∂ f (x) , is defined as follows:


∂f (x)
xi
∂xi

∂f (x) f (x1 , x2 , … , xi + h, … , xn ) − f (x1 , x2 , … , xi , … , xn ) (13.14)


= lim .
∂xi h⟶0 h
Definition 13.3.1.
The differential of f is given by
∂f ∂f ∂f ∂f
df = dx1 + dx2 + ⋯ + dxi + ⋯ + dxn .
∂x1 ∂x2 ∂xi ∂xn

Definition 13.3.2.
The gradient of a function f, denoted ∇f , is defined, in Cartesian coordinates, as follows:
∂f ∂f ∂f ∂f
∇f = e1 + e2 + ⋯ + ei + ⋯ + en ,
∂x1 ∂x2 ∂xi ∂xn

where the e , k = 1, … , n , are the orthogonal unit vectors pointing in the coordinate directions.
k

Thus, in (R , ∥ ⋅ ∥ ) , where ∥x∥ = √⟨x, x⟩ is the Euclidian norm, the gradient of f can be
n
2 2

rewritten as follows:
T
∂f ∂f ∂f
T
∇f = df = ( , ,…, ) .
∂x1 ∂x2 ∂xn

When f is a function of a single variable, i. e., n = 1 , then ∇f = f



.
The orthogonal unit vectors e are explicitly given by
i

0 (13.15)
⎛ ⎞

0
ei =
1

⎝ ⎠
0

with a 1 at the i component and zeros otherwise.


th

Definition 13.3.3.
The Hessian of f, denoted ∇ 2
f , is an n × n matrix of second-order partial derivatives of f, if these
exist, organized as follows:
2 2 2
∂ f ∂ f ∂ f
⎡ … ⎤
∂x1 ∂x1 ∂x1 ∂x2 ∂x1 ∂xn

2 2 2
∂ f ∂ f ∂ f

∂x2 ∂x1 ∂x2 ∂x2 ∂x2 ∂xn
2
∇ f = .

⋮ ⋮ ⋱ ⋮

2 2 2
∂ f ∂ f ∂ f
⎣ … ⎦
∂xn ∂x1 ∂xn ∂x2 ∂xn ∂xn
Thus, the Hessian matrix describes the local curvature of the function f.

Example 13.3.1.
Let f 3
: R ↦ R defined by f (x) = f (x 1, x2 , x3 ) = x
3
1
2
+ x + log (x3 )
2
. Then,
T
2 1
∇f (x) = (3x1 , 2x2 , ) ,
x3

and
6x1 0 0
⎛ ⎞
2 0 2 0
∇ f (x) = .
1
⎝ 0 0 − 2 ⎠
x
3

At the point x̄ = (1, 1/2, 1) , we have ∇f (x̄) = (3, 1, 1) and


6 0 0
⎛ ⎞
2
∇ f (x̄) = 0 2 0 .
⎝ ⎠
0 0 −1

Definition 13.3.4.
Let f be a multivalued function, i. e., f : R ⟶ R . Then, the Jacobian of f, denoted J , is an
n m
f

m × n matrix of the first-order partial derivatives, if they exist, of the m real-valued component

functions ( f , f , … , f ) of f, organized as follows:


1 2 m

∂f1 ∂f1 ∂f1


⎡ … ⎤
∂x1 ∂x2 ∂xn

∂f2 ∂f2 ∂f2



∂x1 ∂x2 ∂xn

Jf = .

⋮ ⋮ ⋱ ⋮

∂fm ∂fm ∂fm


⎣ … ⎦
∂x1 ∂x2 ∂xn

The Jacobian generalizes the gradient of a scalar-valued function of several variables to m real-
valued component functions. Therefore, the Jacobian for a scalar-valued multivariable function, i.
e., when m = 1 , is the gradient.

Example 13.3.2.
Let f 3
: R ↦ R
2
defined by
2
f1 (x1 , x2 , x3 ) x1 x
f (x) = f (x1 , x2 , x3 ) = ( ) = ( 2
2
).
f2 (x1 , x2 , x3 ) x + 2x1 x2
3

Then,
2
x 2x1 x2 0
2
Jf (x) = [ ].
2x2 2x1 2x3

At the point x̄ = (1, 1/2, 1) , we have


1/4 1 0
Jf (x̄) = [ ].
1 2 2

Using R, the gradient of a function f at a point x is computed using the command grad(f, x).
Since the gradient of a function of a single variable is nothing but the first derivative of the
function, then the same command is used to compute the derivative of a function of one variable
at a given point. By contrast, the Hessian and the Jacobian of a function f at a point x are
computed using the command hessian(f, x) and jacobian(f, x), respectively.
For example, the gradient, the Hessian, and the Jacobian of the following function
f (x, y, z) = x y+ sin (z) at the point (x = 2, y = 2, z = 5) can be computed using R as
2

follows:

Let us consider the following example, from economics, for determining extreme values of
economic functions [→182] (see also Example →13.8.1). In this example, the economic functions
of interest are real polynomials [→135].

Example 13.3.3.
Let
1 3 (13.16)
3 2
C(x) = x − x + 7
3 2

be an economic cost function [→182] describing the costs depending on a quantity unit x. To find
the minima of C(x) , we use its derivative, i. e.,

C (x) = x
2
− 3x.
(13.17)

By solving

C (x) = x
2
− 3x = 0,
(13.18)

we find x = 0 , and x = 3 . To check whether x = 3 corresponds to a minimum or maximum, we


determine
C
′′
(x) = 2x − 3. (13.19)

This yields C (3) = 3 > 0 . Hence, we found a minimum of C(x) at x = 3 . This can also be
′′

observed, graphically, in →Figure 13.4.


In the following section, we provide some formal definitions of extrema of a function.

Figure 13.4 An example of an economic cost function C(x) with its minimum located at x = 3 .

13.4 Extrema of a function


The extrema of a function refer to the largest (i. e., maximum) and smallest (i. e., minimum) values
of a function, either on its entire domain (global or absolute extrema) or within a given range
(local extrema). Therefore, we can distinguish four types of extrema: global maxima, global
minima, local maxima, and local minima. These are illustrated in →Figure 13.5.
In the following, we provide mathematical definitions for the four extrema:
Figure 13.5 The four different types of extrema of a function.

Definition 13.4.1.
Let D denote the domain of a function f. A point x ∈ D is called a global maximum of f if

f (x ) ≥ f (x) for all x ∈ D . Then f (x ) is referred to as the maximum value of f in D .


∗ ∗

Definition 13.4.2.
Let D denote the domain of a function f. A point x ∈ D is called a global minimum of f if

f (x ) ≤ f (x) for all x ∈ D . Then f (x ) is referred to as the minimum value of f in D .


∗ ∗

Definition 13.4.3.
Let D denote the domain of a function f, and let J ⊂ D . A point x ∈ J is called a local

maximum of f if f (x ) ≥ f (x) for all x ∈ J . Then f (x ) is referred to as the maximum value of


∗ ∗

f in J .

Definition 13.4.4.
Let D denote the domain of a function f, and let J ⊂ D . A point x ∈ J is called a local

minimum of f if f (x ) ≤ f (x) for all x ∈ J . Then f (x ) is referred to as the minimum value of f


∗ ∗

in J .
To characterize extrema of continuous functions, we invoke the well-known Weierstrass
extreme value theorem.
Theorem 13.4.1 (Weierstrass extreme value theorem).
Let D denote the domain of a function f, and let J = [a, b] ⊂ D . If f is continuous on J , then f
achieves both its maximum value, denoted M, and its minimum value, denoted m. In other words, there
exist x and x in J such that

M

m

f (x

M
and f (x
) = M

m
) = m ,
m ≤ f (x) ≤ M .

We want to emphasize that extrema of functions possess horizontal tangents. These extrema can
be calculated using basic calculus. Suppose that we have a real and a continuous function on a
domain D . The points x ∈ D , satisfying the equation f (x) = 0 , are candidates for extrema

(maximum or minimum). In Section →13.3, we explained that the first derivative of a function at a
point x corresponds to the slope of the tangent at x. Therefore, after solving the equation
f (x) = 0 and after calculating f (x) , it is necessary to distinguish the following cases:
′ ′′

f
′′
(x0 ) > 0 ⟹ f (x) has a minimum at x ∈ D 0

(x ) < 0 ⟹ f (x) has a maximum at x ∈ D


′′
f 0 0

(x ) = 0 , f (x ) = 0 and f (x ) ≠ 0 ⟹ f (x) has a saddle point at x


′ ′′ ′′′
f 0 0 0 0 ∈ D

For a numerical solution to this problem, the package ggpmisc in R can be used to find extrema of a
function, as illustrated in Listing 13.2. The plot from the output of the script is shown in →Figure
13.6, where the colored dots correspond to the different extrema of the function
f (x) = 23.279 − 29.3598 exp (−0.00093393x) sin (0.00917552x + 20.515),

for x ∈ [0, 1500] .


Figure 13.6 Finding extrema of a function using R, see Listing 13.2.

13.5 Taylor series expansion


A Taylor series expansion is an expression that approximates a smooth function f (x) in the
neighborhood of a certain point x = x . In simple terms, this approximation breaks the
0

nonlinearity of a function down into its polynomial components. This yields a function that is more
linear than f (x) . The simplest, yet most frequently used approximation, is the linearization of a
function. Taylor series expansions have many applications in mathematics, physics, and
engineering. For instance, they are used to approximate solutions to differential equations, which
are otherwise difficult to solve.

Definition 13.5.1.
A one-dimensional Taylor series expansion of an infinitely differentiable function f (x) , at a point
x = x , is given by
0

∞ (n)
f (x0 ) n
f (x)= ∑ (x − x0 )
n!
n=0

′ ′′ (3)
f (x0 ) f (x0 ) 2
f (x0 ) 3
= f (x0 ) + (x − x0 ) + (x − x0 ) + (x − x0 ) + ⋯
1! 2! 3!

where f (n)
denotes the n derivative of f.
th
If x
0 = 0 , then the expansion may also be called a Maclaurin series.
Below, we provide examples of Taylor series expansions for some common functions, at a point
x = x :0

1 2
1 3
exp (x)=exp (x0 )[1 + (x − x0 ) + (x − x0 ) + (x − x0 ) + ⋯ ]
2 6

2 3
x − x0 (x − x0 ) (x − x0 )
ln (x)=ln (x0 ) + − + − ⋯
2 3
x0 2x 3x
0 0

1 2
1 3
cos (x)=cos (x0 )− sin (x0 )(x − x0 ) − cos (x0 )(x − x0 ) + sin (x0 )(x − x0 ) + ⋯
2 6

1 2
1 3
sin (x)=sin (x0 )+ cos (x0 )(x − x0 ) − sin (x0 )(x − x0 ) − cos (x0 )(x − x0 ) + ⋯
2 6

The accuracy of a Taylor series expansion depends on both the function to be approximated,
the point at which the approximation is made, and the number of terms used in the
approximation, as illustrated in →Figure 13.7 and →Figure 13.8.
Several packages in R can be used to obtain the Taylor series expansion of a function. For
instance, the library Ryacas can be used to obtain the expression of the Taylor series expansion of
a function, which can then be evaluated. The library pracma, on the other hand, provides an
approximation of the function at a given point using its corresponding Taylor series expansion.
The scripts below illustrate the usage of these two packages.
→Figure 13.7, produced using Listing 13.4, shows the graph of the function f (x) =exp (x)
alongside its corresponding Taylor approximation of order n = 5 , for x ∈ [−1, 1] . It is clear that
this Taylor series approximation of the function f (x) is quite accurate for x ∈ [−1, 1] , since the
graphs of the both functions match in this interval.
Figure 13.7 Taylor series approximation of the function f (x) =exp (x) . The approximation
order is n = 5 .

On the other hand, →Figure 13.8, produced using Listing 13.5, shows the graph of the function
f (x) =
1

1−x
alongside its corresponding Taylor approximation of order n = 5 , for x ∈ [−1, 1] .
The Taylor series approximation of the function f (x) is accurate on most of the interval
x ∈ [−1, 1] , except nearby 1, where the function f (x) and its Taylor approximation diverge. In

fact, when x tends to 1, f (x) tends to ∞, and the corresponding Taylor series approximation
cannot keep pace with the growth of the function f (x) .
Figure 13.8 Taylor series approximation of the function f (x) = 1
1−x
. The approximation order is
n = 5.
13.6 Integrals
The integral of a function f (x) over the interval [a, b] , denoted ∫ f (x)dx , is given by the area
b
a

between the graph of f (x) and the line f (x) = 0 , where a ≤ x ≤ b .

Definition 13.6.1 (Definite integral).


The definite integral of a function f (x) from a to b is denoted
b

∫ f (x)dx.

Definition 13.6.2 (Indefinite integral).


The indefinite integral of a function f (x) is a function F (x) such that its derivative is f (x) , i. e.,
F (x) = f (x) , and it is denoted

∫ f (x)dx = F (x) + C.

The function F is also referred to as the antiderivative of f, whereas C is the called the integration
constant.
Theorem 13.6.1 (Uniqueness Theorem).
If two functions, F and G, are antiderivatives of a function f on an interval I, then there exists a constant
C such that
F (x) = G(x) + C.

This result justifies the integration constant C for the indefinite integral.
Theorem 13.6.2 (First fundamental theorem of calculus).
Let f be a bounded function on the interval [a, b] and continuous on (a, b) . Then, the function
x

F (x) = ∫ f (z)dz, a ≤ x ≤ b.

has a derivative at each point in (a, b) and



F (x) = f (x), a < x < b.

Theorem 13.6.3 (Second fundamental theorem of calculus).


Let f be a bounded function on the interval [a, b] and continuous on (a, b) . Let F be a continuous
function on [a, b] such that F (x) = f (x) on (a, b) . Then,

∫ f (x)dx = F (b) − F (a).

a
The results from the above theorems demonstrate that the differentiation is simply the
inverse of integration.

13.6.1 Properties of definite integrals


The following properties are useful for evaluating integrals.
1. Order of integration: ∫ f (x)dx = − ∫ f (x)dx .
b
a
a
b

2. Zero width interval: ∫ f (x)dx = 0 .


a
a

3. Constant multiple: ∫ kf (x)dx = k ∫ f (x)dx .


b
a
b
a

4. Sum and difference: ∫ (f (x) ± g(x))dx = ∫ f (x)dx ± ∫


a
b b
a
b
a
g(x)dx .
5. Additivity: ∫ f (x)dx + ∫ f (x)dx = ∫ f (x)dx .
b
a
c
b
c
a

13.6.2 Numerical integration


The antiderivative, F (x) , is not always easy to obtain analytically. Therefore, the integral is often
approximated numerically. The numerical estimation is generally carried out as follows: The
interval [a, b] is subdivided into n ∈ N subintervals. Let Δx = x − x denote the length of
i i+1 i

the i subinterval, i = 1, 2, 3 … , n , and let x̃ be a value in the subinterval [x , x ] . Then,


th
i i i+1

b (13.20)
n

∫ f (x)dx ≈ ∑ f (x̃i )Δxi .

i=1
a

The last term in equation (→13.20) is known as the Riemann sum. When n tends to ∞, then Δx i

tends to 0, for all i = 1, 2, 3 … , n , and, consequently, the Riemann sum tends toward the real
value of the integral of f (x) over the interval [a, b] , as illustrated in →Figure 13.9.
Figure 13.9 Geometric interpretation of the integral. Left: Exact form of an integral. Right:
Numerical approximation.

Using R, a one-dimensional integral over a finite or infinite interval is computed using the
command integrate(f, lowerLimit, upperLimit), where f is the function to be integrated,
lowerLimit and upperLimit are the lower and upper limits of the integral, respectively.
2

The integral ∫ can be computed as follows:


(x−5)
+∞ 1 −
−∞
e 2 dx
√ 2π

Using R, an n-fold integral over a finite or infinite interval is computed using the command
adaptIntegrate(f, lowerLimit, upperLimit).
The integral ∫ 0
3

1
5

−2
−1 5

2
sin (x) cos (yz) dx dy dz can be computed as follows:
13.7 Polynomial interpolation
In many applications, results of experimental measurements are available in the form of
discrete data sets. However, efficient exploitation of these data requires their synthetic
representation by means of elementary (continuous) functions. Such an approximation, also
termed data fitting, is the process of finding a function, generally a polynomial, whose graph will
pass through a given set of data points.
Let (x , y ) , i = 0, … , m be m + 1 , given pairs of data. Then, the problem of interest is to find a
i i

polynomial function of degree n, P (x) , such that P (x ) = y for i = 0, … , m , i. e.,


n n i i

n
Pn (xi ) = an xi + an−1 x
n−1
+ ⋯ + a1 xi + a0 = yi , i = 0, … , m.
(13.21)
i

Note that this approach was developed by Lagrange, and the resulting interpolation polynomial is
referred to as the Lagrange polynomial [→99], [→131]. When n = 1 and n = 2 , the process is
called a linear interpolation and quadratic interpolation, respectively.
Let us consider the following data points:

xi 1 2 3 4 5 6 7 8 9 10
yi −1.05 0.25 1.08 −0.02 −0.27 0.79 −1.02 −0.17 0.97 2.06

Using R, the Lagrange polynomial interpolation for the above pairs of data points (x, y) can
be carried out using Listing 13.8. In →Figure 13.10 (left), which is an output of Listing 13.8, the
interpolation points are shown as dots, whereas the corresponding Lagrange polynomial is
represented by the solid line.
Figure 13.10 Left: Polynomial interpolation of the data points in blue. Right: Roots of the
interpolation polynomial.

13.8 Root finding methods


One of the fundamental problems in applied mathematics concerns the identification of roots of
complex and real functions [→135]. Given a function f, a root of f is a value of x such that
f (x) = 0 . In this case, x is also called a zero of f. In cases where f is considered to be an algebraic

polynomial with real or complex-valued coefficients, established results are available to determine
the roots analytically by closed expressions. Let

f (x) = an x
n
+ an−1 x
n−1
+ ⋯ + a1 x + a0
(13.22)

be a real polynomial, i. e., its coefficients are real numbers, and n is the degree of this polynomial.
Then we write deg (f (x)) = n .
If n = 2 ,

f (x) = a2 x
2
+ a1 x + a0 = 0
(13.23)

yields the following (see [→135] for more detail):


(13.24)
2
−a1 ± √ a − 4a2 a0
1

x1,2 = .
2a2

For n = 3 ,

f (x) = a3 x
3
+ a2 x
2
+ a1 x + a0 = 0,
(13.25)

leads to the formulas due to Cardano [→135]. For some special cases where n = 4 , analytical
expressions are also known. In general, the well-known theorem due to Abel and Ruffini [→184]
states that general polynomials with deg (f (x)) ≥ 5 are not solvable by radicals. Radicals are n th

root expressions that depend on the polynomial coefficients. Another classical theorem proves
the existence of a zero of a continuous function.
Theorem 13.8.1 (Intermediate value theorem).
Let f : R ⟶ R be a continuous function, and let a and b ∈ R with a < b and such that f (a) and
f (b) are nonzero and of opposite signs. Then, there exists x with a < x < b such that f (x ) = 0 .
∗ ∗ ∗

Using R, the root(s) of a function, within a specified interval, can be obtained via the package
rootSolve.

Let us consider the following function:

f (x) = a0 + a1 x + a2 x
2
+ a3 x
3
+ a4 x
4
+ a5 x
5
+ a6 x
6 7
+ a7 x + a8 ∗ x
8 9
+ a9 x ,
(13.26)

where a 0 = −229 ,a
1 = 641.943 , a = −728.7627 , a = 445.0133 , a = −162.3738 ,
2 3 4

a5 = 36.9856 ,a6 = −5.29601 , a = 0.4626149 , a = −0.02249529 , a = 0.0004662423 .


7 8 9

Using the function uniroot.all from the package rootSolve, the root(s) of the function (→13.26)
within the interval [1,10] can be obtained using Listing 13.9. In →Figure 13.10 (right), which is an
output of Listing 13.9, the function f (x) is represented by the solid line, whereas its
corresponding roots in the interval [0,10] are shown as dots. Obviously, all the roots lie on the
horizontal line f (x) = 0 .

Example 13.8.1.
In economics, for example, root-finding methods and basic derivatives find frequent application
[→182]. For instance, these are used to explore profit and revenue functions (see [→182]).
Generally, the revenue function R(x) and the profit function P (x) are defined by
R(x) = px, (13.27)

and
P (x) = R(x) − C(x), (13.28)

respectively [→182]. Here, x is a unit of quantity, p is the sales price per unit of quantity, and C(x)
is a cost function. The unit of quantity x is the variable and p is a parameter, i. e., a fixed number.
Suppose that we have a specific profit function defined by
1 2
(13.29)
P (x) = − x + 50x − 480.
10

This profit function P (x) is shown in →Figure 13.11, and to find its maximum, we need to find the
zeros of its derivative:


2 (13.30)
P (x) = − x + 50 = 0.
10
From this, we find x = 250 . Using this value, we obtain the maximizing unit of quantity for P (x) ,
i. e., P (250) = 5770 . To find the so-called break-even points, it is necessary to identify the zeros
of P (x) , i. e.,
1 (13.31)
2
P (x) = − x + 50x − 480 = 0.
10

Between the two zeros of P (x) , we make a profit. Outside this interval, we make a loss.
Therefore,

x
2
− 500x + 4800 = 0
(13.32)

yields to

2
(13.33)
500 500
x1,2 = ± √( ) − 4800.
2 2

Specifically, x = 480.02 , and x = 9.79 . This means that the profit interval of P (x)
1 2

corresponds to [9.79,480.02]. Graphically, this is also evident in →Figure 13.11.


When the zeros of polynomials cannot be calculated explicitly, techniques for estimating bounds
are required. For example, various bounds have been proven for real zeros (positive and
negative), as well as for complex zeros [→50], [→135], [→155]. For the latter case, bounds for the
moduli of a given polynomial are sought. Zero bounds have proven useful in cases where the
degree of a given polynomial is large and, therefore, numerical techniques may fail.
Figure 13.11 An example of a profit function. The profit interval of the profit function is its
positive part between the zeros 9.79 and 480.02.

The following well-known theorems are attributed to Cauchy [→155]:


Theorem 13.8.2.
Let

f (z) = an z
n
+ an−1 z
n−1
+ ⋯ + a0 , an ≠ 0 , ak ∈ C , k = 0, 1, … , n,
(13.34)

be a complex polynomial. All the zeros of f (z) lie in the closed disk |z| ≤ ρ . Here, ρ is the positive root
of another equation, namely

|a0 | + |a1 |z + ⋯ + |an−1 |z


n−1 n
− |an |z = 0.
(13.35)

Theorem 13.8.3.
Let f (z) be a complex polynomial given by equation (→13.34). All the zeros of f (z) lie in the closed
disk
aj (13.36)
|z| ≤ 1+ max .
0≤j≤n−1 an
Below, we provide some examples, which illustrate the results from these two theorems.

Example 13.8.2.
Let f (z) := z3
+ 4z
2
+ 1000z + 99 be a polynomial, whose real and complex zeros are as
follows:
z1 = −0.099 , (13.37)

z2 = −1.950 − 31.556i , (13.38)

z3 = −1.950 + 31.556i , (13.39)

|z1 | = 0.099, (13.40)

|z2 | = |z3 | = 31.616. (13.41)

Using Theorem →13.8.2 and Theorem →13.8.3 gives the bounds ρ = 33.78 and 1001,
respectively. Considering that the largest modulus of f (z) is max (z ) = 31.616 , Theorem
i

→13.8.2 gives a good result. The bound given by Theorem →13.8.3 is useless for f (z) . This
example demonstrates the complexity of the problem of determining zero bounds efficiently (see
[→50], [→135], [→155]).

13.9 Further reading


One of the best, most thorough and yet very readable introduction to analysis, at the time of
writing, is [→78]. Another excellent, but more practical textbook is [→148]. Unfortunately, both
books are only available in German or Russian.

13.10 Exercises
1. Evaluate the gradient, the Hessian, and the Jacobian matrix of the following
functions using R:
2 2
f (x, y) = y cos (x ) + √ xy at the point (x = π, y = 5)

sin(x∗y) 2 3
g(x, y, z) = e + z ∗ y cos (x ∗ y) + x ∗ y + √z at the point

(x = 5, y = 3, z = 21)

2. Use R to find the extremum of the function f (x) = 3x , and determine whether it
x

is a minimum or a maximum.
3. Use R to find the global extrema of the function g(y) = y − 6y − 15y + 100 in
3 2

the interval [−3,6].


4. Use R to find the points, where the function f (x) = 2y − 3y achieves its global
3 2

minimum and global maximum and the corresponding extreme values.


5. Use R to find the global maximum and minimum of the function
f (x) = 2x − 4x + 6 in the interval [−3,6]. Calculate the difference between the
2

maximal and minimal values of f (x) .


6. Use R to find the extrema of the function f (y) = y (y + 1) on the interval
2

3
2

[−1,1].
Find the critical numbers of the function f.
7. Use R to find the Taylor series expansion of the function f (x) =ln (1 + x) . Plot
the graph of the function f and the corresponding Taylor series approximation for
x ∈ [−1; 1] .

8. Evaluate the following integral using R:


π
3
sin x
I1 = ∫ .
2
cos x + 1
−π

9. Use R to find the polynomial interpolation of the following pairs of data points:

xi 1 2 3 4 5 6 7 8 9 10
yi −2.05 0.75 1.8 −0.02 −0.75 1.71 −2.12 −0.25 1.70 3.55

10. Find the real roots of the following functions using R:


f (x)= 2x − 1

2
g(x)= 23x − 3x − 1

8 7 4 2
h(x)= 23x − 3x + x − x − 20
14 Differential equations
Differential equations can be seen as applications of the methods from analysis, discussed in the
previous chapter. The general aim of differential equations is to describe the dynamical behavior
of functions [→78]. This dynamical behavior is the result of an equation that contains derivatives
of a function. In this chapter, we introduce ordinary differential equations and partial differential
equations [→3]. We discuss the general properties of such equations and demonstrate specific
solution techniques for selected differential equations, including the heat equation and the wave
equation. Due to the descriptive nature of differential equations, physical laws as well as biological
and economical models are often formulated with such models.

14.1 Ordinary differential equations (ODE)


Problems from many fields, such as physics, biology, engineering, and economics can be modeled
using ordinary differential equations (ODEs) or systems of ODEs [→3], [→78]. A general
formulation of a first-order ODE problem is given by
dy(t) (14.1)

= y (t) = f (y(t), t, k),
dt

where t is the independent variable, y : R ⟶ R is called the state vector, f : R


n n+1
⟶ R is
n

referred to as the vector-valued function, which controls how y changes over t, and k is a vector of
parameters. When n = 1 , the problem is called a single scalar ODE.
By itself, the ODE problem (→14.1) does not provide a unique solution function y(t) . If, in
addition to equation (→14.1), the initial state at t = t , y(t ) , is known, then the problem is called
0 0

an initial value ODE problem. On the other hand, if some conditions are specified at the extremes
(“boundaries”) of the independent variable t, e. g., y(t ) = C and y(t ) = C with C and
0 1 max 2 1

C given, then the problem is called a boundary value ODE problem.


2
Figure 14.1 Examples of initial value ODE problems. Left: Solutions of the ODE
(y(t) + 1) . Right: Solutions of the ODE y (t) = (y(t)) + t − 1 .
′ 2t ′ 2 2
y (t) = 2
1+t

14.1.1 Initial value ODE problems


Initial value ODE problems govern the evolution of a system from its initial state y(t ) = C at t
0 0

onward, and we are seeking a function y(t) , which describes the state of the system as a function
of t. Thus, a general formulation of a first-order initial value ODE problem can be written as
follows:
dy(t) (14.2)

= y (t) = f (y(t), t, k) f or t > t0 ,
dt

y(t0 )= C, (14.3)

where C is given.
Some examples of initial value ODE problems, depicted in →Figure 14.1, illustrate the
evolution of the ODE’s solution, depending on its initial condition.
Equation (→14.2) may represent a system of ODEs, where
T
y(t) = (y1 (t), … , yn (t)) and f (y(t), t, k) = (f1 (y(t), t, k), … , fn (y(t), t, k)),

and each entry of f (y(t), t, k) can be a nonlinear function of all the entries of y.
The system (→14.2) is called linear if the function f (y(t), t, k) can be written as follows:

f (y(t), t, k) = G(t, k)y + h(t, k),


(14.4)

where G(t, k) ∈ R n×n


, and h(t, k) ∈ R . n

If G(t. k) is constant and h(t, k) ≡ 0 , then the system (→14.4) is called homogeneous. The
solution to the homogeneous system, y (t) = Gy(t) with data y(t ) = C , is given by

0
y(t) = Ce .
G(t−t0 )

An ODE’s order is determined by the highest-order derivative of the solution function y(t)
appearing in the ODEs or the systems of ODEs. Higher-order ODEs or systems of ODEs can be
transformed into equivalent first-order system of ODEs.
Let
(n) ′ ′′ (n−1) (14.5)
y = f (t, y, y , y , … , y )

be an ODE of order n. Then, by making the following substitutions:


′ (n−1) (14.6)
y1 (t) = y(t), y2 (t) = y , …, yn (t) = y (t),

Equation (→14.5) can be rewritten in the form of a system of n first-order ODEs as follows:

y1 (t)= y2 (t),


y2 (t)= y3 (t),


y3 (t)= y4 (t),


yn (t)= f (t, y1 (t), y2 (t), … , yn (t)).

Analytical solutions to ODEs consist of closed-form formulas, which can be evaluated at any
point t. However, the derivation of such closed-form formulas is generally nontrivial. Thus,
numerical methods are generally used to approximate values of the solution function at a discrete
set of points. Since higher-order ODEs can be reduced to a system of first-order ODEs, most
numerical methods for solving ODEs are designed to solve first-order ODEs.
In R, numerical methods for solving ODE problems are implemented within the package
deSolve, and the function ode() from the package deSolve is dedicated to solving initial value ODE
problems. Further details about solving differential equations using R can be found in [→177] and
[→176].
Let us use the function ode() to solve the following ODE problem:
y

= kty (14.7)

with the initial condition y(0) = 10 , and where k = 1/5 .


In R, this problem can be solved using Listing 14.1. →Figure 14.2 (left), which is an output of Listing
14.1, shows the evolution of the solution y(t) to the problem (→14.7), for t ∈ [0, 4] .
Let us consider the following system of ODEs:
dy1 (14.8)
= k1 y2 y3 ,
dt

dy2 (14.9)
= k2 y1 y3 ,
dt

dy3 (14.10)
= k3 y1 y2 ,
dt

with the initial conditions y (0) = −1 , y (0) = 0 , y = 1 , and where k , k and k are
1 2 3 1 2 3

parameters, with values of 1, −1, and −1/2, respectively.


The system (→14.8)–(→14.10), known as the Euler equations, can be solved in R using Listing
14.2. →Figure 14.2 (center), which is an output of Listing 14.2, shows the evolution of the solution
(y (t), y (t), y (t)) to the problem (→14.8)–(→14.10), for t ∈ [0, 15] .
1 2 3
Figure 14.2 Left: Numerical solution of the ODE (→14.7) with the initial condition y(0) = 10 and
the parameter k = 1/5 , Center: Numerical solution of the Euler equations (→14.8)–(→14.10) with
initial conditions y (0) = 1 , y (0) = y (0) = 1 . Right: Numerical solution of BV ODEs system
1 2 3

(→14.13) with the boundary conditions y(−1) = 1/4 , y(1) = 1/3 .

14.1.1.1 Boundary Value ODE problems

A very simplistic formulation of a boundary value ODE problem can be written as follows:
dy(t) (14.11)

= y (t) = f (y(t), t, k) f or t > t0 ,
dt

y(t0 )= C1 , y(tmax ) = C2 , (14.12)

where C and C are given constants or functions.


1 2

In R, the function bvpshoot() from the package deSolve is dedicated to solving boundary value
ODE problems.
Let us use the function bvpshoot() to solve the following boundary value ODE problem:
′′ 2
y (t) − 2y (t) − 4ty(t)y (t)

with y(−1) = 1/4, y(1) = 1/3.
(14.13)

Since the problem (→14.13) is a second-order ODE problem, it is necessary to write its equivalent
first-order ODE system. Using the substitution (→14.6), the second-order ODE (→14.13) can be
rewritten in the following form:

y1 (t)= y2 (t), (14.14)

′ 2
y2 (t)= 2y1 (t) + 4ty1 (t)y2 (t),
(14.15)

with the boundary conditions y (−1) = 1/4 , and y (1) = 1/3 .


1 2

Then, the problem (→14.14)–(→14.15) can be solved in R using Listing 14.3. →Figure 14.2
(right), which is an output of Listing 14.3, shows the evolution of the solution (y (t), y (t)) to the
1 2

problem (→14.14)–(→14.15), for t ∈ [−1, 1] .


14.2 Partial differential equations (PDE)
Partial differential equations (PDEs) arise in many fields of engineering and science. In general,
most physical processes are governed by PDEs [→102]. In many cases, simplifying approximations
are made to reduce the governing PDEs to ODEs or even to algebraic equations. However,
because of the ever-increasing requirement for more accurate modeling of physical processes,
engineers and scientists are increasingly required to solve the actual PDEs that govern the
physical problem being investigated. A PDE is an equation stating a relationship between a
function of two or more independent variables, and the partial derivatives of this function with
respect to these independent variables. For most problems in engineering and science, the
independent variables are either space (x, y, z) or space and time (x, y, z, t) . The dependent
variable, i. e., the function f, depends on the physical problem being modeled.

14.2.1 First-order PDE


A general formulation of a first-order PDE with m independent variables can be written as follows:
F (x, u(x), ∇u(x)) = 0,
(14.16)

where x ∈ R , u(x) = u(x


m
1, x2 , … , xm ) is the unknown function, and F is a given function.
A first-order PDE, with two independent variables, x, y ∈ R and the dependent variable u(x, y) ,
is called a first-order quasilinear PDE if it can be written in the following form:
∂ ∂ (14.17)
f (x, y, u(x, y)) u(x, y) + g(x, y, u(x, y)) u(x, y) = h(x, y, u(x, y)),
∂x ∂y

where f, g, and h are given functions.


The equation (→14.17) is said to be
linear, if the functions f, g, and h are independent of the unknown u;
nonlinear, if the functions f, g, and h depend further on the derivatives of the unknown u.

14.2.2 Second-order PDE


The general formulation of a linear second-order PDE, in two independent variables x, y and the
dependent variable u(x, y) , can be written as follows:
2
∂ u ∂ u
2 2
∂ u ∂u ∂u (14.18)
A + B + C + D + E + F u + G = 0,
2 2
∂x ∂x∂y ∂y ∂x ∂y

where A, B, C, D, F are functions of x, y.


If G = 0 , the equation (→14.18) is said to be homogeneous, and it is nonhomogeneous if
G ≠ 0.

The PDE (→14.18) can be classified according to the values assumed by A, B, and C at a given point
(x, y) . The PDE (→14.18) is called

an elliptic PDE, if B − 4AC < 0 ,


2

a parabolic PDE, if B 4AC = 0 ,


a hyperbolic PDE, if B − 4AC > 0 .


2

14.2.3 Boundary and initial conditions


There are three types of boundary conditions for PDEs. Let R denote a domain and ∂R its
boundary. Furthermore, let n and s denote the coordinates normal (outward) and along the
boundary ∂R , respectively, and let f, g be some functions on the boundary ∂R . Then, the three
boundary conditions, for PDEs, are:
Dirichlet conditions, when u = f on the boundary ∂R ,
Neumann conditions, when = f or
∂u

∂n
= g on the boundary ∂R ,
∂u

∂s

Mixed (Robin) conditions, when + ku = f , k > 0 on the boundary ∂R .


∂u
∂n

Dirichlet conditions can only be applied if the solution is known on the boundary and if the
function f is analytic. These are frequently used for the flow (velocity) into a domain. Neumann
conditions occur more frequently [→102].
14.2.4 Well-posed PDE problems
A mathematical PDE problem is considered well-posed, in the sense of Hadamard, if
the solution exists,
the solution is unique,
the solution depends continuously on the auxiliary data (e. g., boundary and initial
conditions).

Parabolic PDE

In this section, we will illustrate the solution to the heat equation, which is a prototype parabolic
PDE . The heat equation, in a one-dimensional space with zero production and consumption, can
be written as follows:
∂u(x, t)
2
∂ u(x, t) (14.19)
− D , x ∈ (a, b).
2
∂t ∂x

Let us use R to solve the equation (→14.19) with a = 0 , b = 1 , i. e., x ∈ [0, 1] , and the following
boundary and initial conditions:
π (14.20)
u(x, 0) =cos ( x), u(0, t) =sin (t), u(1, t) = 0.
2

The heat equation (→14.19)–(→14.20) can be solved using Listing 14.4. The corresponding
solution, u(x, t) , is depicted in →Figure 14.3 for color levels (left) and a contour plot (right).
Figure 14.3 Solution to the heat equation in equation (→14.19) with the boundary and initial
conditions provided in equation (→14.20).
Hyperbolic PDE

A prototype of hyperbolic PDEs is the wave equation, defined as follows:

∂ u
2 (14.21)
2
= ∇ ⋅ (c ∇u).
2
∂t

Let us consider the following wave equation in a two-dimensional space:


2
∂ u(t, x, y)
2
∂ u(t, x, y)
2
∂ u(t, x, y)
(14.22)
= γ1 + γ2 , x ∈ (a, b), y ∈ (c, d).
2 2 2
∂t ∂x ∂y

We use R to solve Equation (→14.22) with a = c = −4 , b = d = −4 , γ 1 = γ2 = 1 , and the


following boundary and initial conditions:
u(t, x = −4, y)= u(t, x = 4, y) = u(t, x, y = −4) = u(t, x, y = 4) = 0, (14.23)


u(t = 0, x, y)= 0,
∂t
2 2
−(x +y )
u(t = 0, x, y)= e .

The wave equation (→14.22)–(→14.23) can be solved using Listing 14.5. The corresponding
solution, u(t, x, y) , is depicted in →Figure 14.4, for t = 0 , t = 1 , t = 2 and t = 3 , respectively.
Figure 14.4 Solution to the wave equation in equation (→14.22) with the boundary and initial
conditions provided in equation (→14.23).

Elliptic PDE

A prototype of elliptic PDEs is the Poisson’s equation. Let us use R to solve the following Poisson’s
equation in a two-dimensional space:
2
∂ u(x, y)
2
∂ u(x, y) (14.24)
2 2
γ1 + γ2 = x + y , x ∈ (a, b), y ∈ (c, d),
2 2
∂x ∂y

with a = c = 0 , b = d = 2 , γ 1 = γ2 = 1 and the following boundary and initial conditions:


u(x = 0, y)=sin (y), u(x = 2, y) = 1, (14.25)

u(x, y = 0)=cos (x), u(x, y = 2) = 1.


The Poisson’s equation (→14.24)–(→14.25) can be solved using Listing 14.6. The corresponding
solution, u(x, y) , is depicted in →Figure 14.5 for color levels (left) and a contour plot (right).

Figure 14.5 Solution to the Poisson’s equation in equation (→14.24) with the boundary and initial
conditions provided in equation (→14.25).
14.3 Exercises
Use R to solve the following differential equations:
1. Solve the heat equation (→14.19) with a = 0 , b = 1 , i. e., x ∈ [0, 1] , and the
following boundary and initial conditions:
(a) u(x, 0) = 6 sin ( ) , u(0, t) =cos (t) , u(1, t) = 0 .
πx
2

(b) u(x, 0) = 12 sin (


9πx
) − 7 sin (
5
) , u(0, t) =cos (πt) ,
4πx
3

u(1, t) = 0 .

2. Solve the wave equation (→14.22) with a = c = −4 , b = d = 4 , γ = γ = 1 , 1 2

and the following boundary and initial conditions:


(a) u(t, x = −4, y)= u(t, x = 4, y) = u(t, x, y = −4) = u(t, x, y


u(t = 0, x, y)= 0,
∂t
3 2
−(2x +3y )
u(t = 0, x, y)= e .

(b) u(t, x = −4, y)= u(t, x = 4, y) = u(t, x, y = −4) = u(t, x, y


u(t = 0, x, y)= 0,
∂t
2 3
−(5x +7y )
u(t = 0, x, y)= e .

3. Solve the Poisson’s equation (→14.24) with a = c = −0 , b = d = 2 ,


γ = γ = 1 , and the following boundary and initial conditions:
1 2

(a) u(x = 0, y)=cos (y); u(x = 2, y) = 1,

u(x, y = 0)=sin (x); u(x, y = 2) = 1.

(b) u(x = 0, y)=cos (y) sin (y); u(x = 2, y) = 1,

u(x, y = 0)=sin (x) cos (x); u(x, y = 2) = 1.


15 Dynamical systems
Dynamical systems [→179] may be regarded as specific types of the differential equations
discussed in the previous chapter. Generally speaking, a dynamical system is a model in which a
function describes the time evolution of a point in space. This evolution can be continuous or
discrete, and it can be linear or nonlinear. In this chapter, we discuss general dynamical systems
and define their key characteristics. Then, we discuss some of the most important dynamical
systems that find widespread applications in physics, chemistry, biology, economics and medicine,
including the logistic map, cellular automata or random Boolean networks [→108], [→181]. We
will conclude this chapter with case studies of dynamical system models with complex attractors
[→165].

15.1 Introduction
The theory of dynamical systems can be viewed as the most natural way of describing the
behavior of an integrated system over time [→56], [→109]. In other words, a dynamical system
can be cast as the process by which a sequence of states is generated on the basis of certain
dynamical laws. Generally, this behavior is described through a system of differential equations
describing the rate of change of each variable as a function of the current values of the other
variables influencing the one under consideration. Thus, the system states form a continuous
sequence, which can be formulated as follows. Let x = (x , x , … , x ) be a point in C that
1 2 n
n

defines a curve through time, i. e.,


x = x(t) = (x1 (t), x2 (t), … , xn (t)), − ∞ < t < ∞.

Suppose that the laws, which describe the rate and direction of the change of x(t) , are known
and defined by the following equations:
x(t) (15.1)
n
= f (x(t)), t ∈ R, x ∈ C , x(t0 ) = x0 ,
dt

where f (⋅) = (f is a differentiable vector function.


T
1 (x), … , fn (x))

However, when those states form a discrete sequence, a discrete time formulation of the systems
(→15.1) can be written as follows:
x(k + 1) = f (x(k)), k ∈ Z, x(k) ∈ C
n
∀ k, x(0) = x0 . (15.2)

Definition 15.1.1.
A sequence, x(t) , is called a dynamical system if it satisfies the set of ordinary differential
equations (→15.1) (respectively (→15.2)) for a given time interval [t , t] . 0

Definition 15.1.2.
A curve C = {x(t)} , which satisfies the equations (→15.1) (respectively (→15.2)), is called the
orbit of the dynamical system x(t) .

Definition 15.1.3.
A point x ∈
satisfies f (x

2.

3.
k

Definition 15.1.4.

C

) = 0

∥x(t) − x ∥ ⟶ 0 as t ⟶ +∞ .

Definition 15.1.5.
A point x̄ ∈


n


.


is said to be a fixed point, also called a critical point, or a stationary point, if it

A critical point x is said to be stable if every orbit, originating near x , remains near x , i. e.,
∀ ε > 0 , ∃ ξ > 0 such that

x(0) − x

< ξ ⟹ x(t) − x

converges to x when t ⟶ +∞ , i. e., if for some ε > 0 , ∥x(0) − x ∥ < ξ , then



≤ ε,

∀ t > 0.

A critical point x is said to be asymptotically stable if every orbit, originating sufficiently near x ,

is said to be a periodic point for a dynamical system x(t) if ∃ k ∈ N such that


C
n

(x̄) = x̄ and f (x̄) ≠ x̄ for j = 1, … , k − 1 . The integer k is called the period of the point x̄ .

Definition 15.1.6.
j

An attractor is a minimal set of points A ⊂ C such that every orbit originating within its

that lead to an attractor is called the basin of the attractor.


n

neighborhood converges asymptotically towards the set A. A stable fixed point is an attractor
known as a map sink. A dynamical system may have more than one attractor. The set of states

Depending on the form of the functions f and the initial conditions x , in (→15.1) (respectively
i

(→15.2)), the evolution of a dynamical system can lead to one of the following regimes:
1.
0

steady state: In such a regime, in response to any change in the initial condition,
the dynamical system restores itself and resumes its original course again,
leading to the formation of relatively stable patterns; thus, the system is wholly or
largely insensitive to the alteration of its initial conditions.
periodic: In this regime, in response to any change in the initial condition, the
trajectory of the system will eventually stabilize and alternate periodically
between relatively stable patterns.
chaotic: In such a regime, in response to any change in the initial condition, the
dynamical system generates a totally different orbit, i. e., any small perturbations
can lead to different trajectories. Hence, the system is highly sensitive to the
alteration of its initial conditions.
In the subsequent sections, we will illustrate the use of R to simulate and visualize some basic
dynamical systems, including population growth models, cellular automata, Boolean networks,
and other “abstract” dynamical systems, such as strange attractors and fractal geometries. These
dynamical systems are well known for their sensitivity to initial conditions, which is the defining
feature of chaotic systems.

15.2 Population growth models


Population growth models are among the simplest dynamical system models used to describe the
evolution of a population in a specified environment.

15.2.1 Exponential population growth model
The exponential growth model describes the evolution of a population or the concentration
(number of organisms per area unit) of an organism living in an environment, whose resources
and conditions allow them to grow indefinitely. Supposing that the growth rate of the organism is
r, then the evolution of the population number of organisms x(t) over time is governed by the
following equation:
dx (15.3)
= rx.
dt

If r is constant, then the solution of (→15.3) is given by

x(t) = x(0)e
rt
.
(15.4)

The solution (→15.4) can be plotted in R using Listing 15.1, and the corresponding output,
which shows the evolution of the population for x(0) = 2 , r = 0.03 and t ∈ [0, 100] , is depicted
in →Figure 15.1 (left).

Figure 15.1 Left: Exponential population growth for r = 0.03 and x = 2 . Center: Logistic
0

population growth model for r = 0.1 , x = 0.1 < K = 10 . Right: Logistic population growth
0

model for r = 0.1 , x = 20 > K = 10 .


0

15.2.2 Logistic population growth model


In contrast with the exponential growth model, the logistic population model assumes that the
availability of resources restricts the population growth. Let K be the “carrying capacity” of the
living environment of the population, i. e., the population number or the concentration (number
of organisms per area unit) such that the growth rate of the organism population is zero. In this
situation, a larger population results in fewer resources, and this leads to a smaller growth rate.
Hence, the growth rate is no longer constant. When the growth rate is assumed to be a linearly
decreasing function of x of the form
x(t)
r(1 − ),
K

with positive K and r, we obtain the following logistic equation


dx x (15.5)
= rx(1 − ),
dt K

where the expression dx


dt
represents the growth rate of the organism’s population over time.

The growth rate dx

dt
is zero if x = 0 , or x + K . Thus, the solution to the equation (→15.5) is given
by
Kx(0)e
rt (15.6)
x(t) = .
rt
K + x(0)(e − 1)

The solution (→15.6) can be plotted in R using Listing 15.2. The corresponding output, which
shows the evolution of the population over the time interval [0,100], is depicted in →Figure 15.1
(center) for x(0) = 0.1 , r = 0.1 , K = 10 , and in →Figure 15.1 (right) for x(0) = 20 , r = 0.1 ,
K = 10 .

15.2.3 Logistic map


The logistic map is a variant of the logistic population growth model (→15.5) with nonoverlapping
generations. Let y denote the population number of the current generation and y
n denote
n+1

the population number of the next generation. When the growth rate is assumed to be a linearly
decreasing function of y , then we get the following logistic equation:
n
yn (15.7)
yn+1 = ryn (1 − ).
K

Substituting y for Kx and y


n n for Kx
n+1 in Equation (→15.7) gives the following recurrence
n+1

relationship, also known as the logistic map:


xn+1 = rxn (1 − xn ), (15.8)

where, x n+1 denotes the population size of the next generation, whereas x is the population
n

size of the current generation; and r is a positive constant denoting the growth rate of the
population between generations.
The graph x versus x
n is called the cobweb graph of the logistic map.
n+1

For any initial condition, over time, the population x will settle into one of the following types of
n

behavior:
1. fixed, i. e., the population approaches a stable value
2. periodic, i. e., the population alternates between two or more fixed values
3. chaotic, i. e., the population will eventually visit any neighborhood in a subinterval
of (0,1).

15.2.3.1 Stable and unstable fixed points

When 0 ≤ r ≤ 4 , the map


x ↦ f (x) = rx(1 − x) (15.9)

defines a dynamical system on the interval [0,1].


The point x = 0 is a trivial fixed point of the dynamical system defined by (→15.9).
Furthermore, when r ≤ 1 , we have f (x) < x for all x ∈ (0, 1) ; thus, the system converges to
the fixed point x = 0 . However, when r > 1 , the graph of the function f (x) is a parabola
achieving its maximum at x = 1/2 and f (0) = f (1) = 0 .
The intersection between the graph of f and the straight line of equation y = x defines a
point S, whose abscissa, x , satisfies x = rx (1 − x ) .
∗ ∗ ∗ ∗

Hence, the point x = ∗


is another fixed point of the system.
r−1

When 1 < r < 3 , the point x is asymptotically stable, i. e., for any x in the neighborhood of

x , the sequence generated by the map (→15.9)—the orbit of x—remains close to or converges to

x . In R, such a dynamics of the system can be illustrated using the scripts provided in Listing 15.3

and Listing 15.4.


→Figure 15.2 (left), produced using Listing 15.4, shows the cobweb graph of the logistic map
for r = 2.5 , which corresponds to a stable fixed point. When r = 3 the logistic map has an
asymptotically stable fixed point, and the corresponding cobweb graph and the graph of the
population dynamics are depicted in →Figure 15.2 (center) (produced using Listing 15.4) and
→Figure 15.2 (right) (produced using Listing 15.3), respectively.
Figure 15.2 Left: Cobweb graph of a stable fixed point for r = 2.5 . Center: Cobweb graph of an
asymptotically stable fixed point for r = 3 ; Right: Population number dynamics over time.

15.2.3.2 Periodic fixed points: bifurcation

Due to its discrete nature, regulation of the growth rate in the logistic map (→15.8) operates with
a one period delay, leading to overshooting of the dynamical system. Beyond the value r = 3 , the
dynamical system (→15.8) is no longer asymptotically stable, but exhibits some periodic behavior.
The parameter value r = 3 is known as a bifurcation point. This behavior can be illustrated, in R,
using Listing 15.5.

Figure 15.3 Left: Cobweb graph of periodic fixed points for r = 3.2 . Center: Cobweb graph of
periodic fixed points for r = 3.4 ; Right: Dynamics of the population number over time.

→Figure 15.3 (left) and →Figure 15.3 (center), produced using Listing 15.4, show the cobweb
graphs of the logistic map for r = 3.2 and r = 3.4 , which both correspond to periodic fixed
points. →Figure 15.3 (right), produced using Listing 15.3, illustrates the dynamics of the
populations over time for both cases.

15.2.3.3 Chaotic motion

For larger values of r in the logistic map (→15.8), further bifurcations occur, and the number of
periodic points explodes. For instance, for r ≥ 3 , the structure of the orbits of the dynamical
system becomes complex and, hence, chaotic behavior ensues. Such behavior can be illustrated in
R, using the scripts provided in Listing 15.3 and Listing 15.4.
→Figure 15.4 (left) and →Figure 15.4 (center), produced using Listing 15.4, show the cobweb
graphs of the logistic map for r = 3.8 and r = 3.9 , which both correspond to chaotic motions.
→Figure 15.4 (right), produced using Listing 15.3, illustrates the dynamics of the populations over
time for both cases, where the chaotic evolution of the populations can be clearly observed.

Figure 15.4 Left: Cobweb graph of a chaotic motion for r = 3.8 . Center: Cobweb graph of a
chaotic motion for r = 3.9 ; Right: Dynamics of the population number over time.

→Figure 15.5 (left), (center), and (right), produced using Listing 15.5, illustrates the bifurcation
phenomenon, which can be visualized through the graph of the growth rate, r, versus the
population size, x. Such a graph is also known as the bifurcation diagram of a logistic map model.
→Figure 15.5 (left) depicts the bifurcation diagram for 0 ≤ r ≤ 4 , whereas →Figure 15.5 (center)
and →Figure 15.5 (right) show the zoom corresponding to the ranges 3 ≤ r ≤ 4 and
3.52 ≤ r ≤ 3.92 , respectively.
Figure 15.5 Bifurcation diagram for the logistic map model—growth rate r versus population size
x: Left 0 ≤ r ≤ 4 . Center: zoom for 3 ≤ r ≤ 4 . Right: zoom for 3.52 ≤ r ≤ 3.92 .

15.3 The Lotka–Volterra or predator–prey system


The Lotka–Volterra equations, also known as the predator–prey system, are among the earliest
dynamical system models in mathematical ecology, and were derived independently by Vito
Volterra [→193], and Alfred Lotka [→120]. The model involves two species: the first (the prey),
whose population number or concentration at time t is x (t) and the second (the predator),
1

which feeds on the preys, and whose population number or concentration at time t is x (t) . 2

Furthermore, the model is based on the following assumptions about the environment, as well as
the evolution of the populations of the two species:
1. The prey population has an unlimited food supply, and it grows exponentially in
the absence of interaction with the predator species.
2. The rate of predation upon the prey species is proportional to the rate at which
the predator species and the prey meet.
The model describes the evolution of the population numbers x and x over time through the
1 2

following relationships:
dx1 (15.10)
= x1 (α − βx2 ),
dt

dx2
= −x2 (γ − δx1 ),
dt

where, and
dx1

dt
denote the growth rates of the two populations over time; α is the growth
dx2

dt

rate of the prey population in the absence of interaction with the predator species; β is the death
rate of the prey species caused by the predator species; γ is the death (or emigration) rate of the
predator species in the absence of interaction with the prey species; and δ is the growth rate of
the predator population.
The predator–prey model (→15.10) is a system of ODEs. Thus, it can be solved using the
function ode() in R. When the parameters α, β, γ, and δ are set to 0.2, 0.002, 0.1, and 0.001,
respectively, the system (→15.10) can be solved in R, using the scripts provided in Listing 15.6 and
Listing 15.7.
The corresponding outputs are shown in →Figure 15.6, where the solution in the phase plane
(x , x ) for x (0) = 25 , the evolution of the population of the species over time for x (0) = 25 ,
1 2 2 2

and the solution in the phase plane (x , x ) for 10 ≤ x (0) ≤ 150 are depicted in →Figure 15.6
1 2 1

(left), →Figure 15.6 (center), and →Figure 15.6 (right), respectively.


Figure 15.6 Solutions of the system (→15.10) with α = 0.2 , β = 0.002 , γ = 0.1 , δ = 0.001 , and
the initial conditions x (0) = 100 . Left: Solution in the phase plane (x , x ) for x (0) = 25 .
1 1 2 2

Center: evolution of the population of the species over time for x (0) = 25 . Right: solution in the
2

phase plane (x , x ) for 10 ≤ x (0) ≤ 150 .


1 2 1

15.4 Cellular automata


A cellular automaton (CA) is a model used to describe the behaviors and the physics of discrete
dynamical systems [→62], [→194], [→205], [→206]. A CA is characterized by the following features:
An n-dimensional grid of cells;
Each cell has a state, which represents its current status;
Each cell has a neighborhood, which consists of the cell itself and all its immediate
surroundings.
The most elementary and yet interesting cellular automaton consists of a one- dimensional grid of
cells, where the set of states for the cells is 0 or 1, and the neighborhood of a cell is the cell itself,
as well as its immediate successor and predecessor, as illustrated below:
A one-dimensional CA 0 1 1 0 1 0 0 0 0 1

At each time point, the state of each cell of the grid is updated according to a specified rule, so
that the new state of a given cell depends on the state of its neighborhood, namely the current
state of the cell under consideration and its adjacent cells, as illustrated below:

A cell (in red) and its neighborhood


Rule for updating the cell in red

The cells at the boundaries do not have two neighbors, and thus require special treatments. These
cells are called the boundary conditions, and they can be handled in different ways:
The cells can be kept with their initial condition, i. e., they will not be updated at all during
the simulation process.
The cells can be updated in a periodic way, i. e., the first cell on the left is a neighbor of the
last cell on the right, and vice versa.
the cells can be updated using a desired rule.
Depending on the rule specified for updating the cell and the initial conditions, the evolution of
elementary cellular automata can lead to the following system states:
Steady state: The system will remain in its initial configuration, i. e., the initial spatiotemporal
pattern can be a final configuration of the system elements.
Periodic cycle: The system will alternate between coherent periodic stable patterns.
Self-organization: The system will always converge towards a coherent stable pattern.
Chaos: The system will exhibit some chaotic patterns.
For a finite number of cells N, the number of possible configurations for the system is also finite
and is given by 2 . Hence, at a certain time point, all configurations will be visited, and the CA will
N

enter a periodic cycle by repeating itself indefinitely. Such a cycle corresponds to an attractor of
the system for the given initial conditions. When a cellular automaton models an orderly system,
then the corresponding attractor is generally small, i. e., it has a cycle with a small period.
Using the R Listing 15.8, we illustrate some spatiotemporal evolutions of an elementary
cellular automaton using both deterministic and random initial conditions, whereby the cells at
the boundaries are kept to their initial conditions during the simulation process.
→Figure 15.7 shows the spatiotemporal patterns of an elementary cellular automaton with a
simple deterministic initial condition, i. e., all the cells are set to 0, except the middle one, which is
set to 1. Complex localized stable structures (using Rule 182), self-organization (using Rule 210)
and chaotic patterns (using Rule 89) are depicted in →Figure 15.7 (left), →Figure 15.7 (center), and
→Figure 15.7 (right), respectively.
→Figure 15.8 shows spatiotemporal patterns of an elementary cellular automaton with a
random initial condition, i. e., the states of the cells are allocated randomly. Complex localized
stable structures (using Rule 182), self-organization (using Rule 210) and chaotic patterns (using
Rule 89) are depicted in →Figure 15.8 (left), →Figure 15.8 (center), and →Figure 15.8 (right),
respectively.
Figure 15.7 Spatiotemporal patterns of an elementary cellular automaton with a simple
deterministic initial condition, i. e., all the cells are set to 0 except the middle, one which is set to 1.
Left: complex localized stable structures (Rule 182). Center: self-organization (Rule 210). Right:
chaotic patterns (Rule 89).
Figure 15.8 Spatiotemporal patterns of an elementary cellular automaton with a random initial
condition, i. e., the states of the cells are allocated randomly. Left: complex localized stable
structures (Rule 182). Center: self-organization (Rule 210). Right: chaotic patterns (Rule 89).

15.5 Random Boolean networks


Random Boolean networks (RBNs) were first introduced in the late 1960s to model the genetic
regulation in biological cells [→109], and since then have been widely used as a mathematical
approach for modeling complex adaptive and nonlinear biological systems. A Random Boolean
network is usually represented as a directed graph, defined by a pair (X , F ) , where
X = {x , … , , x } is a finite set of nodes, and F = f , … , f
1 N 1 is a corresponding set of
N

Boolean functions, called transition or regulation functions. Let x (t) represent the state of the
i

node x at time t, which takes the value of either 1 (on) or 0 (off). Then, the vector
i

x(t) = (x (t), … , x (t)) represents the state of all the nodes in X , at the time step t. The total
1 N

number of possible states for each time step is 2 . The state of a node x at the next time step
N
i

t + 1 is determined by x (t + 1) = f (x (t), … , x (t)) , where {x , … , x } is the set of the


i i j k j k

immediate predecessors (or input nodes) of x . If all the N nodes have the same number of input
i

nodes, K, then the RBN is referred to as an N K network, and K is also called the number of
connections of the network. Like most dynamical systems, RBNs also enjoy three main regimes
which, for an N K network, are correlated with the number of connections K [→109]. In particular,
if K < 2 the evolution of the RBN leads to stable (ordered) dynamics,
if K = 2 the evolution of the RBN leads to periodic (critical) dynamics,
if K ≥ 3 the evolution of the RBN leads to a chaotic regime.

RBNs can be viewed as a generalization of cellular automata, in the sense that, in Boolean
networks,
a cell neighborhood is not necessarily restricted to its immediate adjacent cells,
the size of the neighborhood of a cell and the position of the cells within the neighborhood
are not necessarily the same for every cell of the grid,
the state transition rules are not necessarily identical or unique for every cell of the grid,
the updating process of the cells is not necessarily synchronous.
The updating process of the nodes in a Boolean network can be synchronous or asynchronous,
deterministic or nondeterministic. According to the specified update process, Boolean networks
can be cast in different categories [→85], including the following:
1. Classical random Boolean networks (CRBNs): In RBNs of this type, at each discrete
time step, all the nodes in the network are updated synchronously in a
deterministic manner, i. e., the nodes are updated at time t + 1 , taking into
account the state of the network at time t.
2. Asynchronous random Boolean networks (ARBNs): In RBNs of this type, at each time
step, a single node is chosen at random and updated, and thus the update
process is asynchronous and nondeterministic.
3. Deterministic asynchronous random Boolean networks (DARBNs): For this class of
Boolean networks, each node is labeled with two integers u, v ∈ N ( u < v ). Let m
denote the number of time steps from the beginning of the simulation to the
current time. Then, the only nodes to be updated during the current time step are
those such that u = (m mod v) . If several nodes have to be updated at the
same time step, then the changes, made in the network by updating one node,
are taken into account during the updating process of the next node. Hence, the
update process is asynchronous and deterministic.
4. Generalized asynchronous random Boolean networks (GARBNs): For this class of
Boolean networks, at each time step, a random number of nodes are selected and
updated synchronously; i. e., if several nodes have to be updated at the same time
step, then the changes, made in the node-states by updating one node, are not
taken into account during the updating process of the next node. Thus, the
update process is semi-synchronous and nondeterministic.
5. Deterministic generalized asynchronous random Boolean networks (DGARBNs): This
type of Boolean networks is similar to the DARBN, except that, in this case, if
several nodes have to be updated at the same time step, the changes, made in
the node-states by updating one node, are not taken into account during the
updating process of the next node. Thus, the update process is semi-synchronous
and deterministic.
In the context of genomics, a gene regulatory network (GRN) can be modeled as a Boolean
network, where the status of a given gene (active/expressed or inactive/not expressed) is
represented as a Boolean variable, whereas the interactions/dependencies between genes are
described through the transition functions, and the input nodes for a gene x consist of genes
i

regulating x . Let us consider the following simple GRN with three genes A, B, C, i. e.,
i

X = {x , x , x } , where A = x , B = x , and C = x and F = {f , f , f } with


1 2 3 1 2 3 1 2 2

⎧f1 = f1 (x1 , x3 ) = x1 ∨ x3 ,

⎨f2 = f2 (x1 , x3 ) = x1 ∧ x3 ,

f3 = f3 (x1 , x2 ) = ¬x1 ∨ x2 ,

∨∧
where , , and ¬ are the logical disjunction (OR), conjunction (AND), and negation (NOT),
respectively.
At a given time point t, the state-vector is x(t) = (x 1 (t), x2 (t), x3 (t)) and the state evolution at
the time point t + 1 is given by

⎧x1 (t + 1) = f1 (x1 (t), x3 (t)) = x1 (t) ∨ x3 (t),


(15.11)

⎨x2 (t + 1) = f2 (x1 (t), x3 (t)) = x1 (t) ∧ x3 (t),



x3 (t + 1) = f3 (x1 (t), x2 (t)) = ¬x1 (t) ∨ x2 (t).

The corresponding truth table, i. e., the nodes-state at time t + 1 for any given configuration of
the state vector x at time t, is as follows:
x(t) = (x1 (t), x2 (t), x3 (t)) 000 001 010 011 100 101 110 111
x(t + 1) = (x1 (t + 1), x2 (t + 1), x3 (t + 1)) 001 101 001 101 100 110 101 111

An RBN with N nodes can be represented by an N by N matrix, known as the adjacency matrix, for
which the value of the component (i, j) is 1 if there is an edge from node i to node j, and 0
otherwise. If we substitute the nodes x , x , x with their associated gene labels A, B, and C,
1 2 3

respectively, then the corresponding adjacency matrix is written as follows:

A B C
A 1 1 1
B 0 0 1
C 1 1 0

To draw the corresponding network using the package igraph in R, we can save the adjacency
matrix as a csv (comma separated values) or a text file and then load the file in R. The
corresponding text or csv file, which we will call here “ExampleBN1.txt”, will be in the following
format:

Using the R package Boolnet [→140], we can also draw a given Boolean network, generate an
RBN and analyze it, e. g., find the associated attractors and plot them. However, the dependency
relations of the network must be written into a text file using an appropriate format. For instance,
the dependency relations (→15.11) can be written in a textual format as follows:
Here, the symbols |, & and ! respectively denote the logical disjunction (OR), conjunction
(AND) and negation (NOT). Let us call the corresponding text file “ExampleBN1p.txt”, and this
must be in the current working R directory.
→Figure 15.9, produced using Listing 15.10, shows the visualization and analysis of the
Boolean network represented in the text file “ExampleBN1p.txt”. The network graph, the state
transition graph as well as attractor basins, and the state transition table when the initial state is
(010) i. e., (A = 0, B = 1, C = 0) , are depicted in →Figure 15.9 (top), →Figure 15.9 (bottom left)
and →Figure 15.9 (bottom right), respectively.
Figure 15.9 Visualization and analysis of a Boolean network—Example 1. Top: network graph.
Bottom left: state transition graph and attractor basins. Bottom right: state transition table when
the initial state is (010), i. e., (A = 0, B = 1, C = 0) .
→Figure 15.10, produced using Listing 15.11, shows the visualization and analysis of an RBN
generated within the listing. The network graph, the state transition graph, as well as attractor
basins, and the state transition table when the initial state is (11111111) are depicted in →Figure
15.10 (top), →Figure 15.10 (bottom left), and →Figure 15.10 (bottom right), respectively.
Figure 15.10 Visualization and analysis of a Boolean network—Example 2. Top: network graph.
Bottom left: state transition graph and attractor basins. Bottom right: state transition table when
the initial state is (11111111).
Figure 15.11 Spatiotemporal patterns of RBNs with N = 1000 . Left: critical dynamics (K = 2) .
Right: chaotic patterns (K = 7) .

→Figure 15.11, produced using Listing 15.12, shows spatiotemporal patterns of RBNs with
N = 1000 . Critical dynamics (for K = 2 ) and chaotic patterns (for K = 7 ) are shown in →Figure
15.11 (left) and →Figure 15.11 (right), respectively.
15.6 Case studies of dynamical system models with complex attractors
In this section, we will provide implementations, in R, for some exemplary dynamical system
models, which are known for their complex attractors.

15.6.1 The Lorenz attractor


The Lorenz attractor is a seminal dynamical system model due to Lorenz Edward [→119], a
meteorologist who was interested in modeling weather and the motion of air as it heats up. The
state variable in the system, x(t) , is in R , i. e., x(t) = (x (t), x (t), x (t)) , and the system is
3
1 2 3

written as:
dx1 (15.12)
= a(x2 − x1 ),
dt

dx2
= rx1 − x2 − x1 x3 ,
dt

dx3
= x1 x2 − bx3 ,
dt

where, a, r, and b are constants.


The chaotic behavior of the Lorenz system (→15.12) is often termed the Lorenz butterfly. In R,
the Lorenz attractor can be simulated using Listing 15.13.
→Figure 15.12, produced using Listing 15.13, shows some visualizations of the Lorenz
attractor for a = 10 , r = 28 , b = 8/3 , (x , y , z ) = (0.01, 0.01, 0.01) , dt = 0.02 after 10
0 0 0
6

iterations. Representations of the attractor in the plane (x, y) , in the space (x, y, z) and in the
plane (x, z) are given in →Figure 15.12 (left), →Figure 15.12 (center), and →Figure 15.12 (right),
respectively.
Figure 15.12 Lorenz attractor for a = 10 , r = 28 , b = 8/3 , (x , y , z ) = (0.01, 0.01, 0.01) ,
0 0 0

dt = 0.02 after 10 iterations: Left in the plane (x, y) . Center in the space (x, y, z) . Right in the
6

plane (x, z) .

15.6.2 Clifford attractor


The Clifford attractor is defined by the following recurrence equations:
xn+1 =sin (ayn ) + c cos (axn ), (15.13)
{
yn+1 =sin (bxn ) + d cos (byn ),

where, a, b, c, and d are the parameters of the attractor.


In R, the system (→15.13) can be solved using Listing 15.14.
→Figure 15.13 shows some visualizations of the Clifford’s attractor for different values of the
parameters and initial conditions: →Figure 15.13 (left) displays the output of Listing 15.14 when
a = −1.4 , b = 1.6 , c = 1 , d = 0.3 , (x , y , ) = (π/2, π/2) after 1.5 × 10 iterations; →Figure
6
0 0

15.13 (center) shows the output of Listing 15.14 when a = −1.4 , b = 1.6 , c = 1 , d = 0.7 ,
(x , y , ) = (π/2, π/2) after 1.5 × 10 iterations; →Figure 15.13 (right) shows output of Listing
6
0 0

15.14 when a = −1.4 , b = 1.6 , c = 1 , d = 0. −1 , (x , y , ) = (π/2, π/2) after 2 × 10


0 0
6

iterations.
Figure 15.13 Clifford attractor. Left: a = −1.4 , b = 1.6 , c = 1 , d = 0.3 , (x , y , ) = (π/2, π/2)
0 0

after 1.5 × 10 iterations. Center: a = −1.4 , b = 1.6 , c = 1 , d = 0.7 , (x , y , ) = (π/2, π/2)


6
0 0

after 1.5 × 10 iterations. Right: a = −1.4 , b = 1.6 , c = 1 , d = 0. −1 , (x , y , ) = (π/2, π/2)


6
0 0

after 2 × 10 iterations.
6

15.6.3 Ikeda attractor


The Ikeda attractor is a dynamical system model that is used to describe a mapping in the
complex plane, corresponding to the plane-wave interactivity in an optical ring laser. Its discrete-
time version is defined by the following complex map:
i
k−p
(15.14)
1+∥zn ∥2
zn+1 = a + bzn e ,

where, z = x + iy .
k k k

The resulting orbit of the map (→15.14) is generally visualized by plotting z in the real-
imaginary plane (x, y) , also called the phase-plot. In R, the orbit of the Ikeda attractor can be
obtained using Listing 15.15. →Figure 15.14 (left), produced using Listing 15.15, shows a
representation of the Ikeda’s attractor in the plane (x, y) .

15.6.4 The Peter de Jong attractor


The Peter de Jong attractor is a well-known strange attractor, and its time-discrete version is
defined by the following system:
xn+1 =sin (ayn )− cos (bxn ), (15.15)

yn+1 =sin (cxn )− cos (dyn ),

where, a, b, c, and d are the parameters of the attractor.

Figure 15.14 Left: Ikeda attractor for a = 0.85 , b = 0.9 , k = 0.4 , p = 7.7 , z = 0 after
0

1.5 × 10 iterations. Center: de Jong attractor (→15.16) for a = 1.4 , b = 1.56 , c = 1.4 ,
6

d = −6.56 , (x , y , ) = (0, 0) after 1.5 × 10 iterations. Right: de Jong attractor (→15.15) for
6
0 0

a = 2.01 , b = −2 , c = 2 , d = −2 , (x , y , ) = (0, 0) after 1.5 × 10 iterations.


6
0 0

A variant of Peter de Jong attractor is given by


xn+1 =d sin (axn )− sin (byn ), (15.16)

yn+1 =c cos (axn )+ cos (byn ).

In R, the orbit of Peter de Jong attractor can be obtained using Listing 15.16. →Figure 15.14
(center), produced using Listing 15.16, shows a representation of the de Jong attractor (→15.16),
in the plane (x, y) , for a = 1.4 , b = 1.56 , c = 1.4 , d = −6.56 , (x , y , ) = (0, 0) after
0 0

1.5 × 10 iterations. →Figure 15.14 (right), produced also using Listing 15.16, shows a
6

representation of the de Jong attractor (→15.15) for a = 2.01 , b = −2 , c = 2 , d = −2 ,


(x , y , ) = (0, 0) after 1.5 × 10 iterations.
6
0 0
15.6.5 Rössler attractor
The Rössler attractor [→157] is a dynamical system that has some applications in the field of
electrical engineering [→113]. It is defined by the following equations:


dx (15.17)
= −y − z,
dt

dy
⎨ = x + ay,
dt

⎩ dz
= b + z(x − c),
dt

where, a, b, and c are the parameters of the attractor. This attractor is known to have some
chaotic behavior for certain values of the parameters.
In R, the system (→15.17) can be solved and its results visualized using Listing 15.17. →Figure
15.15, produced using Listing 15.17, shows some visualizations of the Rössler attractor for
different values of its parameters and initial conditions. →Figure 15.15 (left) shows the output of
Listing 15.17 when a = 0.5 , b = 2 , c = 4 , (x , y , z ) = (0.3, 0.4, 0.5) , dt = 0.03 after 2 × 10
0 0 0
6

iterations. →Figure 15.15 (center) shows the output of Listing 15.17 when a = 0.5 , b = 2 , c = 4 ,
(x , y , z ) = (0.03, 0.04, 0.04) , dt = 0.03 . after 2 × 10 iterations. →Figure 15.15 (right)
6
0 0 0

shows the output of Listing 15.17 when a = 0.2 , b = 0.2 , c = 5.7 ,


(x , y , z ) = (0.03, 0.04, 0.04) , dt = 0.08 . after 2 × 10 iterations.
6
0 0 0

Figure 15.15 Rössler’s attractor. Left: a = 0.5 , b = 2 , c = 4 , (x , y , z ) = (0.3, 0.4, 0.5) ,


0 0 0

dt = 0.03 after 2 × 10 iterations. Center: a = 0.5 , b = 2 , c = 4 ,


6

(x , y , z ) = (0.03, 0.04, 0.04) , dt = 0.03 . after 2 × 10 iterations. Right: a = 0.2 , b = 0.2 ,


6
0 0 0

c = 5.7 , (x , y , z ) = (0.03, 0.04, 0.04) , dt = 0.08 . after 2 × 10 iterations.


6
0 0 0

15.7 Fractals
There exist various definitions of the word “fractal”, and the simplest of these is the one
suggested by Benoit Mandelbrot [→125], who refers to a “fractal” as an object, which possesses
self-similarity. In this section, we will provide examples of implementations for some classical
fractal objects, using R.

15.7.1 The Sierpińsky carpet and triangle


The Sierpińsky carpet and triangle are geometry fractals named after Wacław Sierpińsky, who
introduced them in the early nineteenth century [→172]. The Sierpińsky carpet can be constructed
using the following iterative steps:
Step 1: Set x = 0 , n = 1 , and choose the number of iterations N.
0

Step 2: If n ≤ N , then the following applies:


Set
xn−1 xn−1 xn−1
⎛ ⎞

xn = xn−1 In−1 xn−1 ,

⎝ ⎠
xn−1 xn−1 xn−1

where, I is a 3
n−1 by 3
n−1
matrix of unity elements.
n−1

Set n = n + 1 and go to Step 2.


Otherwise, go to Step 3.
Step 3: Plot the points in the final matrix x . N

The construction and visualization of the Sierpińsky carpet can be carried out, in R, using Listing
15.18. →Figure 15.16 (left), which is an output of Listing 15.18, shows the visualization of the
Sierpińsky carpet after six iterations.

The Sierpińsky triangle can be constructed using the following iterative steps:
Step 1: Select three points (vertices of the triangle) in a two-dimensional plane. Let us call
them x , x , x ;
a b c

Plot the points x , x , x ;


a b c

Choose the number of iterations N;


Step 2: Select an initial point x . Set n = 1 ;
0

Step 3: If n ≤ N then do the following:


Select one of the three vertices {x , x , x } at random, and let us call this
a b c

point p ;
n

Calculate the point x = , and plot x ;


(xn−1 +pn −)
n 2 n

Set n = n + 1 and go to Step 3;


Otherwise go to Step 4;
Step 4: Plot the points of the sequence x , x , … , x .
0 1 N

In R, the construction and the visualization of the Sierpińsky triangle can be achieved using Listing
15.19. →Figure 15.16 (center), which is an output of Listing 15.19, shows the visualization of the
Sierpińsky triangle after 5e + 5 iterations.

15.7.2 The Barnsley fern


Named after the mathematician who introduced it, the Barnsley fern [→11] is a fractal, which can
be constructed using the following iterative process:
Step 1: Set a = (0, 0.85, 0.2, −0.15) , b = (0, 0.04, −0.26, 0.28) ,
c = (0, −0.04, 0.23, 0.26) , d = (0.16, 0.85, 0.22, 0.24) , e = (0, 0, 0, 0) ,
f = (0, 1.6, 1.6, 0.44) ;

Chose the number of iterations N.


Step 2: Set x = 0 , y = 0 , and n = 1 ;
0 0

Step 3: If n ≤ N , then do the following:


Select at random a value r ∈ (0, 1) ,
If r < 0.01 then set j = 1 and go to Step 4,
If 0.01 < r < 0.86 the set j = 2 and go to Step 4,
If 0.86 < r < 0.93 the set j = 3 and go to Step 4,
If 0.93 < r the set j = 4 and go to Step 4,
Step 4: Set x = a × x
n j + b × yn−1 + e , y
j = c × x
n−1 + d × y
j n + f ,
j n−1 j n−1 j

where, a , b , c , d , e , f denote the i component of the vectors a, b, c, d, e


j j j j j j
th

and f, respectively.
Set n = n + 1 and go to Step 3;
Step 5: Plot the points of the sequence (x , y0), (x , y ), … , (x , y ) .
0 1 1 N N

In R, the construction and the visualization of the Barnsley fern can be achieved using Listing
15.20. →Figure 15.16 (right), which is an output of Listing 15.20, shows the visualization of the
Barnsley fern after 1e + 6 iterations.
Figure 15.16 Left: The Sierpińsky carpet. Center: The Sierpińsky triangle. Right: The Barnsley fern.

15.7.3 Julia sets


Let z be a sequence defined by the following recurrence relationship:
n

m
zn+1 = zn + c (15.18)

with c and z ∈ C and m ∈ N .


0

For a given value of c, the associated Julia set [→103] is defined by the boundary between the
set of z values that have bounded orbits, and those which do not. For instance, when m = 2 , for
0

any c ∈ C the recurrence relationship z + c defines a quadratic Julia set.


2

In R, the construction and the visualization of the Julia set can be done using Listing 15.21 and
Listing 15.22. Figures →15.17, →15.18, →15.19, which have all been produced using Listing 15.21,
illustrate the evolution of the quadratic Julia set according to the value of the complex parameter
c.
Figure 15.17 Quadratic Julia sets. Left: c = 0.7 ; Center: c = −0.074543 + 0.11301i ; Right:
c = 0.770978 + 0.08545i .

Figure 15.18 Quadratic Julia sets. Left: c = 0.7 . Center: c = −0.74543 + 0.11301i . Right:
c = 0.770978 + 0.08545i .

Figure 15.19 Quadratic Julia sets. Left: c = −1.75 . Center: c = −i . Right:


c = −0.835 − 0.2321i .

15.7.4 Mandelbrot set


The Mandelbrot set [→125] is the set of all c ∈ C, such that the sequence z defined by the
n

recurrence relationship (→15.19) is bounded.


z0 = 0, (15.19)
{
m
zn+1 = zn + c, m ∈ N,

Formally, the Mandelbrot set can be defined as follows:

M = {c ∈ C : z0 = 0 and |zn | ↛ ∞, as n ⟶ ∞}.


(15.20)

The Mandelbrot system (→15.19) can be reformulated in R by substituting z = x + iy and


2
n n n

c = a + ib with their real and imaginary parts, respectively. For instance, when m = 2 , the
follows:


system (→15.19) is called the quadratic Mandelbrot set, and it can be reformulated in R as

⎧x0 = y0 = 0,
2
⎨xn+1 = x − y + a,

n n
2

yn+1 = 2xn yn + b.
2

(15.21)

In R, the construction and the visualization of the quadratic Mandelbrot set can be done using
the scripts in Listing 15.23 and Listing 15.24. The graphs in →Figure 15.20, produced using Listing
15.23 and Listing 15.24, illustrate some visualization of the quadratic Mandelbrot set depending
on the values of its parameters.
Figure 15.20 Left: z n+1
2
. Center: z
= zn + c n+1 = c∗ cos (zn )/√ (0.8) . Right:
(3) .
3
zn+1 = zn − zn + c − (2/3)/√

15.8 Exercises
1. Consider the following dynamical system:
2
xn+1 = xn f or n = 0, 1, 2, 3, …

Use R to simulate the dynamics of x using the initial conditions x = 1 and


n 0

x = 3 , for n = 1, … , 500 .
0

Plot the corresponding cobweb graph, as well as the graph of the evolution of x , n

over time.
2. Consider the following dynamical system:
zt+1 − zt = zt (1 − zt ) f or t = 0, 1, 2, 3, …

Use R to simulate the dynamics of x using the initial conditions z = 0.2 and
n 0

z = 5 for n = 1, … , 500 .
0

Plot the corresponding cobweb graph, as well as the graph of the evolution of x , n

over time.
3. Let x be the number of fish in generation n in a lake. The evolution of the fish
n

population can be modeled using the following model:


−xn
xn+1 = 8xn e .

Use R to simulate the dynamics of the fish population using the initial conditions
x = 1 and x =log (8) for n = 1, … , 500 .
0 0

Plot the corresponding cobweb graph, as well as the graph of the dynamics of the
population number, over time.
4. Consider the following predator–prey model x and y:
dx (15.22)
= Ax − Bxy,
dt

dy
= −Cy + Dxy.
dt

Use R to solve the system (→15.22) using the following initial conditions and
values of the parameters for t ∈ [0, 200] :
(a) x(0) = 81 , y(0) = 18 , A = 1.5 , B = 1.1 , C = 2.9 , D = 1.2 ;

(b) x(0) = 150 , y(0) = 81 , A = 5 , B = 3.1 , C = 1.9 , D = 2.1 .

Plot the corresponding solutions in the phase plane (x, y) , and the evolution of
the population of both species over time.
5. Use R to plot, in 3D, the following Lorenz system (→15.12) using the parameters
a = 15 , r = 32 , b = 3 , and the following initial conditions: x (0) = 0.03 ,
1

x (0) = 0.03 , x (0) = 0.03 ; x (0) = 0.5 , x (0) = 0.21 , x (0) = 0.55 .
2 3 1 2 3
16 Graph theory and network analysis
This chapter provides a mathematical introduction to networks and graphs. To facilitate this
introduction, we will focus on basic definitions and highlight basic properties of defining
components of networks. In addition to quantify network measures for complex networks, e. g.,
distance- and degree-based measures, we survey also some important graph algorithms,
including breadth-first search and depth-first search. Furthermore, we discuss different classes of
networks and graphs that find widespread applications in biology, economics, and the social
sciences [→10], [→23], [→53].

16.1 Introduction
A network G = (V , E) consists of nodes v ∈ V and edges e ∈ E , see [→94]. Often, an
undirected network is called a graph, but in this chapter we will not distinguish between a network
and a graph and use both terms interchangeably. In →Figure 16.1, we show some examples for
undirected and directed networks. The networks shown on the left-hand side are called undirected
networks, whereas those on the right-hand side are called directed networks since each edge has a
direction pointing from one node to another. Furthermore, all four networks, depicted in →Figure
16.1, are connected [→94], i. e., none of them has isolated vertices. For example, removing the
edge between the nodes from an undirected network with only two vertices, leaves merely two
isolated nodes.
Weighted networks are obtained by assigning weights to each edge. →Figure 16.2 depicts two
weighted, undirected networks (left) and two weighted, directed networks (right). A weight
between two vertices, w , is usually a real number. The range of these weights depends on the
AB

application context. For example, w could be a positive real number indicating the distance
AB

between two cities, or two goods in a warehouse [→156].


From the examples above, it becomes clear that there exist a lot of different graphs with a
given number of vertices. We call two graphs isomorphic if they have the same structure, but they
might look differently [→94].
In general, graphs or networks can be analyzed by using quantitative and qualitative methods
[→52]. For instance, a quantitative method to analyze graphs is a graph measure to quantify
structural information [→52]. In this chapter, we focus on quantitative techniques and in Section
→16.3 we present important examples thereof.
Figure 16.1 Two undirected (left) and two directed (right) networks with two nodes.

Figure 16.2 Weighted undirected and directed graphs with two vertices.

16.2 Basic types of networks


In the previous section, we discussed the basic units of which networks are made of. In this
section, we construct larger networks, which can consist of many vertices and edges. In Section
→16.1, we just discussed the graphical visualization of networks without providing a formal
characterization thereof. In the following, we will provide such a formal characterization because
it is crucial for studying and visualizing graphs.

16.2.1 Undirected networks


To define a network formally, we specify its set of vertices or nodes, V, and its set of edges, E. That
means, any vertex i ∈ V is a node of the network. Similarly, any element E ∈ E is an edge of
ij

the network, which means that the vertices i and j are connected with each other. →Figure 16.3
shows an example of a network with V = {1, 2, 3, 4, 5} and E = {E , E , E , E , E } . For
12 23 34 14 35

example, node 3 ∈ V and edge E are part of the network shown by →Figure 16.3. From
34

→Figure 16.3, we further see that node 3 is connected with node 4, but also, node 4 is connected
with node 3. For this reason, we call such an edge undirected. In fact, the graph shown by →Figure
16.3 is an undirected network. It is evident that in an undirected network the symbol E has the
ij

same meaning as E , because the order of the nodes in this network is not important.
ji

Definition 16.2.1.

An undirected network G = (V , E) is defined by a vertex set V and an edge set E ⊆ (


V

2
) .

E ⊆ (
V

2
) means that all edges of G belong to the set of subsets of vertices with 2 elements.
The size of G is the cardinality of the node set V, and is often denoted by |V | . The notation |E|
stands for the number of edges in the network. From →Figure 16.3, we see that this network has 5
vertices ( |V | = 5 ) and 5 edges ( |E| = 5 ).
In oder to encode a network by utilizing a mathematical representation, we use a matrix
representation. The adjacency matrix A is a squared matrix with |V | number of rows and |V |
number of columns. The matrix elements A , of the adjacency matrix provide the connectivity of
ij

a network.

Definition 16.2.2.

The adjacency matrix A for an undirected network G is defined by


1 if i is connected with j in G, (16.1)
Aij = {
0 otherwise,

for i, j ∈ V .
As an example, let us consider the graph in →Figure 16.3. The corresponding adjacency matrix is
0 1 0 1 0 (16.2)
⎛ ⎞

1 0 1 0 0

A = 0 1 0 1 1 .

1 0 1 0 0

⎝ ⎠
0 0 1 0 0

Since this network is undirected, its adjacency matrix is symmetric, that means A ij = Aji

holds for all i and j.


Figure 16.3 An undirected network.

16.2.2 Geometric visualization of networks


From the previous discussions, we see that the graphical visualization of a network is not
determined by its definition. This is illustrated in →Figure 16.4, where we show the same network
as in →Figure 16.3, but with different positions of the vertices. When comparing their adjacency
matrix (→16.2), one can see that these networks are identical. In general, a network represents a
topological object instead of a geometrical one. This means that we can arbitrarily deform the
network visually as long as V and E remain changed as shown in →Figure 16.4. Therefore, the
formal definition of a given network does not include any geometric information about
coordinates, where the vertices are positioned in a plane as well as features, such as edge length
and bendiness. In order to highlight this issue, we included to the right figure of →Figure 16.4 a
Cartesian coordinate system when drawing the graph. The good news is as long as we do not
require a visualization of a network the topological information about it is sufficient to conduct
any analysis possible.
In contrast, from →Figure 16.3 and →Figure 16.4, we can see that the visualization of a
network is not unique and for a specific visualization often additional information is utilized. This
information could either be motivated by certain structural aspects of the network we are trying
to visualize, e. g., properties of vertices or edges (see Section →16.3.1) or even from domain
specific information (e. g., from biology or economy). An important consequence of the
”arbitrariness” of a network visualization is that there is no formal mapping from G to its
visualization.
Figure 16.4 Two different visualizations of the network depicted in →Figure 16.3.

16.2.3 Directed and weighted networks


We will start this section with some basic definitions for directed networks.

Definition 16.2.3.

A directed network, G = (V , E) , is defined by a vertex set V and an edge set E ⊆ V × V .

E ⊆ V × V means that all directed edges of G are subsets of all possible combinations of

directed edges. The expression V × V is a cartesian product and the corresponding result is a set
of directed edges. If u, v ∈ V , then we write (u, v) to express that there exists a directed edge
from u to v.
The definition of the adjacency matrix of a directed graph is very similar to the definition of an
undirected graph.

Definition 16.2.4.

The components of an adjacency matrix, A, for a directed network, G, are defined by


1 if there is a connection f rom i to j in G (16.3)
Aij = {
0 otherwise

for i, j ∈ V .
In contrast with equation (→16.1), here, we choose the start vertex (i) and the end vertex (j) of a
directed edge. →Figure 16.5 presents a directed network with the following adjacency matrix:
E
t

graph is not always equal to A.

Figure 16.5 A directed network.


A =

⎜⎟


0

0
0

0
0

0
0

0
0

0


.
(16.4)

Here, we can see that A ≠ A . Therefore, the transpose of the adjacency matrix, A, of a directed

For example, the edge set of the directed network, depicted in →Figure 16.5, is
= {(2, 1), (2, 3), (4, 1), (3, 4), (3, 5)} .

Now, we define a weighted, directed network.

Definition 16.2.5.

The components of an adjacency matrix, W, for a directed network, G, are defined by

for i, j ∈ V .
Wij = {
wij

0
if there is a connection f rom i to j in G,

otherwise,
(16.5)
w 34
In equation (→16.5), w
vertex j.

= 3, w = 3, w
35 = 1. 41
ij ∈ R

W =

⎜⎟
denotes the weight associated with an edge from vertex i to

→Figure 16.6 depicted the weighted direct network with the following adjacency matrix:


0

Figure 16.6 Example of a weighted and directed network.

16.2.4 Walks, paths, and distances in networks


We start this section with some basic definitions.

Definition 16.2.6.
0

0
0

0
0

0
0

0

From the adjacency matrix W, we can identify the following (real) weights: w 21 = 2

A walk w of length μ in a network is a sequence of μ edges, which are not necessarily different. We
write w = v v , v v , … , v v . We also call the walk w closed if v = v .
1 2 2 3 μ−1 μ 1 μ
,w
23 = 1 ,
(16.6)
Definition 16.2.7.

A path P is a special walk, where all the edges and all the vertices are different.
In a directed graph, the close path is also called a cycle.
Let us illustrate these definitions by way of the network examples depicted in →Figure 16.7. If
we consider the upper graph on the left hand side, we see that 12, 23, 34 is an undirected path, as
all vertices and edges are different. This path has a length of 3. On the other hand, in the upper
graph of the right hand side, 12, 23, 32 is a walk of length 3. By considering the same graph, we
also find that 14, 43, 34, 41 is a closed walk, as it starts and ends in vertex 1. This closed walk has a
length of 4.
Now, let us consider the lower graph on the left hand side of →Figure 16.7. In this graph, 12,
23, 34 is a directed path of length 3 as the underlying graph is directed.

Figure 16.7 Undirected and directed path.

In the lower graph below on the right hand side, the path 23, 34, 41 has a length 3, but does
not represent a cycle, as its start and end vertices are not the same.
Now, we define the term distance between vertices in a network.

Definition 16.2.8.

A shortest path is the minimum path connecting two vertices.


Also, we define the topological distance between two vertices in a network.
Definition 16.2.9.

The number of edges in the shortest path connecting the vertices u and v is the topological
distance d(u, v) .
Again, we consider the upper graph on the right hand side of →Figure 16.7. For instance, the
path 12, 23, 34, for going from vertex 1 to vertex 4 has length 3 and is obviously not the shortest
one. Calculating the shortest path yields d(1, 4) = 1 .

16.3 Quantitative network measures


Many quantitative network measures, also called network scores or indices, have been developed
to characterize structural properties of networks, see, e. g., [→19], [→43], [→67]. These measures
have often been used for characterizing network classes discussed in section (→16.5), or to
identify distinct network patterns, such as linear and cyclic subgraphs. In the following, we discuss
the most important measures to characterize networks structurally. In case no remark is made,
we always assume that the networks are undirected.
In general, we distinguish between global and local graph measures. A global measure maps
the entire network to a real number. A local measure maps a component of the graph, e. g., a
vertex, an edge, or a subgraph to a real number. The design of these measures depends on the
application domain.

16.3.1 Degree and degree distribution

Definition 16.3.1.

Let G = (V , E) be a network. The degree k of the vertex v is the number of edges, which are
i i

incident with the vertex v .i

In order to characterize complex networks by their degree distributions [→23], [→128], we


utilize the following definition:

Definition 16.3.2.

Let G = (V , E) be a network. We define the degree distribution as follows:


δk (16.7)
P (k) := ,
N

where |V | := N and δ denotes the number of vertices in the network, G, of degree k.


k

It is clear that equation (→16.7) represents the proportion of vertices in G possessing degree
k.
Degree-based statistics have been used in various application areas in computer science. For
example, it has been known that the vertex degrees of many real-world networks, such as www-
graphs and social networks [→2], [→23], [→24], are not Poisson distributed. However, the
following power law always holds:
P (k) ∼ k
−γ
, γ > 1. (16.8)

16.3.2 Clustering coefficient


The clustering coefficient, C , is a local measure [→198] defined, for a particular vertex v , as
i i

follows:
2ei ei (16.9)
Ci = = .
ni (ni − 1) ti

Here, n is the number of neighbors of vertex i, and e is the number of adjacent pairs between all
i i

neighbors of v . Because 0 ≤ e ≤ t , C is the probability that two neighbors of node i are


i i i i

themselves connected with each other. →Figure 16.8 depicts an example of graph as well as the
calculation of the corresponding local clustering coefficient.

Figure 16.8 Local clustering coefficient.

16.3.3 Path-based measures


Path- and distance-based measures have been proven useful, especially when characterizing
networks [→64], [→104]. For example, the average path length and the diameter of a network
have been used to characterize classes of biological and technical networks, see [→64], [→104],
[→196]. An important finding is that the average path lengths and diameters of certain biological
networks are rather small compared to the size of a network, see [→115], [→128], [→143].
In the following, we briefly survey important path and distance-based network measures, see
[→29], [→31], [→93], [→94], [→174]. Starting from a network G = (V , E) , we define the distance
matrix as follows:

Definition 16.3.3.
The distance matrix is defined by

(d(vi , vj )) ,
(16.10)
vi ,vj ∈V

where d(v , v ) is the topological distance between v and v .


i j i j

Similarly, the mean or characteristic distance of a network, G = (V , E) , can be defined as


follows:

Definition 16.3.4.

1 (16.11)
d̄(G) := ∑ d(vi , vj ).
N
( ) 1≤i<j≤N
2

We also define other well-known distance-based graph measures [→94] that have been used
extensively in various disciplines [→57], [→197].

Definition 16.3.5.

Let G = (V , E) be a network. The eccentricity of a vertex v ∈ V is defined by


σ(v) =max d(u, v). (16.12)
u∈V

Definition 16.3.6.

Let G = (V , E) be a network. The diameter of the network, G, is defined by


ρ(G) =max σ(v). (16.13)
v∈V

Definition 16.3.7.

Let G = (V , E) be a network. The radius of the network, G, is defined by


r(G) =min σ(v). (16.14)
v∈V

16.3.4 Centrality measures


These graph measures have been investigated extensively by social scientists for analyzing the
communication within groups of people [→80], [→81], [→197]. For instance, it could be interesting
to know how important or distinct vertices, e. g., representing persons, in social networks are
[→197]. In the context of social networks, importance can be seen as centrality. Following this idea,
numerous centrality measures [→92], [→197] have been developed to determine whether
vertices, e. g., representing persons, may act distinctly with respect to the communication ability
in these networks. In this section, we briefly review the most important centrality measures, see
[→80], [→81], [→197].
Definition 16.3.8.

Let G = (V , E) be a network. The so-called degree centrality of a vertex v ∈ V is defined by

CD (v) = kv , (16.15)

where k denotes the degree of the vertex v.


v

When analyzing directed networks, the degree centrality can be defined straightforwardly by
utilizing the definition of the in-degree and out-degree [→94]. Now, let us define the well-known
betweeness centrality measure [→80], [→81], [→159], [→197].

Definition 16.3.9.

Let G = (V , E) be a network. The betweenness centrality is defined by


σv (vk ) (16.16)
i vj

CB (vk ) = ∑ ,
σvi vj
vi ,vj ∈V ,vi ≠vj

where, σ stands for the number of shortest paths from v to v , and σ


vi vj i j vi vj (vk ) for the number
of shortest paths from v to v that include v .
i j k

In fact, the quantity


σvi vj (vk ) (16.17)

σv
i vj

can be seen as the probability that v lies on a shortest path connecting v with v .
k i j

A further well-known measure of centrality is called closeness centrality.

Definition 16.3.10.

Let G = (V , E) be a network. The closeness centrality is defined by


1 (16.18)
CC (vk ) = ,
N
∑ d(vk , vi )
i=1

where d(v , v ) is the number of edges on a shortest path between v and v .


k i k i

When there exist more than one shortest paths connecting v with v , d(v , v ) remains
k i k i

unchanged.
The measure C (v ) has often been used to determine how close is a vertex to other vertices
C k

in a given network [→197].

16.4 Graph algorithms


In this section, we discuss some important graph algorithms. Graph algorithms are frequently
used for search problems on graphs. In general, search problems on a graph require to find/visit
certain distinct vertices. An example thereof is to find all vertices of an input graph, which
manifest a tree-like hierarchy in a graph by selecting an arbitrary root vertex in the input graph.
The two most prominent examples of graph algorithms for performing graph-based searches are
the breadth-first and depth-first algorithms, see [→38].

16.4.1 Breadth-first search


Breadth-first search (BFS) is a well-known and simple graph algorithm [→38]. The underlying
principle of this algorithm relates to discovering all reachable vertices and touching all edges
systematically, starting from a given vertex s. After selecting s, all neighbors of s are discovered,
and so forth. Here, discovering vertices in a graph involves determining the topological distance
(see Definition →16.2.9) between s and all other reachable vertices.
Starting with a graph G = (V , E) , the algorithm BFS uses colors in order to symbolize the state
of the vertices as follows:
white: the unseen vertices are white; initially, all vertices are white;
grey: the vertex is seen, but it needs to be determined whether it has white neighbors;
black: the vertex is processed, i. e., this vertex and all of its neighbors were seen.
→Figure 16.9 shows an example by using a stack approach, where the colors are omitted. The first
graph in →Figure 16.9 is the input graph. The start vertex is vertex 2. The two stacks on the left
hand side in each situation show the vertices, which have already been visited along with their
parents. For instance, we see that after four steps of the algorithm (the fifth graph in the first row
of →Figure 16.9), we have discovered 3 vertices, whose topological distance equals 1. Also, we see
in the fifth graph in the first row of →Figure 16.9 that the vertices 1, 4, and 5 have been visited
together with their parent relations. Finally, all vertices have been visited in the last graph in
→Figure 16.10 and, hence, BFS ends.
Figure 16.9 The first graph is the input graph to run BFS. The start vertex is vertex 2. The steps
are shown together with a stack showing the visited and parent vertices.

16.4.2 Depth-first search


Depth-first search (DFS) is another graph algorithm for searching graphs [→38]. Suppose we start
at a certain vertex. In case a vertex we visit has a still unexplored neighbor, we visit this neighbor
and pursue going in the depth to find another unexplored neighbor, if it exists. We continue
recursively with this procedure, until we cannot go into the depth. Then, we perform backtracking
to find an edge, which go into the depth.
We explain the basic steps as follows: To start, we highlight all vertices as not found (white). The
basic strategy of DFS is as follows:
Highlight the actual vertex v as found (grey)
Whereas there exists an edge {u, v} with a not found successor u:
Perform the search recursively from u. That is
Explore {u, w} and visit w. Explore from w in the depth until it ends
Highlight u as finished (black)
Perform backtracking from u to v
Highlight v as finished (black)
Finally, we obtain all vertices, starting from the start vertex. →Figure 16.11 shows an input graph
to run DFS. Then, →Figure 16.11 shows the steps to explore the vertices in the depth, starting
from vertex 0. →Figure 16.12 shows the last five graphs before DFS ends.

Figure 16.10 The last two graphs when running BFS on the input graph shown in →Figure 16.9.
Figure 16.11 The first graph is the input graph to run DFS. The start vertex is vertex 0. The steps
are shown together with a stack showing the visited and parent vertices.

Figure 16.12 The last five graphs when running DFS on the input graph shown in →Figure 16.11.
16.4.3 Shortest paths
Determining the shortest paths in networks has been a long-standing problem in graph theory
[→38], [→58]. For instance, finding the flight with the earliest arrival time in a given aviation
network [→117] requires the determination of all shortest paths. Other examples for the
application of shortest paths are graph optimization problems, e. g., for transportation networks
of production processes [→156].
A classical algorithm for determining the shortest paths within networks is due to Dijkstra
[→58]. It is interesting to note that many problems in algorithmic graph theory, e. g., determining
minimum spanning trees (see Section →16.4.4) and breadth first search also utilize Dijkstra’s
method, see [→38].
Dijkstra’s method can be described as follows. Given a network G = (V , E) and a starting
vertex v ∈ V the algorithms finds the shortest paths to all other vertices in G. In this case,
Dijkstra’s algorithm [→58] generates a so-called shortest path tree containing all the vertices that
lie on the shortest path.
We describe the basic steps of the algorithm of Dijkstra in order to determine the shortest paths
starting from a given vertex to all other vertices in G. Here, we assume that the input graph has
vertex labels and real edge labels [→38], [→58]:
We create the set of shortest path trees (SPTS), containing the vertices that are in a shortest
path tree. These vertices have the property that they have minimum distance from the
starting vertex. Before starting, it holds SPTS = ∅ .
We assign initial distance values ∞ in the input graph. Also, we set the distance value for the
starting vertex equal to zero.
Whereas the vertex set of SPTS does not contain all vertices of the input graph, the following
apply:
Select a vertex v ∈ V that is not contained in the vertex set of SPTS with minimum
distance
We put v ∈ V into the vertex set of SPTS
We update the distance value of all vertices that are adjacent with v ∈ V . In order to
update the distances, we iterate among all adjacent vertices. For all the adjacent
vertices u ∈ V with v ∈ V we perform the following: If the sum of the assigned
distance value of the vertex v (from the starting vertex) and the weight of the edge
{v, u} is less than the distance value of u, update the distance value of u.

Now we demonstrate the application of this algorithm for an example. The input graph in A is
given in →Figure 16.13. Because the vertex set of SPTS is initially empty, and we choose vertex 1
as start node. The initial distance values can be seen in →Figure 16.13 (B). We perform some steps
to see how the set of shortest paths is emerging; see →Figure 16.14. The vertices highlighted in
red are the ones in the shortest path tree. The graph shown in →Figure 16.14 in situation D is the
final shortest path tree consisting all vertices of the input graph in →Figure 16.13. That means, the
set of shortest path trees gives all shortest paths from vertex 1 to all other vertices.
Figure 16.13 (A) The input graph. (B) The graph with initial vertex weights.

Figure 16.14 Steps when running Dijkstra’s algorithm for the graph in (A) shown in →Figure
16.13.

As a remark, we would like to note that the graph shown in →Figure 16.13 is a weighted,
undirected graph (see Section →16.2.3). So, using the algorithm of Dijkstra [→58] makes sense for
edge-weighted graphs, as the shortest path between two vertices of a graph depends on these
weights. Interestingly, the shortest path problem becomes more simple if we consider
unweighted networks. If all edges in a network are unweighted, we may set all edge weights to 1.
Then, Dijkstra’s algorithm reduces to the search of the topological distances between vertices, see
Definition →16.2.9.
Let us consider the graph A in →Figure 16.15. In case we determine all shortest paths from
vertex 1 to vertex 4, we see that there exist more than one shortest path between these two
vertices. We find the shortest paths 1-3-4 and 1-2-4. So, the shortest path problem does not
possess a unique solution. The same holds when considering the shortest paths between vertex 1
and vertex 5. The calculations yield the two shortest paths 1-3-4-5 and 1-2-4-5.

Figure 16.15 Calculating shortest paths in unweighted networks.

Another example when calculating shortest paths in unweighted networks gives the graph in
B shown by →Figure 16.15. The shown graph P is referred to as the path graph [→186] with n
n

vertices. We observe that there exist n − 1 pairs of vertices with d(u, v) = 1 , n − 2 pairs of
vertices with d(u, v) = 2 , and so forth. Finally, we see that there exists only n − (n − 1) = 1 pair
with d(u, v) = n − 1 . Here, d(u, v) = n − 1 is just the diameter of P .n

In Listing 16.1, we shown an example how shortest paths can be found by using R. For this
example, we use a small-world network with n = 25 nodes. The command distances() gives only
the length of paths, whereas the command shortest_paths() provides one shortst path. In contrast,
all_shortest_paths() returns all shortest paths.

16.4.4 Minimum spanning tree


In Section →16.5, we will provide Definition →16.5.1, formally introducing what a tree is.
Informally, it is an acyclic and connected graph [→94]. In this section, we discuss spanning trees
and the minimum spanning tree problem [→15], [→38].
Suppose, we start with an undirected input graph G = (V , E ) . A spanning tree
G G

T = (V , E ) of G is a tree, where V
T T T = V . In this case, we say that the tree T spans G, as the
G

vertex set of the two graphs are the same and every edge of T belongs to G.
→Figure 16.16 shows an input graph G with a possible spanning tree T. It is obvious, by definition,
that there often exists more than one spanning tree of a given graph G. The problem of
determining spanning trees gets more complex if we consider weighted networks. In case we
start with an edge-labeled graph, one could determine the so-called minimum spanning tree
[→38]. This can be achieved by adding up the costs of all edge weights and, finally, searching for
the tree with minimum cost among all existing spanning trees. Again, the minimum spanning tree
for a given network is not unique. For instance, well-known algorithms to determine the minimum
spanning tree are due to Prim and Kruskal, see, e. g., [→38]. We emphasize that the application of
those methods may result in different minimum spanning trees. Here, we just demonstrate
Kruskal’s algorithm [→38] representing a greedy approach. Let G = (V , E) be a connected
graph with real edge weights. The main steps are the following:
We arrange the edges according to their weights in ascending order
We add edges to the resulting minimum spanning tree as follows: we start with the smallest
weight and end with the largest weight by consecutively adding edges according to their
weight costs
We only add the described edges if the process does not create a cycle
→Figure 16.17 shows the sequence of steps when applying Kruskal’s algorithm to the shown
input graph G. We choose any subgraph with the smallest weight as depicted in situation A. In B,
we choose the next smallest edge, and so on. We repeat this procedure according to the
algorithmic steps above until it does not create a cycle. Note, intermediate trees can be
disconnected (see C). One possible minimum spanning tree is shown in situation E. Differences
between the algorithms due to Kruskal and Prim are explained in, e. g., [→38].

Figure 16.16 The graph G and a spanning tree T.


Figure 16.17 The input graph G and some subgraphs to achieve a minimum spanning tree in E.

In Listing 16.2, we shown an example how the minimum spanning tree can be found by using
R.For this example, we use a small-world network with n = 25 nodes. The command mst() gives
the underlying minimal spanning tree.

16.5 Network models and graph classes


In this section, we introduce important network models and classes, which have been used in
many disciplines [→25], [→94]. All of these modeles are characterized by specific structural
properties.

16.5.1 Trees
We start with the formal definition of a tree [→94], already briefly introducted in Section →16.4.4.

Definition 16.5.1.

A tree is a graph G = (V , E) that is connected and acyclic. A graph is acyclic if it does not contain
any cycle.
In fact, there exist several characterizations for trees which are equivalent [→100].
Theorem 16.5.1.
Let G = (V , E) be a graph, and let |V | := N . The following assertions are equivalent:
1. G = (V , E) is a tree.

2. Each two vertices of G are connected by a unique path.


3. G is connected, but for each edge e ∈ E , G\{e} is disconnected.
4. G is connected and has exactly N − 1 edges.
5. G is cycle free and has exactly N − 1 edges.
Special types of trees are rooted trees [→94]. Rooted trees often appear in graph algorithms,
e. g., when performing a search or sorting [→38].

Definition 16.5.2.

A rooted tree is a tree containing one designated root vertex. There is a unique path from the root
vertex to all other vertices in the tree, and all other vertices are directed away from the root.
→Figure 16.18 presents a rooted tree, in which the root is at the very top of a tree, whereas all
other vertices are placed on some lower levels. The tree in →Figure 16.18 is an unordered tree,
that means, the order of the vertices is arbitrary. For instance, the order of the green and orange
vertex can be swapped.
Figure 16.18 A rooted tree with its designated root vertex.

Classes of rooted trees include ordered and binary-rooted trees [→94].

Definition 16.5.3.

An ordered tree is a rooted tree assigning a fixed order to the children of each vertex.

Definition 16.5.4.

A binary tree is an ordered tree, where each vertex has exactly two children.

16.5.2 Generalized trees


Undirected and directed rooted trees can be generalized by so-called generalized trees [→51],
[→132]. A generalized tree is also hierarchical like an ordinary rooted tree, but its edge set allows
a richer connectivity among the vertices. →Figure 16.19 shows an undirected generalized tree
with four levels, including the root level.
Figure 16.19 A generalized tree.

We now give a formal definition of an undirected generalized tree [→63].

Definition 16.5.5.

A generalized tree GT is defined by a vertex set V, an edge set E, a level set L, and a multilevel
function L . The edge set E will be defined in Definition →16.5.7. The vertex and edge set define
the connectivity and the level set and the multilevel function induces a hierarchy between the
nodes of GT . The index r ∈ V indicates the root.
The multilevel function is defined as follows [→63].

Definition 16.5.6.

The function L : V ∖ {r} → L is called a multilevel function.


The multilevel function L assigns to all nodes, except r, an element l ∈ L , which corresponds to
the level they possess.

Definition 16.5.7.

A generalized tree as defined by Definition →16.5.5 has three edges types [→63]:
Edges with |L (m) − L (n)| = 1 are called kernel edges ( E ). 1

Edges with |L (m) − L (n)| = 0 are called cross edges ( E ). 2

Edges with |L (m) − L (n)| > 1 are called up edges ( E ). 3

Note that for an ordinary rooted tree as defined by Definition →16.5.2, we always obtain
|L (m) − L (n)| = 1 for all pairs (m, n) . From the above given definitions and the visualization

in →Figure 16.19, it is clear that a generalized tree is a tree-like graph with a hierarchy, and may
contain cycles.

16.5.3 Random networks


Random networks have also been studied in many fields, including computer science and
network physics [→183]. This class of networks are based on the semimal work of Erdös and
Rényi, see [→76], [→77].
By definition, a random graph with N vertices can be obtained by connecting every pair of vertices
with probability p. Then, the expected number of edges for an undirected random graph is given
by
N (N − 1) (16.19)
E(n) = p .
2

In what follows, we survey important properties of random networks [→59]. For instance, the
degree distribution of a vertex v follows a binomial distribution,
i

N − 1 (16.20)
k N −1−k
P (ki = k) = ( )p (1 − p) ,
k

since the maximum degree of the vertex v is at most N − 1 ; in fact, the probability that the
i

vertex has k edges equals p (1 − p) and there exist ( ) possibilities to choose k edges
k N −1−k N −1

from N − 1 vertices.
Considering the limit N → ∞ , Equation (→16.20) yields
z
k
exp (−z) (16.21)
P (ki = k) ∼ .
k!

We emphasize that z = p(N − 1) is the expected number of edges for a vertex. This implies that
if N goes to infinity, the degree distribution of a vertex in a random network can be approximated
by the Poisson distribution. For this reason, random networks are often referred to as Poisson
random networks [→142].
In addition, one can demonstrate that the degree distribution of the whole random network also
follows approximatively the following Poisson distribution:
r
z exp (−z) (16.22)
P (Xk = r) ∼ .
r!

This means that there exist X k = r vertices in the network that possess degree k [→4].
As an application, we recall the already introduced clustering coefficient C , for a vertex v ,
i i

represented by equation (→16.9). In general, this quantity has been defined as the ratio |E | of i

existing connections among its k nearest neighbors divided by the total number of possible
i

connections. This consideration yields the following:


2|Ei | (16.23)
Ci = .
ki (ki − 1)

Therefore, C is the probability that two neighbors of v are connected with each other in a
i i

random graph, and C = p . This can be approximated by


i

z (16.24)
Ci ∼ ,
N

as the average degree of a vertex equals z = p(N − 1) ∼ pN .


The two examples of random networks shown in →Figure 16.20 can be generated using the
following R code:
Figure 16.20 Random networks with p = 0.01 (left) and p = 0.1 (right).

16.5.4 Small-world networks


Small-world networks were introduced by Watts and Strogatz [→198]. These networks possess
two interesting structural properties. Watts and Strogatz [→198] found that small-world networks
have a high clustering coefficient and also a short (average) distance among vertices. Small-world
networks have been explored in several disciplines, such as network science, network biology, and
web mining [→190], [→195], [→203].
In the following, we present a procedure developed by Watts and Strogatz [→198] in order to
generate small-world networks.
To start, all vertices of the graph are arranged on a ring and connect each vertex with its
k/2 nearest neighbors. →Figure 16.21 (left) shows an example using k = 4 . For each

vertex, the connection to its next neighbor (1st neighbor) is highlighted in blue and the
connection to its second next neighbor (2nd neighbor) in red.
Second, start with an arbitrary vertex i and rewire its connection to its nearest neighbor on,
e. g., the right side with probability p to any other vertex j in the network. Then, choose
rw

the next vertex in the ring in a clockwise direction and repeat this procedure.
Third, after all first-neighbor connections have been checked, repeat this procedure for the
second and all higher-order neighbors, if present, successively.
This algorithm guarantees that each connection occurring in the network is chosen exactly once
and rewired with probability p . Hence, the rewiring probability, p , controls the disorder of the
rw rw

resulting network topology. For p = 0 , the regular topology is conserved, whereas p = 1


rw rw

results in a random network. Intermediate values 0 < p < 1 give a topological structure that is
rw

between these two extremes.


→Figure 16.21 (right) shows an example of a small-world network generated with the
following R code:
Figure 16.21 Small-world networks with p rw = 0.0 (left) and p
rw = 0.10 (right). The two rewired
edges are shown in light blue and red.

The generation of a small-world network by using the Watts–Stogatz algorithm consists of two
main parts:
First, the adjacency matrix is initialized in a way that only the nearest k/2 neighbor vertices
are connected. The order of the vertices is arbitrarily induced by the labeling of the vertices
from 1 to N. This allows identifying, e. g., i + f as the fth neighbor of vertex i with f ∈ N .
For instance, f = 1 corresponds to the next neighbor of i. The module function is used to
ensure that the neighbor indices f remain in the range of {1, … , N } . Due to this fact the
vertices can be seen as organized on a ring. We would like to emphasize that for the
algorithm to work, the number of neighbors k needs to be an even number.
Second, each connection in the network is tested once if it should be rewired with
probability p . To do this, a random number, c, between 0 and 1 is uniformly sampled and
rw

tested in an if-clause. Then, if c ≤ p , a connection between vertex i and i + f is rewired.


rw

In this case, we need first to remove the old connection between these vertices and then
draw a random integer, d, from {1, … , N } ∖ {i} to select a new vertex to connect with i.
We would like to note that in order to avoid a self-connection of vertex i, we need to remove
the index i from the set {1, … , N } .

16.5.5 Scale-free networks


Neither random nor small-world network have a property frequently observed in real world
networks, namely a scale-free behavior of the degrees [→4],
P (k) ∼ k
−γ
. (16.25)

To explain this common feature Barabási and Albert introduced a model [→8], now known as
Barabási–Albert (BA) or preferential attachment model [→142]. This model results in so called scale-
free networks, which have a degree distribution following a power law [→8]. A major difference
between the preferential attachment model and the other algorithms, described above, for
generating random or small-world networks is that the BA model does not assume a fixed
number of vertices, N, and then rewires them iteratively with a fixed probability, but in this model
N grows. Each newly added vertex is connected with a certain probability (which is not constant)
to other vertices already present in the network. The attachment probability defined by
ki (16.26)
pi =
∑j kj

is proportional to the degree k of these vertices, explaining the name of the model. This way,
j

each new vertex is added to e ∈ N existing vertices in the network.


→Figure 16.22 presents two examples of random networks generated using the following R
code:
Figure 16.22 Scale-free networks with n = 200 (left) and n = 1000 (right).

16.6 Further reading


For a general introduction to graph theory, we recommend [→88], [→141]. For graph algorithms
the book by [→38] provides a cornucopia of useful algorithms that can be applied to many graph
structures. An introduction to the usage of networks in biology, economics, and finance can be
found in [→67], [→74]. As an initial reading about network science, the article [→75] provides an
elementary overview.

16.7 Summary
Despite the fact that graph theory is a mathematical subject, similar to linear algebra and analysis,
it has a closer connection to practical applications. For this reason many real-world networks have
been studied in many disciplines, such as chemistry, computer science, economy [→64], [→65],
[→143]. A possible explanation for this is provided by the intuitive representation of many natural
networks, e. g., transportation networks of trains and planes, acquaintance networks between
friends or social networks in twitter or facebook. Also many attributes of graphs, e. g., paths or the
degrees of nodes, have a rather intuitive meaning. This motivates the widespread application of
graphs and networks in nearly all application areas. However, we have also seen in this chapter
that the analysis of graphs can be quite intricate, requiring a thorough understanding of the
previous chapters.

16.8 Exercises
1. Let G = V , E be a graph with V = {1, 2, 3, 4, 5} and
E = {{1, 2}, {2, 4}, {1, 3}, {3, 4}, {4, 5}} . Use R to obtain the following

results:
Calculate all vertex degrees of G.
Calculate all shortest paths of G.
Calculate diam(G) .
Calculate the number of circles of G.
2.
Generate 5 arbitrary trees with 10 vertices. Calculate their number of edges by
using R, and confirm E = 10 − 1 = 9 for all 5 generated trees.
3. Let G = V , E be a graph with V = {1, 2, 3, 4, 5, 6} and
E = {{1, 2}, {2, 4}, {1, 3}, {3, 4}, {4, 5}, {5, 6}} . Calculate the number of

spanning trees for the given graph, G.


4. Generate scale-free networks with the BA algorithm. Specifically, generate two
different networks, one for n = 1000 and m = 1 and one for n = 1000 and
m = 3 . Determine for each network the degree distribution of the resulting

network and compare them with each other.


5. Generate small-word networks for n = 2500 . Determine the rewiring probability
prw which separates small-word networks from random networks. Hint:
Investigate the behavior of the clustering coefficient and the average shortest
paths graphically.
6. Identify practical examples of generalized trees by mapping real-world
observations to this graph structure. Are the directories in a computer organized
as a tree or a generalized tree? Starting from your desktop and considering
shortcuts, does this change this answer?
17 Probability theory
Probability theory is a mathematical subject that is concerned with probabilistic behavior of
random variables. In contrast, all topics of the previous chapters in Part →III were concerned with
deterministic behavior of variables. Specifically, the meaning of a probability is a measure
quantifying the likelihood that events will occur. This significant difference between a
deterministic and probabilistic behavior of variables indicates the importance of this field for
statistics, machine learning, and data science in general, as they all deal with the practical
measurement or estimation of probabilities and related entities from data.
This chapter introduces some basic concepts and key characteristics of probability theory,
discrete and continuous distributions, and concentration inequalities. Furthermore, we discuss the
convergence of random variables, e. g., the law of large numbers or the central limit theorem.

17.1 Events and sample space


To learn about a phenomenon in science, it is common to perform an experiment. If this
experiment is repeated under the same conditions, then it is called a random experiment. The
result of an experiment is called an outcome, and the collection of all outcomes constitutes the
sample space, Ω. A subset of the sample space, A ⊂ Ω , is called an event.

Example 17.1.1.
If we toss a coin once, there are two possible outcomes. Either we obtain a “head” (H) or a “tail”
(T). Each of these outcomes is called an elementary event, ω (or a sample point). In this case, the
i

sample space is Ω = {H , T } = {(H ), (T )} , or abstractly {ω , ω } . Points in the sample space


1 2

ω ∈ Ω correspond to an outcome of a random experiment, and subsets of the sample space,

A ⊂ Ω , e. g., A = {T } , are events.

Example 17.1.2.
If we toss a coin three times, the sample space is
Ω = {(H , H , H ), (T , H , H ), (H , T , H ), … , (T , T , T )} , and the elementary outcomes are

triplets composed of elements in {H , T } . It is important to note that the number of triplets in Ω


is the total number of different combinations. In this case the number of different elements in Ω is
2 = 8.
3

From the second example, it is clear that although there are only two elementary outcomes, i.
e. H and T, the size of the sample space can grow by repeating such base experiments.

17.2 Set theory


Before we proceed with the definition of a probability, we provide some necessary background
information about set theory. As highlighted in the examples above, a set is a basic entity on
which the following rests on.

Definition 17.2.1.
A set, A, containing no elements is called an empty set, and it is denoted by ∅.
Definition 17.2.2.
If for every element a ∈ A we also have a ∈ B , then A is a subset of B, and this relationship is
denoted by A ⊂ B .

Definition 17.2.3.
The complement of a set A with respect to the entire space Ω, denoted A or A , is such that if
¯
c

a ∈ A , then a ∈ Ω , but not in A.


c

There is a helpful graphical visualization of sets, called Venn diagram, that allows an insightful
representation of set operations. In →Figure 17.1 (left), we visualize the complement of a set A. In
this figure, the entire space Ω is represented by the large square, and the set A is the inner circle
(blue), whereas its complement A is the area around it (white). In contrast, in →Figure 17.1
¯

(right), the set A is the outer shaded area, and A is the inner circle (white).
¯

Figure 17.1 Visualization of a set A and its complement A . Here Ω = A ∪ A .


¯
¯

Definition 17.2.4.
Two sets A and B are called equivalent if A ⊂ B and B ⊂ A . In this case A = B .

Definition 17.2.5.
The intersection of two sets A and B consists only of the points that are in A and in B, and such a
relationship is denoted by A ∩ B , i. e., A ∩ B = {x ∣ x ∈ A and x ∈ B} .

Definition 17.2.6.
The union of two sets A and B consists of all points that are either in A or in B, or in A and B, and
this relationship is denoted by A ∪ B , i. e., A ∪ B = {x ∣ x ∈ A or x ∈ B} .
→Figure 17.2 provides a visualization of the intersection (left) and the union (right) of two sets
A and B.

Definition 17.2.7.
The set difference between two sets, A and B, consists of the points that are only in A, but not in B,
and this relationship is denoted by A ∖ B , i. e., A ∖ B = {x ∣ x ∈ A and x ∉ B} .
Using R, the four aforementioned set operations can be carried out as follows:

These commands represent the computational realization of the above Definitions →17.2.4 to
→17.2.7, which describe the equivalence, intersection, union, and set difference of sets.

Figure 17.2 Venn diagrams of two sets. Left: Intersection of A and B, A ∩ B . Right: Union of A and
B, A ∪ B .

Theorem 17.2.1.
For three given sets A, B, and C, the following relations hold:
1. Commutativity: A ∪ B = B ∪ A , and A ∩ B = B ∩ A .
2. Associativity: A ∪ (B ∪ C) = (A ∪ B) ∪ C , and A ∩ (B ∩ C) = (A ∩ B) ∩ C .
3. Distributivity: A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) , and
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) .

4. (A ) = A .
c c

For the complement of a set, a bar over the symbol is frequently used instead of the
superscript “c”, i. e., A = A .
¯c

Definition 17.2.8.
Two sets A and A are called mutually exclusive if the following holds: A ∩ A = ∅ .
1 2 1 2

If n sets A with i ∈ {1, … , n} are mutually exclusive, then A ∩ A = ∅ holds for all i and j with
i i j

i ≠ j.

Theorem 17.2.2 (De Morgan’s Laws).


For two given sets, A and B, the following relations hold:
¯
¯
¯ (17.1)
(A ∪ B) = A ∩ B,

¯
¯
¯ (17.2)
(A ∩ B) = A ∪ B.

From, the above relationship, a negation of a union leads to an intersection, and vice versa.
Therefore, De Morgan’s Laws provides a mean for interchanging a union and an intersection via
an application of a negation.

17.3 Definition of probability


The definition of a probability is based on the following three axioms introducted by Kolmogorov
[→48]:

Axiom 17.3.1.
For every event A,
Pr (A) ≥ 0. (17.3)

Axiom 17.3.2.
For the sample space Ω,
Pr (Ω) = 1. (17.4)

Axiom 17.3.3.
For every infinite set of independent events {A 1, … , A∞ } ,
∞ (17.5)
Pr (A1 ∪ A2 ∪ … A∞ ) = ∑ Pr (Ai ).

i=1

Definition 17.3.1.
We call Pr (A) a probability of event A if it fulfills all the three axioms above.
Such a probability is also called a probability measure on sample space Ω. For clarity, we repeat
that Ω contains the outcomes of all possible events. There are different conventions to denote
such a probability and frequent choices are “Pr” or “P”. In the following we use for brievity the
latter one.
These three axioms form the basis of probability theory, from which all other properties can
be derived.
From the definition of a probability and the three above axioms, follow a couple of useful
identities, including:
1. If A ⊂ B , then P (A) ≤ P (B) .
2. For every event A, 0 ≤ P (A) ≤ 1 .
3. For every event A, P (A ) = 1 − P (A) .
c

4. P (∅) = 0 .

5. For every finite set of disjoint events {A 1, … , Ak } ,


k (17.6)
P (A1 ∪ A2 ∪ … Ak ) = ∑ P (Ai ).

i=1

6. For two events A and B,


P (A ∪ B) = P (A) + P (B) − P (A ∩ B). (17.7)

Probabilities are called coherent if they obey the rules from the three axioms above. Examples for
the contrary will be given below.
We would like to note that the above definition of probability does not give a description
about how to quantify it. Classically, Laplace provided such a quantification for equiprobable
elementary outcomes, i. e., for p(ω ) = 1/m for Ω = {ω , … , ω } . In this case, the probability of
i 1 m

an event A is given by the number of elements in A divided by the total number of possible events,
i. e., p(A) = |A|/m . In practice, not all problems can be captured by this approach, because
usually the probabilities, p(ω ) , are not equiprobable. For this reason a frequentist quantification
i

or a Bayesians quantification of probability, which hold for general probability values, is used
[→91], [→161].

17.4 Conditional probability

Definition 17.4.1 (Conditional probability).


For two events A and B with P (B) > 0 , the conditional probability of A, given B, is defined by
P (A ∩ B) (17.8)
P (A|B) = .
P (B)

In the case P (B) = 0 , the conditional probability P (A|B) is not defined.

Definition 17.4.2 (Partition of the sample space).


Suppose that the events {A , … , A } are disjoint, i. e., A ∩ A = ∅ for all i, and j ∈ {1, … , k}
1 k i j

and Ω = A ∪ ⋯ ∪ A . Then, the sets {A , … , A } form a partition of the sample space Ω.


1 k 1 k

Theorem 17.4.1 (Law of total probability).


Suppose that the events {B , … , B } are disjoint and form a partition of the sample space Ω and
1 k

P (B ) > 0 . Then, for an event A ∈ Ω ,


i

k (17.9)
P (A) = ∑ P (A|Bi )P (Bi ).

i=1

Proof.
From the identity
A = A ∩ Ω (17.10)

we have
A = A ∩ (B1 ∪ ⋯ ∪ Bk ), (17.11)

since {B , … , B } is a partition of Ω.
1 k

From De Morgan’s Laws, it follows that


A = (A ∩ B1 ) ∪ ⋯ ∪ (A ∩ Bk ). (17.12)

Since every pair of terms in equation (→17.12) is disjoint, i. e., (A ∩ B ) ∩ (A ∩ B ) = ∅ , because


i j

of A ∩ (B ∩ B ) = A ∩ ∅ = ∅ , the probability expression in equation (→17.12) can be deduced


i j

as follows:
P (A) = P ((A ∩ B1 ) ∪ ⋯ ∪ (A ∩ Bk )) (17.13)

= P (A ∩ B1 ) + ⋯ + P (A ∩ Bk ) (17.14)

= P (A|B1 )P (B1 ) + ⋯ + P (A|Bk )P (Bk ) (17.15)

k (17.16)
= ∑ P (A|Bi )P (Bi ).

i=1

17.5 Conditional probability and independence


The definition of joint probability and conditional probability allows us to connect two or more
events. However, the question is, when are two events said to be independent? This is specified in
the next definition.

Definition 17.5.1.
Two events A and B are called independent, or statistically independent, if one of the following
conditions hold:
1. P (AB) = P (A)P (B)

2. P (A|B) = P (A) if P (B) > 0


3. P (B|A) = P (B) if P (A) > 0
Theorem 17.5.1.
If two events A and B are independent, then the following statements hold:
1.
¯
A and B are independent
2.
¯
A and B are independent

3.
¯
¯
A and B are independent

The extension to more than two events deserves attention, because it requires independence
among all subsets of the events.
Definition 17.5.2.
The n events A , A , … , A
1 2 n ∈ A are called independent if the following condition holds for all
subsets I of {1, … , n} :
(17.17)
P (A1 , … , An ) = ∏ P (Ai ).

i∈I

17.6 Random variables and their distribution function

Definition 17.6.1.
For a given sample space Ω, a random variable X is a function that assigns to each event A ∈ Ω a
real number, i. e., X(A) = x ∈ R with X : Ω → R . The codomain of the function X is
C = {x ∣ x = X(A), A ∈ Ω} ⊂ R .

In the above definition, we emphasized that a random variable is a function, assigning real
numbers to events. For brevity this is mostly neglected when one speaks about random variables .
However, it should not be forgotten.
Furthermore, we want to note that the probability function has not been used explicitly in the
definition. However, it can be used to connect a random variable to the probability of an event.
For example, given a random variable X and a subset of its codomain S ⊂ C , we obtain
P (X ∈ S) = P ({a ∈ Ω ∣ X(a) ∈ S}), (17.18)

since {a ∈ Ω ∣ X(a) ∈ S} ⊂ Ω .
Similarly, for a single element S = x , we obtain
P (X = x) = P ({a ∈ Ω ∣ X(a) = x}). (17.19)

In this way, the probability values for events are clearly defined.

Definition 17.6.2.
The cumulative distribution function of a random variable X is a function F X : R → [0, 1] defined
by
FX (x) = P (X ≤ x). (17.20)

In this definition, the right-hand side term is interpreted as in equation (→17.18) and (→17.19) by
P (X ≤ x) = P ({a ∈ Ω ∣ X(a) ≤ x}). (17.21)

Frequently, a cumulative distribution function is just called a distribution function.

Example 17.6.1.
Suppose that we have a fair coin and define a random variable by X(H ) = 1 and X(T ) = 0 for
a probability space with Ω = {H , T } . We can find a piecewise definition of the corresponding
distribution function as follows:
3.
4.
5.
6.
7.


⎧P (∅) = 0

FX (x) = P ({a ∈ Ω ∣ X(a) ≤ x}) = ⎨P ({T }) = 1/2

P ({T , H }) = 1

Figure 17.3 Distribution function F (x) for equation (→17.22).


f or x < 0;

f or 0 ≤ x < 1;

f or x ≥ 1.

all points up to the end points themselves are. Mathematically, this corresponds to an open
interval indicated by “)”, e. g., [0,1) for the second step in →Fig. 17.3.
Theorem 17.6.1.
The cumulative distribution function, F (x) , has the following properties:
1.
2.
F (−∞) =limx→−∞ F (x) = 0

F (x+) = F (x)
and F (∞) =lim
is continuous from the right;
F (x) = 1 ;
x→∞

F (x) is monotone and nondecreasing; if x ≤ x ⇒ F (x ) ≤ F (x ) ;

P (X > x) = 1 − F (x) ;

P (x < x ≤ x ) = F (x ) − F (x ) ;
1 2 2

P (X = x) = F (x) − F (x−) ;

P (x ≤ x ≤ x ) = F (x ) − F (x −) .
1 2

17.7 Discrete and continuous distributions


2
1

1
1 2 1 2
(17.22)

The circle at the end of the steps in →Fig. 17.3 means that the end points are not included, but
From the connection between a random variable and its probability value, given by equation
(→17.18), we can now introduce the definition of discrete and continuous random variables as
well as their corresponding distributions.

Definition 17.7.1.
If a random variable, X, can only assume a finite number of different values, e. g., x , … , x , then
1 n

X is called a discrete random variable. Furthermore, the collection, P (X = x ) for all i

i ∈ {1, … , n} , is called the discrete distribution of X.

Definition 17.7.2.
Let X be a discrete random variable. The probability function of X, denoted f (x) , is defined for
every real number, x, as follows:
f (x) = P (X = x). (17.23)

Given these two definitions and the properties of probability values, it can be shown that the
following conditions hold:
1. f (x) = 0 , if x is not a possible value of the random variable X;
2. ∑
n
i=1
f (x ) = 1 , if the x are all the possible values for the random variable X.
i i

Definition 17.7.3.
If a random variable, X, can assume an infinite number of values in an interval, e. g., between a
and b ∈ R , then X is called a continuous random variable. The probability of X being within an
interval [a, b] is given by the integral
b (17.24)

P (a ≤ X ≤ b) = ∫ f (x)dx.

Here, the nonnegative function f (x) is called the probability density function of X.
It can be shown that
∞ (17.25)
∫ f (x)dx = 1.

−∞

It is important to note that the probability for a single point x 0 ∈ R is zero, because
x0 (17.26)
P (x0 ≤ X ≤ x0 ) = ∫ f (x)dx = 0.

x0

In Section →17.12, we will discuss some important continuous distributions. However, here we
want to give an example for such a distribution.

17.7.1 Uniform distribution


The simplest continuous distribution is the uniform distribution. It has a constant density function
within the range [a, b] , with a, b ∈ R , and it is defined by
1 (17.27)
if x ∈ [a, b];
b−a
f (x) = {
0 otherwise.

The notation Unif ([a, b]) is often used to denote a uniform distribution in the interval [a, b] .

17.8 Expectation values and moments


In the previous sections, we discussed discrete and continuous distributions for random variables.
In principle, such distributions contain all information about a random variable X. Practically, there
are specific properties of such distributions that are of great importance, and these are related to
expectation values.

17.8.1 Expectation values


The following definition specifies what is meant by an expectation value of a random variable.

Definition 17.8.1.

The expectation value of a random variable X, denoted E[X] , is defined by


(17.28)
E[X] = ∑ xi f (xi ), f or a discrete random variable X,

(17.29)
E[X] = ∫ xf (x)dx, f or a continuous random variable X.

The expectation value of X is also called the mean of X.


A generalization of the above definition can be given, leading to the expectation value of a
function g(X) for a random variable X:
(17.30)
E[g(X)] = ∑ g(xi )f (xi ), f or a discrete random variable X,

(17.31)
E[g(X)] = ∫ g(x)f (x)dx, f or a continuous random variable X.

From the definition of the expectation values of a random variable follows several important
properties that hold for discrete and continuous random variables.
Theorem 17.8.1.
Suppose that X and X , … , X are random variables. Then the following results hold:
1 n

E[Y ] = aE[X] + b, (17.32)

for Y = aX + b , with a and b finite constants in R .


n (17.33)
E[X1 + ⋯ + Xn ] = ∑ E[Xi ].

i=1

If X 1, … , Xn are independent random variables and E[X ] is finite for every i, then
i

n n (17.34)
E[∏ Xi ] = ∏ E[Xi ].

i=1 i=1

17.8.2 Variance
An important special case for an expectation value of a function is given by
2 (17.35)
g(x) = (X − μ)

with μ = E[X] . In this case, we write


2 (17.36)
Var(X) = E[g(x)] = E[(X − μ) ].

Due to the importance of this expression, it has its own name. It is called the variance of X. If the
mean of X, μ, is not finite, or if it does not exist, then Var(X) does not exist.
There is a related measure, called the standard deviation, which is just the square root of the
variance of X, denoted sd(X) = √Var(X) . Frequently, the Greek symbol σ is used to denote 2

the variance, i. e.,


2
σ (X) = Var(X).
(17.37)

In this case, the standard deviation assumes the form, sd(X) = √Var(X) = σ .

The variance has the following properties:


1. For Y = a + bX : Var(Y ) = b Var(X) .2

2. If X , … , X are independent random variables:


1 n

Var(X1 + ⋯ + Xn ) = Var(X1 ) + ⋯ + Var(Xn ) .


3. If X̄ =
1

n
Xi and Var(X ) = Var(X) for all i: Var(X̄) =
i
Var(X)
.
n i=1 n

Property (3) has important practical implications, because it says that the variance of the mean of
a sample of size n for random variables that have all the same variance has a variance that is
reduced by the factor 1/n . If we take the square root of Var(X̄) = , we get the standard
Var(X)

deviation of X̄ given by
sd(X) (17.38)
SE = sd(X̄) = .
√n

This is another important quantity called the standard error ( SE ).


A frequent error observed in applications is the usage of sd(X) when the standard error, SE ,
should be used. For instance, if one performs a repeated analysis leading to ten error measures,
Ei for i ∈ {1, … , 10} , e. g., when performing a 10-fold cross validation [→68], one is interested
in the standard error of E =tot ∑
1

10
E , and not in the variance of the individual errors E .
10
i=1 i i

17.8.3 Moments
Along the same principle, as for the definition of the variance of a random variable X, one can
define further expectation values.

Definition 17.8.2.

For a random variable X and a function


g(x) = (X − μ), (17.39)

with μ = E[X] , the k central moment of X, denoted m , is defined by


th ′
k

′ k k (17.40)
mk = E[g(x) ] = E[(X − μ) ].

For k = 2 , the central moment of X is just the variance of X. Analogously, one defines the kth

moment of a random variable.

Definition 17.8.3.

For a random variable X and a function


g(x) = X, (17.41)

the k moment of X, denoted m , is defined by


th
k

k k (17.42)
mk = E[g(x) ] = E[X ].

17.8.4 Covariance and correlation


The covariance between two random variables X and Y is defined by

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])].


(17.43)

The covariance has the following important properties:


1. Symmetry: Cov(X, Y ) = Cov(Y , X) ;
2. If Y = a + bX : Cov(X, Y ) = b Var(X) ;
3. Cov(X, Y ) ≤ √ Var(X) Var(Y ) .

Definition 17.8.4.

The linear correlation, often referred to as simply correlation, between two random variables X and
Y is defined by
(17.44)
Cor(X, Y ) = E[(X − E[X])(Y − E[Y ])]/√ Var(X) Var(Y )

(17.45)
= Cov(X, Y )/√ Var(X) Var(Y ).

The linear correlation has the following properties:


1. It is normalized: −1 ≤ Cor(X, Y ) ≤ 1 ;
2. Cov(X, Y ) ≤ √ Var(X) Var(Y ) .

We call X and Y positively correlated if Cor(X, Y ) > 0 , and negatively correlated if


Cor(X, Y ) < 0 . When Cor(X, Y ) = 0 , X and Y are said to be linearly uncorrelated. Frequently,

the correlation is denoted by the Greek letter ρ(X, Y ) = Cor(X, Y ) .


The above correlation has been introduced by Karl Pearson. For this reason it is also called
Pearson’s correlation coefficient.

17.9 Bivariate distributions


Now we generalize the distribution of one random variable to the joint distribution of two
random variables.

Definition 17.9.1 (Discrete joint distributions).


The joint cumulative distribution function F 2
: R → [0, 1] for the discrete random variables X and Y
is given by
F (x, y) = P (X ≤ x and Y ≤ y). (17.46)

The corresponding joint probability function f : R


2
→ [0, 1] is given by
f (x, y) = P (X = x and Y = y). (17.47)

Theorem 17.9.1.
Let X and Y be two discrete random variables with joint probability function f (x, y) . If (x a, ya ) is not
in the definition range of (X, Y ) , then f (x , y ) = 0 . Furthermore
a a

(17.48)
∑ f (xi , yi ) = 1,

∀ i

and
(17.49)
P ((X, Y ) ∈ Z) = ∑ f (x, y).

(x,y)∈Z

For evaluating such a discrete joint probability function, the corresponding probabilities can
be presented in a form of table. In →Table 17.1, we present an example of a discrete joint
probability function f (x, y) with X ∈ {x , x } and Y ∈ {y , y , y } .
1 2 1 2 3
Table 17.1 An example of a discrete joint probability function f (x, y) with X ∈ {x1 , x2 } and
Y ∈ {y , y , y } .
1 2 3

Y
y1 y2 y3

X x1 f (x1 , y1 ) f (x1 , y2 ) f (x1 , y3 )

x2 f (x2 , y1 ) f (x2 , y2 ) f (x2 , y3 )

17.10 Multivariate distributions


For multivariate distributions, i. e., for f (x , … , x ) with n > 2 , the above definitions
1 n

generalize naturally. However, the practical characterization of such distributions, e. g., in form of
tables like →Table 17.1 causes problems, because 3, 4, or 100-dimensional tables are not
manageable. Fortunately, for random variables that have a dependency structure that can be
represented by a directed acyclic graph (DAG), there is a simple representation.
By application of the chain rule, one can show that every joint probability distribution factorizes in
the following way:
n (17.50)
P (X1 , … , Xn ) = ∏ p(Xi | pa(Xi )).

i=1

Here, pa(X ) denotes the “parents” of variable X . In →Figure 17.4 (left), we show an example
i i

for n = 5 . The joint probability distribution P (X , … , X ) factorizes in


1 5

P (X1 , … , X5 ) = p(X1 )p(X2 )p(X3 |X1 )p(X4 |X1 , X2 )p(X5 |X1 , X2 ). (17.51)

Similarly, the joint probability distribution, for →Figure 17.4 (right), can be written as follows
P (X1 , … , X5 ) = p(X1 )p(X2 )p(X3 )p(X4 |X1 , X2 , X3 )p(X5 |X4 ). (17.52)

The advantage of such factorization is that the numerical specification of the joint probability
distribution is distributed over the terms p(X | pa(X )) . Importantly, each of these terms can be
i i

represented by a simple table, similar to →Table 17.1.


Figure 17.4 Examples of factorization of a joint probability distribution that can be represented by
a DAG.

The shown DAGs in →Figure 17.4, together with the factorizations of their joint probability
distributions, are examples of so called Bayesian networks [→114], [→149]. Bayesian networks are
special examples of probabilistic models called graphical models [→116].

17.11 Important discrete distributions


In this section, we discuss some important distributions that arise from random variables that
are discrete, which can be found frequently in data science applications. For example, flipping a
coin or tossing a dice leads to discrete outcomes. In the case of a coin, we observe either a “head”
or a “tail”. For a dice, we observe different faces with the number 1 to 6 on them. In general, a
random variable, X, has a discrete distribution if the sample space of X is either finite or countable.
For convenience, in the following, we label these discrete values by integers. For instance, by
defining “head” =1 and “tail” =0.

17.11.1 Bernoulli distribution


One of the most simple discrete distributions and yet very important is the Bernoulli distribution.
For this distribution, the sample space consists of only two outcomes {0,1}. The probabilities for
these events are defined by
P (X = 1) = p, (17.53)

P (X = 0) = 1 − p. (17.54)

As a short notation, we write X ∼ Bern(p) for a random variable, X, drawn from a Bernoulli
distribution with parameter p. Hence, the symbol ∼ means “is drawn from” or “is sampled from”.
The R-package Rlab provides the Bernoulli distribution. With the help of the command rbern, we
can draw 10 random variables from a distribution with p = 0.5 .
An alternative is to use the sample command. Here it is important to sample with replacement.

A simple example for a discrete random variable with a Bernoulli distribution is a coin toss.

17.11.2 Binomial distribution


A Binomial distribution is based on Bernoulli distributed random variables. Suppose that we
observe N indepdendently drawn random variables X ∼ Bern(p) with i ∈ {1, … , N } and
i

P (X = 1) = p . Then the probability to observe n “1s” (e. g. heads) from the N tosses is given by
i

N (17.55)
n N −n
P (X = n) = ( )p (1 − p) .
n

As a short notation, we write X ∼ Binom(N , p) . For example, Binom(6, 0.2) is obtained in R as


shown in Listing 17.4.

In →Figure 17.5, we visualize two Binomial distributions with different parameter. Each bar
corresponds to P (X = n) for a specific value of n.

Figure 17.5 Binomial distribution, Binom(N = 6, p = 0.3) (left) and Binom(N = 6, p = 0.1)

(right).
For N → ∞ and large values of p, the Binomial distribution can be approximated by a
normal distribution (discussed in detail in Sec. →17.12.4). In this case, one can set the mean value
to μ = N p , and the standard deviation to σ = √N p(1 − p) for the normal distribution. The
advantage of such approximation is that the normal distribution is computationally easier to
handle than the Binomial distribution. As a rule of thumb, this approximation can be used if
N p(1 − p) > 9 . Alternatively, it can be used if N p > 5 (for p ≤ 0.5 ) or N (1 − p) > 5 (for

p > 0.5 ).

To illustrate how to generate figures such as →Figure 17.5 (right), we provide below a listing,
using ggplot, for producing such figure.

In the following, we do not provide the scripts for the visualizations of similar figures, but only for
the values of the distributions. However, by following the example in Listing 17.5, such
visualizations can be generated easily.
Figure 17.6 Binomial distribution, pbinom(n, size=6, prob=0.6) (left) and qbinom(p, size=6, prob=0.6)
(right).

So far we have seen that R provides for each available distribution a function to sample random
variables from this distribution, and a function to obtain the corresponding probability density.
For the Binomial distribution, these functions are called rbinom and dbinom. For other
distributions, the following pattern for the names apply:
r’name-of-the-distribution’: draw random samples from the distribution;
d’name-of-the-distribution’: density of the distribution.
There are two more standard functions available that provide useful information about a
distribution. The first one is the distribution function, also called cumulative distribution function,
because it provides P (X ≤ n) , i. e., the probability up to a certain value of n, which is given by
m=n (17.56)
P (X ≤ n) = ∑ P (X = m).

m=0

The second function is the quantile function, which provides information about the value of n, for
which P (X ≤ n) = p holds. In R, the names of these functions follow the pattern:
p’name-of-the-distribution’: distribution function;
q’name-of-the-distribution’: quantile function.

17.11.3 Geometric distribution


Suppose that we observe an infinite number of independent and identically distributed (iid)
random variables X ∼ Bern(p) . Then, the probability to observe in the first n consecutive
i

observations a tail, is given by the geometric distribution defined by


P (X = n) = (1 − p) p.
n
(17.57)
For example, if we observe 0001 … then the first n = 3 observations show consecutively tail,
and the probability for this to happen is given by P (X = 3) = (1 − p) p .
3

Using R, sampling from X ∼ Geom(p = 0.4) is obtained as shown in Listing 17.6.

17.11.4 Negative binomial distribution


Suppose that we observe an infinite number of independent and identically distributed random
variables X ∼ Bern(p) . Then the probability to observe n tails before we observe r “heads” is
i

given by the negative binomial distribution, defined by


r + n − 1 (17.58)
r n
P (X = n) = ( )p (1 − p) .
n

For instance, sampling from X ∼ nbinom(r = 6, p = 0.2) using R can be done as follows:

17.11.5 Poisson distribution


The Poisson distribution expresses the probability of a given number of independent events
occurring in a fixed time interval. The Poisson distribution is defined by
λ
n
exp (−λ) (17.59)
P (X = n) = .
n!

For example, sampling from X ∼ pois(λ = 3) using R can be done as follows:

→Figure 17.7 provides some visualization of some Poisson distributions.


Figure 17.7 Poisson distribution, pois(λ = 3) (left) and pois(λ = 1) (right).

It is worth noting that the Poisson distribution can be obtained from a Binomial distribution
for N → ∞ and p → 0 , assuming that λ = N p remains constant. This means that for large N
and small p we can use the Poisson distribution with λ = N p to approximate a Binomial
distribution, because the former is easier to handle computationally. Two rules of thumb say that
this approximation is good if N ≥ 20 and p ≤ 0.05 , or if N ≥ 100 and N p ≤ 10 .
This approximation explains also why the Poisson distribution is used to describe rare events
that have a small probability to occur, e. g., radioactive decay of chemical elements. Other
examples of rare events include spelling errors on a book page, the number of visitors of a certain
website, or the number of infections due to a virus.

17.12 Important continuous distributions


Similar to discrete distributions, there are also important continuous distributions, i. e. for
continuous random variables, which will be discussed in the following.

17.12.1 Exponential distribution


The density or the probability function of the exponential distribution is defined by
λ exp (−λx) if x ≥ 0 (17.60)
f (x) = {
0 otherwise.

The parameter λ of the exponential distribution must be strictly positive, i. e., λ > 0 .
Figure 17.8 Exponential distribution. Left: dexp(rate = 1) (left) and pexp(rate = 1) (right).

17.12.2 Beta distribution


The density or the probability function of the Beta distribution is defined by
1 α−1 β−1
(17.61)
f (x) = x (1 − x) , x ∈]0, 1[.
B(α, β)

In the denominator of the definition of the Beta distribution appears the Beta function, which is
defined by
1 (17.62)
α−1 β−1
B(α, β) = ∫ x (1 − x) dx.

The parameters α and β in the Beta function must be strictly positive.


Figure 17.9 Beta distribution. Left: dbeta(α = 2, β = 2) (left) and pbeta(α = 2, β = 2) (right).

17.12.3 Gamma distribution


The density function of the gamma distribution is defined by
1 −α−1 (17.63)
α
x exp (−x/β) if x ≥ 0
Γ(α)β
f (x) = {
0 otherwise

The parameters α and β must be strictly positive. In the denominator of the density appears the
gamma function, Γ, which is defined as follows:
∞ (17.64)
α−1
Γ(α) = ∫ t exp (−t)dt.

0
Figure 17.10 Gamma distribution. Left: dgamma(α = 2, β = 2) (left) and
pgamma(α = 2, β = 2) (right).

17.12.4 Normal distribution


The normal distribution is the most important probability distribution in statistics, because it is the
appropriate way to describe many natural phenomena. The normal distribution is also known as
the Gaussian distribution or the bell-shape distribution.

17.12.4.1 One-dimensional normal distribution

The density function of the one-dimensional normal distribution is defined by


2 (17.65)
1 (x − μ)
f (x) = exp (− ), − ∞ ≤ x ≤ ∞.
2
√ 2πσ 2σ

An important special case of the normal distribution is the standard normal distribution defined
by
1 x
2 (17.66)
f (x) = exp (− ), − ∞ ≤ x ≤ ∞.
√ 2π 2

The standard normal distribution has a mean of 0, and a variance of 1.


Figure 17.11 One-dimensional normal distribution. Left: Different values of σ ∈ {0.5, 1, 3} for a
constant mean of μ = 0 . Right: Different values of μ ∈ {−1, 1, 3} for a constant standard
deviation of σ = 2 .

17.12.4.2 Two-dimensional normal distribution

The density function of the normal distribution in R is defined by 2

2 2 (17.67)
1 (x1 − μ1 ) (x2 − μ2 ) (x1 − μ1 )(x2 − μ2 )
f (x)= c exp (− [ + − 2ρ ]),
2 2 2
2(1 − ρ ) σ σ σ1 σ2
1 2

2
x = (x1 , x2 ) ∈ R ,

whereas ρ is the correlation between X and X and the factor c is given by


1 2

1 (17.68)
c = .
2
2πσ1 σ2 √ (1 − ρ )
Figure 17.12 Two-dimensional normal distribution. In addition, projections on the x - and x -
1 2

axis are shown presenting a perspective view.

A visualization of a two-dimensional normal distribution is shown in →Figure 17.12. This figure


shows also projections on the x - and x -axis resulting in one-dimensional projections. In
1 2

contrast, →Figure 17.13 shows a contour plot of this distribution. Such a plot shows parallel slices
of the x − x plane.
1 2
Figure 17.13 Two-dimensional normal distribution: heat map and contour plot.

17.12.4.3 Multivariate normal distribution

The density function of the multivariate normal distribution is defined by


−1 t (17.69)
1 (x − μ)Σ (x − μ)
n
f (x) = exp (− ), x ∈ R .
n 2
√ (2π) |Σ|

Here, x ∈ R is a n-dimensional random variable and the parameters of the density are its mean,
n

μ ∈ R , and the n × n covariance matrix Σ. |Σ| is the determinate of Σ. For n = 2 , we obtain the
n

two-dimensional normal distribution given in Eqn. →17.67.

17.12.5 Chi-square distribution


The density function of the chi-square distribution is defined by
1 k x (17.70)
−1
f (x) = x 2
exp (− ), 0 ≤ x < ∞.
k/2
2 Γ(k/2) 2

It is worth noting that for k iid random variables X ∼ N (0, 1) , Y i = ∑


k
i+1
X
i
2
follows a chi-
square distribution with k degrees of freedom, Y ∼ χ . 2
k

Figure 17.14 Chi-square distribution. Left: Different values of the degree of freedom
k ∈ {2, 7, 20} . Right: Cumulative distribution function.

An example of the application of the Chi-square distribution is the sampling distribution for a
Chi-square test, which is a statistical hypothesis test that can be used to study the variance or the
distribution of data [→171].

17.12.6 Student’s t-distribution


The density function of the t-distribution with ν degrees of freedom is defined by
2

ν+1

2
(17.71)
Γ((ν + 1)/2) x
f (x) = (1 + ) , − ∞ ≤ x ≤ ∞.
√νπΓ(ν/2) ν

Here ν can assume integer values.


If Z ∼ N (0, 1) and Y ∼ χ (Chi-square distribution with ν degrees of freedom) are two
2
ν

independent random variables, then


Z (17.72)
X = ,
Y

ν

follows a Student’s t distribution, i. e., X ∼ tν .

The Student’s t-distribution is also used as a sampling distribution for hypothesis tests.
Specifically, it is used for a t-test that can be used to compare the mean value of one or two
populations, i. e., groups of measurements, each with a certain number of samples [→171].

Figure 17.15 Student’s t-distribution. Left: Different values of the degree of freedom
k ∈ {2, 7, 20} . Right: QQnormal plot for t-distribution with k = 100 .

17.12.7 Log-normal distribution


The log-normal distribution is defined by
2 (17.73)
1 (ln x − μ)
f (x) = exp (− ), 0 < x < ∞.
2
√ 2πσx 2σ
Figure 17.16 Log-normal distribution. Left: Constant μ = 0.0 and varying σ ∈ {1.25, 0.75, 0.25}
. Right: Constant σ = 0.75 and varying μ ∈ {3.0, 2.0, 1.0} .

The log-normal distribution, shown in →Figure 17.16, has the following location measures:
σ
2 (17.74)
mean: exp (μ + ),
2

2
variance: exp (2μ + σ )(exp (σ ) − 1),
2 (17.75)

mode: exp (μ − σ ).
2 (17.76)

17.12.8 Weibull distribution


The Weibull distribution is defined by
β−1 β (17.77)
β x x
f (x) = ( ) exp (−( ) ), 0 < x < ∞.
λ λ λ

The Weibull distribution, shown in →Figure 17.17, has the following location measures:
mean: λΓ(1 + 1/β), (17.78)

2 2 (17.79)
variance: λ [Γ(1 + 2/β) − (Γ(1 + 1/β)) ],

1/β (17.80)
β − 1
mode: λ( ) , β > 1,
β

where Γ denotes the Gamma function.


Figure 17.17 Weibull distribution. Left: Constant value of λ = 1.0 and varying
β ∈ {1.0, 2.0, 3.5} . Right: Constant value of β = 2.0 and varying λ ∈ {0.9, 2.0, 4.0} .

In biostatistics, the log-normal distribution and the Weibull distribution find their applications
in survival analysis [→112]. Specifically, these distributions are used as a parametric model for the
baseline hazard function of a Cox proportional hazard model, which can be used to model time-
to-event processes by considering covariates.

17.13 Bayes’ theorem


The Bayes’ theorem provides a systematic way to calculate the inverse for a given conditional
probability [→114]. For instance, if the conditional probability P (D|H ) for two events D and H is
given, but we are interested in P (H |D) , which can be viewed as the inverse conditional
probability of P (D|H ) ; and Bayes’ theorem provides a way to achieve this.
In its simplest form, the Bayes’ theorem can be stated as follows:
P (D|H )P (H ) (17.81)
P (H |D) = .
P (D)

Its proof follows directly from the definition of conditional probabilities and the commutativity of
the intersection.
The terms in the above equation have the following names:
P (H )is called the prior probability, or prior.
P (D|H ) is called the likelihood.

P (D) is just a normalizing constant, sometimes called marginal likelihood.

P (H |D) is called the posterior probability or posterior.

The letters denoting the above variables, i. e., D and H, are arbitrary, but by using D for “data” and
H for “hypothesis”, one can interpret equation →17.81 as the change of the probability for a
hypothesis (given by the prior) after considering new data about this hypothesis (given by the
posterior).
Bayes’ theorem can be generalized to more variables.
Theorem 17.13.1 (Bayes’ theorem).
Let the events B … B be a partition of the space S such that P (B ) > 0 for all i ∈ {1, … , k} and
1 k i

P (A) > 0 . Then, for i ∈ {1, … , k} , we have

P (A|Bi )P (Bi ) (17.82)


P (Bi |A) = .
k
∑ P (A|Bj )P (Bj )
j=1

To understand the utility of the Bayes’ theorem, let us consider the following example:
Suppose that a medical test for a disease is performed on a patient, and this test has a reliability
of 90 % . That means, if a patient has this disease, the test will be positive with a probability of
90 % . Furthermore, assume that if the patient does not have the disease, the test will be positive

with a probability of 10 % . Let us assume that a patient tests positive for this disease. What is the
probability that this patient has this disease? The answer to this question can be obtained using
Bayes’ theorem.
In order to make the usage of Bayes’ theorem more intuitive, we adopt the formulation in
equation (→17.82). Specifically, let us denote a positive test by A = T , a sick patient that has +

the disease (D) by B = D , and a healthy patient that does not have the disease by B = D .
1
+
2

Then, equation (→17.82) becomes


P (T
+
|D
+
)P (D
+
) (17.83)
+ +
P (D |T ) = .
+ − − + + +
P (T |D )P (D ) + P (T |D )P (D )

Note that D and D provide a partition of the sample space, because P (D ) + P (D ) = 1


+ − + −

(either the patient is sick or healthy). From the provided information about the medical test, see
above, we can identify the following entities:

P (T
+
|D
+
) = 0.9,
(17.84)

P (T
+
|D

) = 0.1.
(17.85)

At this point, the following observation can be made: the knowledge about the medical test is not
enough to calculate the probability P (D |T ) , because we also need information about P (D )
+ + +

and P (D ) .

These probabilities correspond to the prevalence of the disease in the population and are
independent from the characteristics of the performed medical test. Let us consider two different
diseases: one is a common disease and one is a rare disease. For the common (c) disease, we
assume P (D ) = 1/1000 , and for the rare (r) disease P (D ) = 1/1000000 . That means, for
c
+
r
+

the common disease, one person from 1000 is, on average, sick, whereas, for the rare disease,
only one person from 1000000 is sick. This gives us

Common disease Pc (D
+
) = 1/10 , Pc (D
3 −
) = 1 − 1/10 ,
3 (17.86)

Rare disease Pr (D
+
) = 1/10 , Pr (D
6 −
) = 1 − 1/10 .
6 (17.87)

Using these numbers in equation (→17.83) yields


Common disease Pc (D
+
|T
+
) = 0.0089,
(17.88)

Rare disease Pr (D
+
|T
+
) = 8.99 ⋅ 10
−6
.
(17.89)

It is worth noting that although the used medical test has the exact same characteristics, given by
|D ) and P (T |D ) (see equation (→17.84) and (→17.85)), the resulting probabilities are
+ + + −
P (T

different from each other. More precisely,

Pc (D
+
|T
+
) = 991.1 ⋅ Pr (D
+
|T
+
),
(17.90)

which makes it almost 1000 times more likely to suffer from the common disease than the rare
disease, if tested positive.
The above example demonstrates that the context, as provided by P (D ) and P (D ) , is + −

crucial in order to obtain a sensible result.


Finally, in →Figure 17.18, we present some results for repeated analysis of the above example,
using different values for P (D ) from the full range of possible prevalence probabilities, i. e.,
+

from [0,1]. We can see that for any probability value of P (D ) below 80 % , the probability to
+

have a disease, if tested positive, is always below 5 % . Furthermore, we can see that the functional
relation between P (D ) and P (D |T ) is strongly nonlinear. Such a functional behavior makes
+ + +

it difficult to make good guesses for the values of P (D |T ) without doing the underlying
+ +

mathematics properly.
After this example, demonstrating the use of the Bayes’ theorem, we will now provide the
proof of the theorem.
Figure 17.18 P (D |T ) as a function of the prevalence probability P (D
+ + +
) for a common
disease. The horizontal lines corresponds to 5 % .

Proof.
From the definition of a conditional probability for two events A and B,
P (A ∩ B) (17.91)
P (A|B) = ,
P (B)

follows the identity


P (A ∩ Bi ) = P (Bi |A)P (A) = P (A|Bi )P (Bi ), (17.92)

since P (A ∩ B ) = P (B ∩ A) .
i i

Rearranging equation (→17.92) leads to


P (A|Bi )P (Bi ) (17.93)
P (Bi |A) = .
P (A)

Using the law of total probability and assuming that {B 1, … , Bk } is a partition of the sample
space, we can write
k (17.94)
P (A) = ∑ P (A|Bj )P (Bj ).

j=1

Substituting this in equation (→17.93) gives


P (A|Bi )P (Bi ) (17.95)
P (Bi |A) = ,
k
∑ P (A|Bj )P (Bj )
j=1

which is Bayes’ theorem. □


It is because of the simplicity of this “proof” that the Bayes’ theorem is sometimes also
referred to as the Bayes’ rule.

17.14 Information theory


Information theory is based on the application of probability theory concerned with the
quantification, storage, and communication of information. It builds upon the fundamental work
of Claude Shannon [→169]. In this chapter, we introduce the key concept of entropy and related
entities, e. g., conditional entropy, mutual information, and Kullback–Leibler divergence.

17.14.1 Entropy
Shannon defined the entropy for a discrete random variable X, assuming values in {X 1, … , Xn }

with probability density p = P (X ) , as follows:


i i

Definition 17.14.1 (Entropy).

The entropy, H (X) , of a discrete random variable X is given by


n (17.96)
H (X) = E[− log (P (X))] = − ∑ pi log (pi ).

i=1

Usually, the logarithm is base 2, because the entropy is expressed in bits (that means its unit is
a bit). However, sometimes, other bases are used, hence, attention to this is required.
The entropy is a measure of the uncertainty of a random variable. Specifically, it quantifies the
average amount of information needed to describe the random variable.

Properties of the entropy

The entropy has the following properties:


Positivity: H (X) ≥ 0
Symmetry: Let Π be a permutation of the indices 1, … , n of the probability mass function
P (X ) in the form that P (X
i ) is a new probability mass function for the discrete
Π(i)

random variable X = X . Then,



i Π(i)

H (X) = H (X ).
′ (17.97)

Maximum: The maximum of the entropy is assumed for P (X ) = 1/n = const . ∀i , for
i

{X , … , X } .
1 n

The definition of the entropy can be extended to a continuous random variable, X, with probability
mass function f (X) and X ∈ D as follows:
(17.98)
H (X) = E[− log (f (X))] = − ∫ f (x) log (f (x))dx.

x∈D

In this case, the entropy is also called differential entropy.


In →Figure 17.19, we present an example of the entropy for a random variable X that can assume
two values, i. e.,
0, with probability 1 − p (17.99)
X = {
1, with probability p

Clearly, the entropy is positive for all values of p, and assumes its maximum for p = 0.5 with
H (p = 0.5) = 1 bit . In order to plot the entropy, we used n = 50 different values for p obtained

with the R command p <- seq(from=0, to=1, length.out=n).


Figure 17.19 Visualization of the entropy H (p) for different values of p. The vertical dashed line
(red) indicates the maximum of H (p) .

Similar to the joint probability and the conditional probability, there are also extensions of the
entropy along these lines.

Definition 17.14.2 (Joint entropy).

Let X and Y be two random variables assuming values in X , … , X and Y , … , Y .


1 n 1 m

Furthermore, let p = P (X , Y ) be their joint probability distribution. Then, the joint entropy of X
ij i j

and Y, denoted H (X, Y ) , is given by


n m (17.100)
H (X, Y ) = − ∑ ∑ pij log (pij ).

i=1 j=1

Definition 17.14.3 (Conditional entropy).

Let X and Y be two random variables assuming values in X , … , X and Y , … , Y with


1 n 1 m

probability distribution p = P (X ) for X.


i i

Furthermore, let p = P (Y |X ) be their conditional probability distribution and H (Y |X = x )


ji j i i

the entropy of Y, conditioned on X = x . Then, the conditional entropy of Y given X, denoted


i

H (Y |X) , is given by
n n m (17.101)
H (Y |X) = ∑ pi H (Y |X = xi ) = ∑ ∑ pi pji log (pji ).

i=1 i=1 j=1

Properties of the conditional entropy

The joint and conditional entropy have the following properties:


Chain rule: H (Y |X) = H (X, Y ) − H (X) ;
Symmetry: H (X, Y ) = H (X) + H (Y |X) = H (Y ) + H (X|Y )

17.14.2 Kullback–Leibler divergence


The Kullback–Leibler divergence, also called relative entropy, is a measure of the distance
between two probability distributions [→39].

Definition 17.14.4 (Kullback–Leibler divergence).

Let X and Y be two random variables assuming values in X , … , X and Y , … , Y with the
1 n 1 n

probability distributions p = P (X ) and q = P (Y ) . Then the Kullback–Leibler divergence for X


i i j j

and Y, denoted KL(P ∥ Q) , is given by


n
pi
(17.102)
KL(P ∥ Q) = ∑ pi log ( ).
qj
i=1

Properties of the Kullback–Leibler divergence

The Kullback–Leibler divergence has the following properties:


The Kullback–Leibler divergence KL(P ∥ Q) is nonsymmetric;
Gibbs’ inequality: KL(P ∥ Q) ≥ 0 ;
KL(P ∥ Q) = 0 if and only if both distributions are identical, i. e., P = Q .
→Figure 17.20 presents an example of the Kullback–Leibler divergence. On the left-hand side is
depicted the probability distribution p(x) , which is a gamma distribution, and q(x) which is a
normal distribution. On the right-hand side is shown only the logarithm, log ( ) , of both
p

distributions. The vertical dashed lines indicate the intersection points between both distributions.
At these points the sign of the logarithm changes, as shown on the right-hand side, since for
log (x) with x > 1 the logarithm is positive, and for x < 0 the logarithm is negative.
Figure 17.20 An example for the Kullback–Leibler divergence. On the left-hand side, we show the
probability distribution p(x) (a gamma distribution) and q(x) (a normal distribution). On the
right-hand side, we show only the logarithm, log ( ) , of both distributions.
p

17.14.3 Mutual information


Another measure, called mutual information, follows from the definition of the Kullback–Leibler
divergence by the transformation p(x) → p(x, y) and q(x) → p(x)p(y) . It measures the
amount of information of one random variable, X, from another random variable Y.

Definition 17.14.5 (Mutual information).

Let X and Y be two random variables assuming values in X , … , X and Y , … , Y with the
1 n 1 m

probability distributions p = P (X ) and q = P (Y ) .


i i j j

Furthermore, let p = P (X , Y ) be their joint probability distribution. Then, the mutual


ij i j

information of X and Y, denoted I (X, Y ) , is given by


n m
pij (17.103)
I (X, Y ) = ∑ ∑ pij log ( ).
pi qj
i=1 j=1

Properties of the mutual information

The mutual information has the following properties:


Symmetry: I (X, Y ) = I (Y , X)
If X and Y are two independent random variables, I (X, Y ) = 0
I (X, Y ) = H (X) + H (Y ) − H (X, Y )

I (X, Y ) = H (X, Y ) − H (X|Y ) − H (Y |X)

I (X|X) = H (X)

I (X, Y ) = H (X) − H (X|Y )


From the last relationship above follows a further property of the conditional entropy:
H (X|Y ) ≤ H (X). (17.104)

In →Figure 17.21, we visualize the relationships between entropies and mutual information.
This graphical representation of the abstract relationships helps in summarizing these nontrivial
dependencies and in gaining an intuitive understanding.

Figure 17.21 Visualization of the nontrivial relationships between entropies and mutual
information.

In contrast with the correlation discussed in Section →17.8.4, mutual information measures
linear and nonlinear dependencies between X and Y. This extension makes this measure a popular
choice for practical applications. For instance, the mutual information has been used to estimate
the regulatory effects between genes [→44] to construct gene regulatory networks [→69], [→72],
[→139]. It has also been used to estimate finance networks representing the relationships
between stocks from, e. g., the New York stock exchange [→66] or investor trading networks [→7].

17.15 Law of large numbers


The law of large numbers is an important result, because it provides a systematic connection
between the sample mean from a distribution and its population mean. In other words, the law of
large numbers provides a theoretical foundation for using a finite sample from a distribution to
make a statement about the underlying (unknown) population mean. However, before we are in a
position to state and prove the law of large numbers, we need to introduce some inequalities
between probability values and expectation values.
Theorem 17.15.1.
For a given random variable X with P (X ≥ 0) and every real t ∈ R with t > 0 , the following
inequality holds:
E[X] (17.105)
P (X ≥ t) ≤ .
t

This inequality is called the Markov inequality.


Theorem 17.15.2.
For a given random variable X with finite Var(X) and every real t ∈ R with t > 0 , the following
inequality holds:
Var(X) (17.106)
P ( X − E[X] ≥ t) ≤ .
2
t

This inequality is called the Chebyshev inequality.

Proof.
To prove the Chebyshev inequality, we set Y = |X − E[X]| . This guarantees P (Y ≥ 0) ,
2

because Y is nonnegative. Furthermore, E[Y ] = Var(X) per definition of the variance. Now,
application of the Markov inequality and setting s = t gives2

2 2 (17.107)
P ( X − E[X] ≥ t) = P ( X − E[X] ≥ t ),

E[Y ] (17.108)
= P (Y ≥ s) ≤ ,
s

Var(X) (17.109)
= .
s


It is important to emphasize that the two above inequalities hold for every probability distribution
with the required conditions. Despite this generality, it is possible to make a specific statement
about the distance of a random sample from the mean of the distribution. For example, for
t = 4σ , we obtain

1 (17.110)
P ( X − E[X] ≥ 4σ) ≤ = 0.063.
16
That means, for every distribution, the probability that the distance between a random sample X
and E[X] is larger than four standard derivations is less than 6.3 % .
At the beginning of this chapter, we stated briefly the result of the law of large numbers. Before
we formulate it formally, we have one last point that requires some clarification. This point relates
to the mean of a sample. Suppose that we have a random sample of size n, given by X , … , X , 1 n

and each X is drawn from the same distribution with mean μ and variance σ . Furthermore,
i
2

each X is drawn independently from the other samples. We call such samples independent and
i

identically distributed (iid) random variables.1 Then,


E[X1 ] = ⋯ = E[Xn ] = μ, (17.111)

and

Var(X1 ) = ⋯ = Var(Xn ) = σ .
2 (17.112)

The question of interest here is the following: what is the expectation value of the sample mean?
The sample mean of the sample X 1, … , Xn is given by

1
n (17.113)
X̄ n = ∑ Xi .
n
i=1

Here, we emphasize the dependence on n by the subscript of the mean value. From this, we can
obtain the expectation value of X̄ by applying the rules for the expectation values discussed in
n

Section →17.8, giving


n n (17.114)
1 1
E[X̄ n ] = E[ ∑ Xi ] = ∑ E[Xi ] = μ.
n n
i=1 i=1

Similarly, we can obtain the variance of the sample mean, i. e., Var(X̄ ) , by n

n (17.115)
1
Var(X̄ n ) = Var( ∑ Xi ),
n
i=1

n (17.116)
1
= Var(∑ Xi ),
2
n
i=1

1
n (17.117)
= ∑ Var(Xi ),
2
n
i=1

1 σ
2 (17.118)
2
= nσ = .
2
n n

These results are interesting, because they demonstrate that the expectation value of the sample
mean is identical to the mean of the distribution, but the sample variance is reduced by a factor of
1/n compared to the variance of the distribution. Hence, the sampling distribution of X̄ n
Hence,

mean μ.


becomes more and more peaked around μ with increasing values of n, and also having a smaller
variance than the distribution of X for all n > 1 .

Furthermore, application of the Chebyshev inequality for X

P ( X̄ n − E[X̄ n ]

Using the previous results, we obtain


≥ t) ≤

Pr (|X̄ n − μ| ≥ t) ≤
Var(X̄ n )

σ
2

nt

P (|X̄ n − μ| < t) = 1 − P (|X̄ n − μ| ≥ t) ≥ 1 −

Taking the limit n → ∞ , the above equation yields

lim
n→∞
P (|X̄ n − μ| < t) = 1.
2
t
2

.
= X̄ n

.
, gives

This is a precise probabilistic relationship between the distance of the sample mean X̄ from the
mean μ as a function of the sample size n. Hence, this relationship can be used to get an estimate
for the number of samples required in order for the sample mean to be “close” to the population

We are now in a position to finally present the result known as the law of large numbers,
which adds a further component to the above considerations for the sample mean. Specifically, so
far, we know that the expectation of the sample mean is the mean of the distribution (see
equation (→17.114)) and that the probability of the minimal distance between X̄ and μ, given by
t, decreases systematically for increasing n (see equation (→17.120)). However, so far, we did not
assess the opposite behavior of equation (→17.120), namely what is P (|X̄ − μ| < t) ?

nt

This last expression is the result of the law of large numbers. That means, the law of large
2

numbers provides evidence that the distance between X̄ and μ stays with certainty, i. e., with a
probability of 1, below any arbitrary small value of t > 0 .
n

Formally, in statistics there is a special symbol that is reserved for the type of convergence
presented in equation (→17.122), which is written as

Theorem 17.15.3 (Law of large numbers).

17.16 Central limit theorem


p

X̄ n → μ.

The “p” over the arrow means that the sample mean converges in probability to μ.

Suppose that we have an iid sample of size n, X , … , X , where each X is drawn from the same
1

X̄ n → μ.
n

n
i
n

distribution with mean μ and variance σ . Then, the sample mean X̄ converges in probability to μ,
2
n
n
(17.119)

(17.120)

(17.121)

(17.122)

(17.123)

(17.124)
In the previous section, we saw that the expected sample mean and the variance of a random
sample are μ and σ /n , respectively, if the distribution from which the samples are drawn has a
2

mean of μ and a variance of σ . What we did not discuss, so far, is the distributional form of this
2

random sample. This is the topic addressed by the central limit theorem.
Theorem 17.16.1 (Central limit theorem).
Let X , … , X be an iid sample from a distribution with mean μ and variance σ . Then,
1 n
2

X̄n − μ
(17.125)
lim Pr ( ≤ x) = F (x).
n→∞
√ σ2 /n

Here, F is the cumulative distribution function of the standard normal distribution, and x is a fixed real
number.
To understand the importance of the central limit theorem, we would like to emphasize that
equation (→17.125) holds for a large sample from any distribution, whether discrete or
continuous. In this case, can be approximated by a standard normal distribution. This
X̄n −μ
1/2
σ/n

implies that X̄ can be approximated by a normal distribution with mean μ and variance σ /n .
n
2

The central limit theorem is one of the reasons why the normal distribution plays such a
prescind role in statistics, machine learning, and data science. Even when individual random
variables do not come from a normal distribution (i. e., they are not sampled from a normal
distribution), their sum is normally distributed.

17.17 Concentration inequalities


In Section →17.15, we discussed already the Markov and Chebyshev inequalities, because they are
needed to prove the law of large numbers. In general, such inequalities, also called concentration
inequalities or probabilistic inequalities, are playing an important role in proving theorems about
random variables, since they provide bounds on the behavior of random variable and their
deviates, e. g., for expectation values. However, aside from this, they provide also insights into the
laws of probability theory. For this reason, we present, in the following, some additional
concentration inequalities.

17.17.1 Hoeffding’s inequality


Theorem 17.17.1 (Hoeffding’s inequality).
Let X , … , X be some iid random variables with finite mean, a ≤ X ≤ b ∀i , sample mean
1 n i i i

X and μ = E[X̄] . Then, for any ϵ > 0 , the following inequalities hold:
n
X̄ = 1/n ∑ i
i=1

2n ϵ
2 2 (17.126)
P (X̄ − μ ≥ ϵ) ≤ exp (− ),
n 2
∑ (bi − ai )
i=1

2n ϵ
2 2 (17.127)
P (|X̄ − μ| ≥ ϵ) ≤ 2 exp (− ).
n 2
∑ (bi − ai )
i=1

By setting ϵ′
= nϵ , one obtains inequalities for S = ∑
n
i=1
Xi ,


′2 (17.128)

P (S − E[S] ≥ ϵ ) ≤ exp (− ),
n 2
∑ (bi − ai )
i=1

′2 (17.129)

P (|S − E[S]| ≥ ϵ ) ≤ 2 exp (− ).
n 2
∑ (bi − ai )
i=1

As an application of Hoeffding’s inequality, we consider the following example:

Example 17.17.1.

Suppose X 1, … , Xn are independent and identically distributed random variables with


Xi ∼ Bernoulli(p) and a ≤ X ≤ b , ∀i . Then, from the Hoeffding’s inequality we obtain the
i

following inequality:

P (|X̄ − p| ≥ ϵ) ≤ 2 exp (−2nϵ ).


2 (17.130)

The Hoeffding’s inequality finds its applications in statistical learning theory [→192].
Specifically, it can be used to estimate a bound for the difference between the in-sample error E in

and the out-of-sample error E . More generally, it is used for deriving learning bounds for
out

models [→138].

17.17.2 Cauchy–Schwartz inequality


Let us define the scalar product for two random variables X and Y by
X ⋅ Y = E[XY ]. (17.131)

Then, we obtain the following probabilistic version of the Cauchy–Schwartz inequality for
expectation values
2 2 2 (17.132)
E[XY ] ≤ E[X ]E[Y ].

Using the Cauchy–Schwartz inequality, we can show that the correlation between two linearly
dependent random variables X and Y is 1, i. e.,
ρ(X, Y ) = 1 if Y = aX + b with a, b ∈ R. (17.133)

17.17.3 Chernoff bounds


Chernoff bounds are typically tighter than Markov’s inequality and Chebyshev bounds, but
they require stronger assumptions [→137].
In a general form, Chernoff bounds are defined by
E[exp (tX)] (17.134)
P (X ≥ a) ≤ f or t > 0,
exp (ta)

E[exp (tX)] (17.135)


P (X ≤ a) ≤ f or t < 0.
exp (ta)

Here, E[exp (tX)] is the moment-generating function of X. There are many different Chernoff
bounds for different probability distributions and different values of the parameter t. Here, we
provide a bound for Poisson trails, which is a sum of iid Bernoulli random variables, which are
allowed to have different expectation values, i. e., P (X = 1) = p .
i i

Theorem 17.17.2.
Let X , … , X be iid Bernoulli random variables with P (X = 1) = p , and let X̄ = ∑ X be
1 n i i n
n
i=1 i

a Poisson trial with μ = E[X̄ ] = ∑ p . Then ∀δ ∈ (0, 1]


n
n
i=1 i

exp (−δ)
μ
(17.136)
Pr (X ≤ (1 − δ)μ) < ( ) ,
(1−δ)
(1 − δ)

whereas for δ > 0


exp (δ)
μ
(17.137)
Pr (X ≥ (1 + δ)μ) < ( ) .
(1+δ)
(1 + δ)

Example 17.17.2.

As an example, we use this bound to estimate the probability when tossing a fair coin n = 100
times to observe m = 40 , or less heads. For this μ = 50 , and from (1 − δ)μ = 30 follows
δ = 0.2 . This gives P (X ≤ m) = 0.34 .

17.18 Further reading


For readers interested in advanced reading material about the topics of this chapter, we
recommend for probability theory [→17], [→18], [→48], [→87], [→105], [→147], [→150], Bayesian
analysis [→84], [→101], [→180], and for information theory [→39], [→83]. An excellent tutorial on
Bayesian analysis can be found in [→173], and a thorough introduction to information theory,
with focus on machine learning, is provided by [→123]. For developing a better and intuitive
understanding of the terms discussed in this chapter, we recommend the textbooks [→118],
[→145]. Finally, for a historical perspective on the development of probability, the book by [→91]
provides a good overview.

17.19 Summary
Probability theory plays a pivotal role when dealing with data, because essentially every
measurement contains errors. Hence, there is an accompanied uncertainty that needs to be
quantified probabilistically when dealing with data. In this sense, probability theory is an
important extension of deterministic mathematical fields, e. g., linear algebra, graph theory and
analysis, which cannot account for such uncertainties. Unfortunately, such methods are usually
more difficult to understand and require, for this reason, much more practice. However, once
mastered, they add considerably to the analysis and the understanding of real-world problems,
which is essential for any method in data science.

17.20 Exercises
1. In Section →17.11.2, we discussed that under certain conditions a Binomial
distribution can be approximated by a Poisson distribution. Show this result
numerically, using R. Use different approximation conditions and evaluate these.
How can this be quantified? Hint: See Section →17.14.2 about the Kullback–Leibler
divergence.
2. Calculate the mutual information for the discrete joint distribution P (X, Y ) given
in →Table 17.2.

Table 17.2 Numerical values of a discrete joint distribution P (X, Y ) with


X ∈ {x , x } and Y ∈ {y , y , y } .
1 2 1 2 3

Y
y1 y2 y3

X x1 0.2 0.3 0.1


x2 0.1 0.1 0.2

Table 17.3 Numerical values of a discrete joint distribution P (X, Y ) with


X ∈ {x , x } and Y ∈ {y , y , y } in dependence on the parameter z.
1 2 1 2 3

Y
y1 y2 y3

X x1 z 0.5 − z 0.1
x2 0.1 0.1 0.2

3. Use R to calculate the mutual information for the discrete joint distribution
P (X, Y ) given in →Table 17.3, for z ∈ S = [0, 0.5) , and plot the mutual

information as a function of z. What happens for z values outside the interval S?


4. Use the Bayes’ theorem for doping tests in sports. Specifically, suppose that we
have a doping test that identifies with 99 % someone correctly who is using
doping, i. e., P (+|doping) = 0.99 , and has a false positive probability of 1 % , i.
e., P (+|no doping) = 0.01 . Furthermore, assume that the percentage of people
who are doping is 1 % . What is the probability that someone who tests positive is
doping?
18 Optimization
Optimization problems consistently arise when we try to select the best element from a set of
available alternatives. Frequently, this consists of finding the best parameters of a function with
respect to an optimization criterion. This is especially difficult if we have a high-dimensional
problem, meaning that there are many such parameters that must be optimized. Since most
models, used in data science, essentially have many parameters, then optimization (or
optimization theory) is necessary to devise these models.
In this chapter, we will introduce some techniques used to address unconstrained and
constrained, as well as deterministic and probabilistic optimization problems, including Newton’s
method, simulated annealing and the Lagrange multiplier method. We will discuss examples and
available packages in R that can be used to solve the aforementioned optimization problems.

18.1 Introduction
In general, an optimization problem is characterized by the following:
a set of alternative choices called decision variables;
a set of parameters called uncontrollable variables;
a set of requirements to be satisfied by both decision and uncontrollable variables, called
constraints;
some measure(s) of effectiveness expressed in term of both decision and uncontrollable
variables, called objective-function(s).

Definition 18.1.1.
A set of decision variables that satisfy the constraints is called a solution to the problem.
The aim of an optimization problem is to find, among all solutions to the problem, a solution that
corresponds to either
the maximal value of the objective function, in which case the problem is referred to as a
maximization problem, e. g. maximizing the profit;
the minimal value of the objective-function, in which case the problem is referred to as a
minimization problem, e. g. minimizing the cost; or
a trade-off value of many and generally conflicting objective-functions, in which case the
problem is referred to as a multicriteria optimization problem.
Optimization problems are widespread in every activity, where numerical information is
processed, e. g. mathematics, physics, engineering, economics, systems biology, etc. For instance,
typical examples of optimization applications in systems biology include therapy treatment
planning and scheduling, probe design and selection, genomics analysis, etc.

18.2 Formulation of an optimization problem


Optimization problems are often formulated using mathematical models. Let
x = (x , x , … , x ) denote the decision variables; then a general formulation of an
1 2 n

optimization problem is written as follows:


Optimize f (x), (18.1)
n
x∈R

n
subject to: x ∈ S ⊆ R ,

i. e., the problem is to find the solution x ∗


∈ S , if it exists, such that for all x ∈ S , we have

f (x ) ≤ f (x) , if “Optimize” stands for “minimize”;

f (x ) ≥ f (x) , if “Optimize” stands for “maximize”.
The function f denotes the objective function or the cost-function, whereas S is the feasible set,
and any x ∈ S is called a feasible solution to the problem.

Definition 18.2.1.
A solution x̄ to the problem (→18.1) is called a local optimum if
f (x̄) ≤ f (x) for all x in a neighborhood of x̄ , for a minimization-type problem, or
f (x̄) ≥ f (x) for all x in a neighborhood of x̄ , for a maximization-type problem.

Definition 18.2.2.
A solution x to the problem (→18.1) is called a global optimum if


f (x ) ≤ f (x) for all x ∈ S , for a minimization-type problem, or
) ≥ f (x) for all x ∈ S , for a maximization-type problem.

f (x

If S = ∅ , then the problem (→18.1) has no solution, otherwise,


1. if f (x ) is finite, then the problem (→18.1) has a finite optimal solution;

2. if f (x ) = −∞ (for a minimization-type problem) or f (x ) = ∞ (for a


∗ ∗

maximization-type problem), then the problem (→18.1) is unbounded, i. e., the


optimal value of the objective function is not a finite number and, therefore,
cannot be achieved.
When S then, the optimal solution, x , can be any stationary point of f over R , such that
= R
n ∗ n

f (x ) ≤ f (x) (respectively f (x ) ≥ f (x) ) for a minimization-type problem (respectively for a


∗ ∗

maximization-type problem) for all x ∈ R . In this case, the problem (→18.1) is termed an
n

unconstrained optimization problem. On the other hand, if S ⊂ R , then the problem (→18.1) is
n

called a constrained optimization problem, in which case, S is determined by a set of constraints.


Typically, one can distinguish three types of constraints:
Equality constraints: g(x) = c , where g : R n
⟶ R and c ∈ R , e. g.,

4x1 + 25x2 − 7x3 + ⋯ − 5xn = 59.

Inequality constraints: h(x) ≤ c or h(x) ≥ c , where h : R n


⟶ R , and c ∈ R , e. g.,
2
4x1 + 25x2 − 7x3 + ⋯ − 5xn ≤ 59

4x1 + 25x2 − 7x3 + ⋯ − 5xn ≥ 59.

Integrality constraints: e. g., x ∈ Z , x ∈ N .


n n
Remark 18.2.1.
If all the decision variables, x , in the problem (→18.1) take only discrete values (e. g. 0,1,2,…),
i

then the problem is called a discrete optimization problem, otherwise it is called a continuous
optimization problem. When there is a combination of discrete and continuous variables, the
problem is called a mixed optimization problem.

Remark 18.2.2.
Any minimization problem can be rewritten as a maximization problem, and vice versa, by
substituting the objective function f (x) with z(x) = −f (x) .
Therefore, from now, we will focus exclusively on minimization-type optimization problems.

18.3 Unconstrained optimization problems


Unconstrained optimization problems arise in various practical applications, including data fitting,
engineering design, and process control. Techniques for solving unconstrained optimization
problems form the foundation of most methods used to solve constrained optimization problems.
These methods can be classified into two categories: gradient-based methods and derivative-free
methods.
Typically, the formulation of an unconstrained optimization problem can be written as follows:
Minimize f (x). (18.2)
n
x∈R

18.3.1 Gradient-based methods


Gradient-based algorithms for solving unconstrained optimization problems assume that the
function to be minimized in (→18.1) is twice continuously differentiable, and are based upon the
following conditions: a solution x ∈ R is said to be a local optimum of f if

1.

the gradient of f at the point x , ∇f (x ) , is zero, i. e., = 0 , j = 1, … , n ,


∗ ∗ ∂f (x )

∂xj

and
2. the Hessian matrix of f at the point x , ∇ f (x ) , is positive definite, i. e.,
∗ 2 ∗

η∇ (f (x ))η > 0 for all nonzero η ∈ R .


2 ∗ n

The general principle of the gradient-based algorithms can be summarized by the following steps:

Step 1: Set k = 0 , and choose an initial point x = x and some convergence criteria;
(k) (0)

Step 2: Test for convergence: if the conditions for convergence are satisfied, then we can
stop, and x is the solution. Otherwise, go to Step 3;
(k)

Step 3: Computation of a search direction (also termed a descent direction): find a vector
d ≠ 0 that defines a suitable direction that, if followed, will bring us as close as possible to
k

the solution, x ; ∗

Step 4: Computation of the step-size: find a scalar α > 0 such that


k

(k) (k)
f (x + αk dk ) < f (x ).

Step 5: Updating the variables: set x (k+1)


= x
(k)
+ αk dk , k = k + 1 , and go to Step 2.
The main difference between the various gradient-based methods lies in the computation of the
descent direction (Step 3) and the computation of the step-size (Step 4).
In R, various gradients-based methods have been implemented either as stand-alone
packages or as part of a general-purpose optimization package.

18.3.1.1 The steepest descent method

The steepest descent method, also called the gradient descent method, uses the negative of the
gradient vector, at each point, as the search direction for each iteration; thus, steps 3 and 4 are
performed as follows:
(k)

Step 3: the descent direction is given by d ;


∇f (x )
k = − (k)
∥∇f (x )∥

Step 4: the step-size is given by α k =argminα f (x


(k)
− αdk ) .
In R, an implementation of the steepest descent method can be found in the package pracma.
Let us consider the following problem:

min
2
f (x1 , x2 ) = x1 + x2 ;
2 (18.3)
2
(x1 ,x2 )∈R

The contour plot of the functions f (x 1, x2 ) , depicted in →Figure 18.1 (left), is obtained using the
following script:
Using the steepest descent method, the problem (→18.3) can be solved in R as follows:

Let us consider the following problem:


2
(x−2x −y )
2
2 (18.4)
max g(x1 , x2 ) = e sin (6(x + y + xy )).
2
(x1 ,x2 )∈R

The contour plot of the functions g(x 1, x2 ) , depicted in →Figure 18.1 (right), is obtained using the
following script:
Figure 18.1 Left: contour plot of the function f (x , x ) in (→18.3) in the (x
1 2 1, x2 ) plane; right:
contour plot of the function g(x , x ) in (→18.4) in the (x , x ) plane.
1 2 1 2

Most of the optimization methods available in R, including the steepest descent, are implemented
for minimization problems. Since the solution that maximizes a function h(x) minimizes the
function −h(x) , we can solve the problem (→18.5) to find the solution to (→18.4), and then
multiply the value of the objective-function of (→18.5) by −1 to recover the value of the objective-
function of (→18.4).
2 2
(x−2x −y ) 2 (18.5)
min g(x1 , x2 ) = −e sin (6(x + y + xy ))
x1 ,x2

Using the steepest descent method, implemented in R, the problem (→18.5) can be solved as
follows:
Note that the convergence and solution given by the steepest descent method depend on
both the form of the function to be minimized and the initial solution.

18.3.1.2 The conjugate gradient method

The conjugate gradient method is a modification to the steepest descent method, which takes
into account the history of the gradients to move more directly towards the optimum. The
computation of the descent direction (Step 3) and the step-size (Step 4) are performed as follows:
Step 3: the descent direction is given by
(k)
−∇f (x ), k = 0,
dk = {
(k)
−∇f (x ) + βk dk−1 , k ≥ 0,

where several types of formulas for β have been proposed. The most known formulas are
k

those proposed by Fletcher–Reeves (FR), Polak–Ribière–Polyak (PRP) and Hestenes–Stiefel


(HS), and they are defined as follows:
∥∇f (x
(k)
)∥
2 (18.6)
FR
βk = ,
(k−1) 2
∥∇f (x )∥

(k)
T (18.7)
(∇f (x )) yk−1
PRP
βk = ,
(k−1) 2
∥∇f (x )∥

(k)
T (18.8)
(∇f (x )) yk−1
HS
β = ,
k T
d yk−1
k−1

where ‖·‖ denotes the Euclidean norm, and y k−1 = ∇f (x


(k)
) − ∇f (x
(k−1)
) .
Step 4: The step-size α is such that
k

(k) (k) (k)


T (18.9)
f (x ) − f (x + αk dk ) ≥ −δαk (∇f (x )) dk ,

(k)
T
(k)
T (18.10)
(∇f (x + αk dk )) ≤ −σ(∇f (x )) dk ,

where 0 < δ < σ < 1 .


In R, the implementation of the conjugate gradient method can be found in the general
multipurpose package optimx. This implementation of the conjugate gradient method can be used
to solve the problem (→18.3) as follows:
Now, let us use the conjugate gradient method, implemented in the package optimx to solve
the problem (→18.4).
The solution to the problem (→18.4) with the initial solution x = (x , x ) = (1, 1) is
(0) (0) (0)

1 2

x̄ = (1.5112, 2.016) , and f (x̄) = −0.0008079518 , which is a local minima. However, in contrast

with the steepest descent method, the conjugate gradient method converges with the initial
solution x = (1, 1) .
(0)

18.3.1.3 Newton’s method

In contrast with the steepest descent and conjugate gradient methods, which only use first-order
information, i. e., the first derivative (or the gradient) term, Newton’s method requires a second-
order derivative (or the Hessian) to estimate the descent direction. Steps 3 and 4 are performed
as follows:
−1
Step 3: d = −[∇ f (x )]
k
2 (k)
∇f (x
(k)
) is the descent direction, where ∇2
f (x) is the
Hessian of f at the point x.
Step 4: α =argmin f (x
k α
(k)
− αdk ) .
Since the computation of the Hessian matrix is generally expensive, several modifications of
Newton’s method have been suggested in order to improve its computational efficiency. One
variant of Newton’s method is the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method, which
uses the gradient to iteratively approximate the inverse of the Hessian matrix
−1
H
−1

k
2
= [∇ f (x
(k)
)] , as follows:
T T T
sk y sk y sk s
−1 k −1 k k
H = (I − )H (I − ) + ,
k T k−1 T T
s yk s yk s yk
k k k

where, s = x − x
k
(k)
and y = ∇f (x ) − ∇f (x
(k−1)
k
(k)
).
(k−1)

In R, the implementation of the BFGS variant of Newton’s method can be found in the general
multipurpose package optimx. This implementation of the BFGS method can be used to solve the
problem (→18.3), as follows:

Now, let us use the BFGS method, implemented in the package optimx, to solve the problem
(→18.4).
18.3.2 Derivative-free methods
Gradient-based methods rely upon information about at least the gradient of the objective-
function to estimate the direction of search and the step size. Therefore, if the derivative of the
function cannot be computed, because, for example, the objective-function is discontinuous,
these methods often fail. Furthermore, although these methods can perform well on functions
with only one extrema (unimodal function), such as (→18.3), their efficiency in solving problems
with multimodal functions depend upon how far the initial solution is from the global minimum, i.
e., gradient-based methods are more or less efficient in finding the global minimum only if they
start from an initial solution sufficiently close to it. Therefore, the solution obtained using these
methods may be one of several local minima, and we often cannot be sure that the solution is the
global minimum. In this section, we will present some commonly used derivative-free methods,
which aim to reduce the limitations of the gradient-based methods by providing an alternative to
the computation of the derivatives of the objective-functions. These methods can be very efficient
in handling complex problems, where the functions are either discontinuous or improperly
defined.

18.3.2.1 The Nelder–Mead method

The Nelder–Mead method is an effective and computationally compact simplex algorithm for
finding a local minimum of a function of several variables. Hence, it can be used to solve
unconstrained optimization problems of the form:
min f (x),
n
x ∈ R . (18.11)
x

Definition 18.3.1.

A simplex is an n-dimensional polytope that is the convex hull of n + 1 vertices.


The Nelder–Mead method iteratively generates a sequence of simplices to approximate an
optimal solution to the problem (→18.11). At each iteration, the n + 1 vertices of the simplex are
ranked such that
f (x1 ) ≤ f (x2 ) ≤ ⋯ ≤ f (xn+1 ). (18.12)

Thus, x and x
1 n+1 correspond to the best and worst vertices, respectively.
At each iteration, the Nelder–Mead method consists of four possible operations: reflection,
expansion, contraction, and shrinking. Each of these operations has a scalar parameter associated
with it. Let us denote by α, β, γ, and δ the parameters associated with the aforementioned
operations, respectively. These parameters are chosen such that α > 0 , β > 1 , 0 < γ < 1 , and
0 < δ < 1.

Then, the Nelder–Mead simplex algorithm, as described in Lagarias et al. [→207], can be
summarized as follows:
Step 0: Generate a simplex with n + 1 vertices, and choose a convergence criterion;
Step 1: Sort the n + 1 vertices according to their objective-function values, i. e., so that
(→18.12) holds. Then, evaluate the centroid of the points in the simplex, excluding x , n+1

given by: x̄ = ∑ x ;
n
i=1 i

Step 2:
Calculate the reflection point x = x̄ + α(x̄ − x ) ;
r n+1

If f (x ) ≤ f (x ) ≤ f (x ) , then perform a reflection by replacing x


1 r n with x ;
n+1 r

Step 3:
If f (x ) < f (x ) , then calculate the expansion point x = x̄ + β(x − x̄) ;
r 1 e r

If f (x ) < f (x ) , then perform an expansion by replacing x


e r with x ;
n+1 e

otherwise (i. e. f (x ) ≥ f (x ) ), then perform a reflection by replacing x


e r with x ;n+1 r

Step 4:
If f (x ) ≤ f (x ) < f (x + 1) , then calculate the outside contraction point
n r n

x oc = x̄ + γ(x − x̄) ;
r

If f (x ) ≤ f (x ) , then perform an outside contraction by replacing x


oc r with x ;
n+1 oc

otherwise (i. e. if f (x ) > f (x ) ), then go to Step 6;


oc r

Step 5:
If f (x
r) ≥ f (xn+1 ) , then calculate the inside contraction point
xic = x̄ − γ(xr − x̄) ;
If f (x ) < f (x ) , then perform an inside contraction by replacing x
ic n+1 n+1 with x ;
ic

otherwise (i. e. if f (x ) ≥ f (x ) ), then go to Step 6;


ic n+1

Step 6: Perform a shrink by updating x , 2 ≤ i ≤ n + 1 as follows:


i

xi = x1 + δ(xi − x1 );

Step 7: Repeat Step 1 through Step 6 until convergence.


In R, an implementation of the Nelder–Mead method can be found in the general multipurpose
package optimx. The Nelder–Mead method in the package optimx, can be used to solve the
problem (→18.3) as follows:

Now, let us use the Nelder–Mead method, implemented in the package optimx, to solve the
problem (→18.4).
18.3.2.2 Simulated annealing

The efficiency of the optimization methods, previously discussed, depends on the proximity of the
initial point, from which they started, to the optimum. Therefore, they cannot always guarantee a
global minimum, since they may be trapped in one of several local minima. Simulated annealing is
based on a neighborhood search strategy, derived from the physical analogy of cooling material
in a heath bath, which occasionally allows uphill moves.
Simulated annealing is based on the Metropolis algorithm [→133], which simulates the
change in energy within a system when subjected to the cooling process; eventually, the system
converges to a final “frozen” state of a certain energy.
Let us consider a system with a state described by an n-dimensional vector x, for which the
function to be minimized is f (x) . This is equivalent to an unconstrained minimization problem.
Let T, denoting the generalized temperature, be a scalar quantity, which has the same dimensions
as f. Then, the Metropolis algorithm description, for a nonatomic system, can be summarized as
follows:
Step 0:
Construct an initial solution x ; set x = x ;
0 0

Set the number of Monte Carlo steps N MC = 0;

Set the temperature, T, to some high value, T .0

Step 1: Choose a transition Δx at random.


Step 2: Evaluate Δf = f (x) − f (x − Δx) .
Step 3:
If Δf ≤ 0 , then accept the state by updating x as follows:
x ⟵ x + Δx.

Otherwise (i. e., Δf > 0 ) then,


– Generate a random number u ∈ [0, 1] ;
– If u < e −Δf /T
, then accept the state by updating x as follows:
x ⟵ x + Δx;

Step 4:
Update the temperature value as follows: T ⟵ T − ε , where ε
T T ≪ T is a specified
positive real value.
Update the number of Monte Carlo steps: N ⟵ NMC + 1.MC

Step 5:
If T ≤ 0 , then stop, and return x;
Otherwise (i. e. T > 0 ) then go to Step 1.
In R, an implementation of the simulated annealing method can be found in the package GenSA,
and it can be used to solve the problem (→18.3) as follows:

Now, let us use the simulated annealing method, implemented in the package GenSA, to solve
the problem (→18.4).
18.4 Constrained optimization problems
Constrained optimization problems describe most of the real-world optimization problems. Their
complexity depends on the properties of the functional relationships between the decision
variables in both the objection function and the constraints.

18.4.1 Constrained linear optimization problems


A linear optimization problem, also referred to as a linear programming problem, occurs when the
objective function f and the equality and inequality constraints are all linear. The general structure
of such a problem can be written as follows:
n

Optimize f (x) = ∑ cj xj

j=1

subject to ∑ aij xj ≤ bi , i ∈ I ⊆ {1, … , m};

j=1

∑ akj xj ≥ bk , k ∈ K ⊆ {1, … , m};

j=1

∑ arj xj = br , r ∈ R ⊆ {1, … , m};

j=1

lj ≤ xj ≤ uj , j = 1, … , n.

Optimize = Minimize or Maximize;


x ∈ R and f : R ⟶ R is a linear function;
n n

I, K, and R are disjunct and I ∪ K ∪ R = {1, … , m} ;


l , u ∈ R ∪ {±∞} ;
j j

The coefficients c , a , a , a , b , b and b are given real constants.


j ij kj rj j k r
Linear constrained optimization problems can be solved using algorithms, such as the simplex
method or the interior point method.
In R, methods for solving linear optimization problems can be found in the package lpSolveAPI.
Let us consider the following constrained linear optimization problems:
Maximize f (x1 , x2 ) = x1 + 3x2

subject to x1 + x2 ≤ 14

(P1 ) −2x1 + 3x2 ≤ 12

2x1 − x2 ≤ 12

x1 , x2 ≥ 0

Minimize f (x1 , x2 ) = x2 − x1

subject to 2x1 − x2 ≥ −2

(P2 ) x1 − x2 ≤ 2

x1 + x2 ≤ 5

x1 , x2 ≥ 0

Maximize f (x1 , x2 ) = 5x1 + 7x2

subject to x1 + x2 ≥ 6

(P3 ) x1 ≥ 4

x2 ≤ 3

x1 , x2 ≥ 0

Minimize f (x1 , x2 ) = x2 − x1

subject to 2x1 − x2 ≥ −2

(P4 ) x1 − 2x2 ≤ −8

x1 + x2 ≤ 5

x1 , x2 ≥ 0

Since most of the optimization methods available in R are implemented for minimization-type
problems, and the solution which maximizes a function f (x) minimizes the function −f (x) , then
it is necessary to multiply the objective-functions for problems ( P ) and (P ) by −1, and solve the
1 3

corresponding minimization problems. Afterwards, we multiply the values of the objective-


functions by −1 to recover the value of the objective functions of ( P ) and (P ) .
1 3

The problem (P ) can be solved using the lpSolveAPI package as follows:


1
The problem (P ) can be solved using the lpSolveAPI package as follows:
2

The problem (P 3) can be solved using the lpSolveAPI package as follows:


The problem (P ) can be solved using the lpSolveAPI package as follows:
4

Suppose that, in the problem (P ) , x is a binary variable (i. e., it takes only the value 0 or 1),
1 1

and x is an integer variable, then it is necessary to set them to the appropriate type before
2

solving the problem (P 1) . This can be done as follows:

18.4.2 Constrained nonlinear optimization problems


In its general form, a nonlinear constrained optimization problem, with n variables and m
constraint, can be written as follows:
Optimize f (x) (18.13)

subject to gi (x) = bi , i ∈ I ⊆ {1, … , m};

hj (x) ≤ dj j ∈ J ⊆ {1, … , m};

hk (x) ≥ dk k ∈ K ⊆ {1, … , m};

x ≤ u

x ≥ l,

where
Optimize = Minimize or Maximize;
⟶ R, g : R ⟶ R, ∀ i ∈ I , h : R ⟶ R, ∀ r ∈ J , with at least one of
n n n
f : R i r ∪ K

these functions being nonlinear;


I, J, and K are disjunct and I ∪ J ∪ K = {1, … , m} ;
b , d , d ∈ R ∪ {±∞}, ∀ i, j, k ;
i j k

l, u ∈ (R ∪ {±∞}) .
n

The solution to constrained nonlinear optimization problems, in the form of (→18.13), can be
obtained using the Lagrange multiplier method.

18.4.3 Lagrange multiplier method


Without loss of the generality, assume that the problem (→18.13) is a minimization problem, and
let us multiply the inequality constraints of type “≥” by −1; then the problem can be rewritten as:
Minimize z = f (x) (18.14)

subject to gi (x) = bi , i = 1, … , p, with p ≤ m;

hj (x) ≤ dj j = 1, … , m − p.

The Lagrangian of the problem (→18.14), denoted L, is defined as follows:


p m−p (18.15)
L(x, λ, μ) = f (x) + ∑ λi (gi (x) − bi ) + ∑ μj (hj (x) − dj ),

i=1 j=1

where λ , and μ are the Lagrangian multipliers associated with the constraints g (x) = b , and
i j i i

h (x) ≤ d , respectively.
j j

The fundamental result behind the Lagrangian formulation (→18.15) can be summarized as
follows: suppose that a solution x = (x , x , … , x ) minimizes the function f (x) subject to
∗ ∗
1

2

n

the constraints g (x) = b , for i = 1, … , p and h (x) ≤ d , for j = 1, … , m − p . Then we


i i j j

have one of the following:


1. Either there exist vectors λ ∗ ∗
= (λ1 , … , λp )

and μ ∗ ∗
= (μ1 , … , μm−p )

such that
p m−p (18.16)
∗ ∗ ∗ ∗ ∗
∇f (x ) + ∑ λi ∇gi (x ) + ∑ μj ∇hj (x ) = 0;

i=1 j=1

∗ ∗
μj (hj (x ) − dj ) = 0, j = 1, … , m − p;
(18.17)

μj ≥ 0, j = 1, … , m − p; (18.18)

2. Or the vectors ∇g (x ) , for i = 1, … , p , ∇h


i

j (x

) for j = 1, … , m − p are
linearly dependent.
The result that is of greatest interest is the first one, i. e., case 1. From the equation (→18.17),
either μ is zero or h (x ) − d = 0 . This provides various possible solutions and the optimal
j j

j

solution is one of these. For an optimal solution, x , some of the inequalities constraints will be

satisfied at equality, and others will not. The latter can be ignored, whereas the former will form
the second equation above. Thus, the constraints μ (h (x ) − d ) = 0 mean that either an

j j

j

inequality constraint is satisfied at equality, or the Lagrangian multiplier μ is zero. j

The conditions (→18.16)–(→18.18) are referred to as the Karush–Kuhn–Tucker (KKT) conditions,


and they are necessary conditions for a solution to a nonlinear constrained optimization problem
to be optimal. For a maximization-type problem, the conditions (KKT) remain unchanged with the
exception of the first condition (→18.16), which is written as
p m−p

∗ ∗ ∗ ∗ ∗
∇f (x ) − ∑ λi ∇gi (x ) − ∑ μj ∇hj (x ) = 0.

i=1 j=1

Note that the KKT conditions (→18.16)–(→18.18) represent the stationarity, the
complementary slackness and the dual feasibility, respectively. Other supplementary KKT
conditions are the primal feasibiliy conditions defined by constraints of the problem (→18.14).
In R, an implementation of the Lagrange multiplier method, for solving nonlinear constrained
optimization problems, can be found in the package Rsolnp.
Let us use the function solnp from the R package Rsolnp to solve the following constrained
nonlinear minimization problem:
x1 2 2
Minimize f (x1 , x2 ) = e (4x + 2x + 4x1 x2 + 2x2 + 1)
1 2

(P ) subject to

x1 + x2 = 1

x1 x2 ≥ −10
18.5 Some applications in statistical machine learning
Most of the statistical theory, including statistical machine learning, consists of the efficient
use of collected data to estimate the unknown parameters of a model, which answers the
questions of interest.

18.5.1 Maximum likelihood estimation


The likelihood function of a parameter ω for a given observed data set, D , denoted by L , is
defined by
L (ω) = κP (D , ω), ω ∈ Ω; (18.19)

where κ is a constant independent of ω, P (D , ω) is the probability of the observed data set and Ω
is the feasible set of ω.
When the data set, D , consists of a complete random sample x , x , … , x from a discrete
1 2 n

probability distribution with probability function p(x/ω) , the probability of the observed dataset
is given by
P (D , ω)= P (X1 = x1 , X2 = x2 , … , Xn = xn |ω), (18.20)

n n (18.21)
= ∏ P (Xi = xi ) = ∏ p(xi |ω).

i=1 i=1

When the data set D consists of a complete random sample x , x , … , x from a continuous
1 2 n

probability distribution with probability function f (x/ω) , then x ∈ R and the observation x falls i

within a small interval [x , x + Δx ] with approximate probability Δx f (x/ω) . The probability


i i i i

of the observed data set is then given by


n n n (18.22)
P (D , ω)≈ ∏ Δxi f (xi |ω) = ∏ Δxi ∏ f (xi |ω).

i=1 i=1 i=1

Definition 18.5.1.

Without loss of generality, the likelihood function of a sample is proportional to the product of the
conditional probability of the data sample, given the parameter of interest, i. e.,
n (18.23)
L (ω) ∝ ∏ f (xi |ω), ω ∈ Ω.

i=1

Definition 18.5.2.

The value of the parameter ω, which maximizes the likelihood L (ω) , hence the probability of the
observed dataset P (D , ω) , is known as the maximum likelihood estimator (MLE) of ω and is
denoted ω̂ .
Note that the MLE ω̂ is a function of the data sample x , x , … , x . The likelihood function
1 2 n

(→18.23) is often complex to manipulate and, in practice, it is more convenient to work with the
logarithm of L (ω) ( log L (ω) ), which also yields the same optimal parameter ω̂ .
The MLE problem can then be formulated as the following optimization problem, which can be
solved using the numerical methods, implemented in R, presented in the previous sections.
Maximize log L (ω). (18.24)
ω∈Ω

18.5.2 Support vector classification


Suppose that we are given the following data points: (x , y ), … , (x , y ) , where x ∈ R and
1 1 n n i
m

y ∈ {−1, +1} . The fundamental idea, behind the concept of support vector machine (SVM)

classification [→191], is to find a pair (w, b) ∈ R × R such that the hyperplane defined by
m

⟨w, x⟩ + b = 0 separates the data points labeled y = +1 from those labeled y = −1 , and
i i
maximizes the distance to the closest points from either class. If the points (x , y ), i i i = 1, … , n

are linearly separable, then such a pair exists.


Let (x , y ) and (x , y ) , with y = +1 and y = −1 , be the closest points on either sides of the
1 1 2 2 1 2

optimal hyperplane defined by ⟨w, x⟩ + b = 0 . Then, we have


⟨w, x1 ⟩ + b = +1, (18.25)
{
⟨w, x2 ⟩ + b = −1.

From (→18.25), we have ⟨w, (x 1 − x2 )⟩ = 2 ⟹ ⟨


w

∥w∥
, (x1 − x2 )⟩ =
2

∥w∥
. Hence, for the
distance between the points (x 1, y1 ) and (x 2, y2 ) to be maximum, we need the ratio 2

∥w∥
to be
as maximum as possible or, equivalently, we need the ratio to be as minimum as possible, i.
∥w∥

e.
1 2
(18.26)
minimize ∥w∥ .
w∈R
m
2

Generalizing (→18.25) to all the points (x , y ) yields the following:


i i

⟨w, xi ⟩ + b ≥ +1, if yi = +1 (18.27)


{ ⟹ yi (⟨w, xi ⟩ + b) ≥ 1.
⟨w, xi ⟩ + b ≤ −1, if yi = −1

Thus, to construct such an optimal hyperplane, it is necessary to solve the following problem:
1 2
(18.28)
minimize z(w) = ∥w∥
w∈R
m
, b∈R 2

subject to yi (⟨w, xi ⟩ + b) ≥ 1, f or i = 1, … , m

The above problem is a constrained optimization problem with a nonlinear (quadratic) objective
function and linear constraints, which can be solved using the Lagrange multiplier method.
The Lagrangian associated with the problem (→18.28) can be defined as follows:

1
m (18.29)
2
L(w, b, λ) = ∥w∥ − ∑ λi (yi (⟨w, xi ⟩ + b) − 1),
2
i=1

where λ ≥ 0 , for i = 1, … , m denote the Lagrange multipliers.


i

The Lagrangian (→18.29) must be minimized with respect to w and b, and maximized with
respect to λ.
Solving

L(w, b, λ) = 0
(18.30)
∂b
{

L(w, b, λ) = 0
∂w

yields
m (18.31)
∑ λi yi = 0,

i=1

m (18.32)
w = ∑ λi yi .

i=1

Substituting w into the Lagrangian (→18.29) leads to the following optimization problem, also
known as the dual formulation of support vector classifier:
m
1
m m (18.33)
maximize Z(λ) = ∑ λi − ∑ ∑ λi λj yi yj ⟨xi , xj ⟩
λ∈R
m
2
i i=1 j=1

subject to ∑ λi yi = 0,

i=1

λi ≥ 0, f or i = 1, … , m.

Both problems (→18.28) and (→18.33) can be solved in R using the package Rsolnp, as illustrated in
Listing (18.18). However, since the constraints of (→18.28) are relatively complex, it is
computationally easier to solve the problem (→18.33) and then recover the vector w through
(→18.32).

18.6 Further reading


For advanced readings about numerical methods in optimization, we recommend the following
references: [→13], [→16], [→144]. For stochastic optimization, the textbook [→162] provides a
good introduction and overview.

18.7 Summary
Optimization is a broad and complex topic. One of the major challenges in optimization is the
determination of global optima for nonlinear and high-dimensional problems. Generally,
optimization methods find applications in attempts to optimize a parametric decision-making
process, such as classification, clustering, or regression of data. The corresponding optimization
problems either involve complex nonlinear functions or are based on data points, i. e., the
problems include discontinuities. Knowledge about optimization methods can be helpful in
designing analysis methods, since they usually involve difficult optimization solutions. Hence, a
parsimonious approach for designing such analysis methods will also help to keep optimization
problems tractable.

18.8 Exercises
1. Consider the following unconstrained problem:
2 2 (18.34)
max f (x1 , x2 ) = −(x1 − 5) − (x2 − 3) .
2
(x1 ,x2 )∈R

Using R, provide the contour plot of the function f (x 1, x2) and solve the problem
(→18.34) using
the steepest descent method;
the conjugate gradient method;
Newton’s method;
the Nelder–Mead method;
Simulated annealing;
2. Consider the following unconstrained problem:
1 (18.35)
2 2
min z(x1 , x2 ) = −2x1 − 3x2 + x1 + 2x2 − 3x1 x2 .
(x1 ,x2 )∈R
2
5

Using R, provide the contour plot of the function z(x , x2) and solve the problem
1

(→18.35) using
the steepest descent method;
the conjugate gradient method;
Newton’s method;
the Nelder–Mead method;
Simulated annealing;
3. Using the R package lpSolveAPI, solve the following linear programming problems:
Minimize f (x1 , x2 ) = 2x1 + 3x2

1 1
subject to x1 + x2 ≤ 4
4
2

(A) x1 + 3x2 ≥ 20

x1 + x2 = 10

x1 , x2 ≥ 0

Maximize f (x1 , x2 ) = 3x1 + x2

subject to x1 + x2 ≥ 3

(B) 2x1 + x2 ≤ 4

x1 + x2 = 3

x1 , x2 ≥ 0

4. Using the function solnp from the R package Rsolnp, solve the following nonlinear
constrained optimization problems:
2 2
Minimize f (x1 , x2 ) = (x1 − 1) + (x2 − 2)

(A) subject to

−x1 + x2 = 1

x1 + x2 ≤ 3

2 2
Minimize f (x1 , x2 ) = −x − x + 3x1 + 5x2
1 2

(B) subject to

x1 + x2 ≤ 7

x1 ≤ 5

x2 ≤ 6
Bibliography
[1] H. Abelson, G. J. Sussman, and J. Sussman. Structure and
Interpretation of Computer Programs. MIT Press; 2nd edition,
1996. a, b, c
[2] L. Adamic and B. Huberman. Power-law distribution of the
world wide web. Science, 287:2115a, 2000. →
[3] W. A. Adkins and M. G. Davidson. Ordinary Differential
Equations. Undergraduate Texts in Mathematics. Springer New
York, 2012. a, b
[4] R. Albert and A. L. Barabási. Statistical mechanics of complex
networks. Rev. Mod. Phys., 74:47–97, 2002. a, b
[5] G. R. Andrews. Foundations of Multithreaded, Parallel, and
Distributed Programming. Addison-Wesley, 1999. →
[6] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, and H. Butler
et al. Gene ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nat. Genet., 25(1):25–29, May 2000. →
[7] K. Baltakys, J. Kanniainen, and F. Emmert-Streib. Multilayer
aggregation of investor trading networks. Sci. Rep., 1:8198,
2018. →
[8] A. L. Barabási and R. Albert. Emergence of scaling in random
networks. Science, 206:509–512, 1999. a, b
[9] A. L. Barabási and Z. N. Oltvai. Network biology:
understanding the cell’s functional organization, Nat. Rev., 5:101–
113, 2004. →
[10] Albert-László Barabási. Network science. Philos. Trans. R. Soc.
Lond. A, 371(1987):20120375, 2013. →
[11] M. Barnsley. Fractals Everywhere. Morgan Kaufmann, 2000.

[12] R. G. Bartle and D. R. Sherbert. Introduction to Real Analysis.
Wiley Publishing, 1999. →
[13] Mokhtar S Bazaraa, Hanif D Sherali, and Chitharanjan M
Shetty. Nonlinear Programming: Theory and Algorithms. John Wiley
& Sons, 2013. →
[14] R. A. Becker and J. M. Chambers. An Interactive Environment
for Data Analysis and Graphics. Wadsworth & Brooks/Cole, Pacific
Grove, CA, USA, 1984. →
[15] M. Behzad, G. Chartrand, and L. Lesniak-Foster. Graphs &
Digraphs. International Series. Prindle, Weber & Schmidt, 1979.

[16] D. P. Bertsekas. Nonlinear Programming. Athena Scientific
Optimization and Computation Series. Athena Scientific, 2016. →
[17] Dimitri P Bertsekas and John N Tsitsiklis. Introduction to
probability, volume 1, 2002. →
[18] Joseph K Blitzstein and Jessica Hwang. Introduction to
Probability. Chapman and Hall/CRC, 2014. →
[19] D. Bonchev. Information Theoretic Indices for Characterization
of Chemical Structures. Research Studies Press, Chichester, 1983.

[20] D. Bonchev and D. H. Rouvray. Chemical Graph Theory:
Introduction and Fundamentals. Mathematical Chemistry. Abacus
Press, 1991. →
[21] D. Bonchev and D. H. Rouvray. Complexity in Chemistry,
Biology, and Ecology. Mathematical and Computational
Chemistry. Springer, New York, NY, USA, 2005. →
[22] G. S. Boolos, J. P. Burgess, and R. C. Jeffrey. Computability and
Logic Cambridge University Press; 5th edition, 2007. →
[23] S. Bornholdt and H. G. Schuster. Handbook of Graphs and
Networks: From the Genome to the Internet. John Wiley & Sons,
Inc., New York, NY, USA, 2003. a, b, c
[24] U. Brandes and T. Erlebach. Network Analysis. Lecture Notes
in Computer Science. Springer, Berlin Heidelberg New York,
2005. →
[25] A. Brandstädt, V. B. Le, and J. P. Sprinrad. Graph Classes. A
Survey. SIAM Monographs on Discrete Mathematics and
Applications, 1999. →
[26] L. Breiman. Random forests. Mach. Learn., 45:5–32, 2001. →
[27] O. Bretscher. Linear Algebra with Applications. Prentice Hall;
3rd edition, 2004. a, b, c, d, e, f, g, h, i, j
[28] Sergey Brin and Lawrence Page. The anatomy of a large-
scale hypertextual Web search engine. Comput. Netw. ISDN Syst.,
30(1–7):107–117, 1998. →
[29] M. Brinkmeier and T. Schank. Network statistics. In U.
Brandes and T. Erlebach, editors, Network Analysis, Lecture Notes
of Computer Science, pages 293–317. Springer, 2005. →
[30] I. A. Bronstein, A. Semendjajew, G. Musiol, and H. Mühlig.
Taschenbuch der Mathematik. Harri Deutsch Verlag, 1993. →
[31] F. Buckley and F. Harary. Distance in Graphs. Addison Wesley
Publishing Company, 1990. →
[32] P. E. Ceruzzi. A History of Modern Computing. MIT Press; 2nd
edition, 2003. a, b, c, d
[33] S. Chiaretti, X. Li, R. Gentleman, A. Vitale, M. Vignetti, F.
Mandelli, J. Ritz, and R. Foa. Gene expression profile of adult t-cell
acute lymphocytic leukemia identifies distinct subsets of patients
with different response to therapy and survival. Blood,
103(7):2771–2778, 2003. →
[34] W. F. Clocksin and C. S. Mellish. Programming in Prolog: Using
the ISO Standard. Springer, 2002. →
[35] S. Cole-Kleene. Mathematical Logic. Dover Books on
Mathematics. Dover Publications, 2002. a, b, c
[36] B. Jack Copeland, C. J. Posy, and O. Shagrir. Elements of
Information Theory. The MIT Press, 2013. a, b, c, d
[37] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to
Algorithms. MIT Press, 1990. a, b, c, d, e, f, g, h, i, j
[38] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.
Introduction to Algorithms. MIT Press, 2001. a, b, c, d, e, f, g, h, i, j,
k, l, m
[39] T. M. Cover and J. A. Thomas. Information Theory. John Wiley
& Sons, Inc., 1991. a, b
[40] T. M. Cover and J. A. Thomas. Elements of Information Theory.
Wiley Series in Telecommunications and Signal Processing. Wiley
& Sons, 2006. →
[41] N. Cristianini and J. Shawe-Taylor. An Introduction to Support
Vector Machines. Cambridge University Press, Cambridge, UK,
2000. a, b
[42] Gabor Csardi and Tamas Nepusz. The igraph software
package for complex network research. InterJournal, Complex
Systems:1695, 2006, http://igraph.sf.net. →
[43] L. da F. Costa, F. Rodrigues, and G. Travieso.
Characterization of complex networks: a survey of
measurements. Adv. Phys., 56:167–242, 2007. →
[44] R. de Matos Simoes and F. Emmert-Streib. Influence of
statistical estimators of mutual information and data
heterogeneity on the inference of gene regulatory networks.
PLoS ONE, 6(12):e29279, 2011. →
[45] R. de Matos Simoes and F. Emmert-Streib. Bagging
statistical network inference from large-scale gene expression
data. PLoS ONE, 7(3):e33624, 2012. →
[46] Pierre Lafaye de Micheaux, Rémy Drouilhet, and Benoit
Liquet. The r software. 2013. →
[47] J. Debasish. C++ and Object Oriented Programming Paradigm.
PHI Learning Pvt. Ltd., 2005. a, b, c, d, e
[48] Morris H DeGroot and Mark J Schervish. Probability and
statistics. Pearson Education, 2012. a, b
[49] M. Dehmer. Die analytische Theorie der Polynome.
Nullstellenschranken für komplexwertige Polynome. Weissensee-
Verlag, Berlin, Germany, 2004. →
[50] M. Dehmer. On the location of zeros of complex
polynomials. J. Inequal. Pure Appl. Math., 7(1), 2006. a, b
[51] M. Dehmer. Strukturelle Analyse web-basierter Dokumente.
Multimedia und Telekooperation. Deutscher Universitäts Verlag,
Wiesbaden, 2006. →
[52] M. Dehmer, editor. Structural Analysis of Complex Networks.
Birkhäuser Publishing, 2010. a, b
[53] M. Dehmer and F. Emmert-Streib, editors. Analysis of
Complex Networks: From Biology to Linguistics. Wiley-VCH,
Weinheim, 2009. →
[54] M. Dehmer, K. Varmuza, S. Borgert, and F. Emmert-Streib.
On entropy-based molecular descriptors: statistical analysis of
real and synthetic chemical structures. J. Chem. Inf. Model.,
49:1655–1663, 2009. →
[55] M. Dehmer, K. Varmuza, S. Borgert, and F. Emmert-Streib.
On entropy-based molecular descriptors: statistical analysis of
real and synthetic chemical structures. J. Chem. Inf. Model.,
49(7):1655–1663, 2009. →
[56] R. Devaney and M. W. Hirsch. Differential Equations,
Dynamical Systems, and an Introduction to Chaos. Academic Press,
2004. →
[57] J. Devillers and A. T. Balaban. Topological Indices and Related
Descriptors in QSAR and QSPR. Gordon and Breach Science
Publishers, Amsterdam, The Netherlands, 1999. →
[58] E. W. Dijkstra. A note on two problems in connection with
graphs. Numer. Math., 1:269–271, 1959. a, b, c, d, e, f, g
[59] S. N. Dorogovtsev and J. F. F. Mendes. Evolution of Networks.
From Biological Networks to the Internet and WWW. Oxford
University Press, 2003. →
[60] J. Duckett. Beginning HTML, XHTML, CSS, and JavaScript. Wrox,
2009. →
[61] F Emmert-Streib and M Dehmer. Defining data science by a
data-driven quantification of the community. Machine Learning
and Knowledge Extraction, 1(1):235–251, 2019. a, b
[62] F. Emmert-Streib. Exploratory analysis of spatiotemporal
patterns of cellular automata by clustering compressibility. Phys.
Rev. E, 81(2):026103, 2010. →
[63] F. Emmert-Streib and M. Dehmer. Topolocial mappings
between graphs, trees and generalized trees. Appl. Math.
Comput., 186(2):1326–1333, 2007. a, b, c
[64] F. Emmert-Streib and M. Dehmer, editors. Analysis of
Microarray Data: A Network-based Approach. Wiley VCH
Publishing, 2010. a, b, c, d, e
[65] F. Emmert-Streib and M. Dehmer. Identifying critical
financial networks of the djia: towards a network based index.
Complexity, 16(1), 2010. →
[66] F. Emmert-Streib and M. Dehmer. Influence of the time
scale on the construction of financial networks. PLoS ONE,
5(9):e12884, 2010. →
[67] F. Emmert-Streib and M. Dehmer. Networks for systems
biology: conceptual connection of data and function. IET Syst.
Biol., 5:185–207, 2011. a, b
[68] F. Emmert-Streib and M. Dehmer. Evaluation of regression
models: model assessment, model selection and generalization
error. Machine Learning and Knowledge Extraction, 1(1):521–551,
2019. →
[69] F. Emmert-Streib, M. Dehmer, and B. Haibe-Kains.
Untangling statistical and biological models to understand
network inference: the need for a genomics network ontology.
Front. Genet., 5:299, 2014. →
[70] F. Emmert-Streib, M. Dehmer, and O. Yli-Harja. Against
Dataism and for data sharing of big biomedical and clinical data
with research parasites. Front. Genet., 7:154, 2016. →
[71] F. Emmert-Streib and G. V. Glazko. Network biology: a direct
approach to study biological function. Wiley Interdiscip. Rev., Syst.
Biol. Med., 3(4):379–391, 2011. →
[72] F. Emmert-Streib, G. V. Glazko, Gökmen Altay, and Ricardo
de Matos Simoes. Statistical inference and reverse engineering
of gene regulatory networks from observational expression
data. Front. Genet., 3:8, 2012. →
[73] F. Emmert-Streib, S. Moutari, and M. Dehmer. The process
of analyzing data is the emergent feature of data science. Front.
Genet., 7:12, 2016. →
[74] F. Emmert-Streib, S. Tripathi, O. Yli-Harja, and M. Dehmer.
Understanding the world economy in terms of networks: a
survey of data-based network science approaches on economic
networks. Front. Appl. Math. Stat., 4:37, 2018. a, b
[75] Frank Emmert-Streib and Matthias Dehmer. Network
science: from chemistry to digital society. Front. Young Minds,
2019. →
[76] P. Erdös and A. Rényi. On random graphs. I. Publ. Math.,
6:290–297, 1959. →
[77] P. Erdös and A. Rényi. On random graphs. Publ. Math. Inst.
Hung. Acad. Sci., 5:17, 1960. →
[78] G Fichtenholz. Differentialrechnung und Integralrechnung.
Verlag Harri Deutsch, 1997. a, b, c
[79] R. W. Floyd. The paradigms of programming. Commun. ACM,
22(8):455–460, 1979. a, b, c, d
[80] L. C. Freeman. A set of measures of centrality based on
betweenness. Sociometry, 40, 1977. a, b, c
[81] L. C. Freeman. Centrality in social networks: conceptual
clarification. Soc. Netw., 1:215–239, 1979. a, b, c
[82] Thomas M. J. Fruchterman and Edward M. Reingold. Graph
drawing by force-directed placement. Softw. Pract. Exp.,
21(11):1129–1164, 1991. →
[83] R. G. Gallager. Information Theory and Reliable
Communication. Wiley, 1968. →
[84] A Gelman, J B Carlin, H S Stern, and D B Rubin. Bayesian Data
Analysis. Chapman & Hall/CRC, 2003. →
[85] C. Gershenson. Classification of random boolean networks.
In R. K. Standish, M. A. Bedau, and H. A. Abbass, editors, Artificial
Life VIII, pages 1–8. MIT Press, Cambridge, 2003. →
[86] G. H. Golub and C. F. Van Loan. Matrix Computation. The
Johns Hopkins University, 2012. →
[87] Geoffrey Grimmett, Geoffrey R Grimmett, and David
Stirzaker. Probability and Random Processes. Oxford University
Press, 2001. →
[88] Jonathan L Gross and Jay Yellen. Graph Theory and Its
Applications. CRC Press, 2005. →
[89] Grundlagen der Informatik für Ingenieure, 2008. Course
materials, School of Computer Science, Otto-von-Guericke-
University Magdeburg, Germany. a, b, c, d
[90] I. Gutman. The energy of a graph: old and new results. In A.
Betten, A. Kohnert, R. Laue, and A. Wassermann, editors,
Algebraic Combinatorics and Applications, pages 196–211. Springer
Verlag, Berlin, 2001. →
[91] Ian Hacking. The Emergence of Probability: A Philosophical
Study of Early Ideas About Probability, Induction and Statistical
Inference. Cambridge University Press, 2006. a, b
[92] P. Hage and F. Harary. Eccentricity and centrality in
networks. Soc. Netw., 17:57–63, 1995. →
[93] R. Halin. Graphentheorie. Akademie Verlag, Berlin, Germany,
1989. →
[94] F. Harary. Graph Theory. Addison Wesley Publishing
Company, Reading, MA, USA, 1969. a, b, c, d, e, f, g, h, i, j, k
[95] R. Harrison, L. G. Smaraweera, M. R. Dobie, and P. H. Lewis.
Comparing programming paradigms: an evaluation of functional
and object-oriented programs. Softw. Eng. J., 11(4):247–254,
1996. a, b
[96] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of
Statistical Learning. Springer, Berlin, New York, 2001. →
[97] D. C. Hoaglin, F. Mosteller, and J. W. Tukey. Understanding
Robust and Exploratory Data Analysis. Wiley, New York, 1983. →
[98] R. E. Hodel. An Introduction to Mathematical Logic. Dover
Publications, 2013. a, b, c, d, e, f, g, h, i
[99] A. S. Householder. The Numerical Treatment of a Single
Nonlinear Equation. McGraw-Hill, New York, NY, USA, 1970. →
[100] T. Ihringer. Diskrete Mathematik. Teubner, Stuttgart, 1994.
a, b
[101] Edwin T Jaynes. Probability Theory: The Logic of Science.
Cambridge University Press, 2003. →
[102] J. Jost. Partial Differential Equations. Springer, New York, NY,
USA, 2007. a, b
[103] G. Julia. Mémoire sur l’itération des fonctions rationnelles.
J. Math. Pures Appl., 8:47–245, 1918. →
[104] B. Junker, D. Koschützki, and F. Schreiber. Exploration of
biological network centralities with centibin. BMC Bioinform.,
7(1):219, 2006. a, b
[105] Joseph B Kadane. Principles of Uncertainty. Chapman and
Hall/CRC, 2011. →
[106] Tomihisa Kamada, Satoru Kawai, et al. An algorithm for
drawing general undirected graphs. Inf. Process. Lett., 31(1):7–15,
1989. →
[107] M. Kanehisa and S. Goto. KEGG: Kyoto Encyclopedia of
Genes and Genomes. Nucleic Acids Res., 28:27–30, 2000. →
[108] Daniel Kaplan and Leon Glass. Understanding Nonlinear
Dynamics. Springer Science & Business Media, 2012. →
[109] S. A. Kauffman. The Origin of Order: Self Organization and
Selection in Evolution. Oxford University Press, USA, 1993. a, b, c
[110] S. V. Kedar. Programming Paradigms and Methodology.
Technical Publications, 2008. a, b, c, d
[111] U. Kirch-Prinz and P. Prinz. C++. Lernen und professionell
anwenden. mitp Verlag, 2005. a, b, c, d, e
[112] D. G. Kleinbaum and M. Klein. Survival Analysis: A Self-
Learning Text. Statistics for Biology and Health. Springer, 2005.

[113] V. Kontorovich, L. A. Beltrn, J. Aguilar, Z. Lovtchikova, and K.
R. Tinsley. Cumulant analysis of Rössler attractor and its
applications. Open Cybern. Syst. J., 3:29–39, 2009. →
[114] Kevin B Korb and Ann E Nicholson. Bayesian Artificial
Intelligence. CRC Press, 2010. a, b
[115] R. C. Laubenbacher. Modeling and Simulation of Biological
Networks. Proceedings of Symposia in Applied Mathematics.
American Mathematical Society, 2007. →
[116] S. L. Lauritzen. Graphical Models. Oxford Statistical Science
Series. Oxford University Press, 1996. →
[117] M. Z. Li, M. S. Ryerson, and H. Balakrishnan. Topological
data analysis for aviation applications. Transp. Res., Part E, Logist.
Transp. Rev., 128:149–174, 2019. →
[118] Dennis V Lindley. Understanding Uncertainty. John Wiley &
Sons, 2013. →
[119] E. N. Lorenz. Deterministic nonperiodic flow. J. Atmos. Sci.,
20:130–141, 1963. →
[120] A. J. Lotka. Elements of Physical Biology. Williams and
Wilkins, 1925. →
[121] K. C. Louden. Compiler Construction: Principles and Practice.
Course Technology, 1997. a, b
[122] K. C. Louden and K. A. Lambert. Programming Languages:
Principles and Practice. Advanced Topics Series. Cengage
Learning, 2011. a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v,
w, x, y, z, aa, ab, ac, ad, ae, af, ag, ah, ai, aj, ak, al, am
[123] D. J. C. MacKay. Information Theory, Inference and Learning
Algorithms. Cambridge University Press, 2003. →
[124] D. Maier. Theory of Relational Databases. Computer Science
Press; 1st edition, 1983. →
[125] B. B. Mandelbrot. The Fractal Geometry of Nature. W. H.
Freeman and Company, San Francisco, 1983. a, b
[126] E. G. Manes and A. A. Arbib. Algebraic Approaches to
Program Semantics. Monographs in Computer Science. Springer,
1986. a, b, c, d, e, f, g
[127] M. Marden. Geometry of polynomials. Mathematical Surveys
of the American Mathematical Society, Vol. 3. Rhode Island, USA,
1966. →
[128] O. Mason and M. Verwoerd. Graph theory and networks in
biology. IET Syst. Biol., 1(2):89–119, 2007. a, b
[129] N. Matloff. The Art of R Programming: A Tour of Statistical
Software Design. No Starch Press, 2011. →
[130] B. D. McKay. Graph isomorphisms. Congr. Numer., 730:45–
87, 1981. →
[131] J. M. McNamee. Numerical Methods for Roots of Polynomials.
Part I. Elsevier, 2007. →
[132] A. Mehler, M. Dehmer, and R. Gleim. Towards logical
hypertext structure. A graph-theoretic perspective. In
Proceedings of I2CS’04, Lecture Notes, pages 136–150. Springer,
Berlin–New York, 2005. →
[133] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, and A.
H. Teller. Equations of state calculations by fast computing
machines. J. Chem. Phys., 21(6):1087–1092, 1953. →
[134] C. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM,
2000. →
[135] M. Mignotte and D. Stefanescu. Polynomials: An Algorithmic
Approach. Discrete Mathematics and Theoretical Computer
Science. Springer, Singapore, 1999. a, b, c, d, e, f
[136] J. C. Mitchell. Concepts in Programming Languages.
Cambridge University Press, 2003. a, b, c
[137] Michael Mitzenmacher and Eli Upfal. Probability and
Computing: Randomization and Probabilistic Techniques in
Algorithms and Data Analysis. Cambridge University Press, 2017.

[138] Mehryar Mohri, Afshin Rostamizadeh, and Ameet
Talwalkar. Foundations of Machine Learning. MIT Press, 2018. →
[139] D. Moore, R. de Matos Simoes, M. Dehmer, and F. Emmert-
Streib. Prostate cancer gene regulatory network inferred from
RNA-Seq data. Curr. Genomics, 20(1):38–48, 2019. →
[140] C. Müssel, M. Hopfensitz, and H. A. Kestler. Boolnet—an R
package for generation, reconstruction and analysis of Boolean
networks. Bioinformatics, 26(10):1378–1380, 2010. →
[141] M. Newman. Networks: An Introduction. Oxford University
Press, Oxford, 2010. →
[142] M. E. J. Newman. The structure and function of complex
networks. SIAM Rev., 45:167–256, 2003. a, b
[143] M. E. J. Newman, A. L. Barabási, and D. J. Watts. The
Structure and Dynamics of Networks. Princeton Studies in
Complexity. Princeton University Press, 2006. a, b
[144] Jorge Nocedal and Stephen Wright. Numerical Optimization.
Springer Science & Business Media, 2006. →
[145] Peter Olofsson. Probabilities: The Little Numbers That Rule
Our Lives. John Wiley & Sons, 2015. →
[146] G. O’Regan. Mathematics in Computing: An Accessible Guide
to Historical, Foundational and Application Contexts. Springer,
2012. →
[147] A. Papoulis. Probability, Random Variables, and Stochastic
Processes. Mc Graw-Hill, 1991. →
[148] Lothar Papula. Mathematik für Ingenieure und
Naturwissenschaftler Band 1: Ein Lehr-und Arbeitsbuch für das
Grundstudium. Springer-Verlag, 2018. →
[149] J. Pearl. Probabilistic Reasoning in Intelligent Systems.
Morgan-Kaufmann, 1988. →
[150] J. Pitman. Probability. Springer Texts in Statistics. Springer
New York, 1999. →
[151] V. V. Prasolov. Polynomials. Springer, 2004. →
[152] T. W. Pratt, M. V. Zelkowitz, and T. V. Gopal. Programming
Languages: Design and Implementation, volume 4. Prentice-Hall,
2000. →
[153] R, software, a language and environment for statistical
computing. www.r-project.org, 2018. R Development Core Team,
Foundation for Statistical Computing, Vienna, Austria. →
[154] R Development Core Team. R: A Language and Environment
for Statistical Computing. R Foundation for Statistical Computing,
Vienna, Austria, 2008. ISBN 3-900051-07-0. →
[155] Q. I. Rahman and G. Schmeisser. Analytic Theory of
Polynomials. Critical Points, Zeros and Extremal Properties.
Clarendon Press, Oxford, UK, 2002. a, b, c
[156] J.-P. Rodrigue, C. Comtois, and B. Slack. The Geography of
Transport Systems. Taylor & Francis, 2013. a, b
[157] O. E. Rössler. An equation for hyperchaos. Phys. Lett.,
71A:155–157, 1979. →
[158] W. Rudin. Real and Complex Analysis. McGraw-Hill, 3rd
edition, 1986. a, b, c, d, e, f, g, h, i
[159] G. Sabidussi. The centrality index of a graph. Psychometrika,
31:581–603, 1966. →
[160] A. Salomaa. Formal Languages. Academic Press, 1973. a, b
[161] Leonard J Savage. The Foundations of Statistics. Courier
Corporation, 1972. →
[162] J. Schneider and S. Kirkpatrick. Stochastic Optimization.
Scientific Computation. Springer Berlin Heidelberg, 2007. →
[163] Uwe Schöning. Algorithmen—kurz gefasst. Spektrum
Akademischer Verlag, 1997. →
[164] Uwe Schöning. Theoretische Informatik—kurz gefasst.
Spektrum Akademischer Verlag, 2001. a, b, c
[165] H. G. Schuster. Deterministic Chaos. Wiley VCH Publisher,
1988. →
[166] K. Scott. The SQL Programming Language. Jones & Bartlett
Publishers, 2009. a, b
[167] M. L. Scott. Programming Language Pragmatics. Morgan
Kaufmann, 2009. a, b
[168] R. W. Sebesta. Concepts of Programming Languages, volume
9. Addison-Wesley Reading, 2009. a, b, c, d, e, f
[169] C. E. Shannon and W. Weaver. The Mathematical Theory of
Communication. University of Illinois Press, 1949. a, b
[170] L. Shapiro. Organization of relational models. In
Proceedings of Intern. Conf. on Pattern Recognition, pages 360–365,
1982. →
[171] D. J. Sheskin. Handbook of Parametric and Nonparametric
Statistical Procedures. RC Press, Boca Raton, FL; 3rd edition,
2004. a, b
[172] W. Sierpinśki. On curves which contains the image of any
given curve. Mat. Sbornik. In Russian. French translation in Oeuvres
Choisies II, 30:267–287, 1916. →
[173] Devinderjit Sivia and John Skilling. Data Analysis: A Bayesian
Tutorial. OUP Oxford, 2006. →
[174] V. A. Skorobogatov and A. A. Dobrynin. Metrical analysis of
graphs. MATCH Commun. Math. Comput. Chem., 23:105–155,
1988. →
[175] P. Smith. An Introduction to Formal Logic. Cambridge
University Press, 2003. a, b, c, d, e, f
[176] K. Soetaert, J. Cash, and F Mazzia. Solving Differential
Equations in R. Springer-Verlag, New York, 2012. →
[177] K. Soetaert and P. M. J. Herman. A Practical Guide to
Ecological Modelling. Using R as a Simulation Platform. Springer-
Verlag, New York, 2009. →
[178] D. Ştefănescu. Bounds for real roots and applications to
orthogonal polynomials. In Computer Algebra in Scientific
Computing, 10th International Workshop, CASC 2007, Bonn,
Germany, pages 377–391, 2007. →
[179] S. Sternberg. Dynamical Systems. Dover Publications, New
York, NY, USA, 2010. →
[180] James V Stone. Bayes’ Rule: A Tutorial Introduction to
Bayesian Analysis. Sebtel Press, 2013. →
[181] S. H. Strogatz. Nonlinear Dynamics and Chaos: With
Applications to Physics, Biology, Chemistry, and Engineering.
Addison-Wesley, Reading, 1994. →
[182] K. Sydsaeter, P. Hammond, and A. Strom. Essential
Mathematics for Economic Analysis. Pearson; 4th edition, 2012. a,
b, c, d, e
[183] S. Thurner. Statistical mechanics of complex networks. In
M. Dehmer and F. Emmert-Streib, editors, Analysis of Complex
Networks: From Biology to Linguistics, pages 23–45. Wiley-VCH,
2009. →
[184] J. P. Tignol. Galois’ Theory of Algebraic Equations. World
Scientific Publishing Company, 2016. →
[185] Mary Tiles. Mathematics: the language of science? Monist,
67(1):3–17, 1984. →
[186] N. Trinajstić. Chemical Graph Theory. CRC Press, Boca Raton,
FL, USA, 1992. a, b, c
[187] S. Tripathi, M. Dehmer, and F. Emmert-Streib. NetBioV: an
R package for visualizing large-scale data in network biology.
Bioinformatics, 384, 2014. a, b
[188] S. B. Trust. Role of Mathematics in the Rise of Science.
Princeton Legacy Library. Princeton University Press, 2014. →
[189] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, New
York, 1977. →
[190] V. van Noort, B. Snel, and M. A. Huymen. The yeast
coexpression network has a small-world, scale-free architecture
and can be explained by a simple model. EMBO Rep., 5(3):280–
284, 2004. →
[191] V. Vapnik. Statistical Learning Theory. J. Willey, 1998. →
[192] Vladimir Naumovich Vapnik. The Nature of Statistical
Learning Theory. Springer, 1995. →
[193] V. Volterra. Variations and fluctuations of the number of
individuals in animal species living together. In R. N. Chapman,
editor, Animal Ecology, McGraw–Hill, 1931. →
[194] J. von Neumann. The Theory of Self-Reproducing Automata.
University of Illinois Press, Urbana, 1966. →
[195] Andreas Wagner and David A. Fell. The small world inside
large metabolic networks. Proc. R. Soc. Lond. B, Biol. Sci.,
268(1478):1803–1810, 2001. →
[196] J. Wang and G. Provan. Characterizing the structural
complexity of real-world complex networks. In J. Zhou, editor,
Complex Sciences, volume 4 of Lecture Notes of the Institute for
Computer Sciences, Social Informatics and Telecommunications
Engineering, pages 1178–1189. Springer, Berlin/Heidelberg,
Germany, 2009. →
[197] S. Wasserman and K. Faust. Social Network Analysis:
Methods and Applications. Structural Analysis in the Social
Sciences. Cambridge University Press, 1994. a, b, c, d, e, f, g
[198] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-
world’ networks. Nature, 393:440–442, 1998. a, b, c, d
[199] A. Weil. Basic Number Theory. Springer, 2005. →
[200] Hadley Wickham. ggplot2: Elegant Graphics for Data
Analysis. Springer, 2016. →
[201] Hadley Wickham. Advanced R. Chapman and Hall/CRC; 2nd
edition, 2019. →
[202] R. Wilhelm and D. Maurer. Übersetzerbau: Theorie,
Konstruktion, Generierung. Springer, 1997. a, b, c, d
[203] Thomas Wilhelm, Heinz-Peter Nasheuer, and Sui Huang.
Physical and functional modularity of the protein network in
yeast. Mol. Cell. Proteomics, 2(5):292–298, 2003. →
[204] Leland Wilkinson. The grammar of graphics. In Handbook
of Computational Statistics, pages 375–414. Springer, 2012. →
[205] S. Wolfram. Statistical mechanics of cellular automata.
Phys. Rev. E, 55(3):601–644, 1983. →
[206] S. Wolfram. A New Kind of Science. Wolfram Media, 2002. →
[207] J. A. Wright, M. H. Wright, P. Lagarias, and J. C. Reeds.
Convergence properties of the nelder-mead simplex algorithm
in low dimensions. SIAM J. Optim., 9:112–147, 1998. →
Subject Index
A
adjacency matrix 1
algorithm 1
analysis 1
antiderivative 1
Asynchronous Random Boolean Networks 1
attractor 1
attractors 1
aviation network 1

B
bar plot 1
basic programming 1
basin of the attractor 1
Bayesian networks 1
Bayes’ theorem 1
Bernoulli distribution 1
Beta distribution 1
betweenness centrality 1
bifurcation 1
bifurcation point 1
binomial coefficient 1
Binomial distribution 1
bivariate distribution 1
boolean functions 1
Boolean logic 1
boolean value 1
Boundary Value ODE 1
Boundary Value ODE problem 1
breadth-first search 1
byte code compilation 1

C
Cartesian space 1
Cauchy–Schwartz inequality 1
cellular automata 1
centrality 1
central limit theorem 1, 2
chaotic behavior 1
character string 1
Chebyshev inequality 1
Chernoff bounds 1
Chi-square distribution 1
Cholesky factorization 1
Classical Random Boolean Networks 1
closeness centrality 1
clustering coefficient 1
cobweb graph 1
codomain of a function 1
complexity 1
complex number 1
computability 1
concentration inequalities 1
conditional entropy 1
conditional probability 1, 2
conjugate gradient 1
constrained optimization 1
constraints 1
continuous distributions 1
contour plot 1
coordinates systems 1
correlation 1
covariance 1
Cramer’s method 1
critical point 1
cross product 1
cumulative distribution function 1
curvature 1

D
data science 1
data structures 1
decision variables 1
definite integral 1
degree 1
degree centrality 1
degree distribution 1
De Morgan’s laws 1
density plot 1
dependency structure 1
depth-first search 1
derivative 1
derivative-free methods 1
determinant 1
Deterministic Asynchronous Random Boolean Networks 1
Deterministic Generalized Asynchronous Random Boolean
Networks 1
diameter 1
differentiable 1
differential equations 1
differentiation 1
directed acyclic graph 1
directed network 1
Dirichlet conditions 1
discrete distributions 1
distance 1
distance matrix 1
distribution function 1
domain of a function 1
dot plot 1
dot product 1
dynamical system 1
dynamical systems 1

E
eccentricity 1
economic cost function 1
edge 1
Eigenvalues 1
Eigenvectors 1
elliptic PDE 1, 2
entropy 1
error handling 1
Euclidean norm 1
Euclidean space 1
Euler equations 1
exception handling 1
expectation value 1
exponential distribution 1
extrema 1
F
First fundamental Theorem of calculus 1
first-order PDE 1
fixed point 1
Fletcher–Reeves 1
fractal 1
functional programming 1

G
Gamma distribution 1
Generalized Asynchronous Random Boolean Networks 1
generalized tree 1
global maximum 1
global minimum 1
global network layout 1
gradient 1, 2
gradient-based algorithms 1
graph 1
graph algorithms 1
graphical models 1
graph measure 1

H
Hadamard 1
heat equation 1
Hessian 1, 2
Hestenes–Stiefel 1
histogram 1
Hoeffding’s inequality 1
hyperbolic PDE 1, 2

I
image plot 1
imperative programming 1
indefinite integral 1
information flow 1
information theory 1
Initial Value ODE 1
Initial Value ODE problem 1
integral 1
Intermediate value theorem 1

J
Jacobian 1
joint probability 1

K
Kolmogorov 1
Kullback–Leibler divergence 1

L
Lagrange multiplier 1
Lagrange polynomial 1, 2
law of large numbers 1
law of total probability 1
layered network layout 1
likelihood 1
likelihood function 1
limes 1
limiting value 1, 2
linear algebra 1
linear optimization 1
linux 1
local maximum 1
local minimum 1
logical statement 1
logic programming 1
logistic map 1
Log-normal distribution 1
Lotka–Volterra equations 1
LU factorization 1

M
Maclaurin series 1
Markov inequality 1
matrices 1
matrix factorization 1
matrix norms 1
maximization 1
maximum likelihood estimation 1
minimization 1
mixed product 1
modular network layout 1
moment 1
multi-valued function 1
multivariate distribution 1
mutual information 1
N
Negative binomial distribution 1
Nelder-Mead method 1
NetBioV 1
network 1
network visualization 1
Neumann conditions 1
Newton’s method 1
node 1
non-linear constrained optimization 1
normal distribution 1
numerical integration 1

O
objective-function 1
Object-oriented programming 1
operations with matrices 1
optimization 1
orbit 1
ordinary differential equations – ODE 1
orthogonal unit vectors 1
over-determined linear system 1

P
package 1
parabolic PDE 1, 2
partial derivative 1, 2
partial differential equations – PDE 1
path 1
Pearson’s correlation coefficient 1
periodic behavior 1
periodic point 1
pie chart 1
plot 1
p-norm 1
Poisson distribution 1
Poisson’s equation 1
Polak–Ribière–Polyak 1
polynomial interpolation 1
posterior 1
predator–prey system 1
prior 1
probability 1
programming languages 1
programming paradigm 1

Q
QR factorization 1
Qualitative techniques 1
quantitative method 1

R
radius 1
Random Boolean network 1
random networks 1
random variables 1
rank of a matrix 1
reading data 1
real sequences 1
repositories 1
Riemann sum 1
Robin conditions 1
root finding 1
rug plot 1
Rules of de Morgan 1

S
sample space 1
scalar product 1
scale-free networks 1
scatterplot 1
scope of variables 1
search direction 1
Second fundamental Theorem of calculus 1
second-order PDE 1
self-similarity 1
sequence 1
set operations 1, 2
sets 1
Sherman–Morrison–Woodbury formula 1
shortest path 1
Sierpińsky’s carpet 1
Simulated Annealing 1
singular value decomposition – SVD 1
small-world networks 1
sorting vectors 1
spanning trees 1
special matrices 1
stable fixed point 1
stable point 1
standard deviation 1
standard error 1
stationary point 1
statistical machine learning 1
steepest descent 1
strip plot 1
Student’s t-distribution 1
Support Vector Machine 1
systems of linear equations 1

T
Taylor series expansion 1
trace 1
transportation networks 1
tree 1
triangular linear system 1
Turing completeness 1
Turing machines 1

U
ubuntu 1
unconstrained optimization 1
uncontrollable variables 1
under-determined linear system 1
undirected network 1
uniform distribution 1
useful commands 1

V
variance 1
vector decomposition 1
vector projection 1
vector reflection 1
vector rotation 1
vectors 1
vector scaling 1
vector sum 1
vector translation 1
Venn diagram 1

W
walk 1
wave equation 1
Weibull distribution 1
weighted network 1
well-determined linear system 1
writing data 1
writing functions 1
Notes
1 Vi is a very simple yet powerful and fast editor used on
Unix or Linux computers.
1 →
Note that the direction of the vector u for the cross
→ →
product V × W is determined by the right-hand rule,
i. e., it is given by the direction of the right-hand thumb
→ →
when the other four fingers are rotated from V to W .
1 When we speak about a random sample, we mean an iid
sample.

You might also like