The Data Visualization Workshop
The Data Visualization Workshop
Data
Visualization
Workshop
A self-paced, practical approach
to transforming your complex data
into compelling, captivating graphics
All rights reserved. No part of this course may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this course to ensure the accuracy
of the information presented. However, the information contained in this course
is sold without warranty, either express or implied. Neither the authors, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages caused
or alleged to be caused directly or indirectly by this course.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this course by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
Reviewers: Rohan Chikorde, Joshua Görner, Anshu Kumar, Piotr Malak, Ashish Pratik
Patil, Narinder Kaur Saini, and Ankit Verma
Acquisitions Editors: Manuraj Nair, Royluis Rodrigues, Kunal Sawant, Sneha Shinde,
Anindya Sil, and Karan Wadekar
Editorial Board: Megan Carlisle, Samuel Christa, Mahesh Dhyani, Heather Gopsill,
Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Dominic Pereira,
Shiny Poojary, Abhishek Rane, Brendan Rodrigues, Erol Staveley, Ankita Thakur,
Nitesh Thakur, and Jonathan Wray
ISBN: 978-1-80056-884-6
Birmingham B3 2PB, UK
Table of Contents
Preface i
Introduction ............................................................................................... 2
Introduction to Data Visualization ............................................................. 2
The Importance of Data Visualization ........................................................ 3
Data Wrangling .............................................................................................. 4
Tools and Libraries for Visualization .......................................................... 5
Overview of Statistics ............................................................................... 5
Measures of Central Tendency ................................................................... 7
Measures of Dispersion ............................................................................... 8
Correlation ..................................................................................................... 9
Types of Data ............................................................................................... 10
Summary Statistics ..................................................................................... 11
NumPy ..................................................................................................... 12
Exercise 1.01: Loading a Sample Dataset and Calculating
the Mean Using NumPy ............................................................................. 15
Activity 1.01: Using NumPy to Compute the Mean, Median,
Variance, and Standard Deviation of a Dataset ...................................... 19
Basic NumPy Operations ........................................................................... 20
Indexing................................................................................................................21
Slicing....................................................................................................................21
Splitting.................................................................................................................22
Iterating................................................................................................................22
Exercise 1.02: Indexing, Slicing, Splitting, and Iterating ......................... 23
Advanced NumPy Operations ................................................................... 29
Filtering.................................................................................................................29
Sorting...................................................................................................................30
Combining............................................................................................................30
Reshaping.............................................................................................................31
Slicing....................................................................................................................56
Iterating................................................................................................................57
Series.....................................................................................................................57
Sorting...................................................................................................................65
Reshaping.............................................................................................................66
Introduction ............................................................................................ 80
Comparison Plots ................................................................................... 81
Line Chart ..................................................................................................... 81
Uses.......................................................................................................................81
Example................................................................................................................83
Design Practices...................................................................................................84
Examples...............................................................................................................85
Design Practices...................................................................................................87
Examples...............................................................................................................88
Design Practices...................................................................................................90
Examples...............................................................................................................94
Design Practices...................................................................................................96
Examples...............................................................................................................96
Design Practices...................................................................................................98
Correlogram ................................................................................................ 98
Examples...............................................................................................................99
Design Practices.................................................................................................100
Examples.............................................................................................................101
Design Practice...................................................................................................102
Activity 2.02: Road Accidents Occurring over Two Decades ................ 103
Composition Plots ................................................................................ 104
Pie Chart ..................................................................................................... 104
Use.......................................................................................................................104
Examples.............................................................................................................105
Design Practices.................................................................................................105
Design Practice...................................................................................................107
Examples.............................................................................................................108
Design Practices.................................................................................................110
Examples.............................................................................................................111
Design Practice...................................................................................................111
Example..............................................................................................................113
Design Practice...................................................................................................114
Example..............................................................................................................115
Design Practice...................................................................................................115
Example..............................................................................................................116
Design Practice...................................................................................................117
Examples.............................................................................................................118
Examples.............................................................................................................120
Design Practice...................................................................................................122
Activity 2.04: Frequency of Trains during Different Time Intervals .... 123
Geoplots ................................................................................................ 124
Dot Map ...................................................................................................... 124
Use.......................................................................................................................124
Example..............................................................................................................124
Design Practices.................................................................................................125
Design Practices.................................................................................................126
Examples.............................................................................................................127
Design Practices.................................................................................................128
Appendix 387
Index 507
Preface
ii | Preface
The Data Visualization Workshop will guide you through the world of data visualization
and help you to unlock simple secrets for transforming data into meaningful visuals
with the help of exciting exercises and activities.
Starting with an introduction to data visualization, this book shows you how to
first prepare raw data for visualization using NumPy and pandas operations. As
you progress, you'll use plotting techniques, such as comparison and distribution,
to identify relationships and similarities between datasets. You'll then work
through practical exercises to simplify the process of creating visualizations using
Python plotting libraries such as Matplotlib, and Seaborn. If you've ever wondered
how popular companies like Uber and Airbnb use geoplotlib for geographical
visualizations, this book has got you covered, helping you analyze and understand
the process effectively. Finally, you'll use the Bokeh library to create dynamic
visualizations that can be integrated into any web page.
By the end of this workshop, you'll have learned how to present engaging
mission-critical insights by creating impactful visualizations with real-world data.
Audience
The Data Visualization Workshop is for beginners who want to learn data visualization,
as well as developers and data scientists who are looking to enrich their practical data
science skills. Prior knowledge of data analytics, data science, and visualization is not
mandatory. Knowledge of Python basics and high-school-level math will help you
grasp the concepts covered in this data visualization book more quickly
and effectively.
Chapter 2, All You Need to Know about Plots, will explain the design practices for certain
plots. You will design attractive, tangible visualizations and learn to identify the best
plot type for a given dataset and scenario.
About the Book | iii
Chapter 3, A Deep Dive into Matplotlib, will teach you the fundamentals of Matplotlib
and how to create visualizations using the built-in plots that are provided by the
library. You will also practice how to customize your visualization plots and write
mathematical expressions using TeX.
Chapter 5, Plotting Geospatial Data, will teach you how to utilize Geoplotlib to create
stunning geographical visualizations, identify the different types of geospatial charts,
and create complex visualizations using tile providers and custom layers.
Chapter 6, Making Things Interactive with Bokeh, will introduce Bokeh, which is used
to create insightful web-based visualizations that can be extended into beautiful,
interactive visualizations that can easily be integrated into your web page.
Chapter 7, Combining What We Have Learned, will apply all the concepts that we will
have learned in all the previous chapters, using three new datasets in combination
with practical activities for Matplotlib, Seaborn, Geoplotlib, and Bokeh.
Conventions
Code words in text, database table names, folder names, filenames, file extensions,
path names, dummy URLs, user input, and Twitter handles are shown as follows:
"Note that by simply passing the axis parameter in the np.mean() call, we can
define the dimension our data will be aggregated on. axis=0 is horizontal and
axis=1 is vertical."
Words that you see on the screen (for example, in menus or dialog boxes) appear in
the same format.
"In this book, you will learn how to use Python in combination with various libraries,
such as NumPy, pandas, Matplotlib, Seaborn, and geoplotlib, to create impactful
data visualizations using real-world data."
iv | Preface
Code Presentation
Lines of code that span multiple lines are split using a backslash ( \ ). When the code
is executed, Python will ignore the backslash, and treat the code on the next line as a
direct continuation of the current line.
For example:
Comments are added into code to help explain specific bits of logic. Single-line
comments are denoted using the # symbol, as follows:
"""
Define a seed for the random number generator to ensure the
result will be reproducible
"""
seed = 1
np.random.seed(seed)
random.set_seed(seed)
Installing Python
The following section will help you to install python in Windows, macOS and
Linux systems.
3. Ensure that you install a version relevant to the architecture of your system
(either 32-bit or 64-bit). You can find out this information in the System
Properties window of your OS.
4. After you download the installer, simply double-click on the file and follow the
on-screen instructions.
1. Open Command Prompt and verify that p\Python 3 is not already installed by
running python3 --version.
3. Alternatively, you can install Python with the Anaconda Linux distribution by
downloading the installer from https://www.anaconda.com/distribution/#linux and
following the instructions.
1. Open the Terminal for Mac by pressing CMD + Spacebar, type terminal in the
open search box, and hit Enter.
Installing Libraries
pip comes pre-installed with Anaconda. Once Anaconda is installed on your
machine, all the required libraries can be installed using pip, for example, pip
install numpy. Alternatively, you can install all the required libraries using pip
install –r requirements.txt. You can find the requirements.txt file at
https://packt.live/3dgg8Hv.
2. You can either download it using GitHub Desktop or as a zipped folder by clicking
on the green Clone or download button.
3. You can open a Jupyter Notebook using the Anaconda Navigator by clicking the
Launch button under the Jupyter Notebook icon.
4. You can also open a Jupyter Notebook using the Anaconda Prompt. To do this,
open the Anaconda Prompt and run the following command:
jupyter notebook
5. Once you have launched Jupyter Notebook, a list of all files and folders will be
presented. You can open the Jupyter Notebook file you wish to work with by
simply double clicking it.
About the Book | vii
1. To import libraries such as NumPy and pandas, run the following code. This will
import the whole numpy library into your current file:
2. In the first cells of the exercises and activities of this book, you will see the
following code. Use np instead of numpy in our code to call methods
from numpy:
from numpy import mean # only import the mean method of numpy
We've tried to support interactive versions of all activities and exercises, but we
recommend a local installation as well for instances where this support isn't available.
Introduction
Unlike machines, people are usually not equipped for interpreting a large amount of
information from a random set of numbers and messages in each piece of data. Out
of all our logical capabilities, we understand things best through the visual processing
of information. When data is represented visually, the probability of understanding
complex builds and numbers increases.
Python has recently emerged as a programming language that performs well for
data analysis. It has applications across data science pipelines that convert data into
a usable format (such as pandas), analyzes it (such as NumPy), and extract useful
conclusions from the data to represent it in a visually appealing manner (such as
Matplotlib or Bokeh). Python provides data visualization libraries that can help you
assemble graphical representations efficiently.
In this book, you will learn how to use Python in combination with various libraries,
such as NumPy, pandas, Matplotlib, seaborn, and geoplotlib, to create impactful
data visualizations using real-world data. Besides that, you will also learn about
the features of different types of charts and compare their advantages and
disadvantages. This will help you choose the chart type that's suited to visualizing
your data.
Once we understand the basics, we can cover more advanced concepts, such
as interactive visualizations and how Bokeh can be used to create animated
visualizations that tell a story. Upon completing this book, you will be able
to perform data wrangling, extract relevant information, and visualize your
findings descriptively.
Data Wrangling
Data wrangling is the process of transforming raw data into a suitable
representation for various tasks. It is the discipline of augmenting, cleaning, filtering,
standardizing, and enriching data in a way that allows it to be used in a downstream
task, which in our case is data visualization.
Look at the following data wrangling process flow diagram to understand how
accurate and actionable data can be obtained for business analysts to work on:
In relation to the preceding figure, the following steps explain the flow of the data
wrangling process:
3. The cleaned data is then transformed into graphs, from which findings can
be derived.
For example, employee engagement can be measured based on raw data gathered
from feedback surveys, employee tenure, exit interviews, one-on-one meetings,
and so on. This data is cleaned and made into graphs based on parameters such
as referrals, faith in leadership, and scope of promotions. The percentages, that is,
information derived from the graphs, help us reach our result, which is to determine
the measure of employee engagement.
However, Python is the most popular language in the industry. Its ease of use and the
speed at which you can manipulate and visualize data, combined with the availability
of a number of libraries, make Python the best choice for data visualization.
Note
MATLAB (https://www.mathworks.com/products/matlab.html), R (https://
www.r-project.org), and Tableau (https://www.tableau.com) are not part of
this book; we will only cover the relevant tools and libraries for Python.
Overview of Statistics
Statistics is a combination of the analysis, collection, interpretation, and
representation of numerical data. Probability is a measure of the likelihood that an
event will occur and is quantified as a number between 0 and 1.
A discrete probability distribution shows all the values that a random variable can
take, together with their probability. The following diagram illustrates an example of
a discrete probability distribution. If we have a six-sided die, we can roll each number
between 1 and 6. We have six events that can occur based on the number that's
rolled. There is an equal probability of rolling any of the numbers, and the individual
probability of any of the six events occurring is 1/6:
Figure 1.4: Continuous probability distribution for the time taken to reach home
• Median: This is the middle value of the ordered dataset. If there is an even
number of observations, the median will be the average of the two middle
values. The median is less prone to outliers compared to the mean, where
outliers are distinct values in data.
• Mode: Our last measure of central tendency, the mode is defined as the most
frequent value. There may be more than one mode in cases where multiple
values are equally frequent.
For example, a die was rolled 10 times, and we got the following numbers: 4, 5, 4, 3, 4,
2, 1, 1, 2, and 1.
The mean is calculated by summing all the events and dividing them by the number
of observations: (4+5+4+3+4+2+1+1+2+1)/10=2.7.
To calculate the median, the die rolls have to be ordered according to their values.
The ordered values are as follows: 1, 1, 1, 2, 2, 3, 4, 4, 4, 5. Since we have an even
number of die rolls, we need to take the average of the two middle values. The
average of the two middle values is (2+3)/2=2.5.
The modes are 1 and 4 since they are the two most frequent events.
Measures of Dispersion
Dispersion, also called variability, is the extent to which a probability distribution is
stretched or squeezed.
• Variance: The variance is the expected value of the squared deviation from
the mean. It describes how far a set of numbers is spread out from their mean.
Variance is calculated as follows:
• Range: This is the difference between the largest and smallest values in
a dataset.
• Interquartile range: Also called the midspread or middle 50%, this is the
difference between the 75th and 25th percentiles, or between the upper and
lower quartiles.
Correlation
The measures we have discussed so far only considered single variables. In contrast,
correlation describes the statistical relationship between two variables:
Note
One thing you should be aware of is that correlation does not imply
causation. Correlation describes the relationship between two or more
variables, while causation describes how one event is caused by another.
For example, consider a scenario in which ice cream sales are correlated
with the number of drowning deaths. But that doesn't mean that ice
cream consumption causes drowning. There could be a third variable,
say temperature, that may be responsible for this correlation. Higher
temperatures may cause an increase in both ice cream sales and more
people engaging in swimming, which may be the real reason for the
increase in deaths due to drowning.
10 | The Importance of Data Visualization and Data Exploration
Example
Consider you want to find a decent apartment to rent that is not too expensive
compared to other apartments you've found. The other apartments (all belonging to
the same locality) you found on a website are priced as follows: $700, $850, $1,500,
and $750 per month. Let's calculate some values statistical measures to help us make
a decision:
As an exercise, you can try and calculate the variance as well. However, note that
compared with all the above values, the median value ($800) is a better statistical
measure in this case since it is less prone to outliers (the rent amount of $1,500).
Given that all apartments belong to the same locality, you can clearly see that the
apartment costing $1500 is definitely priced much higher as compared with other
apartments. A simple statistical analysis helped us to narrow down our choices.
Types of Data
It is important to understand what kind of data you are dealing with so that you can
select both the right statistical measure and the right visualization. We categorize
data as categorical/qualitative and numerical/quantitative. Categorical data describes
characteristics, for example, the color of an object or a person's gender. We can
further divide categorical data into nominal and ordinal data. In contrast to nominal
data, ordinal data has an order.
Numerical data can be divided into discrete and continuous data. We speak of
discrete data if the data can only have certain values, whereas continuous data can
take any value (sometimes limited to a range).
Another aspect to consider is whether the data has a temporal domain – in other
words, is it bound to time or does it change over time? If the data is bound to a
location, it might be interesting to show the spatial relationship, so you should keep
that in mind as well. The following flowchart classifies the various data types:
Overview of Statistics | 11
Summary Statistics
In real-world applications, we often encounter enormous datasets. Therefore,
summary statistics are used to summarize important aspects of data. They
are necessary to communicate large amounts of information in a compact and
simple way.
We have already covered measures of central tendency and dispersion, which are
both summary statistics. It is important to know that measures of central tendency
show a center point in a set of data values, whereas measures of dispersion show
how much the data varies.
The following table gives an overview of which measure of central tendency is best
suited to a particular type of data:
Figure 1.8: Best suited measures of central tendency for different types of data
12 | The Importance of Data Visualization and Data Exploration
In the next section, we will learn about the NumPy library and implement a few
exercises using it.
NumPy
When handling data, we often need a way to work with multidimensional arrays.
As we discussed previously, we also have to apply some basic mathematical and
statistical operations on that data. This is exactly where NumPy positions itself. It
provides support for large n-dimensional arrays and has built-in support for many
high-level mathematical and statistical operations.
Note
Before NumPy, there was a library called Numeric. However, it's no longer
used, because NumPy's signature ndarray allows the performant handling
of large and high-dimensional matrices.
Ndarrays are the essence of NumPy. They are what makes it faster than using
Python's built-in lists. Other than the built-in list data type, ndarrays provide a
stridden view of memory (for example, int[] in Java). Since they are homogeneously
typed, meaning all the elements must be of the same type, the stride is consistent,
which results in less memory wastage and better access times.
The stride is the number of locations between the beginnings of two adjacent
elements in an array. They are normally measured in bytes or in units of the size of
the array elements. A stride can be larger or equal to the size of the element, but not
smaller; otherwise, it would intersect the memory location of the next element.
Note
Remember that NumPy arrays have a defined data type. This means you
are not able to insert strings into an integer type array. NumPy is mostly
used with double-precision data types.
The following are some of the built-in methods that we will use in the exercises and
activities of this chapter.
NumPy | 13
mean
Note
The # symbol in the code snippet below denotes a code comment.
Comments are added into code to help explain specific bits of logic.
median
Several of the mathematical operations have the same interface. This makes them
easy to interchange if necessary. The median, var, and std methods will be used in
the upcoming exercises and activities:
Note that we can index every element from the end of our dataset as we can from
the front by using reverse indexing. It's a simple way to get the last or several of
the last elements of a list. Instead of [0] for the first/last element, it starts with
dataset[-1] and then decreases until dataset[-len(dataset)], which is the
first element in the dataset.
var
As we mentioned in the Overview of Statistics section, the variance describes how far
a set of numbers is spread out from their mean. We can calculate the variance using
the var method of NumPy:
std
One of the advantages of the standard deviation is that it remains in the scalar
system of the data. This means that the unit of the deviation will have the same unit
as the data itself. The std method works just like the others:
Now we will do an exercise to load a dataset and calculate the mean using
these methods.
Note
All the exercises and activities in this chapter will be developed in Jupyter
Notebooks. Please download the GitHub repository with all the prepared
templates from https://packt.live/31USkof. Make sure you have installed all
the libraries as mentioned in the preface.
Exercise 1.01: Loading a Sample Dataset and Calculating the Mean Using NumPy
In this exercise, we will be loading the normal_distribution.csv dataset and
calculating the mean of each row and each column in it:
1. Using the Anaconda Navigator launch either Jupyter Labs or Jupyter Notebook. In
the directory of your choice, create a Chapter01/Exercise1.01 folder.
import numpy as np
Note
The code snippet shown here uses a backslash ( \ ) to split the logic
across multiple lines. When the code is executed, Python will ignore the
backslash, and treat the code on the next line as a direct continuation of the
current line.
dataset = \
np.genfromtxt('../../Datasets/normal_distribution.csv', \
delimiter=',')
16 | The Importance of Data Visualization and Data Exploration
Note
In the preceding snippet, and for the rest of the book, we will be using a
relative path to load the datasets. However, for the preceding code to work
as intended, you need to follow the folder arrangement as present in this
link: https://packt.live/3ftUu3P. Alternatively, you can also use the absolute
path; for example, dataset = np.genfromtxt('C:/Datasets/
normal_distribution.csv', delimiter=','). If your
Jupyter Notebook is saved in the same folder as the dataset, then you can
simply use the filename: dataset = np.genfromtxt('normal_
distribution.csv', delimiter=',')
The genfromtxt method helps load the data from a given text or .csv file. If
everything works as expected, the generation should run through without any
error or output.
Note
The numpy.genfromtext method is less efficient than the pandas.
read_csv method. We shall refrain from going into the details of why this
is the case as this explanation is beyond the scope of this text.
5. Check the data you just imported by simply writing the name of the ndarray in
the next cell. Simply executing a cell that returns a value, such as an ndarray,
will use Jupyter formatting, which looks nice and, in most cases, displays more
information than using print:
6. Print the shape using the dataset.shape command to get a quick overview of
our dataset. This will give us output in the form (rows, columns):
dataset.shape
We can also call the rows as instances and the columns as features. This means
that our dataset has 24 instances and 8 features. The output of the preceding
code is as follows:
(24,8)
18 | The Importance of Data Visualization and Data Exploration
7. Calculate the mean after loading and checking our dataset. The first row in
a NumPy array can be accessed by simply indexing it with zero; for example,
dataset[0]. As we mentioned previously, NumPy has some built-in functions
for calculations such as the mean. Call np.mean() and pass in the dataset's
first row to get the result:
100.177647525
np.mean(dataset[:, 0])
99.76743510416668
9. Calculate the mean for every single row, aggregated in a list, using the axis
tools of NumPy. Note that by simply passing the axis parameter in the
np.mean() call, we can define the dimension our data will be aggregated on.
axis=0 is horizontal and axis=1 is vertical. Get the result for each row by
using axis=1:
np.mean(dataset, axis=1)
np.mean(dataset, axis=0)
10. Calculate the mean of the whole matrix by summing all the values we retrieved
in the previous steps:
np.mean(dataset)
100.16536917390624
Note
To access the source code for this specific section, please refer to
https://packt.live/30IkAMp.
You are already one step closer to using NumPy in combination with plotting libraries
and creating impactful visualizations. Since we've now covered the very basics and
calculated the mean, it's now up to you to solve the upcoming activity.
Activity 1.01: Using NumPy to Compute the Mean, Median, Variance, and Standard
Deviation of a Dataset
In this activity, we will use the skills we've learned to import datasets and perform
some basic calculations (mean, median, variance, and standard deviation) to compute
our tasks.
4. Load the dataset and calculate the mean of the third row. Access the third row
by using index 2, dataset[2].
5. Index the last element of an ndarray in the same way a regular Python list can be
accessed. dataset[:, -1] will give us the last column of every row.
6. Get a submatrix of the first three elements of every row of the first three
columns by using the double-indexing mechanism of NumPy.
8. Use reverse indexing to define a range to get the last three columns. We can use
dataset[:, -3:] here.
9. Aggregate the values along an axis to calculate the rows. We can use
axis=1 here.
10. Calculate the variance for each column using axis 0.
11. Calculate the variance of the intersection of the last two rows and the first
two columns.
Note
The solution for this activity can be found on page 388.
You have now completed your first activity using NumPy. In the following activities,
this knowledge will be consolidated further.
Indexing
Indexing elements in a NumPy array, at a high level, works the same as with built-in
Python lists. Therefore, we can index elements in multi-dimensional matrices:
Slicing
Slicing has also been adapted from Python's lists. Being able to easily slice parts of
lists into new ndarrays is very helpful when handling large amounts of data:
# rows 1 and 2
dataset[1:3]
Splitting
Splitting data can be helpful in many situations, from plotting only half of your time-
series data to separating test and training data for machine learning algorithms.
There are two ways of splitting your data, horizontally and vertically. Horizontal
splitting can be done with the hsplit method. Vertical splitting can be done with the
vsplit method:
# split horizontally in 3 equal lists
np.hsplit(dataset, (3))
Iterating
Iterating the NumPy data structures, ndarrays, is also possible. It steps over the
whole list of data one after another, visiting every single element in the ndarray once.
Considering that they can have several dimensions, indexing gets very complex.
The nditer is a multi-dimensional iterator object that iterates over a given number
of arrays:
The ndenumerate will give us exactly this index, thus returning (0, 1) for the second
value in the first row:
Note
The triple-quotes ( """ ) shown in the code snippet below are used to
denote the start and end points of a multi-line code comment. Comments
are added into code to help explain specific bits of logic.
"""
iterating over the whole dataset with indices matching the
position in the dataset
"""
for index, value in np.ndenumerate(dataset):
print(index, value)
NumPy | 23
Note
You can obviously plot a distribution and show the spread of data, but here
we want to practice implementing the aforementioned operations using the
NumPy library.
Let's use the features of NumPy to index, slice, split, and iterate ndarrays.
Indexing
import numpy as np
dataset = np.genfromtxt('../../Datasets/'\
'normal_distribution_splittable.csv', \
delimiter=',')
Note
As mentioned in the previous exercise, here too we have used a relative
path to load the dataset. You can change the path depending on where you
have saved the Jupyter Notebook and the dataset.
24 | The Importance of Data Visualization and Data Exploration
Remember that we need to show that our dataset is closely distributed around
a mean of 100; that is, whatever value we wish to show/calculate should be
around 100. For this purpose, first we will calculate the mean of the values of the
second and the last row.
4. Use simple indexing for the second row, as we did in our first exercise. For a
clearer understanding, all the elements of the second row are saved to a variable
and then we calculate the mean of these elements:
second_row = dataset[1]
np.mean(second_row)
96.90038836444445
5. Now, reverse index the last row and calculate the mean of that row. Always
remember that providing a negative number as the index value will index the list
from the end:
last_row = dataset[-1]
np.mean(last_row)
100.18096645222221
From the outputs obtained in step 4 and 5, we can say that these values indeed
are close to 100. To further convince our client, we will access the first value of
the first row and the last value of the second last row.
6. Index the first value of the first row using the Python standard syntax of [0][0]:
first_val_first_row = dataset[0][0]
np.mean(first_val_first_row)
99.14931546
7. Use reverse indexing to access the last value of the second last row (we want
to use the combined access syntax here). Remember that -1 means the
last element:
101.2226037
Note
For steps 6 and 7, even if you had not used np.mean(), you would have
got the same values as presently shown. This is because the mean of a
single value will be the value itself. You can try the above steps with the
following code:
first_val_first_row = dataset[0][0]
first_val_first_row
last_val_second_last_row = dataset[-2, -1]
last_val_second_last_row
From all the preceding outputs, we can confidently say that the values we obtained
hover around a mean of 100. Next, we'll use slicing, splitting, and iterating to achieve
our goal.
Slicing
1. Create a 2x2 matrix that starts at the second row and second column using
[1:3, 1:3]:
"""
slicing an intersection of 4 elements (2x2) of the
first two rows and first two columns
"""
subsection_2x2 = dataset[1:3, 1:3]
np.mean(subsection_2x2)
95.63393608250001
2. In this task, we want to have every other element of the fifth row. Provide
indexing of ::2 as our second element to get every second element of the
given row:
98.35235805800001
Introducing the second column into the indexing allows us to add another layer
of complexity. The third value allows us to only select certain values (such as
every other element) by providing a value of 2. This means it skips the values
between and only takes each second element from the used list.
100.18096645222222
Splitting
1. Use the hsplit method to split our dataset into three equal parts:
hor_splits = np.hsplit(dataset,(3))
Note that if the dataset can't be split with the given number of slices, it will throw
an error.
2. Split the first third into two equal parts vertically. Use the vsplit method to
vertically split the dataset in half. It works like hsplit:
ver_splits = np.vsplit(hor_splits[0],(2))
3. Compare the shapes. We can see that the subset has the required half of the
rows and the third half of the columns:
print("Dataset", dataset.shape)
print("Subset", ver_splits[0].shape)
Dataset (24, 9)
Subset (12, 3)
NumPy | 27
Iterating
curr_index = 0
for x in np.nditer(dataset):
print(x, curr_index)
curr_index += 1
Looking at the given piece of code, we can see that the index is simply
incremented with each element. This only works with one-dimensional data. If
we want to index multi-dimensional data, this won't work.
2. Use the ndenumerate method to iterate over the whole dataset. It provides
two positional values, index and value:
Notice that all the output values we obtained are close to our mean value of 100.
Thus, we have successfully managed to convince our client using several NumPy
methods that our data is closely distributed around the mean value of 100.
Note
To access the source code for this specific section, please refer to
https://packt.live/2Neteuh.
We've already covered most of the basic data wrangling methods for NumPy. In the
next section, we'll take a look at more advanced features that will give you the tools
you need to get better at analyzing your data.
Filtering
Filtering is a very powerful tool that can be used to clean up your data if you want to
avoid outlier values.
If we only want to extract the indices of the values that match our given condition, we
can use the built-in where method. For example, np.where(dataset > 5) will
return a list of indices of the values from the initial dataset that is bigger than 5:
Sorting
Sorting each row of a dataset can be really useful. Using NumPy, we are also able to
sort on other dimensions, such as columns.
In addition, argsort gives us the possibility to get a list of indices, which would
result in a sorted list:
Combining
Stacking rows and columns onto an existing dataset can be helpful when you have
two datasets of the same dimension saved to different files.
If we use hstack, we stack our datasets "next to each other," meaning that the
elements from the first row of dataset_1 will be followed by the elements of the
first row of dataset_2. This will be applied to each row:
Reshaping
Reshaping can be crucial for some algorithms. Depending on the nature of your data,
it might help you to reduce dimensionality to make visualization easier:
import numpy as np
dataset = np.genfromtxt('../../Datasets/'\
'normal_distribution_splittable.csv', \
delimiter=',')
dataset
32 | The Importance of Data Visualization and Data Exploration
Note
For ease of presentation, we have shown only a part of the output.
NumPy | 33
Filtering
1. Get values greater than 105 by supplying the condition > 105 in the brackets:
You can see in the preceding figure that all the values in the output are greater
than 105.
2. Extract the values of our dataset that are between the values 90 and 95. To
use more complex conditions, we might want to use the extract method
of NumPy:
The preceding output clearly shows that only values lying between 90 and 95
are printed.
3. Use the where method to get the indices of values that have a delta of less than
1; that is, [individual value] – 100 should be less than 1. Use those indices (row,
col) in a list comprehension and print them out:
rows, cols = np.where(abs(dataset - 100) < 1)
one_away_indices = [[rows[index], \
cols[index]] for (index, _) \
in np.ndenumerate(rows)]
one_away_indices
The where method from NumPy allows us to get indices (rows, cols)
for each of the matching values.
NumPy | 35
Figure 1.17: Indices of the values that have a delta of less than 1
Let us confirm if we indeed obtained the right indices. The first set of indices 0,0
refer to the very first value in the output shown in Figure 1.14. Indeed, this is the
correct value as abs (99.14931546 – 100) < 1. We can quickly check
this for a couple of more values and conclude that indeed the code has worked
as intended.
Note
List comprehensions are Python's way of mapping over data. They're a
handy notation for creating a new list with some operation applied to every
element of the old list.
For example, if we want to double the value of every element in this list,
list = [1, 2, 3, 4, 5], we would use list comprehensions like
this: doubled_list=[x*x for x in list]. This gives us the
following list: [1, 4, 9, 16, 25]. To get a better understanding
of list comprehensions, please visit https://docs.python.org/3/tutorial/
datastructures.html#list-comprehensions.
36 | The Importance of Data Visualization and Data Exploration
Sorting
row_sorted = np.sort(dataset)
row_sorted
Compare the preceding output with that in Figure 1.14. What do you observe?
The values along the rows have been sorted in an ascending order as expected.
NumPy | 37
2. With multi-dimensional data, we can use the axis parameter to define which
dataset should be sorted. Use the 0 axes to sort the values by column:
3. Create a sorted index list and use fancy indexing to get access to sorted
elements easily. To keep the order of our dataset and obtain only the values of a
sorted dataset, we will use argsort:
index_sorted = np.argsort(dataset[0])
dataset[0][index_sorted]
As can be seen from the preceding output, we have obtained the first row with
sorted values.
Combining
4. Use the combining features to add the second half of the first column back
together, add the second column to our combined dataset, and add the third
column to our combined dataset.
halfed_first[0]
After stacking the second half of our split dataset, we have one-third of our initial
dataset stacked together again. Now, we want to add the other two remaining
datasets to our first_col dataset.
6. Use the hstack method to combine our already combined first_col with
the second of the three split datasets:
A truncated version of the output resulting from the preceding code is as follows:
7. Use hstack to combine the last one-third column with our dataset. This is the
same thing we did with our second-third column in the previous step:
A truncated version of the output resulting from the preceding code is as follows:
Reshaping
1. Reshape our dataset into a single list using the reshape method:
A truncated version of the output resulting from the preceding code is as follows:
2. Provide a -1 for the dimension. This tells NumPy to figure the dimension
out itself:
A truncated version of the output resulting from the preceding code is as follows:
Note
To access the source code for this specific section, please refer to
https://packt.live/2YD4AZn.
You have now used many of the basic operations that are needed so that you
can analyze a dataset. Next, we will be learning about pandas, which will provide
several advantages when working with data that is more complex than simple multi-
dimensional numerical data. pandas also support different data types in datasets,
meaning that we can have columns that hold strings and others that have numbers.
NumPy, as you've seen, has some powerful tools. Some of them are even more
powerful when combined with pandas DataFrames.
pandas
The pandas Python library provides data structures and methods for manipulating
different types of data, such as numerical and temporal data. These operations are
easy to use and highly optimized for performance.
Data formats, such as CSV and JSON, and databases can be used to create
DataFrames. DataFrames are the internal representations of data and are very
similar to tables but are more powerful since they allow you to efficiently apply
operations such as multiplications, aggregations, and even joins. Importing and
reading both files and in-memory data is abstracted into a user-friendly interface.
When it comes to handling missing data, pandas provide built-in solutions to clean up
and augment your data, meaning it fills in missing values with reasonable values.
Integrated indexing and label-based slicing in combination with fancy indexing (what
we already saw with NumPy) make handling data simple. More complex techniques,
such as reshaping, pivoting, and melting data, together with the possibility of easily
joining and merging data, provide powerful tooling so that you can handle your
data correctly.
If you're working with time-series data, operations such as date range generation,
frequency conversion, and moving window statistics can provide an advanced
interface for data wrangling.
Note
The installation instructions for pandas can be found here:
https://pandas.pydata.org/. The latest version is v0.25.3
(used in this book); however, every v0.25.x should be suitable.
pandas | 45
• Less intuition: Many methods, such as joining, selecting, and loading files, are
used without much intuition and without taking away much of the powerful
nature of pandas.
• Easy DataFrame design: DataFrames are designed for operations with and on
large datasets.
Disadvantages of pandas
The following are some of the disadvantages of pandas:
• Less applicable: Due to its higher abstraction, it's generally less applicable than
NumPy. Especially when used outside of its scope, operations can get complex.
• More disk space: Due to the internal representation of DataFrames and the way
pandas trades disk space for a more performant execution, the memory usage
of complex operations can spike.
• Hidden complexity: Less experienced users often tend to overuse methods and
execute them several times instead of reusing what they've already calculated.
This hidden complexity makes users think that the operations themselves are
simple, which is not the case.
Note
Always try to think about how to design your workflows instead of
excessively using operations.
Now, we will do an exercise to load a dataset and calculate the mean using pandas.
Exercise 1.04 Loading a Sample Dataset and Calculating the Mean using Pandas
In this exercise, we will be loading the world_population.csv dataset and
calculating the mean of some rows and columns. Our dataset holds the yearly
population density for every country. Let's use pandas to perform this exercise:
import pandas as pd
4. Now, check the data you just imported by simply writing the name of the dataset
in the next cell. pandas uses a data structure called DataFrames. Print some of
the rows. To avoid filling the screen, use the pandas head() method:
dataset.head()
pandas | 47
Both head() and tail() let you provide a number, n, as a parameter, which
describes how many rows should be returned.
Note
Simply executing a cell that returns a value such as a DataFrame will use
Jupyter formatting, which looks nicer and, in most cases, displays more
information than using print.
48 | The Importance of Data Visualization and Data Exploration
5. Print out the shape of the dataset to get a quick overview using the dataset.
shape command. This works the same as it does with NumPy ndarrays. It will
give us the output in the form (rows, columns):
dataset.shape
(264, 60)
6. Index the column with the year 1961. pandas DataFrames have built-in functions
for calculations, such as the mean. This means we can simply call dataset.
mean() to get the result.
The printed output should look as follows:
dataset["1961"].mean()
176.91514132840555
7. Check the difference in population density over the years by repeating the
previous step with the column for the year 2015 (the population more than
doubled in the given time range):
368.70660104001837
8. To get the mean for every single country (row), we can make use of pandas axis
tools. Use the mean() method on the dataset on axis=1, meaning all the rows,
and return the first 10 rows using the head() method:
dataset.mean(axis=1).head(10)
pandas | 49
9. Get the mean for each column and return the last 10 entries:
dataset.mean(axis=0).tail(10)
Since pandas DataFrames can have different data types in each column,
aggregating this value on the whole dataset out of the box makes no sense. By
default, axis=0 will be used, which means that this will give us the same result
as the cell prior to this.
Note
To access the source code for this specific section, please refer to
https://packt.live/37z3Us1.
We've now seen that the interface of pandas has some similar methods to NumPy,
which makes it really easy to understand. We have now covered the very basics,
which will help you solve the first exercise using pandas. In the following exercise,
you will consolidate your basic knowledge of pandas and use the methods you just
learned to solve several computational tasks.
Exercise 1.05: Using pandas to Compute the Mean, Median, and Variance of a
Dataset
In this exercise, we will take the previously learned skills of importing datasets and
basic calculations and apply them to solve the tasks of our first exercise using pandas.
Let's use pandas features such as mean, median, and variance to make some
calculations on our data:
import pandas as pd
3. Use the read_csv method to load the aforementioned dataset and use the
index_col parameter to define the first column as our index:
dataset = \
pd.read_csv('../../Datasets/world_population.csv', \
index_col=0)
dataset[0:2]
52 | The Importance of Data Visualization and Data Exploration
5. Now, index the third row by using dataset.iloc[[2]]. Use the axis
parameter to get the mean of the country rather than the yearly column:
dataset.iloc[[2]].mean(axis=1)
6. Index the last element of the DataFrame using -1 as the index for the
iloc() method:
dataset.iloc[[-1]].mean(axis=1)
7. Calculate the mean value of the values labeled as Germany using loc, which
works based on the index column:
dataset.loc[["Germany"]].mean(axis=1)
8. Calculate the median value of the last row by using reverse indexing and
axis=1 to aggregate the values in the row:
dataset.iloc[[-1]].median(axis=1)
9. Use reverse indexing to get the last three columns with dataset[-3:] and
calculate the median for each of them:
dataset[-3:].median(axis=1)
10. Calculate the median population density values for the first 10 countries of the
list using the head and median methods:
dataset.head(10).median(axis=1)
Figure 1.37: Usage of the axis to calculate the median of the first 10 rows
When handling larger datasets, the order in which methods are executed
matters. Think about what head(10) does for a moment. It simply takes
your dataset and returns the first 10 rows in it, cutting down your input to the
mean() method drastically.
The last method we'll cover here is the variance. pandas provide a consistent API,
which makes it easy to use.
11. Calculate the variance of the dataset and return only the last five columns:
dataset.var().tail()
12. Calculate the mean for the year 2015 using both NumPy and pandas separately:
Note
To access the source code for this specific section, please refer to
https://packt.live/2N7E2Kh.
This exercise of how to use NumPy's mean method with a pandas DataFrame shows
that, in some cases, NumPy has better functionality. However, the DataFrame format
of pandas is more applicable, so we combine both libraries to get the best out
of both.
You've completed your first exercise with pandas, which showed you some of the
similarities, and also differences when working with NumPy and pandas. In the
following exercise, this knowledge will be consolidated. You'll also be introduced to
more complex features and methods of pandas.
Indexing
Indexing with pandas is a bit more complex than with NumPy. We can only access
columns with a single bracket. To use the indices of the rows to access them, we need
the iloc method. If we want to access them with index_col (which was set in the
read_csv call), we need to use the loc method:
# index the 2000 col
dataset["2000"]
Slicing
Slicing with pandas is even more powerful. We can use the default slicing syntax
we've already seen with NumPy or use multi-selection. If we want to slice different
rows or columns by name, we can simply pass a list into the brackets:
Iterating
Iterating DataFrames is also possible. Considering that they can have several
dimensions and dtypes, the indexing is very high level and iterating over each
row has to be done separately:
Series
A pandas Series is a one-dimensional labeled array that is capable of holding any
type of data. We can create a Series by loading datasets from a .csv file, Excel
spreadsheet, or SQL database. There are many different ways to create them, such as
the following:
• NumPy arrays:
# import pandas
import pandas as pd
# import numpy
import numpy as np
# creating a numpy array
numarr = np.array(['p','y','t','h','o','n'])
ser = pd.Series(numarr)
print(ser)
• pandas lists:
# import pandas
import pandas as pd
# creating a pandas list
plist = ['p','y','t','h','o','n']
ser = pd.Series(plist)
print(ser)
Let's use the indexing, slicing, and iterating operations to display the population
density of Germany, Singapore, United States, and India for years 1970, 1990,
and 2010.
Indexing
import pandas as pd
4. Index the row with the index_col "United States" using the
loc method:
dataset.loc[["United States"]].head()
pandas | 59
Figure 1.40: A few columns from the output showing indexing United States
with the loc method
5. Use reverse indexing in pandas to index the second to last row using the
iloc method:
dataset.iloc[[-2]]
6. Columns are indexed using their header. This is the first line of the CSV file. Index
the column with the header of 2000 as a Series:
dataset["2000"].head()
Remember, the head() method simply returns the first five rows.
7. First, get the data for the year 2000 as a DataFrame and then select India using
the loc() method using chaining:
dataset[["2000"]].loc[["India"]]
Since the double brackets notation returns a DataFrame once again, we can
chain method calls to get distinct elements.
8. Use the single brackets notation to get the distinct value for the population
density of India in 2000:
dataset["2000"].loc["India"]
pandas | 61
If we want to only retrieve a Series object, we must replace the double brackets
with single ones. The output of the preceding code is as follows:
354.326858357522
Slicing
1. Create a slice with the rows 2 to 5 using the iloc() method again:
2. Use the loc() method to access several rows in the DataFrame and use the
nested brackets to provide a list of elements. Slice the dataset to get the rows for
Germany, Singapore, United States, and India:
3. Use chaining to get the rows for Germany, Singapore, United States, and India
and return only the values for the years 1970, 1990, and 2010. Since the double
bracket queries return new DataFrames, we can chain methods and therefore
access distinct subframes of our data:
Figure 1.46: Slices some of the countries and their population density
for 1970, 1990, and 2010
Iterating
1. Iterate our dataset and print out the countries up until Angola using the
iterrows() method. The index will be the name of our row, and the row will
hold all the columns:
Note
To access the source code for this specific section, please refer to
https://packt.live/2YKqHNM.
We've already covered most of the underlying data wrangling methods using pandas.
In the next exercise, we'll take a look at more advanced features such as filtering,
sorting, and reshaping to prepare you for the next chapter.
pandas | 65
Filtering
Filtering in pandas has a higher-level interface than NumPy. You can still use the
simple brackets-based conditional filtering. However, you're also able to use more
complex queries, for example, filter rows based on labels using likeness, which
allows us to search for a substring using the like argument and even full regular
expressions using regex:
# years containing an 8
dataset.filter(like="8", axis=1)
Sorting
Sorting each row or column based on a given row or column will help you analyze
your data better and find the ranking of a given dataset. With pandas, we are able to
do this pretty easily. Sorting in ascending and descending order can be done using
the parameter known as ascending. The default sorting order is ascending. Of
course, you can do more complex sorting by providing more than one value in the by
= [ ] list. Those will then be used to sort values for which the first value is the same:
# values sorted by 1999
dataset.sort_values(by=["1999"])
# values sorted by 1999 descending
dataset.sort_values(by=["1994"], ascending=False)
66 | The Importance of Data Visualization and Data Exploration
Reshaping
Reshaping can be crucial for easier visualization and algorithms. However, depending
on your data, this can get really complex:
dataset.pivot(index=["1999"] * len(dataset), \
columns="Country Code", values="1999")
Filtering
3. Use the read_csv method to load the dataset, again defining our first column
as an index column:
4. Use filter instead of using the bracket syntax to filter for specific items. Filter
the dataset for columns 1961, 2000, and 2015 using the items parameter:
5. Use conditions to get all the countries that had a higher population density than
500 in 2000. Simply pass this condition in brackets:
"""
filtering countries that had a greater population density
than 500 in 2000
"""
dataset[(dataset["2000"] > 500)][["2000"]]
68 | The Importance of Data Visualization and Data Exploration
Figure 1.49: Filtering out values that are greater than 500 in the 2000 column
6. Search for arbitrary columns or rows (depending on the index given) that match
a certain regex. Get all the columns that start with 2 by passing ^2 (meaning
that it starts at 2):
dataset.filter(regex="^2", axis=1).head()
7. Filter the rows instead of the columns by passing axis=0. This will be helpful
for situations when we want to filter all the rows that start with A:
dataset.filter(regex="^A", axis=0).head()
8. Use the like query to find only the countries that contain the word land, such
as Switzerland:
dataset.filter(like="land", axis=0).head()
Sorting
1. Use the sort_values or sort_index method to get the countries with the
lowest population density for the year 1961:
dataset.sort_values(by=["1961"])[["1961"]].head(10)
dataset.sort_values(by=["2015"])[["2015"]].head(10)
72 | The Importance of Data Visualization and Data Exploration
We can see that the order of the countries with the lowest population density
has changed a bit, but that the first three entries remain unchanged.
3. Sort column 2015 in descending order to show the biggest values first:
dataset.sort_values(by=["2015"], \
ascending=False)[["2015"]].head(10)
pandas | 73
Reshaping
1. Get a DataFrame where the columns are country codes and the only row is
the year 2015. Since we only have one 2015 label, we need to duplicate it as
many times as our dataset's length:
Figure 1.56: Reshaping the dataset into a single row for the values of 2015
Note
To access the source code for this specific section, please refer to
https://packt.live/2N0xHQZ.
You now know the basic functionality of pandas and have already applied it to a real-
world dataset. In the final activity for this chapter, we will try to analyze a forest fire
dataset to get a feeling for mean forest fire sizes and whether the temperature of
each month is proportional to the number of fires.
• area: The burned area of the forest (in ha): 0.00 to 1090.84
Note
We will only be using the month, temp, and area columns in this activity.
3. Print the first two rows of the dataset to get a feeling for its structure.
1. Filter the dataset so that it only contains entries that have an area larger than 0.
2. Get the mean, min, max, and std of the area column and see what information
this gives you.
3. Sort the filtered dataset using the area column and print the last 20 entries
using the tail method to see how many huge values it holds.
4. Then, get the median of the area column and visually compare it to the
mean value.
1. Get a list of unique values from the month column of the dataset.
2. Get the number of entries for the month of March using the shape member of
our DataFrame.
76 | The Importance of Data Visualization and Data Exploration
3. Now, iterate over all the months, filter our dataset for the rows containing the
given month, and calculate the mean temperature. Print a statement with the
number of fires, the mean temperature, and the month.
Note
The solution for this activity can be found on page 391.
You have now completed this topic all about pandas, which concludes this chapter.
We have learned about the essential tools that help you wrangle and work with
data. pandas is an incredibly powerful and widely used tool for wrangling and
understanding data.
Summary
NumPy and pandas are essential tools for data wrangling. Their user-friendly
interfaces and performant implementation make data handling easy. Even though
they only provide a little insight into our datasets, they are valuable for wrangling,
augmenting, and cleaning our datasets. Mastering these skills will improve the quality
of your visualizations.
In this chapter, we learned about the basics of NumPy, pandas, and statistics. Even
though the statistical concepts we covered are basic, they are necessary to enrich
our visualizations with information that, in most cases, is not directly provided in
our datasets. This hands-on experience will help you implement the exercises and
activities in the following chapters.
In the next chapter, we will focus on the different types of visualizations and how
to decide which visualization would be best for our use case. This will give you
theoretical knowledge so that you know when to use a specific chart type and why.
It will also lay down the fundamentals of the remaining chapters in this book, which
will focus on teaching you how to use Matplotlib and seaborn to create the plots
we have discussed here. After we have covered basic visualization techniques with
Matplotlib and seaborn, we will dive more in-depth and explore the possibilities of
interactive and animated charts, which will introduce an element of storytelling into
our visualizations.
2
All You Need to Know about
Plots
Overview
This chapter will teach you the fundamentals of the various types of plots
such as line charts, bar charts, bubble plots, radar charts, and so on. For
each plot type that we discuss, we will also describe best practices and
use cases. The activities presented in this chapter will enable you to apply
the knowledge gained. By the end of this chapter, you will be equipped
with the important skill of identifying the best plot type for a given dataset
and scenario.
80 | All You Need to Know about Plots
Introduction
In the previous chapter, we learned how to work with new datasets and get familiar
with their data and structure. We also got hands-on experience of how to analyze and
transform them using different data wrangling techniques such as filtering, sorting,
and reshaping. All of these techniques will come in handy when working with further
real-world datasets in the coming activities.
In this chapter, we will focus on various visualizations and identify which visualization
is best for showing certain information for a given dataset. We will describe every
visualization in detail and give practical examples, such as comparing different stocks
over time or comparing the ratings for different movies. Starting with comparison
plots, which are great for comparing multiple variables over time, we will look at their
types (such as line charts, bar charts, and radar charts).
We will then move onto relation plots, which are handy for showing relationships
among variables. We will cover scatter plots for showing the relationship between two
variables, bubble plots for three variables, correlograms for variable pairs, and finally,
heatmaps for visualizing multivariate data.
The chapter will further explain composition plots (used to visualize variables that
are part of a whole), as well as pie charts, stacked bar charts, stacked area charts,
and Venn diagrams. To give you a deeper insight into the distribution of variables,
we will discuss distribution plots, describing histograms, density plots, box plots, and
violin plots.
Finally, we will talk about dot maps, connection maps, and choropleth maps, which
can be categorized into geoplots. Geoplots are useful for visualizing geospatial data.
Let’s start with the family of comparison plots, including line charts, bar charts, and
radar charts.
Comparison Plots | 81
Note
The data used in this chapter has been provided to demonstrate the
different types of plots available to you. In each case, the data itself will be
revisited and explained more fully in a later chapter.
Comparison Plots
Comparison plots include charts that are ideal for comparing multiple variables
or variables over time. Line charts are great for visualizing variables over time. For
comparison among items, bar charts (also called column charts) are the best way
to go. For a certain time period (say, fewer than 10-time points), vertical bar charts
can be used as well. Radar charts or spider plots are great for visualizing multiple
variables for multiple groups.
Line Chart
Line charts are used to display quantitative values over a continuous time period and
show information as a series. A line chart is ideal for a time series that is connected
by straight-line segments.
The value being measured is placed on the y-axis, while the x-axis is the timescale.
Uses
• Line charts are great for comparing multiple variables and visualizing trends for
both single as well as multiple variables, especially if your dataset has many time
periods (more than 10).
• For smaller time periods, vertical bar charts might be the better choice.
82 | All You Need to Know about Plots
The following diagram shows a trend of real estate prices (per million US dollars)
across two decades. Line charts are ideal for showing data trends:
Example
The following figure is a multiple-variable line chart that compares the stock-closing
prices for Google, Facebook, Apple, Amazon, and Microsoft. A line chart is great for
comparing values and visualizing the trend of the stock. As we can see, Amazon
shows the highest growth:
Figure 2.2: Line chart showing stock trends for five companies
84 | All You Need to Know about Plots
Design Practices
• Avoid too many lines per chart.
Note
For plots with multiple variables, a legend should be given to describe
each variable.
Bar Chart
In a bar chart, the bar length encodes the value. There are two variants of bar charts:
vertical bar charts and horizontal bar charts.
Use
While they are both used to compare numerical values across categories, vertical bar
charts are sometimes used to show a single variable over time.
• Another common mistake is to use bar charts to show central tendencies among
groups or categories. Use box plots or violin plots to show statistical measures or
distributions in these cases.
Comparison Plots | 85
Examples
The following diagram shows a vertical bar chart. Each bar shows the marks out of
100 that 5 students obtained in a test:
The following diagram shows a horizontal bar chart. Each bar shows the marks out of
100 that 5 students obtained in a test:
The following diagram compares movie ratings, giving two different scores. The
Tomatometer is the percentage of approved critics who have given a positive review
for the movie. The Audience Score is the percentage of users who have given a score
of 3.5 or higher out of 5. As we can see, The Martian is the only movie with both a
high Tomatometer and Audience Score. The Hobbit: An Unexpected Journey has a
relatively high Audience Score compared to the Tomatometer score, which might be
due to a huge fan base:
Comparison Plots | 87
Design Practices
• The axis corresponding to the numerical variable should start at zero. Starting
with another value might be misleading, as it makes a small value difference look
like a big one.
• Use horizontal labels—that is, as long as the number of bars is small, and the
chart doesn’t look too cluttered.
• The labels can be rotated to different angles if there isn’t enough space to
present them horizontally. You can see this on the labels of the x-axis of the
preceding diagram.
Radar Chart
Radar charts (also known as spider or web charts) visualize multiple variables with
each variable plotted on its own axis, resulting in a polygon. All axes are arranged
radially, starting at the center with equal distances between one another, and have
the same scale.
88 | All You Need to Know about Plots
Uses
• Radar charts are great for comparing multiple quantitative variables for a single
group or multiple groups.
• They are also useful for showing which variables score high or low within a
dataset, making them ideal for visualizing performance.
Examples
The following diagram shows a radar chart for a single variable. This chart displays
data about a student scoring marks in different subjects:
The following diagram shows a radar chart for two variables/groups. Here, the chart
explains the marks that were scored by two students in different subjects:
The following diagram shows a radar chart for multiple variables/groups. Each chart
displays data about a student’s performance in different subjects:
Figure 2.8: Radar chart with faceting for multiple variables (multiple students)
Design Practices
• Try to display 10 factors or fewer on a single radar chart to make it easier
to read.
• Use faceting (displaying each variable in a separate plot) for multiple variables/
groups, as shown in the preceding diagram, in order to maintain clarity.
Comparison Plots | 91
In the first section, we learned which plots are suitable for comparing items. Line
charts are great for comparing something over time, whereas bar charts are for
comparing different items. Last but not least, radar charts are best suited for
visualizing multiple variables for multiple groups. In the following activity, you can
check whether you understood which plot is best for which scenario.
2. You are given the following bar and radar charts. List the advantages and
disadvantages of both charts. Which is the better chart for this task in your
opinion, and why?
The following diagram shows a bar chart for the employee skills:
The following diagram shows a radar chart for the employee skills:
Note
The solution to this activity can be found on page 397.
Relation Plots | 93
Concluding the activity, you hopefully have a good understanding of deciding which
comparison plots are best for the situation. In the next section, we will discuss
different relation plots.
Relation Plots
Relation plots are perfectly suited to showing relationships among variables. A
scatter plot visualizes the correlation between two variables for one or multiple
groups. Bubble plots can be used to show relationships between three variables.
The additional third variable is represented by the dot size. Heatmaps are great for
revealing patterns or correlations between two qualitative variables. A correlogram is
a perfect visualization for showing the correlation among multiple variables.
Scatter Plot
Scatter plots show data points for two numerical variables, displaying a variable on
both axes.
Uses
• You can detect whether a correlation (relationship) exists between two variables.
• They allow you to plot the relationship between multiple groups or categories
using different colors.
• A bubble plot, which is a variation of the scatter plot, is an excellent tool for
visualizing the correlation of a third variable.
94 | All You Need to Know about Plots
Examples
The following diagram shows a scatter plot of height and weight of persons
belonging to a single group:
The following diagram shows the same data as in the previous plot but differentiates
between groups. In this case, we have different groups: A, B, and C:
The following diagram shows the correlation between body mass and the maximum
longevity for various animals grouped by their classes. There is a positive correlation
between body mass and maximum longevity:
Figure 2.13: Correlation between body mass and maximum longevity for animals
Design Practices
• Start both axes at zero to represent data accurately.
• Use contrasting colors for data points and avoid using symbols for scatter plots
with multiple groups or categories.
Examples
The following diagram shows the correlation between body mass and the maximum
longevity for animals in the Aves class. The marginal histograms are also shown,
which helps to get a better insight into both variables:
Relation Plots | 97
Bubble Plot
A bubble plot extends a scatter plot by introducing a third numerical variable. The
value of the variable is represented by the size of the dots. The area of the dots is
proportional to the value. A legend is used to link the size of the dot to an actual
numerical value.
Use
Bubble plots help to show a correlation between three variables.
98 | All You Need to Know about Plots
Example
The following diagram shows a bubble plot that highlights the relationship between
heights and age of humans to get the weight of each person, which is represented by
the size of the bubble:
Figure 2.15: Bubble plot showing the relation between height and age of humans
Design Practices
• The design practices for the scatter plot are also applicable to the bubble plot.
• Don’t use bubble plots for very large amounts of data, since too many bubbles
make the chart difficult to read.
Correlogram
A correlogram is a combination of scatter plots and histograms. Histograms will be
discussed in detail later in this chapter. A correlogram or correlation matrix visualizes
the relationship between each pair of numerical variables using a scatter plot.
Relation Plots | 99
The diagonals of the correlation matrix represent the distribution of each variable in
the form of a histogram. You can also plot the relationship between multiple groups
or categories using different colors. A correlogram is a great chart for exploratory
data analysis to get a feel for your data, especially the correlation between
variable pairs.
Examples
The following diagram shows a correlogram for the height, weight, and age of
humans. The diagonal plots show a histogram for each variable. The off-diagonal
elements show scatter plots between variable pairs:
The following diagram shows the correlogram with data samples separated by color
into different groups:
Design Practices
• Start both axes at zero to represent data accurately.
• Use contrasting colors for data points and avoid using symbols for scatter plots
with multiple groups or categories.
Heatmap
A heatmap is a visualization where values contained in a matrix are represented
as colors or color saturation. Heatmaps are great for visualizing multivariate data
(data in which analysis is based on more than two variables per observation),
where categorical variables are placed in the rows and columns and a numerical or
categorical variable is represented as colors or color saturation.
Relation Plots | 101
Use
The visualization of multivariate data can be done using heatmaps as they are great
for finding patterns in your data.
Examples
The following diagram shows a heatmap for the most popular products on the
electronics category page across various e-commerce websites, where the color
shows the number of units sold. In the following diagram, we can analyze that the
darker colors represent more units sold, as shown in the key:
Let’s see the same example we saw previously in an annotated heatmap, where the
color shows the number of units sold:
Figure 2.19: Annotated heatmap for popular products in the electronics category
Design Practice
• Select colors and contrasts that will be easily visible to individuals with vision
problems so that your plots are more inclusive.
In this section, we introduced various plots for relating a variable to other variables
and looked at their uses, and multiple examples for the different relation plots were
given. The following activity will give you some practice in working with heatmaps.
Relation Plots | 103
1. Identify the two years during which the number of road accidents occurring was
the least.
2. For the past two decades, identify the month for which accidents showed a
marked decrease:
Note
The solution to this activity can be found on page 397.
104 | All You Need to Know about Plots
Composition Plots
Composition plots are ideal if you think about something as a part of a whole. For
static data, you can use pie charts, stacked bar charts, or Venn diagrams. Pie charts
or donut charts help show proportions and percentages for groups. If you need an
additional dimension, stacked bar charts are great. Venn diagrams are the best way
to visualize overlapping groups, where each group is represented by a circle. For data
that changes over time, you can use either stacked bar charts or stacked area charts.
Pie Chart
Pie charts illustrate numerical proportions by dividing a circle into slices. Each arc
length represents a proportion of a category. The full circle equates to 100%. For
humans, it is easier to compare bars than arc lengths; therefore, it is recommended
to use bar charts or stacked bar charts the majority of the time.
Use
To compare items that are part of a whole.
Composition Plots | 105
Examples
The following diagram shows household water usage around the world:
Design Practices
• Arrange the slices according to their size in increasing/decreasing order, either in
a clockwise or counterclockwise manner.
Design Practice
• Use the same color that’s used for the category for the subcategories. Use
varying brightness levels for the different subcategories.
Use
• To compare variables that can be divided into sub-variables
Examples
The following diagram shows a generic stacked bar chart with five groups:
Figure 2.24: Stacked bar chart to show sales of laptops and mobiles
Composition Plots | 109
The following diagram shows a 100% stacked bar chart with the same data that was
used in the preceding diagram:
Figure 2.25: 100% stacked bar chart to show sales of laptops, PCs, and mobiles
110 | All You Need to Know about Plots
The following diagram illustrates the daily total sales of a restaurant over several
days. The daily total sales of non-smokers are stacked on top of the daily total sales
of smokers:
Figure 2.26: Daily total restaurant sales categorized by smokers and non-smokers
Design Practices
• Use contrasting colors for stacked bars.
• Ensure that the bars are adequately spaced to eliminate visual clutter. The ideal
space guideline between each bar is half the width of a bar.
Use
To show trends for time series that are part of a whole.
Composition Plots | 111
Examples
The following diagram shows a stacked area chart with the net profits of Google,
Facebook, Twitter, and Snapchat over a decade:
Figure 2.27: Stacked area chart to show net profits of four companies
Design Practice
• Use transparent colors to improve information visibility. This will help you to
analyze the overlapping data and you will also be able to see the grid lines.
In this section, we covered various composition plots and we will conclude this
section with the following activity.
112 | All You Need to Know about Plots
1. Looking at the following line chart, analyze the sales of each manufacturer
and identify the one whose fourth-quarter performance is exceptional when
compared to the third quarter.
2. Analyze the performance of all manufacturers and make a prediction about two
companies whose sales units will show a downward and an upward trend:
3. What would be the advantages and disadvantages of using a stacked area chart
instead of a line chart?
Note
The solution to this activity can be found on page 398.
Composition Plots | 113
Venn Diagram
Venn diagrams, also known as set diagrams, show all possible logical relations
between a finite collection of different sets. Each set is represented by a circle. The
circle size illustrates the importance of a group. The size of overlap represents the
intersection between multiple groups.
Use
To show overlaps for different sets.
Example
Visualizing the intersection of the following diagram shows a Venn diagram for
students in two groups taking the same class in a semester:
Figure 2.29: Venn diagram showing students taking the same class
From the preceding diagram, we can note that there are eight students in just group
A, four students in just group B, and one student in both groups.
114 | All You Need to Know about Plots
Design Practice
• It is not recommended to use Venn diagrams if you have more than three
groups. It would become difficult to understand.
Distribution Plots
Distribution plots give a deep insight into how your data is distributed. For a single
variable, a histogram is effective. For multiple variables, you can either use a box
plot or a violin plot. The violin plot visualizes the densities of your variables, whereas
the box plot just visualizes the median, the interquartile range, and the range for
each variable.
Histogram
A histogram visualizes the distribution of a single numerical variable. Each bar
represents the frequency for a certain interval. Histograms help get an estimate
of statistical measures. You see where values are concentrated, and you can easily
detect outliers. You can either plot a histogram with absolute frequency values or,
alternatively, normalize your histogram. If you want to compare distributions of
multiple variables, you can use different colors for the bars.
Use
Get insights into the underlying distribution for a dataset.
Distribution Plots | 115
Example
The following diagram shows the distribution of the Intelligence Quotient (IQ) for a
test group. The dashed lines represent the standard deviation each side of the mean
(the solid line):
Design Practice
• Try different numbers of bins (data intervals), since the shape of the histogram
can vary significantly.
Density Plot
A density plot shows the distribution of a numerical variable. It is a variation of a
histogram that uses kernel smoothing, allowing for smoother distributions. One
advantage these have over histograms is that density plots are better at determining
the distribution shape since the distribution shape for histograms heavily depends on
the number of bins (data intervals).
116 | All You Need to Know about Plots
Use
To compare the distribution of several variables by plotting the density on the same
axis and using different colors.
Example
The following diagram shows a basic density plot:
Design Practice
• Use contrasting colors to plot the density of multiple variables.
Box Plot
The box plot shows multiple statistical measurements. The box extends from the
lower to the upper quartile values of the data, thus allowing us to visualize the
interquartile range (IQR). The horizontal line within the box denotes the median.
The parallel extending lines from the boxes are called whiskers; they indicate the
variability outside the lower and upper quartiles. There is also an option to show data
outliers, usually as circles or diamonds, past the end of the whiskers.
118 | All You Need to Know about Plots
Use
Compare statistical measures for multiple variables or groups.
Examples
The following diagram shows a basic box plot that shows the height of a group
of people:
The following diagram shows a basic box plot for multiple variables. In this case, it
shows heights for two different groups – adults and non-adults:
In the next section, we will learn what the features, uses, and best practices are of the
violin plot.
Violin Plot
Violin plots are a combination of box plots and density plots. Both the statistical
measures and the distribution are visualized. The thick black bar in the center
represents the interquartile range, while the thin black line corresponds to the
whiskers in a box plot. The white dot indicates the median. On both sides of the
centerline, the density is visualized.
120 | All You Need to Know about Plots
Use
Compare statistical measures and density for multiple variables or groups.
Examples
The following diagram shows a violin plot for a single variable and shows how
students have performed in Math:
From the preceding diagram, we can analyze that most of the students have scored
around 40-60 in the Math test.
The following diagram shows a violin plot for two variables and shows the
performance of students in English and Math:
Figure 2.36: Violin plot for multiple variables (English and Math)
From the preceding diagram, we can say that on average, the students have scored
more in English than in Math, but the highest score was secured in Math.
122 | All You Need to Know about Plots
The following diagram shows a violin plot for a single variable divided into three
groups, and shows the performance of three divisions of students in English based
on their score:
Figure 2.37: Violin plot with multiple categories (three groups of students)
From the preceding diagram, we can note that on average, division C has scored the
highest, division B has scored the lowest, and division A is, on average, in between
divisions B and C.
Design Practice
• Scale the axes accordingly so that the distribution is clearly visible and not flat.
In this section, distribution plots were introduced. In the following activity, we will
have a closer look at histograms.
Distribution Plots | 123
1. Looking at the following histogram, can you identify the interval during which a
maximum number of trains arrive?
2. How would the histogram change if in the morning, the same total number of
trains arrive as in the afternoon, and if you have the same frequencies for all
time intervals?
Note
The solution to this activity can be found on page 398.
124 | All You Need to Know about Plots
With that activity, we conclude the section about distribution plots and we will
introduce geoplots in the next section.
Geoplots
Geological plots are a great way to visualize geospatial data. Choropleth maps can
be used to compare quantitative values for different countries, states, and so on. If
you want to show connections between different locations, connection maps are the
way to go.
Dot Map
In a dot map, each dot represents a certain number of observations. Each dot has the
same size and value (the number of observations each dot represents). The dots are
not meant to be counted; they are only intended to give an impression of magnitude.
The size and value are important factors for the effectiveness and impression of the
visualization. You can use different colors or symbols for the dots to show multiple
categories or groups.
Use
To visualize geospatial data.
Example
The following diagram shows a dot map where each dot represents a certain amount
of bus stops throughout the world:
Geoplots | 125
Design Practices
• Do not show too many locations. You should still be able to see the map to get a
feel for the actual location.
• Choose a dot size and value so that in dense areas, the dots start to blend. The
dot map should give a good impression of the underlying spatial distribution.
Choropleth Map
In a choropleth map, each tile is colored to encode a variable. For example, a tile
represents a geographic region for counties and countries. Choropleth maps provide
a good way to show how a variable varies across a geographic area. One thing to keep
in mind for choropleth maps is that the human eye naturally gives more attention to
larger areas, so you might want to normalize your data by dividing the map area-wise.
126 | All You Need to Know about Plots
Use
To visualize geospatial data grouped into geological regions—for example, states
or countries.
Example
The following diagram shows a choropleth map of a weather forecast in the USA:
Figure 2.40: Choropleth map showing a weather forecast for the USA
Design Practices
• Use darker colors for higher values, as they are perceived as being higher in
magnitude.
• Limit the color gradation, since the human eye is limited in how many colors it
can easily distinguish between. Seven color gradations should be enough.
Connection Map
In a connection map, each line represents a certain number of connections between
two locations. The link between the locations can be drawn with a straight or rounded
line, representing the shortest distance between them.
Geoplots | 127
Each line has the same thickness and value (the number of connections each line
represents). The lines are not meant to be counted; they are only intended to give
an impression of magnitude. The size and value of a connection line are important
factors for the effectiveness and impression of the visualization.
You can use different colors for the lines to show multiple categories or groups, or
you can use a colormap to encode the length of the connection.
Use
To visualize connections.
Examples
The following diagram shows a connection map of flight connections around
the world:
Figure 2.41: Connection map showing flight connections around the world
128 | All You Need to Know about Plots
Design Practices
• Do not show too many connections as it will be difficult for you to analyze the
data. You should still see the map to get a feel for the actual locations of the start
and end points.
• Choose a line thickness and value so that the lines start to blend in dense
areas. The connection map should give a good impression of the underlying
spatial distribution.
Geoplots are special plots that are great for visualizing geospatial data. In the
following section, we want to briefly talk about what’s generally important when it
comes to creating good visualizations.
• A visualization should tell a story and be designed for your audience. Before
creating your visualization, think about your target audience; create simple
visualizations for a non-specialist audience and more technical detailed
visualizations for a specialist audience. Think about a story to tell with your
visualization so that your visualization leaves an impression on the audience.
• Keep it simple and don’t overload the visualization with too much information.
What Makes a Good Visualization? | 129
2. How could we improve the visualizations? Sketch the right visualization for
both scenarios.
The first visualization is supposed to illustrate the top 30 YouTube music channels
according to their number of subscribers:
Figure 2.42: Pie chart showing the top 30 YouTube music channels
130 | All You Need to Know about Plots
Note
The solution to this activity can be found on page 399.
What Makes a Good Visualization? | 131
The following diagram shows the population by different income groups using a
density plot:
The following diagram shows the population by different income groups using a
box plot:
The following diagram shows the population by different income groups using a
violin plot:
Note
The solution to this activity can be found on page 401.
134 | All You Need to Know about Plots
Summary
This chapter covered the most important visualizations, categorized into comparison,
relation, composition, distribution, and geological plots. For each plot, a description,
practical examples, and design practices were given. Comparison plots, such as line
charts, bar charts, and radar charts, are well suited to comparing multiple variables
or variables over time. Relation plots are perfectly suited to show relationships
between variables. Scatter plots, bubble plots, which are an extension of scatter plots,
correlograms, and heatmaps were considered.
Composition plots are ideal if you need to think about something as part of a
whole. We first covered pie charts and continued with stacked bar charts, stacked
area charts, and Venn diagrams. For distribution plots that give a deep insight into
how your data is distributed, histograms, density plots, box plots, and violin plots
were considered. Regarding geospatial data, we discussed dot maps, connection
maps, and choropleth maps. Finally, some remarks were provided on what makes a
good visualization.
In the next chapter, we will dive into Matplotlib and create our own visualizations. We
will start by introducing the basics, followed by talking about how you can add text
and annotations to make your visualizations more comprehensible. We will continue
creating simple plots and using layouts to include multiple plots within a visualization.
At the end of the next chapter, we will explain how you can use Matplotlib to
visualize images.
3
A Deep Dive into Matplotlib
Overview
This chapter describes the fundamentals of Matplotlib and teaches you
how to create visualizations using the built-in plots that are provided by
the library. Specifically, you will create various visualizations such as bar
plots, pie charts, radar plots, histograms, and scatter plots through various
exercises and activities. You will also learn basic skills such as loading,
saving, plotting, and manipulating the color scale of images. You will
also be able to customize your visualization plots and write mathematical
expressions using TeX.
138 | A Deep Dive into Matplotlib
Introduction
In the previous chapter, we focused on various visualizations and identified which
visualization is best suited to show certain information for a given dataset. We
learned about the features, uses, and best practices for following various plots
such as comparison plots, relation plots, composition plots, distribution plots,
and geoplots.
Matplotlib is probably the most popular plotting library for Python. It is used for data
science and machine learning visualizations all around the world. John Hunter was
an American neurobiologist who began developing Matplotlib in 2003. It aimed to
emulate the commands of the MATLAB software, which was the scientific standard
back then. Several features, such as the global style of MATLAB, were introduced into
Matplotlib to make the transition to Matplotlib easier for MATLAB users. This chapter
teaches you how to best utilize the various functions and methods of Matplotlib to
create insightful visualizations.
Before we start working with Matplotlib to create our first visualizations, we will need
to understand the hierarchical structure of plots in Matplotlib. We will then cover the
basic functionality, such as creating, displaying, and saving Figures. Before covering
the most common visualizations, text and legend functions will be introduced.
After that, layouts will be covered, which enable multiple plots to be combined
into one. We will end the chapter by explaining how to plot images and how to use
mathematical expressions.
Furthermore, we again find Python objects that control axes, tick marks, legends,
titles, text boxes, the grid, and many other objects. All of these objects can
be customized.
• Figure
The Figure is an outermost container that allows you to draw multiple plots
within it. It not only holds the Axes object but also has the ability to configure
the Title.
• Axes
The axes are an actual plot, or subplot, depending on whether you want to plot
single or multiple visualizations. Its sub-objects include the x-axis, y-axis, spines,
and legends.
Observing this design, we can see that this hierarchical structure allows us to create a
complex and customizable visualization.
When looking at the "anatomy" of a Figure (shown in the following diagram), we get
an idea about the complexity of a visualization. Matplotlib gives us the ability not only
to display data, but also design the whole Figure around it by adjusting the Grid, X
and Y ticks, tick labels, and the Legend.
140 | A Deep Dive into Matplotlib
This implies that we can modify every single bit of a plot, starting from the Title and
Legend, right down to the major and minor ticks on the spines:
Taking a deeper look into the anatomy of a Figure object, we can observe the
following components:
• Grid: Vertical and horizontal lines used as an extension of the tick marks
• X/Y axis label: Text labels for the X and Y axes below the spines
• Minor tick: Small value indicators between the major tick marks
• Minor tick label: Text label that will be displayed at the minor ticks
Pyplot Basics | 141
• Major tick label: Text label that will be displayed at the major ticks
• Markers: Plotting type that plots every data point with a defined marker
Pyplot Basics
pyplot contains a simpler interface for creating visualizations that allow the users to
plot the data without explicitly configuring the Figure and Axes themselves. They are
automatically configured to achieve the desired output. It is handy to use the alias
plt to reference the imported submodule, as follows:
import matplotlib.pyplot as plt
The following sections describe some of the common operations that are performed
when using pyplot.
Creating Figures
You can use plt.figure() to create a new Figure. This function returns a
Figure instance, but it is also passed to the backend. Every Figure-related command
that follows is applied to the current Figure and does not need to know the
Figure instance.
By default, the Figure has a width of 6.4 inches and a height of 4.8 inches with a dpi
(dots per inch) of 100. To change the default values of the Figure, we can use the
parameters figsize and dpi.
Even though it is not necessary to explicitly create a Figure, this is a good practice if
you want to create multiple Figures at the same time.
142 | A Deep Dive into Matplotlib
Closing Figures
Figures that are no longer used should be closed by explicitly calling plt.close(),
which also cleans up memory efficiently.
If nothing is specified, the plt.close() command will close the current Figure.
To close a specific Figure, you can either provide a reference to a Figure instance or
provide the Figure number. To find the number of a Figure object, we can make use
of the number attribute, as follows:
plt.gcf().number
The plt.close('all') command is used to close all active Figures. The following
example shows how a Figure can be created and closed:
For a small Python script that only creates a visualization, explicitly closing a Figure
isn't required, since the memory will be cleaned in any case once the program
terminates. But if you create lots of Figures, it might make sense to close Figures in
between so as to save memory.
Format Strings
Before we actually plot something, let's quickly discuss format strings. They are a
neat way to specify colors, marker types, and line styles. A format string is specified
as [color][marker][line], where each item is optional. If the color argument
is the only argument of the format string, you can use matplotlib.colors.
Matplotlib recognizes the following formats, among others:
• RGB or RGBA float tuples (for example, (0.2, 0.4, 0.3) or (0.2, 0.4, 0.3, 0.5))
All the available marker options are illustrated in the following figure:
All the available line styles are illustrated in the following diagram. In general, solid
lines should be used. We recommend restricting the use of dashed and dotted lines
to either visualize some bounds/targets/goals or to depict uncertainty, for example, in
a forecast:
To conclude, format strings are a handy way to quickly customize colors, marker
types, and line styles. It is also possible to use arguments, such as color, marker,
and linestyle.
Plotting
With plt.plot([x], y, [fmt]), you can plot data points as lines and/or
markers. The function returns a list of Line2D objects representing the plotted
data. By default, if you do not provide a format string (fmt), the data points will be
connected with straight, solid lines. plt.plot([0, 1, 2, 3], [2, 4, 6,
8]) produces a plot, as shown in the following diagram. Since x is optional and the
default values are [0, …, N-1], plt.plot([2, 4, 6, 8]) results in the
same plot:
Pyplot Basics | 145
If you want to plot markers instead of lines, you can just specify a format string with
any marker type. For example, plt.plot([0, 1, 2, 3], [2, 4, 6, 8],
'o') displays data points as circles, as shown in the following diagram:
To plot multiple data pairs, the syntax plt.plot([x], y, [fmt], [x], y2,
[fmt2], …) can be used. plt.plot([2, 4, 6, 8], 'o', [1, 5, 9,
13], 's') results in the following diagram. Similarly, you can use plt.plot
multiple times, since we are working on the same Figure and Axes:
Any Line2D properties can be used instead of format strings to further customize
the plot. For example, the following code snippet shows how we can additionally
specify the linewidth and markersize arguments:
Besides providing data using lists or NumPy arrays, it might be handy to use pandas
DataFrames, as explained in the next section.
Ticks
Tick locations and labels can be set manually if Matplotlib's default isn't sufficient.
Considering the previous plot, it might be preferable to only have ticks at multiples of
ones at the x-axis. One way to accomplish this is to use plt.xticks() and plt.
yticks() to either get or set the ticks manually.
plt.xticks(ticks, [labels], [**kwargs]) sets the current tick locations
and labels of the x-axis.
Parameters:
• ticks: List of tick locations; if an empty list is passed, ticks will be disabled.
• labels (optional): You can optionally pass a list of labels for the
specified locations.
import numpy as np
plt.figure(figsize=(6, 3))
plt.plot([2, 4, 6, 8], 'o', [1, 5, 9, 13], 's')
plt.xticks(ticks=np.arange(4))
plt.figure(figsize=(6, 3))
plt.plot([2, 4, 6, 8], 'o', [1, 5, 9, 13], 's')
plt.xticks(ticks=np.arange(4), \
labels=['January', 'February', 'March', 'April'], \
rotation=20)
If you want to do even more sophisticated things with ticks, you should look into tick
locators and formatters. For example, ax.xaxis.set_major_locator(plt.
NullLocator()) would remove the major ticks of the x-axis, and ax.xaxis.
set_major_formatter(plt.NullFormatter()) would remove the major
tick labels, but not the tick locations of the x-axis.
Displaying Figures
plt.show() is used to display a Figure or multiple Figures. To display Figures
within a Jupyter Notebook, simply set the %matplotlib inline command at the
beginning of the code.
If you forget to use plt.show(), the plot won't show up. We will learn how to save
the Figure in the next section.
Pyplot Basics | 149
Saving Figures
The plt.savefig(fname) saves the current Figure. There are some useful
optional parameters you can specify, such as dpi, format, or transparent. The
following code snippet gives an example of how you can save a Figure:
plt.figure()
plt.plot([1, 2, 4, 5], [1, 3, 4, 3], '-o')
#bbox_inches='tight' removes the outer white margins
plt.savefig('lineplot.png', dpi=300, bbox_inches='tight')
Note
All exercises and activities will be developed in Jupyter Notebook. Please
download the GitHub repository with all the prepared templates from
https://packt.live/2HkTW1m. The datasets used in this chapter can be
downloaded from https://packt.live/3bzApYN.
150 | A Deep Dive into Matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(dpi=200)
4. Plot the following data pairs (x, y) as circles, which are connected via line
segments: (1, 1), (2, 3), (4, 4), and (5, 3). Then, visualize the plot:
Figure 3.12: A simple visualization created with the help of given data pairs and connected
via line segments
5. Save the plot using the plt.savefig() method. Here, we can either provide a
filename within the method or specify the full path:
plt.savefig('Exercise3.01.png', bbox_inches='tight')
Note
To access the source code for this specific section, please refer to
https://packt.live/2URkzlE.
This exercise showed you how to create a line plot in Matplotlib and how to use
format strings to quickly customize the appearance of the specified data points. Don't
forget to use bbox_inches='tight' to remove the outer white margins. In the
following section, we will cover how to further customize plots by adding text and
a legend.
152 | A Deep Dive into Matplotlib
Labels
Matplotlib provides a few label functions that we can use for setting labels to the x-
and y-axes. The plt.xlabel() and plt.ylabel() functions are used to set the
label for the current axes. The set_xlabel() and set_ylabel() functions are
used to set the label for specified axes.
Example:
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
You should (always) add labels to make a visualization more self-explanatory. The
same is valid for titles, which will be discussed now.
Titles
A title describes a particular chart/graph. The titles are placed above the axes in the
center, left edge, or right edge. There are two options for titles – you can either set
the Figure title or the title of an Axes. The suptitle() function sets the title for
the current and specified Figure. The title() function helps in setting the title for
the current and specified axes.
Example:
fig = plt.figure()
fig.suptitle('Suptitle', fontsize=10, fontweight='bold')
This creates a bold Figure title with a text subtitle and a font size of 10:
plt.title('Title', fontsize=16)
The plt.title function will add a title to the Figure with text as Title and font
size of 16 in this case.
Basic Text and Legend Functions | 153
Text
There are two options for text – you can either add text to a Figure or text to an Axes.
The figtext(x, y, text) and text(x, y, text) functions add text at
locations x or y for a Figure.
Example:
This creates a yellow text box with the text Text in Data Coords.
Annotations
Compared to text that is placed at an arbitrary position on the Axes, annotations are
used to annotate some features of the plot. In annotations, there are two locations
to consider: the annotated location, xy, and the location of the annotation, text
xytext. It is useful to specify the parameter arrowprops, which results in an
arrow pointing to the annotated location.
Example:
This creates a green arrow pointing to the data coordinates (4, 2) with the text
Example of Annotate at data coordinates (8, 4):
Legends
Legend describes the content of the plot. To add a legend to your Axes, we have to
specify the label parameter at the time of plot creation. Calling plt.legend() for
the current Axes or Axes.legend() for a specific Axes will add the legend. The loc
parameter specifies the location of the legend.
Example:
Labels, titles, text, annotations, and a legend are great ways to add textual
information to visualization and therefore make it more understandable and self-
explanatory. But don't overdo it. Too much text can be overwhelming. The following
activity gives you the opportunity to consolidate the theoretical foundations learned
in this section.
Let's look at the following scenario: you are interested in investing in stocks. You
downloaded the stock prices for the "big five": Amazon, Google, Apple, Facebook, and
Microsoft. You want to visualize the closing prices in dollars to identify trends. This
dataset is available in the Datasets folder that you had downloaded initially. The
following are the steps to perform:
156 | A Deep Dive into Matplotlib
1. Import the necessary modules and enable plotting within a Jupyter Notebook.
3. Use Matplotlib to create a line chart visualizing the closing prices for the past
5 years (whole data sequence) for all five companies. Add labels, titles, and a
legend to make the visualization self-explanatory. Use plt.grid() to add a
grid to your plot. If necessary, adjust the ticks in order to make them readable.
After executing the preceding steps, the expected output should be as follows:
Note
The solution to this activity can be found on page 402.
This covers the most important things about pyplot. In the following section, we will
talk about how to create various plots in Matplotlib.
Basic Plots | 157
Basic Plots
In this section, we are going to go through the different types of simple plots. This
includes bar charts, pie charts, stacked bar, and area charts, histograms, box plots,
scatter plots and bubble plots. Please refer to the previous chapter to get more
details about these plots. More sophisticated plots, such as violin plots, will be
covered in the next chapter, using Seaborn instead of Matplotlib.
Bar Chart
The plt.bar(x, height, [width]) creates a vertical bar plot. For horizontal
bars, use the plt.barh() function.
Important parameters:
• width (optional): Specifies the width of all bars; the default is 0.8
Example:
The preceding code creates a bar plot, as shown in the following diagram:
If you want to have subcategories, you have to use the plt.bar() function
multiple times with shifted x-coordinates. This is done in the following example and
illustrated in the figure that follows. The arange() function is a method in the
NumPy package that returns evenly spaced values within a given interval. The gca()
function helps in getting the instance of current axes on any current Figure. The
set_xticklabels() function is used to set the x-tick labels with the list of given
string labels.
Example:
After providing the theoretical foundation for creating bar charts in Matplotlib, you
can apply your acquired knowledge in practice with the following activity.
1. Import the necessary modules and enable plotting within a Jupyter Notebook.
3. Use Matplotlib to create a visually appealing bar plot comparing the two scores
for all five movies.
4. Use the movie titles as labels for the x-axis. Use percentages at intervals of 20
for the y-axis and minor ticks at intervals of 5. Add a legend and a suitable title to
the plot.
5. Use functions that are required to explicitly specify the axes. To get the reference
to the current axes, use ax = plt.gca(). To add minor y-ticks, use Axes.
set_yticks([ticks], minor=True). To add a horizontal grid for major
ticks, use Axes.yaxis.grid(which='major'), and to add a dashed
horizontal grid for minor ticks, use Axes.yaxis.grid(which='minor',
linestyle='--').
160 | A Deep Dive into Matplotlib
Note
The solution to this activity can be found on page 404.
After practicing the creation of bar plots, we will discuss how to create pie charts in
Matplotlib in the following section.
Pie Chart
The plt.pie(x, [explode], [labels], [autopct]) function creates a
pie chart.
Important parameters:
• explode (optional): Specifies the fraction of the radius offset for each slice. The
explode-array must have the same length as the x-array.
Basic Plots | 161
Example:
After this short introduction to pie charts, we will create a more sophisticated
pie chart that visualizes the water usage in a common household in the
following exercise.
162 | A Deep Dive into Matplotlib
# Import statements
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Load dataset
data = pd.read_csv('../../Datasets/water_usage.csv')
4. Use a pie chart to visualize water usage. Highlight one usage of your choice using
the explode parameter. Show the percentages for each slice and add a title:
# Create figure
plt.figure(figsize=(8, 8), dpi=300)
# Create pie plot
plt.pie('Percentage', explode=(0, 0, 0.1, 0, 0, 0), \
labels='Usage', data=data, autopct='%.0f%%')
# Add title
plt.title('Water usage')
# Show plot
plt.show()
Note
To access the source code for this specific section, please refer to
https://packt.live/3frXRrZ.
In the next section, we will learn how to generate a stacked bar chart and implement
an activity on it.
164 | A Deep Dive into Matplotlib
plt.bar(x, bars1)
plt.bar(x, bars2, bottom=bars1)
plt.bar(x, bars3, bottom=np.add(bars1, bars2))
Let's get some more practice with stacked bar charts in the following activity.
Basic Plots | 165
Use the dataset tips from Seaborn, which contains multiple entries of restaurant bills,
and create a matrix where the elements contain the sum of the total bills for each day
and smokers/non-smokers:
Note
For this exercise, we will import the Seaborn library as import seaborn
as sns. The dataset can be loaded using this code: bills = sns.
load_dataset('tips').
We will learn in detail about this in Chapter 4, Simplifying Visualizations
Using Seaborn.
1. Import all the necessary dependencies and load the tips dataset. Note that we
have to import the Seaborn library to load the dataset.
2. Use the given dataset and create a matrix where the elements contain the sum
of the total bills for each day and split according to smokers/non-smokers.
3. Create a stacked bar plot, stacking the summed total bills separated according to
smoker and non-smoker for each day.
After executing the preceding steps, the expected output should be as follows:
166 | A Deep Dive into Matplotlib
Note
The solution to this activity can be found on page 406.
In the following section, stacked area charts will be covered, which, in comparison
to stacked bar charts, are suited to visualizing part-of-a-whole relationships for time
series data.
Basic Plots | 167
• y: Specifies the y-values of the data series. For multiple series, either as
a 2D array or any number of 1D arrays, call the following function: plt.
stackplot(x, y1, y2, y3, …).
• labels (optional): Specifies the labels as a list or tuple for each data series.
Example:
Let's get some more practice regarding stacked area charts in the following activity.
Activity 3.04: Comparing Smartphone Sales Units Using a Stacked Area Chart
In this activity, we will compare smartphone sales units using a stacked area chart.
Let's look at the following scenario: you want to invest in one of the five biggest
smartphone manufacturers. Looking at the quarterly sales units as part of a whole
may be a good indicator of which company to invest in:
1. Import the necessary modules and enable plotting within a Jupyter Notebook.
After executing the preceding steps, the expected output should be as follows:
Figure 3.24: Stacked area chart comparing sales units of different smartphone
manufacturers
Note
The solution to this activity can be found on page 409.
Basic Plots | 169
In the following section, the histogram will be covered, which helps to visualize the
distribution of a single numerical variable.
Histogram
A histogram visualizes the distribution of a single numerical variable. Each bar
represents the frequency for a certain interval. The plt.hist(x) function creates a
histogram.
Important parameters:
• bins: (optional): Specifies the number of bins as an integer or specifies the bin
edges as a list.
• range: (optional): Specifies the lower and upper range of the bins as a tuple.
Example:
Histograms are a good way to visualize an estimated density of your data. If you're
only interested in summary statistics, such as central tendency or dispersion, the
following covered box plots are more interesting.
Box Plot
The box plot shows multiple statistical measurements. The box extends from the
lower to the upper quartile values of the data, thereby allowing us to visualize the
interquartile range. For more details regarding the plot, refer to the previous chapter.
The plt.boxplot(x) function creates a box plot.
Basic Plots | 171
Important parameters:
• x: Specifies the input data. It specifies either a 1D array for a single box, or a
sequence of arrays for multiple boxes.
• notch: (optional) If true, notches will be added to the plot to indicate the
confidence interval around the median.
Example:
Now that we've introduced histograms and box plots in Matplotlib, our theoretical
knowledge can be practiced in the following activity, where both charts are used to
visualize data regarding the intelligence quotient.
172 | A Deep Dive into Matplotlib
Note
The plt.axvline(x, [color=…], [linestyle=…]) function
draws a vertical line at position x.
1. Import the necessary modules and enable plotting within a Jupyter Notebook.
# IQ samples
iq_scores = [126, 89, 90, 101, 102, 74, 93, 101, 66, \
120, 108, 97, 98, 105, 119, 92, 113, 81, \
104, 108, 83, 102, 105, 111, 102, 107, 103, \
89, 89, 110, 71, 110, 120, 85, 111, 83, 122, \
120, 102, 84, 118, 100, 100, 114, 81, 109, 69, \
97, 95, 106, 116, 109, 114, 98, 90, 92, 98, \
91, 81, 85, 86, 102, 93, 112, 76, 89, 110, \
75, 100, 90, 96, 94, 107, 108, 95, 96, 96, \
114, 93, 95, 117, 141, 115, 95, 86, 100, 121, \
103, 66, 99, 96, 111, 110, 105, 110, 91, 112, \
102, 112, 75]
3. Plot a histogram with 10 bins for the given IQ scores. IQ scores are normally
distributed with a mean of 100 and a standard deviation of 15. Visualize the
mean as a vertical solid red line, and the standard deviation using dashed
vertical lines. Add labels and a title. The expected output is as follows:
Basic Plots | 173
4. Create a box plot to visualize the same IQ scores. Add labels and a title. The
expected output is as follows:
5. Create a box plot for each of the IQ scores of the different test groups. Add
labels and a title. The following are IQ scores for different test groups that we
can use as data:
group_a = [118, 103, 125, 107, 111, 96, 104, 97, 96, \
114, 96, 75, 114, 107, 87, 117, 117, 114, \
117, 112, 107, 133, 94, 91, 118, 110, 117, \
86, 143, 83, 106, 86, 98, 126, 109, 91, 112, \
120, 108, 111, 107, 98, 89, 113, 117, 81, 113, \
112, 84, 115, 96, 93, 128, 115, 138, 121, 87, \
112, 110, 79, 100, 84, 115, 93, 108, 130, 107, \
106, 106, 101, 117, 93, 94, 103, 112, 98, 103, \
70, 139, 94, 110, 105, 122, 94, 94, 105, 129, \
110, 112, 97, 109, 121, 106, 118, 131, 88, 122, \
125, 93, 78]
group_b = [126, 89, 90, 101, 102, 74, 93, 101, 66, \
120, 108, 97, 98, 105, 119, 92, 113, 81, \
104, 108, 83, 102, 105, 111, 102, 107, 103, \
89, 89, 110, 71, 110, 120, 85, 111, 83, \
122, 120, 102, 84, 118, 100, 100, 114, 81, \
109, 69, 97, 95, 106, 116, 109, 114, 98, \
90, 92, 98, 91, 81, 85, 86, 102, 93, 112, \
76, 89, 110, 75, 100, 90, 96, 94, 107, 108, \
95, 96, 96, 114, 93, 95, 117, 141, 115, 95, \
86, 100, 121, 103, 66, 99, 96, 111, 110, 105, \
110, 91, 112, 102, 112, 75]
group_c = [108, 89, 114, 116, 126, 104, 113, 96, 69, 121, \
109, 102, 107, 122, 104, 107, 108, 137, 107, 116, \
98, 132, 108, 114, 82, 93, 89, 90, 86, 91, \
99, 98, 83, 93, 114, 96, 95, 113, 103, 81, \
107, 85, 116, 85, 107, 125, 126, 123, 122, 124, \
115, 114, 93, 93, 114, 107, 107, 84, 131, 91, \
108, 127, 112, 106, 115, 82, 90, 117, 108, 115, \
113, 108, 104, 103, 90, 110, 114, 92, 101, 72, \
109, 94, 122, 90, 102, 86, 119, 103, 110, 96, \
Basic Plots | 175
90, 110, 96, 69, 85, 102, 69, 96, 101, 90]
group_d = [93, 99, 91, 110, 80, 113, 111, 115, 98, 74, \
96, 80, 83, 102, 60, 91, 82, 90, 97, 101, \
89, 89, 117, 91, 104, 104, 102, 128, 106, 111, \
79, 92, 97, 101, 106, 110, 93, 93, 106, 108, \
85, 83, 108, 94, 79, 87, 113, 112, 111, 111, \
79, 116, 104, 84, 116, 111, 103, 103, 112, 68, \
54, 80, 86, 119, 81, 84, 91, 96, 116, 125, \
99, 58, 102, 77, 98, 100, 90, 106, 109, 114, \
102, 102, 112, 103, 98, 96, 85, 97, 110, 131, \
92, 79, 115, 122, 95, 105, 74, 85, 85, 95]
Note
The solution to this activity can be found on page 411.
Scatter Plot
Scatter plots show data points for two numerical variables, displaying a variable
on both axes. plt.scatter(x, y) creates a scatter plot of y versus x, with
optionally varying marker size and/or color.
Important parameters:
Example:
plt.scatter(x, y)
Note
The Axes.set_xscale('log') and the Axes.set_
yscale('log') change the scale of the x-axis and y-axis to a
logarithmic scale, respectively.
Let's visualize the correlation between various animals with the help of a scatter plot:
# Import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Load dataset
data = pd.read_csv('../../Datasets/anage_data.csv')
4. The given dataset is not complete. Filter the data so that you end up with
samples containing a body mass and a maximum longevity. Sort the data
according to the animal class; here, the isfinite() function (to check whether
the number is finite or not) checks for the finiteness of the given element:
# Preprocessing
longevity = 'Maximum longevity (yrs)'
mass = 'Body mass (g)'
data = data[np.isfinite(data[longevity]) \
& np.isfinite(data[mass])]
178 | A Deep Dive into Matplotlib
5. Create a scatter plot visualizing the correlation between the body mass and the
maximum longevity. Use different colors to group data samples according to
their class. Add a legend, labels, and a title. Use a log scale for both the x-axis
and y-axis:
# Create figure
plt.figure(figsize=(10, 6), dpi=300)
# Create scatter plot
plt.scatter(amphibia[mass], amphibia[longevity], \
label='Amphibia')
plt.scatter(aves[mass], aves[longevity], \
label='Aves')
plt.scatter(mammalia[mass], mammalia[longevity], \
label='Mammalia')
plt.scatter(reptilia[mass], reptilia[longevity], \
label='Reptilia')
# Add legend
plt.legend()
# Log scale
ax = plt.gca()
ax.set_xscale('log')
ax.set_yscale('log')
# Add labels
plt.xlabel('Body mass in grams')
plt.ylabel('Maximum longevity in years')
# Show plot
plt.show()
From the preceding output, we can visualize the correlation between various
animals based on the maximum longevity in years and body mass in grams.
Note
To access the source code for this specific section, please refer to
https://packt.live/3fsozRf.
Bubble Plot
The plt.scatter function is used to create a bubble plot. To visualize a third or
fourth variable, the parameters s (scale) and c (color) can be used.
Example:
The colorbar function adds a colorbar to the plot, which indicates the value of the
color. The result is shown in the following diagram:
Layouts
There are multiple ways to define a visualization layout in Matplotlib. By layout, we
mean the arrangement of multiple Axes within a Figure. We will start with subplots
and how to use the tight layout to create visually appealing plots and then cover
GridSpec, which offers a more flexible way to create multi-plots.
Layouts | 181
Subplots
It is often useful to display several plots next to one another. Matplotlib offers the
concept of subplots, which are multiple Axes within a Figure. These plots can be grids
of plots, nested plots, and so on.
To share the x-axis or y-axis, the parameters sharex and sharey must be set,
respectively. The axis will have the same limits, ticks, and scale.
Both examples yield the same result, as shown in the following diagram:
Example 2:
Setting sharex and sharey to True results in the following diagram. This allows
for a better comparison:
Layouts | 183
Subplots are an easy way to create a Figure with multiple plots of the same size
placed in a grid. They are not really suited for more sophisticated layouts.
Tight Layout
The plt.tight_layout() adjusts subplot parameters (primarily padding
between the Figure edge and the edges of subplots, and padding between the edges
of adjacent subplots) so that the subplots fit well in the Figure.
Examples:
Radar Charts
Radar charts, also known as spider or web charts, visualize multiple variables, with
each variable plotted on its own axis, resulting in a polygon. All axes are arranged
radially, starting at the center with equal distance between each other, and have the
same scale.
186 | A Deep Dive into Matplotlib
# Import settings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
"""
Sample data
Attributes: Efficiency, Quality, Commitment, Responsible Conduct,
Cooperation
"""
data = \
pd.DataFrame({'Employee': ['Alex', 'Alice', \
'Chris', 'Jennifer'], \
'Efficiency': [5, 4, 4, 3,],
'Quality': [5, 5, 3, 3],
'Commitment': [5, 4, 4, 4],
'Responsible Conduct': [4, 4, 4, 3],
'Cooperation': [4, 3, 4, 5]})
Layouts | 187
attributes = list(data.columns[1:])
values = list(data.values[:, 1:])
employees = list(data.values[:, 0])
angles = [n / float(len(attributes)) * 2 \
* np.pi for n in range(len(attributes))]
# Close the plot
angles += angles[:1]
values = np.asarray(values)
values = np.concatenate([values, values[:, 0:1]], axis=1)
5. Create subplots with the polar projection. Set a tight layout so that
nothing overlaps:
# Create figure
plt.figure(figsize=(8, 8), dpi=150)
# Create subplots
for i in range(4):
ax = plt.subplot(2, 2, i + 1, polar=True)
ax.plot(angles, values[i])
ax.set_yticks([1, 2, 3, 4, 5])
ax.set_xticks(angles)
ax.set_xticklabels(attributes)
ax.set_title(employees[i], fontsize=14, color='r')
# Set tight layout
plt.tight_layout()
# Show plot
plt.show()
188 | A Deep Dive into Matplotlib
From the preceding output, we can clearly see how the various team members have
performed in terms of metrics such as Quality, Efficiency, Cooperation, Responsible
Conduct, and Commitment. You can easily draw the conclusion that Alex outperforms
his collegues when all metrics are considered. In the next section, we will learn how to
use the GridSpec function.
Layouts | 189
Note
To access the source code for this specific section, please refer to
https://packt.live/3e6is4X.
GridSpec
The matplotlib.gridspec.GridSpec(nrows, ncols) function specifies the
geometry of the grid in which a subplot will be placed. For example, you can specify
a grid with three rows and four columns. As a next step, you have to define which
elements of the gridspec are used by a subplot; elements of a gridspec are accessed
in the same way as NumPy arrays. You could, for example, only use a single element
of a gridspec for a subplot and therefore end up with 12 subplots in total. Another
possibility, as shown in the following example, is to create a bigger subplot using 3x3
elements of the gridspec and another three subplots with a single element each.
Example:
gs = matplotlib.gridspec.GridSpec(3, 4)
ax1 = plt.subplot(gs[:3, :3])
ax2 = plt.subplot(gs[0, 3])
ax3 = plt.subplot(gs[1, 3])
ax4 = plt.subplot(gs[2, 3])
ax1.plot(series[0])
ax2.plot(series[1])
ax3.plot(series[2])
ax4.plot(series[3])
plt.tight_layout()
190 | A Deep Dive into Matplotlib
1. Import the necessary modules and enable plotting within a Jupyter Notebook.
2. Filter the data so that you end up with samples containing a body mass and
maximum longevity as the given dataset, AnAge, which was used in the previous
exercise, is not complete. Select all of the samples of the Aves class with a body
mass of less than 20,000.
Layouts | 191
3. Create a Figure with a constrained layout. Create a gridspec of size 4x4. Create a
scatter plot of size 3x3 and marginal histograms of size 1x3 and 3x1. Add labels
and a Figure title.
After executing the preceding steps, the expected output should be as follows:
Note
The solution to this activity can be found on page 415.
192 | A Deep Dive into Matplotlib
Next, we will learn how to work with image data in our visualizations.
Images
If you want to include images in your visualizations or work with image data,
Matplotlib offers several functions for you. In this section, we will show you how to
load, save, and plot images with Matplotlib.
Note
The images that are used in this section are sourced from https://unsplash.
com/.
Loading Images
If you encounter image formats that are not supported by Matplotlib, we recommend
using the Pillow library to load the image. In Matplotlib, loading images is part of the
image submodule. We use the alias mpimg for the submodule, as follows:
import matplotlib.image as mpimg
img_filenames = os.listdir('../../Datasets/images')
imgs = \
[mpimg.imread(os.path.join('../../Datasets/images', \
img_filename)) \
for img_filename in img_filenames]
The os.listdir() method in Python is used to get the list of all files and
directories in the specified directory and then the os.path.join() function is
used to join one or more path components intelligently.
Images | 193
Saving Images
Sometimes, it might be helpful to get an insight into the color values. We can simply
add a color bar to the image plot. It is recommended to use a colormap with high
contrast—for example, jet:
plt.imshow(img, cmap='jet')
plt.colorbar()
Another way to get insight into the image values is to plot a histogram, as shown in
the following diagram. To plot the histogram for an image array, the array has to be
flattened using numpy.ravel:
To plot multiple images in a grid, we can simply use plt.subplots and plot an
image per Axes:
In some situations, it would be neat to remove the ticks and add labels. axes.set_
xticks([]) and axes.set_yticks([]) remove x-ticks and y-ticks, respectively.
axes.set_xlabel('label') adds a label:
fig, axes = plt.subplots(1, 2)
labels = ['coast', 'beach']
for i in range(2):
axes[i].imshow(imgs[i])
axes[i].set_xticks([])
axes[i].set_yticks([])
axes[i].set_xlabel(labels[i])
1. Import the necessary modules and enable plotting within a Jupyter Notebook.
3. Visualize the images in a 2x2 grid. Remove the axes and give each image a label.
After executing the preceding steps, the expected output should be as follows:
Note
The solution to this activity can be found on page 418.
Writing Mathematical Expressions | 199
In this activity, we have plotted images in a 2x2 grid. In the next section, we will learn
the basics of how to write and plot a mathematical expression.
plt.xlabel(‚$x$')
plt.ylabel(‚$\cos(x)$')
TeX examples:
• '$\alpha_i>\beta_i$' produces .
• '$\sqrt[3]{8}$' produces .
Summary
In this chapter, we provided a detailed introduction to Matplotlib, one of the most
popular visualization libraries for Python. We started off with the basics of pyplot
and its operations, and then followed up with a deep insight into the numerous
possibilities that help to enrich visualizations with text. Using practical examples, this
chapter covered the most popular plotting functions that Matplotlib offers, including
comparison charts, and composition and distribution plots. It concluded with how to
visualize images and write mathematical expressions.
In the next chapter, we will learn about the Seaborn library. Seaborn is built on top
of Matplotlib and provides a higher-level abstraction to create visualizations in an
easier way. One neat feature of Seaborn is the easy integration of DataFrames from
the pandas library. Furthermore, Seaborn offers a few more plots out of the box,
including more advanced visualizations, such as violin plots.
4
Simplifying Visualizations
Using Seaborn
Overview
In this chapter, we will see how Seaborn differs from Matplotlib and
construct effective plots leveraging the advantages of Seaborn. Specifically,
you will use Seaborn to plot bivariate distributions, heatmaps, pairwise
relationships, and so on. This chapter also teaches you how to use
FacetGrid for visualizing plots for multiple variables separately. By the end
of this chapter, you will be able to explain the advantages Seaborn has
compared to Matplotlib and design visually appealing and insightful
plots efficiently.
204 | Simplifying Visualizations Using Seaborn
Introduction
In the previous chapter, we took an in-depth look at Matplotlib, one of the most
popular plotting libraries for Python. Various plot types were covered, and we looked
into customizing plots to create aesthetic plots.
With Seaborn, we attempt to make visualization a central part of data exploration and
understanding. Internally, Seaborn operates on DataFrames and arrays that contain
the complete dataset. This enables it to perform semantic mappings and statistical
aggregations that are essential for displaying informative visualizations. Seaborn can
also be used to simply change the style and appearance of Matplotlib visualizations.
• Built-in color palettes that can be used to reveal patterns in the dataset
• A dataset-oriented interface
Advantages of Seaborn
Working with DataFrames using Matplotlib adds some inconvenient overhead. For
example, simply exploring your dataset can take up a lot of time, since you require
some additional data wrangling to be able to plot the data from the DataFrames
using Matplotlib.
Introduction | 205
Seaborn, however, is built to operate on DataFrames and full dataset arrays, which
makes this process simpler. It internally performs the necessary semantic mappings
and statistical aggregation to produce informative plots.
Note
The American Community Survey (ACS) Public-Use Microdata
Samples (PUMS) dataset (one-year estimate from 2017) from https://
www.census.gov/programs-surveys/acs/technical-documentation/pums/
documentation.2017.html is used in this chapter. This dataset is later used
in Chapter 07, Combining What We Have Learned. This dataset can also be
downloaded from GitHub. Here is the link: https://packt.live/3bzApYN.
Seaborn uses Matplotlib to draw plots. Even though many tasks can be accomplished
with just Seaborn, further customization might require the usage of Matplotlib. We
only provided the names of the variables in the dataset and the roles they play in the
plot. Unlike in Matplotlib, it is not necessary to translate the variables into parameters
of the visualization.
Other potential obstacles are the default Matplotlib parameters and configurations.
The default parameters in Seaborn provide better visualizations without
additional customization. We will look at these default parameters in detail in the
upcoming topics.
For users who are already familiar with Matplotlib, the extension with Seaborn is self-
evident, since the core concepts are mostly similar.
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
Seaborn categorizes Matplotlib's parameters into two groups. The first group
contains parameters for the aesthetics of the plot, while the second group scales
various elements of the plot so that it can be easily used in different contexts, such as
visualizations that are used for presentations and posters.
Here is an example:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
210 | Simplifying Visualizations Using Seaborn
Here is an example:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.figure()
x1 = [10, 20, 5, 40, 8]
Controlling Figure Aesthetics | 211
The aesthetics are only changed temporarily. The result is shown in the
following diagram:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
sns.despine()
plt.legend()
plt.show()
In the next section, we will learn to control the scale of plot elements.
Controlling Figure Aesthetics | 213
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("poster")
plt.figure()
x1 = [10, 20, 5, 40, 8]
x2 = [30, 43, 9, 7, 20]
plt.plot(x1, label='Group A')
plt.plot(x2, label='Group B')
plt.legend()
plt.show()
214 | Simplifying Visualizations Using Seaborn
Contexts are an easy way to use preconfigured scales of plot elements for different
use cases. We will apply them in the following exercise, which uses a box plot to
compare the IQ scores of different test groups.
Note
All the exercises and activities in this chapter are developed using Jupyter
Notebook. The files can be downloaded from the following link: https://packt.
live/2ONDmLl. All the datasets used in this chapter can be found at https://
packt.live/3bzApYN.
Controlling Figure Aesthetics | 215
Exercise 4.01: Comparing IQ Scores for Different Test Groups by Using a Box Plot
In this exercise, we will generate a box plot using Seaborn. We will compare IQ scores
among different test groups using a box plot of the Seaborn library to demonstrate
how easy and efficient it is to create plots with Seaborn provided that we have a
proper DataFrame. This exercise also shows how to quickly change the style and
context of a Figure using the pre-configurations supplied by Seaborn.
Let's compare IQ scores among different test groups using the Seaborn library:
3. Use the pandas read_csv() function to read the data located in the
Datasets folder:
mydata = pd.read_csv("../../Datasets/iq_scores.csv")
4. Access the data of each test group in the column. Convert this into a list using
the tolist() method. Once the data of each test group has been converted
into a list, assign this list to variables of each respective test group:
group_a = mydata[mydata.columns[0]].tolist()
group_b = mydata[mydata.columns[1]].tolist()
group_c = mydata[mydata.columns[2]].tolist()
group_d = mydata[mydata.columns[3]].tolist()
5. Print the values of each group to check whether the data inside it is converted
into a list. This can be done with the help of the print() function:
print(group_a)
216 | Simplifying Visualizations Using Seaborn
print(group_b)
print(group_c)
print(group_d)
6. Once we have the data for each test group, we need to construct a DataFrame
from this data. This can be done with the help of the pd.DataFrame()
function, which is provided by pandas:
7. If you don't create your own DataFrame, it is often helpful to print the column
names, which is done by calling print(data.columns). The output is
as follows:
You can see that our DataFrame has two variables with the labels Groups and
IQ score. This is especially interesting since we can use them to specify which
variable to plot on the x-axis and which one on the y-axis.
8. Now, since we have the DataFrame, we need to create a box plot using the
boxplot() function provided by Seaborn. Within this function, specify the
variables for both the axes along with the DataFrame. Make Groups the variable
to plot on the x-axis, and IQ score the variable for the y-axis. Pass data as
a parameter. Here, data is the DataFrame that we obtained from the previous
step. Moreover, use the whitegrid style, set the context to talk, and remove
all axes spines, except the one on the bottom:
plt.figure(dpi=150)
# Set style
sns.set_style('whitegrid')
# Create boxplot
sns.boxplot('Groups', 'IQ score', data=data)
# Despine
sns.despine(left=True, right=True, top=True)
# Add title
plt.title('IQ scores for different test groups')
# Show plot
plt.show()
218 | Simplifying Visualizations Using Seaborn
The despine() function helps in removing the top and right spines from the
plot by default (without passing any arguments to the function). Here, we have
also removed the left spine. Using the title() function, we have set the title
for our plot. The show() function visualizes the plot.
After executing the preceding steps, the final output should be as follows:
From the preceding diagram, we can conclude that Seaborn offers visually appealing
plots out of the box and allows easy customization, such as changing the style,
context, and spines. Once a suitable DataFrame exists, the plotting is achieved with
a single function. Column names are automatically used for labeling the axis. Even
categorical variables are supported out of the box.
Note
To access the source code for this specific section, please refer to
https://packt.live/3hwvR8m.
Another great advantage of Seaborn is color palettes, which are introduced in the
following section.
Color Palettes | 219
Color Palettes
Color is a very important factor for your visualization. Color can reveal patterns in
data if used effectively or hide patterns if used poorly. Seaborn makes it easy to
select and use color palettes that are suited to your task. The color_palette()
function provides an interface for many of the possible ways to generate
color palettes.
You can set the palette for all plots with set_palette(). This function accepts the
same arguments as color_palette(). In the following sections, we will explain
how color palettes are divided into different groups.
Choosing the best color palette is not straightforward and, to some extent, subjective.
To make a good decision, you have to know the characteristics of your data. There are
three general groups of color palettes, namely, categorical, sequential, and diverging,
which we will break down in the following sections.
Some examples where it is suitable to use categorical color palettes are line charts
showing stock trends for different companies, and a bar chart with subcategories;
basically, any time you want to group your data.
220 | Simplifying Visualizations Using Seaborn
There are six default themes in Seaborn: deep, muted, bright, pastel, dark,
and colorblind. The code and output for each theme are provided in the
following diagram. Out of these color palettes, it doesn't really matter which one
you use. Choose the one you prefer and the one that best fits the overall theme of
the visualization. It's never a bad idea to use the colorblind palette to account for
colorblind people. The following is the code to create a deep color palette:
palette2 = sns.color_palette("muted")
sns.palplot(palette2)
palette3 = sns.color_palette("bright")
sns.palplot(palette3)
palette4 = sns.color_palette("pastel")
sns.palplot(palette4)
palette5 = sns.color_palette("dark")
sns.palplot(palette5)
palette6 = sns.color_palette("colorblind")
sns.palplot(palette6)
One of the sequential color palettes that Seaborn offers is cubehelix palettes. They
have a linear increase or decrease in brightness and some variation in hue, meaning
that even when converted to black and white, the information is preserved.
Creating custom sequential palettes that only produce colors that start at either light
or dark desaturated colors and end with a specified color can be accomplished with
light_palette() or dark_palette(). Two examples are given in
the following:
custom_palette2 = sns.light_palette("magenta")
sns.palplot(custom_palette2)
The preceding palette can also be reversed by setting the reverse parameter to
True in the following code:
custom_palette3 = sns.light_palette("magenta", reverse=True)
sns.palplot(custom_palette3)
By default, creating a color palette only returns a list of colors. If you want to use it as
a colormap object, for example, in combination with a heatmap, set the
as_cmap=True argument, as demonstrated in the following example:
x = np.arange(25).reshape(5, 5)
ax = sns.heatmap(x, cmap=sns.cubehelix_palette(as_cmap=True))
custom_palette4 = sns.color_palette("coolwarm", 7)
sns.palplot(custom_palette4)
As we already mentioned, colors, when used effectively, can reveal patterns in data.
Spend some time thinking about which color palette is best for certain data. Let's
apply color palettes to visualize temperature changes in the following exercise.
Note
The dataset used for this exercise is used from https://data.giss.nasa.gov/
gistemp/ (accessed January 7, 2020). For more details about the dataset,
visit the website, looking at the FAQs in particular. This dataset is also
available in your Datasets folder.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
226 | Simplifying Visualizations Using Seaborn
data = pd.read_csv("../../Datasets/"\
"northern_surface_temperature.csv", \
index_col=['Year'])
data = data.transpose()
4. Create a custom-diverging palette that diverges to blue (240 degrees on the hue
wheel) for low values and to red (15 degrees on the hue wheel) for high values.
Set the saturation as s=99. Make sure that the diverging_palette()
function returns a colormap by setting as_cmap=True:
5. Plot the heatmap for every 5 years. To ensure that the neutral color corresponds
to no temperature change (the value is zero), set center=0:
plt.figure(dpi=200)
sns.heatmap(data.iloc[:, ::5], cmap=heat_colormap, center=0)
plt.title("Temperature Changes from 1880 to 2015 " \
"(base period 1951-1980)")
plt.savefig('temperature_change.png', dpi=300, \
bbox_inches='tight')
Color Palettes | 227
The preceding diagram helps us to visualize the surface temperature change for
the Northern Hemisphere for past years.
Note
To access the source code for this specific section, please refer to
https://packt.live/3fracg8.
Let's now perform an activity to create a heatmap using a real-life dataset with
various color palettes.
228 | Simplifying Visualizations Using Seaborn
3. Use your own appropriate colormap. Make sure that the lowest value is the
brightest, and the highest the darkest, color. After executing the preceding steps,
the expected output should be as follows:
Note
The solution to this activity can be found on page 420.
After the in-depth discussion about various color palettes, we will introduce some
more advanced plots that Seaborn offers in the following section.
Bar Plots
In the last chapter, we already explained how to create bar plots with Matplotlib.
Creating bar plots with subgroups was quite tedious, but Seaborn offers a very
convenient way to create various bar plots. They can also be used in Seaborn to
represent estimates of central tendency with the height of each bar, while uncertainty
is indicated by error bars at the top of the bar.
The following example gives you a good idea of how this works:
import pandas as pd
import seaborn as sns
data = pd.read_csv("../Datasets/salary.csv")
sns.set(style="whitegrid")
sns.barplot(x="Education", y="Salary", hue="District", data=data)
230 | Simplifying Visualizations Using Seaborn
Let's get some practice with Seaborn bar plots in the following activity.
3. Use Seaborn to create a visually appealing bar plot that compares the two scores
for all five movies.
Advanced Plots in Seaborn | 231
After executing the preceding steps, the expected output should appear
as follows:
Note
The solution to this activity can be found on page 422.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('../../Datasets/age_salary_hours.csv')
232 | Simplifying Visualizations Using Seaborn
sns.distplot(data.loc[:, 'Age'])
plt.xlabel('Age')
plt.ylabel('Density')
The KDE plot is shown in the following diagram, along with a shaded area under
the curve:
A scatter plot shows each observation as points on the x and y axes. Additionally, a
histogram for each variable is shown:
import pandas as pd
import seaborn as sns
data = pd.read_csv('../../Datasets/age_salary_hours.csv')
sns.set(style="white")
sns.jointplot(x="Annual Salary", y="Age", data=data))
The scatter plot with marginal histograms is shown in the following diagram:
It is also possible to use the KDE procedure to visualize bivariate distributions. The
joint distribution is shown as a contour plot, as demonstrated in the following code:
The joint distribution is shown as a contour plot in the center of the diagram. The
darker the color, the higher the density. The marginal distributions are visualized on
the top and on the right.
236 | Simplifying Visualizations Using Seaborn
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('../../Datasets/age_salary_hours.csv')
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(data, hue='Education')
Note
The age_salary_hours dataset is derived from https://www.census.gov/
programs-surveys/acs/technical-documentation/pums/documentation.2017.
html.
A pair plot, also called a correlogram, is shown in the following diagram. Scatter plots
are shown for all variable pairs on the off-diagonal, while KDEs are shown on the
diagonal. Groups are highlighted by different colors:
Advanced Plots in Seaborn | 237
Violin Plots
A different approach to visualizing statistical measures is by using violin plots. They
combine box plots with the kernel density estimation procedure that we described
previously. It provides a richer description of the variable's distribution. Additionally,
the quartile and whisker values from the box plot are shown inside the violin.
238 | Simplifying Visualizations Using Seaborn
import pandas as pd
import seaborn as sns
data = pd.read_csv("../../Datasets/salary.csv")
sns.set(style="whitegrid")
sns.violinplot('Education', 'Salary', hue='Gender', \
data=data, split=True, cut=0)
The violin plot shows both statistical measures and the probability distribution. The
data is divided into education groups, which are shown on the x-axis, and gender
groups, which are highlighted by different colors.
With the next activity, we will conclude the section about advanced plots. In this
section, multi-plots in Seaborn are introduced.
Advanced Plots in Seaborn | 239
3. Create a pandas DataFrame from the data for each respective group.
4. Create a box plot for the IQ scores of the different test groups using Seaborn's
violinplot function.
5. Use the whitegrid style, set the context to talk, and remove all axes spines,
except the one on the bottom. Add a title to the plot.
After executing the preceding steps, the final output should appear as follows:
Note
The solution to this activity can be found on page 424.
Multi-Plots in Seaborn
In the previous topic, we introduced a multi-plot, namely, the pair plot. In this topic,
we want to talk about a different way to create flexible multi-plots.
FacetGrid
The FacetGrid is useful for visualizing a certain plot for multiple variables separately.
A FacetGrid can be drawn with up to three dimensions: row, col, and hue. The first
two have the obvious relationship with the rows and columns of an array. The hue is
the third dimension and is shown in different colors. The FacetGrid class has to be
initialized with a DataFrame, and the names of the variables that will form the row,
column, or hue dimensions of the grid. These variables should be categorical
or discrete.
• row, col, hue: Variables that define subsets of the given data, which will be
drawn on separate facets in the grid
Initializing the grid does not draw anything on it yet. To visualize data on this grid, the
FacetGrid.map() method has to be used. You can provide any plotting function
and the name(s) of the variable(s) in the DataFrame to the plot:
Multi-Plots in Seaborn | 241
• *args: The column names in data that identify variables to plot. The data for
each variable is passed to func in the order in which the variables are specified.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("../../Datasets/salary.csv")[:1000]
g = sns.FacetGrid(data, col='District')
g.map(plt.scatter, 'Salary', 'Age')
Visualize the given data using a FacetGrid with two columns. The first column should
show the number of subscribers for each YouTube channel, whereas the second
column should show the number of views. The goal of this activity is to get some
practice working with FacetGrids. The following are the steps to implement
this activity:
1. Use pandas to read the YouTube.csv dataset located in the Datasets folder.
2. Access the data of each group in the column, convert this into a list, and assign
this list to variables of each respective group.
3. Create a pandas DataFrame with the preceding data, using the data of each
respective group.
After executing the preceding steps, the final output should appear as follows:
Note
The solution to this activity can be found on page 427.
In the next section, we will learn how to plot a regression plot using Seaborn.
Regression Plots | 243
Regression Plots
Regression is a technique in which we estimate the relationship between a
dependent variable (mostly plotted along the Y – axis) and an independent variable
(mostly plotted along the X – axis). Given a dataset, we can assign independent and
dependent variables and then use various regression methods to find out the relation
between these variables. Here, we will only cover linear regression; however, Seaborn
provides a wider range of regression functionality if needed.
import numpy as np
import seaborn as sns
x = np.arange(100)
# normal distribution with mean 0 and a standard deviation of 5
y = x + np.random.normal(0, 5, size=100)
sns.regplot(x, y)
The regplot() function draws a scatter plot, a regression line, and a 95%
confidence interval for that regression, as shown in the following diagram:
Note
The dataset used is from http://genomics.senescence.info/download.
html#anage. The dataset can also be downloaded from GitHub. Here is the
link to it: https://packt.live/3bzApYN.
After executing the preceding steps, the output should appear as follows:
Note
The solution to this activity can be found on page 430.
In the next section, we will learn how to plot Squarify using Seaborn.
246 | Simplifying Visualizations Using Seaborn
Squarify
At this point, we will briefly talk about tree maps. Tree maps display hierarchical
data as a set of nested rectangles. Each group is represented by a rectangle, of which
its area is proportional to its value. Using color schemes, it is possible to represent
hierarchies (groups, subgroups, and so on). Compared to pie charts, tree maps use
space efficiently. Matplotlib and Seaborn do not offer tree maps, and so the Squarify
library that is built on top of Matplotlib is used. Seaborn is a great addition for
creating color palettes.
Note
To install Squarify, first launch the command prompt from the
Anaconda Navigator. Then, execute the following command:
pip install squarify.
The following code snippet is a basic tree map example. It requires the
squarify library:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import squarify
colors = sns.light_palette("brown", 4)
squarify.plot(sizes=[50, 25, 10, 15], \
label=["Group A", "Group B", "Group C", "Group D"], \
color=colors)
plt.axis("off")
plt.show()
Squarify | 247
Now, let's have a look at a real-world example that uses tree maps in the
following exercise.
Note
Before beginning the exercise, make sure you have installed Squarify by
executing pip install squarify on your command prompt. The
water_usage.csv dataset used is this exercise is sourced from this
link: https://www.epa.gov/watersense/how-we-use-water. Their data originates
from https://www.waterrf.org/research/projects/residential-end-uses-water-
version-2. This dataset is also available in your Datasets folder.
248 | Simplifying Visualizations Using Seaborn
mydata = pd.read_csv("../../Datasets/water_usage.csv", \
index_col=0)
4. Create a list of labels by accessing each column from the preceding dataset.
Here, the astype('str') function is used to cast the fetched data into a type
string:
labels = mydata['Usage'] \
+ ' (' + mydata['Percentage'].astype('str') + '%)'
5. To create a tree map visualization of the given data, use the plot() function of
the squarify library. This function takes three parameters. The first parameter
is a list of all the percentages, and the second parameter is a list of all the labels,
which we got in the previous step. The third parameter is the colormap that can
be created by using the light_palette() function of the Seaborn library:
# Create figure
plt.figure(dpi=200)
# Create tree map
squarify.plot(sizes=mydata['Percentage'], \
label=labels, \
color=sns.light_palette('green', mydata.shape[0]))
Squarify | 249
plt.axis('off')
# Add title
plt.title('Water usage')
# Show plot
plt.show()
Note
To access the source code for this specific section, please refer to
https://packt.live/3fxRzqZ.
To conclude this exercise, you can see that tree maps are great for visualizing part-
of-a-whole relationships. We immediately see that using the toilet requires the most
water, followed by showers.
250 | Simplifying Visualizations Using Seaborn
Activity 4.06: Visualizing the Impact of Education on Annual Salary and Weekly
Working Hours
In this activity, we will generate multiple plots using a real-life dataset. You're asked
to get insights on whether the education of people has an influence on their annual
salary and weekly working hours. You ask 500 people in the state of New York about
their age, annual salary, weekly working hours, and their education. You first want
to know the percentage for each education type, so therefore you use a tree map.
Two violin plots will be used to visualize the annual salary and weekly working hours.
Compare in each case to what extent education has an impact.
It should also be taken into account that all visualizations in this activity are designed
to be suitable for colorblind people. In principle, this is always a good idea to bear
in mind.
Note
The American Community Survey (ACS) Public-Use Microdata
Samples (PUMS) dataset (one-year estimate from 2017) from https://
www.census.gov/programs-surveys/acs/technical-documentation/pums/
documentation.2017.html is used in this activity. This dataset is later used
in Chapter 07, Combining What We Have Learned. This dataset can also be
downloaded from GitHub. Here is the link: https://packt.live/3bzApYN.
Squarify | 251
3. Create a subplot with two rows to visualize two violin plots for the annual salary
and weekly working hours, respectively. Compare in each case to what extent
education has an impact. To exclude pensioners, only consider people younger
than 65. Use a colormap that is suitable for colorblind people. subplots()
can be used in combination with Seaborn's plot, by simply passing the ax
argument with the respective axes. The following output will be generated after
implementing this step:
Note
The solution to this activity can be found on page 432.
Summary
In this chapter, we demonstrated how Seaborn helps to create visually appealing
figures. We discussed various options for controlling Figure aesthetics, such as Figure
style, controlling spines, and setting the context of visualizations. We talked about
color palettes in detail. Further visualizations were introduced for univariate and
bivariate distributions. Moreover, we discussed FacetGrids for creating multi-plots,
and regression plots as a way to analyze the relationships between two variables.
Finally, we discussed the Squarify library, which is used to create tree maps.
In the next chapter, we will work with a different category of data, called geospatial
data. The prominent attribute of such a dataset is the presence of geo-coordinates
that can be used to plot elements on a given position on a map. We will visualize
poaching points, the density of cities around the world, and create a more interactive
visualization that only displays data points of the currently selected country.
5
Plotting Geospatial Data
Overview
By the end of this chapter, you will be able to utilize geoplotlib to create
stunning geographical visualizations and identify the different types of
geospatial charts. You will be able to demonstrate datasets containing
geospatial data for plotting and create complex visualizations using tile
providers and custom layers.
256 | Plotting Geospatial Data
Introduction
geoplotlib is an open-source Python library for geospatial data visualizations. It has
a wide range of geographical visualizations and supports hardware acceleration.
It also provides performance rendering for large datasets with millions of data
points. As discussed in earlier chapters, Matplotlib provides various ways to visualize
geographical data.
However, Matplotlib is not designed for this task because its interfaces are
complicated and inconvenient to use. Matplotlib also restricts how geographical
data can be displayed. The Basemap and Cartopy libraries allow you to plot on
a world map, but these packages do not support drawing on map tiles. Map tiles
are underlying rectangular, square, or hexagonal tile slabs that are used to create
a seamless map of the world, with lightweight, individually requested tiles that are
currently in view.
geoplotlib, on the other hand, was designed precisely for this purpose; it not only
provides map tiles but also allows for interactivity and simple animations. It provides
a simple interface that allows access to compelling geospatial visualizations such
as histograms, point-based plots, tessellations such as Voronoi or Delaunay, and
choropleth plots.
In the exercises and activities in this chapter, we will use geoplotlib in combination
with different real-world datasets to do the following:
• Discover dense areas within cities in Europe that have a high population
• Create a custom animated layer that displays the time series data of aircraft
Introduction | 257
geoplotlib uses the concept of layers that can be placed on top of one another,
providing a powerful interface for even complex visualizations. It comes with several
common visualization layers that are easy to set up and use.
258 | Plotting Geospatial Data
From the preceding diagram, we can see that geoplotlib is built on top of NumPy/
SciPy and Pyglet/OpenGL. These libraries take care of numerical operations and
rendering. Both components are based on Python, therefore enabling the use of the
full Python ecosystem.
Note
All the datasets used in this chapter can be found at
https://packt.live/3bzApYN. All the files of exercises and
activities can be found here: https://packt.live/2UJRbyt.
geoplotlib fully integrates into the Python ecosystem. This even enables us to
plot geographical data inline inside our Jupyter Notebooks. This possibility allows
us to design our visualizations quickly and iteratively.
• Simplicity: Looking at the example provided here, we can quickly see that
geoplotlib abstracts away the complexity of plotting map tiles and already-
provided layers such as dot density and histogram. It has a simple API that
provides common visualizations. These visualizations can be created using
custom data with only a few lines of code.
The core attributes of our datasets are lat and lon values. Latitude and
longitude values enable us to index every single location on Earth. In geoplotlib,
we need them to tell the library where on the map our elements need to be
rendered. If our dataset comes with lat and lon columns, we can display each
of those data points, for example, dots on a map with five lines of code.
260 | Plotting Geospatial Data
In addition, we can use the f_tooltip argument to provide a popup for each
point as an element of the column we provide as a source as follows:
dataset_obj = DataAccessObject(dataset_filtered)
geoplotlib.dot(dataset_obj, \
f_tooltip=lambda d:d['City'].title())
geoplotlib.show()
Executing this code will result in the following dot density plot:
Figure 5.2: Dot density layer of cities in Brazil and an overlay of the city on hovering
Geospatial Visualizations | 261
Next, we will create geographical visualizations without much effort and discover the
advantages of using geoplotlib in combination with pandas. We will implement an
exercise that plots the cities of the world and will be able to feel the performance of
the library when plotting thousands of dots on our map.
Geospatial Visualizations
Voronoi tessellation, Delaunay triangulation, and choropleth plots are a few of
the geospatial visualizations that will be used in this chapter. An explanation for each
of them is provided here.
Voronoi Tessellation
In a Voronoi tessellation, each pair of data points is separated by a line that is the
same distance from both data points. The separation creates cells that, for every
given point, marks which data point is closer. The closer the data points, the smaller
the cells.
The following example shows how you can simply use the voronoi method to
create this visualization:
geoplotlib.show()
After importing the dependencies we need, we read the dataset using the read_csv
method of pandas (or geoplotlib). We then use it as data for our voronoi method,
which handles all the complex logic of plotting the data on the map.
262 | Plotting Geospatial Data
In addition to the data itself, we can set several parameters, such as general
smoothing using the set_smoothing method. The smoothing of the lines
uses anti-aliasing:
Delaunay Triangulation
A Delaunay triangulation is related to Voronoi tessellation. When connecting each
data point to every other data point that shares an edge, we end up with a plot that
is triangulated. The closer the data points are to each other, the smaller the triangles
will be. This gives us a visual clue about the density of points in specific areas. When
combined with color gradients, we get insights about points of interest, which can be
compared with a heatmap:
geoplotlib.show()
Geospatial Visualizations | 263
This example uses the same dataset as before, that is, population density in Brazil.
The structure of the code is the same as in the voronoi example.
After importing the dependencies that we need, we read the dataset using the read_
csv method and then use it as data for our delaunay method, which handles all of
the complex logic of plotting data on the map.
In addition to the data itself, we can again use the set_smoothing method to
smooth the lines using anti-aliasing.
Choropleth Plot
This kind of geographical plot displays areas such as the states of a country in
a shaded or colored manner. The shade or color of the plot is determined by a
single data point or a set of data points. It gives an abstract view of a geographical
area to visualize the relationships and differences between the different areas. In
the following code and visual example, we can see that the unemployment rate
determines the shade of each state of the US. The darker the shade, the higher
the rate:
"""
plot the outlines of the states and color them using the unemployment
rate
"""
cmap = ColorMap('Reds', alpha=255, levels=10)
geoplotlib.geojson('../../Datasets/us_states_shapes.json', \
fill=True, color=get_color, \
f_tooltip=lambda properties: properties['NAME'])
geoplotlib.geojson('../../Datasets/us_states_shapes.json', \
fill=False, color=[255, 255, 255, 64])
geoplotlib.set_bbox(BoundingBox.USA)
geoplotlib.show()
We will cover what each line does in more detail later. However, to give you a better
understanding of what is happening here, we will quickly cover the sections of the
preceding code.
The first few lines import all the necessary dependencies, including geoplotlib and
json, which will be used to load our dataset, which is provided in this format.
After the import statements, we see a get_color method. This method returns
a color that has been determined by the unemployment rate of the given data point.
This method defines how dark the red value will be. In the last section of the script,
we read our dataset and use it with the geojson method.
The choropleth plot is one of the only visualizations that does not have a method
assigned that is solely used for this kind of plot. We use the geojson() method to
create more complex shapes than simple dots. By using the f_tooltip argument,
we can also display the name of the city we are hovering over.
The BoundingBox object is an object to define the "corners" of the viewport. We can
set an initial focus when running our visualization, which helps the user see what the
visualization is about without panning around and zooming first.
266 | Plotting Geospatial Data
Executing this code with the right example dataset provides the
following visualization:
Figure 5.5: Choropleth plot of unemployment rates in the US; the darker the color, the
higher the value
Exercise 5.01: Plotting Poaching Density Using Dot Density and Histograms
In this exercise, we'll be looking at the primary use of geoplotlib's plot methods for
dot density, histograms, and Voronoi diagrams. For this, we will make use of data
on various poaching incidents that have taken place all over the world.
The dataset that we will be using here contains data from poaching incidents in
Tanzania. The dataset consists of 268 rows and 6 columns (id_report, date_
report, description, created_date, lat, and lon).
Geospatial Visualizations | 267
Note that geoplotlib requires your dataset to have both lat and lon columns. These
columns are the geographical data for latitude and longitude, which are used to
determine how to plot the data. The following are the steps to perform:
dataset = read_csv('../../Datasets/poaching_points_cleaned.csv')
4. Print out the dataset and look at its type. What difference do you see compared
to a pandas DataFrame? Let's take a look:
6. Plot each row of our dataset as a single point on the map using a dot density
layer by calling the dot method. Then, call the show method to render the map
with a given layer:
Only looking at the lat and lon values in the dataset won't give us a very good
idea of where on the map our elements are located or how far apart they are.
We're not able to draw conclusions and get insights into our dataset without
visualizing our data points on a map. When looking at the rendered map, we
can instantly see that some areas have more incidents than others. This insight
couldn't have been easily identified by simply looking at the numbers in the
dataset itself.
7. Visualize the density using the hist method, which will create a Histogram
Layer on top of our map tiles. Then, define a binsize of 20. This will allow us
to set the size of the hist bins in our visualization:
8. Create a Voronoi plot using the same dataset. Use a color map cmap of
'Blues_r' and define the max_area parameter as 1e5:
# plotting a voronoi map
geoplotlib.voronoi(dataset, cmap='Blues_r', \
max_area=1e5, alpha=255)
geoplotlib.show()
Geospatial Visualizations | 271
Note
To access the source code for this specific section, please refer to
https://packt.live/2UIwGkT.
This section does not currently have an online interactive example, and will
need to be run locally.
Voronoi plots are good for visualizing the density of data points, too. Voronoi
introduces a little bit more complexity with several parameters, such as cmap, max_
area, and alpha. Here, cmap denotes the color of the map, alpha denotes the
color of the alpha, and max_area denotes a constant that determines the color of
the Voronoi areas.
272 | Plotting Geospatial Data
If we compare this Voronoi visualization with the histogram plot, we can see that
one area draws a lot of attention. The center-right edge of the plot shows quite a
large dark blue area with an even darker center: something that could've easily been
overlooked with the histogram plot.
We have now covered the basics of geoplotlib. It has many more methods, but they
all have a similar API that makes using the other methods simple. Since we have
looked at some very basic visualizations, it's now up to you to solve the first activity.
3. List all the datatypes that are present in it and verify that they are correct.
Then, map the Latitude and Longitude columns to lat and lon.
5. Use the agg method of pandas to get the average number of cities per country.
6. Obtain the number of cities per country (the first 20 entries) and extract the
countries that have a population of greater than zero.
8. Again, filter your remaining data for cities with a population of greater
than 100,000.
9. To get a better understanding of the density of our data points on the map, use
a Voronoi tessellation layer.
10. Filter down the data even further to only cities in countries such as Germany and
Great Britain.
11. Finally, use a Delaunay triangulation layer to find the most densely
populated areas.
Geospatial Visualizations | 273
Figure 5.13: A Delaunay triangle visualization of cities in Germany and Great Britain
Note
The solution for this activity can be found on page 436.
Geospatial Visualizations | 275
You have now completed your first activity using geoplotlib. Note how we made use
of different plots to get the information we required. Next, we will look at some more
custom features of geoplotlib that will allow us to change the map tiles provider and
create custom plotting layers.
{
"type": "Feature",
"properties": {
"name": "Dinagat Islands"
},
"geometry": {
"type": "Point",
"coordinates": [125.6, 10.1]
}
}
2. Since the geojson method of geoplotlib only needs a path to the us_states.
json dataset instead of a DataFrame or object, we don't need to load it.
However, since we still want to see what kind of data we are handling, we must
open the GeoJSON file and load it as a json object. We can then access its
members using simple indexing:
Our dataset contains a few properties. Only the state name, NAME, and the
number of consensus areas, CENSUSAREA, are important for us in this exercise.
Note
Geospatial applications prefer GeoJSON files for persisting and exchanging
geographical data.
3. Extract the names of all the states of the USA from the dataset. Next, print the
number of states in the dataset and then print all the states as a list:
4. If your GeoJSON file is valid, that is, if it has the expected structure, then use the
geojson method of geoplotlib. Create a GeoJSON plot using the geojson()
method of geoplotlib:
After calling the show method, the map will show up with a focus on North
America. In the following diagram, we can already see the borders of each state:
5. Rather than assigning a single value to each state, we want the darkness to
represent the number of census areas. To do this, we have to provide a method
for the color property. Map the CENSUSAREA attribute to a ColorMap class
object with 10 levels to allow a good distribution of color. Provide a maxvalue
of 300000 to the to_color method to define the upper limit of our dataset:
As you can see in the code example, we can provide three arguments to our
ColorMap. The first one, 'Reds', in our case, defines the basic coloring
scheme. The alpha argument defines how opaque we want the color to be,
255 being 100% opaque, and 0 completely invisible. Those 8-bit values for the
Red, Green, Blue, and Alpha (RGBA) values are commonly used in styling: they
all range from 0 to 255. With the levels argument, we can define how many
"steps," that is, levels of red values, we can map to.
6. Use the us_states.json file in the Datasets folder to visualize the different
states. First, provide the color mapping to our color parameter and set the
fill parameter to True. Then, draw a black outline for each state. Use the
color argument and provide the RGBA value for black. Lastly, use the USA
constant of the BoundingBox class to set the bounding box:
"""
plotting the shaded states and adding another layer which plots the
state outlines in white
our BoundingBox should focus the USA
"""
geoplotlib.geojson('../../Datasets/us_states.json', \
fill=True, color=get_color)
geoplotlib.geojson('../../Datasets/us_states.json', \
fill=False, color=[0, 0, 0, 255])
geoplotlib.set_bbox(BoundingBox.USA)
geoplotlib.show()
280 | Plotting Geospatial Data
A new window will open, displaying the country, USA, with the areas of its
states filled with different shades of red. The darker areas represent higher
census areas.
7. To give the user some more information about this plot, use the f_tooltip
argument to provide a tooltip displaying the name and census area value of the
state currently hovered over:
geoplotlib.geojson('../../Datasets/us_states.json', \
fill=False, color=[0, 0, 0, 255])
geoplotlib.set_bbox(BoundingBox.USA)
geoplotlib.show()
Upon hovering, we will get a tooltip for each of the plotted areas displaying the
name of the state and the census area value.
Note
To access the source code for this specific section, please refer to
https://packt.live/30PX9Rh.
This section does not currently have an online interactive example, and will
need to be run locally.
282 | Plotting Geospatial Data
You've already built different plots and visualizations using geoplotlib. In this exercise,
we looked at displaying data from a GeoJSON file and creating a choropleth plot.
In the following topics, we will cover more advanced customizations that will give you
the tools to create more powerful visualizations.
Tile Providers
geoplotlib supports the use of different tile providers. This means that any
OpenStreetMap tile server can be used as a backdrop for our visualization. Some of
the popular free tile providers include Stamen Watercolor, Stamen Toner, Stamen
Toner Lite, and DarkMatter. Changing the tile provider can be done in two ways:
geoplotlib contains a few built-in tile providers with shortcuts. The following code
shows you how to use it:
geoplotlib.tiles_provider('darkmatter')
geoplotlib.tiles_provider({\
'url': lambda zoom, \
xtile, ytile:
'http://a.tile.stamen.com/'\
'watercolor/%d/%d/%d.png' \
% (zoom, xtile, ytile),\
'tiles_dir': 'tiles_dir',
'attribution': \
'Python Data Visualization | Packt'\
})
Tile Providers | 283
The caching in tiles_dir is mandatory since, each time the map is scrolled or
zoomed into, we query new map tiles if they are not already downloaded. This
can lead to the tile provider refusing your request due to too many requests
occurring in a short period of time.
In the following exercise, we'll take a quick look at how to switch the map tile
provider. It might not seem convincing at first, but it can take your visualizations to
the next level if leveraged correctly.
import geoplotlib
We won't use a dataset in this exercise since we want to focus on the map tiles
and tile providers.
geoplotlib.show()
284 | Plotting Geospatial Data
This will display an empty world map since we haven't specified a tile provider.
By default, it will use the CartoDB Positron map tiles.
Tile Providers | 285
In this example, we used the darkmatter map tiles. As you can see, they are
very dark and will make your visualizations pop out.
Note
We can also use different map tiles such as watercolor, toner,
toner-lite, and positron in a similar way.
geoplotlib.tiles_provider({
'url': lambda zoom, \
xtile, ytile: \
'http://a.tile.openstreetmap.fr/'\
'hot/%d/%d/%d.png' \
% (zoom, xtile, ytile),\
'tiles_dir': 'custom_tiles',
'attribution': 'Custom Tiles '\
'Provider – Humanitarian map style'\
})
geoplotlib.show()
Tile Providers | 287
Figure 5.20: Humanitarian map tiles from the custom tile providers object
288 | Plotting Geospatial Data
Some map tile providers have strict request limits, so you may see warning
messages if you're zooming in too fast.
Note
To access the source code for this specific section, please refer to
https://packt.live/3e6WjTT.
This section does not currently have an online interactive example, and will
need to be run locally.
You now know how to change the tile provider to give your visualization one more
layer of customizability. This also introduces us to another layer of complexity. It
all depends on the concept of our final product and whether we want to use the
"default" map tiles or some artistic map tiles.
The next section will cover how to create custom layers that can go far beyond
the ones we have described in this book. We'll look at the basic structure of the
BaseLayer class and what it takes to create a custom layer.
Custom Layers
Now that we have covered the basics of visualizing geospatial data with built-in
layers and methods to change the tile provider, we will now focus on defining our
custom layers. Custom layers allow you to create more complex data visualizations.
They also help with adding more interactivity and animation to them. Creating a
custom layer starts by defining a new class that extends the BaseLayer class that's
provided by geoplotlib. Besides the __init__ method, which initializes the class
level variables, we also have to, at the very least, extend the draw method of the
BaseLayer class already provided.
Depending on the nature of your visualization, you might also want to implement
the invalidate method, which takes care of map projection changes such as
zooming into your visualization. Both the draw and invalidate methods receive
a Projection object that takes care of the latitude and longitude mapping on our
two-dimensional viewport. These mapped points can be handed to an instance of a
BatchPainter object that provides primitives such as points, lines, and shapes to
draw those coordinates onto your map.
Custom Layers | 289
class CountrySelectLayer(BaseLayer):
self.country_num = (self.country_num + 1) \
% len(countries)
return True
elif key == pyglet.window.key.LEFT:
self.country_num = (self.country_num - 1) \
% len(countries)
return True
return False
europe_bbox = BoundingBox(north=68.574309, \
west=-25.298424, \
south=34.266013, \
east=47.387123)
geoplotlib.add_layer(CountrySelectLayer(dataset, europe_bbox))
geoplotlib.show()
As we've seen several times before, we first import all the necessary dependencies for
this plot, including geoplotlib. BaseLayer and BatchPainter are dependencies
we haven't seen before, since they are only needed when writing custom layers.
The BatchPainter class is another helper for our implementation that lets us
trigger the drawing of elements onto the map.
When creating the custom layer, we simply provide the BaseLayer class in the
parentheses to tell Python to extend the given class.
The class then needs to implement at least two of the provided methods,
__init__ and draw.
__init__ defines what happens when a new custom layer is instantiated. This is
used to set the state of our layer; here, we define values such as our data to be used
and create a new BatchPainter class.
Custom Layers | 291
The draw method is called every frame and draws the defined elements using the
BatchPainter class.
In this method, we can do all sorts of calculations such as, in this case, filtering our
dataset to only contain the values of the current active timestamp. In addition to that,
we make the viewport follow our current lat and lon values by fitting the projection
to a new BoundingBox.
Since we don't want to draw everything from scratch with every frame, we use the
invalidate method, which only updates the points on the viewport. For example,
changes such as zooming.
When using interaction elements, such as switching through our countries using
the arrow keys, we can return either True or False from the on_key_pressed
method to trigger the redrawing of all the points.
Once our class is defined, we can call the add_layer method of geoplotlib to add
the newly defined layer to our visualization and finally call show() to show the map.
When executing the preceding example code, we get a visualization that, upon
switching the selected country with the arrow keys, draws the cities for the selected
country using dots on the map:
The following figure shows the cities in Spain after changing the selected country
using the arrow keys:
Figure 5.22: The selection of cities in Spain after changing the country using the arrow keys
In the following exercise, we will create our animated visualization by using what
we've learned about custom layers in the preceding example.
Note
Since geoplotlib operates on OpenGL, this process is highly performant and
can even draw complex visualizations quickly.
Let's create a custom layer that will allow us to display geospatial data and animate
the data points over time:
dataset = pd.read_csv('../../Datasets/flight_tracking.csv')
3. Use the head method to list the first five rows of the dataset and to understand
the columns:
4. Rename the latitude and longitude columns to lat and lon by using the
rename method provided by pandas:
# renaming columns latitude to lat and longitude to lon
dataset = dataset.rename(index=str, \
columns={"latitude": "lat", "longitude": "lon"})
Take another look at the first five elements of the dataset, and observe that the
names of the columns have changed to lat and lon:
Figure 5.24: The dataset with the lat and lon columns
294 | Plotting Geospatial Data
5. Since we want to get a visualization over time in this activity, we need to work
with date and time. If we take a closer look at our dataset, it shows us that
date and time are separated into two columns. Combine date and time into
a timestamp, using the to_epoch method already provided:
6. Use to_epoch and the apply method provided by the pandas DataFrame to
create a new column called timestamp that holds the Unix timestamp:
"""
create a new column called timestamp with the to_epoch method applied
"""
dataset['timestamp'] = dataset.apply(lambda x: to_epoch\
(x['date'], x['time']), \
axis=1)
7. Take another look at our dataset. We now have a new column that holds the
Unix timestamps:
Since our dataset is now ready to be used with all the necessary columns
in place, we can start writing our custom layer. This layer will display each
point once it reaches the timestamp that's provided in the dataset. It will be
displayed for a few seconds before it disappears. We'll need to keep track of the
current timestamp in our custom layer. Consolidating what we learned in the
theoretical section of this topic, we have an __init__ method that constructs
our custom TrackLayer.
8. In the draw method, filter the dataset for all the elements that are in the
mentioned time range and use each element of the filtered list to display it on
the map with color that's provided by the colorbrewer method.
Since our dataset only contains data from a specific time range and we're always
incrementing the time, we want to check whether there are still any elements
with timestamps after the current timestamp. If not, we want to set our
current timestamp to the earliest timestamp that's available in the dataset. The
following code shows how we can create a custom layer:
class TrackLayer(BaseLayer):
def __init__(self, dataset, bbox=BoundingBox.WORLD):
self.data = dataset
self.cmap = colorbrewer(self.data['hex_ident'], \
alpha=200)
self.time = self.data['timestamp'].min()
self.painter = BatchPainter()
self.view = bbox
def draw(self, proj, mouse_x, mouse_y, ui_manager):
self.painter = BatchPainter()
df = self.data.where((self.data['timestamp'] \
> self.time) \
& (self.data['timestamp'] \
<= self.time + 180))
296 | Plotting Geospatial Data
9. Define a custom BoundingBox that focuses our view on this area, since the
dataset only contains data from the area around Leeds in the UK:
Figure 5.26: Final animated tracking map that displays the routes of the aircraft
298 | Plotting Geospatial Data
Note
To access the source code for this specific section, please refer to
https://packt.live/3htmztU.
This section does not currently have an online interactive example, and will
need to be run locally.
You have now completed the custom layer activity using geoplotlib. We've applied
several preprocessing steps to shape the dataset as we want to have it. We've also
written a custom layer to display spatial data in the temporal space. Our custom layer
even has a level of animation. This is something we'll look into more in the following
chapter about Bokeh. We will now implement an activity that will help us get more
acquainted with custom layers in Bokeh.
Activity 5.02: Visualizing City Density by the First Letter Using an Interactive
Custom Layer
In this last activity for geoplotlib, you'll combine all the methodologies learned in the
previous exercises and the activity to create an interactive visualization that displays
the cities that start with a given letter, by merely pressing the left and right arrow keys
on your keyboard.
Since we use the same setup to create custom layers as the library does, you will be
able to understand the library implementations of most of the layers provided by
geoplotlib after this activity.
5. Filter the dataset to only contain European cities by using the given europe_
country_codes list.
Custom Layers | 299
6. Compare the length of all data with the filtered data of Europe by printing the
length of both.
7. Filter down the European dataset to get a dataset that only contains cities that
start with the letter Z.
8. Print its length and the first five rows using the head method.
9. Create a dot density plot with a tooltip that shows the country code and the
name of the city separated by a -. Use the DataAccessObject to create a
copy of our dataset, which allows the use of f_tooltip. The following is the
expected output of the dot density plot:
10. Create a Voronoi plot with the same dataset that only contains cities that start
with Z. Use the 'Reds_r' color map and set the alpha value to 50 to make
sure you still see the map tiles. The following is the expected output of the
Voronoi plot:
11. Create a custom layer that plots all the cities in Europe dataset that starts with
the provided letter. Make it interactive so that by using the left and right arrow
keys, we can switch between the letters. To do that, first, filter the self.data
dataset in the invalidate method using the current letter acquired from the
start_letters array using self.start_letter indexing.
12. Create a new BatchPainter() function and project the lon and lat values
to x and y values. Use the BatchPainter function to paint the points on the
map with a size of 2.
Custom Layers | 301
13. Call the batch_draw() method in the draw method and use the ui_
manager to add an info dialog to the screen telling the user which starting
letter is currently being used.
15. Add the custom layer using the add_layer method and provide the given
europe_bbox as a BoundingBox class.
The following is the expected output of the custom filter layer:
Figure 5.29: A custom filter layer displaying European cities starting with A
302 | Plotting Geospatial Data
If we press the right arrow twice, we will see the cities starting with C instead:
Figure 5.30: A custom filter layer displaying European cities starting with C
Note
The solution for this activity can be found on page 447.
This last activity has a custom layer that uses all the properties described by
geoplotlib. All of the already provided layers by geoplotlib are created using the same
structure. This means that you're now able to dig into the source code and create
your own advanced layers.
Summary | 303
Summary
In this chapter, we covered basic and advanced concepts and methods of geoplotlib.
It gave us a quick insight into internal processes, and we learned how to practically
apply the library to our own problem statements. Most of the time, the built-in plots
should suit your needs pretty well. If you're interested in building animated or even
interactive visualizations, you will have to create custom layers that enable
those features.
In the following chapter, we'll get some hands-on experience with the Bokeh library
and build visualizations that can easily be integrated into web pages. Once we have
finished using Bokeh, we'll conclude the chapter with an activity that allows you to
work with a new dataset and a library of your choice so that you can come up with
your very own visualization.
6
Making Things
Interactive with Bokeh
Overview
In this chapter, we will design interactive plots using the Bokeh library. By
the end of this chapter, you will be able to use Bokeh to create insightful
web-based visualizations and explain the difference between two interfaces
for plotting. You will identify when to use the Bokeh server and create
interactive visualizations.
306 | Making Things Interactive with Bokeh
Introduction
Bokeh is an interactive visualization library focused on modern browsers and the
web. Other than Matplotlib or geoplotlib, the plots and visualizations we are going to
create in this chapter will be based on JavaScript widgets. Bokeh allows us to create
visually appealing plots and graphs nearly out of the box without much styling. In
addition to that, it helps us construct performant interactive dashboards based on
large static datasets or even streaming data.
Bokeh has been around since 2013, with version 1.4.0 being released in November
2019. It targets modern web browsers to present interactive visualizations to users
rather than static images. The following are some of the features of Bokeh:
• Supports multiple languages: Other than Matplotlib and geoplotlib, Bokeh has
libraries for both Python and JavaScript, in addition to several other
popular languages.
• Beautiful chart styling: The tech stack is based on Tornado in the backend
and is powered by D3 in the frontend. D3 is a JavaScript library for creating
outstanding visualizations. Using the underlying D3 visuals allows us to create
beautiful plots without much custom styling.
Since we are using Jupyter Notebook throughout this book, it's worth mentioning that
Bokeh, including its interactivity, is natively supported in Notebook.
Introduction | 307
Concepts of Bokeh
The basic concept of Bokeh is, in some ways, comparable to that of Matplotlib. In
Bokeh, we have a figure as our root element, which has sub-elements such as a title,
an axis, and glyphs. Glyphs have to be added to a figure, which can take on different
shapes, such as circles, bars, and triangles. The following hierarchy shows the
different concepts of Bokeh:
Interfaces in Bokeh
The interface-based approach provides different levels of complexity for users that
either want to create some basic plots with very few customizable parameters or
want full control over their visualizations to customize every single element of their
plots. This layered approach is divided into two levels:
Note
The models interface is the basic building block for all plots.
The following are the two levels of the layered approach to interfaces:
• bokeh.plotting
The vital thing to note here is that even though its setup is done automatically,
we can configure the sub-elements. When using this interface, the creation of
the scene graph used by BokehJS is handled automatically too.
• bokeh.models
This low-level interface is composed of two libraries: the JavaScript library called
BokehJS, which gets used for displaying the charts in the browser, and the core
plot creation Python code, which provides the developer interface. Internally, the
definition created in Python creates JSON objects that hold the declaration for
the JavaScript representation in the browser.
Introduction | 309
The models interface provides complete control over how Bokeh plots and
widgets (elements that enable users to interact with the data displayed) are
assembled and configured. This means that it is up to the developer to ensure
the correctness of the scene graph (a collection of objects describing
the visualization).
Output
Outputting Bokeh charts is straightforward. There are three ways this can be done:
• The .show() method: The primary option is to display the plot in an HTML page
using this method.
• The inline .show() method: When using inline plotting with a Jupyter
Notebook, the .show() method will allow you to display the chart inside
your Notebook.
The most powerful way of providing your visualization is through the use of the
Bokeh server.
Bokeh Server
Bokeh creates scene graph JSON objects that will be interpreted by the BokehJS
library to create the visualization output. This process gives you a unified format for
other languages to create the same Bokeh plots and visualizations, independently of
the language used.
To create more complex visualizations and leverage the tooling provided by Python,
we need a way to keep our visualizations in sync with one another. This way, we can
not only filter data but also do calculations and operations on the server-side, which
updates the visualizations in real-time.
In addition to that, since we will have an entry point for data, we can create
visualizations that get fed by streams instead of static datasets. This design provides a
way to develop more complex systems with even greater capabilities.
310 | Making Things Interactive with Bokeh
Looking at the scheme of this architecture, we can see that the documents are
provided on the server-side, then moved over to the browser, which then inserts
it into the BokehJS library. This insertion will trigger the interpretation by BokehJS,
which will then create the visualization. The following diagram describes how the
Bokeh server works:
Presentation
In Bokeh, presentations help make the visualization more interactive by using
different features, such as interactions, styling, tools, and layouts.
Interactions
Probably the most exciting feature of Bokeh is its interactions. There are two types of
interactions: passive and active.
Introduction | 311
Passive interactions are actions that the users can take that doesn't change the
dataset. In Bokeh, this is called the inspector. As we mentioned before, the inspector
contains attributes such as zooming, panning, and hovering over data. This tooling
allows the user to inspect the data in more detail and might provide better insights
by allowing the user to observe a zoomed-in subset of the visualized data points. The
elements highlighted with a box in the following figure show the essential passive
interaction elements provided by Bokeh. They include zooming, panning, and
clipping data.
Active interactions are actions that directly change the displayed data. This includes
actions such as selecting subsets of data or filtering the dataset based on parameters.
Widgets are the most prominent of active interactions since they allow users to
manipulate the displayed data with handlers. Examples of available widgets are
buttons, sliders, and checkboxes.
312 | Making Things Interactive with Bokeh
Referring back to the subsection about the output styles, these widgets can be
used in both the so-called standalone applications in the browser and the Bokeh
server. This will help us consolidate the recently learned theoretical concepts and
make things more transparent. Some of the interactions in Bokeh are tab panes,
dropdowns, multi-selects, radio groups, text inputs, check button groups, data tables,
and sliders. The elements highlighted with a red box in the following figure show a
custom active interaction widget for the same plot we looked at in the example of
passive interaction.
Integrating
Embedding Bokeh visualizations can take two forms:
Bokeh is a little bit more complicated than Matplotlib with Seaborn and has its
drawbacks like every other library. Once you have the basic workflow down, however,
you're able to quickly extend basic visualizations with interactivity features to give
power to the user.
Note
One interesting feature is the to_bokeh method, which allows you to
plot Matplotlib figures with Bokeh without configuration overhead. Further
information about this method is available at https://bokeh.pydata.org/
en/0.12.3/docs/user_guide/compat.html.
In the following exercises and activities, we'll consolidate the theoretical knowledge
and build several simple visualizations to explain Bokeh and its two interfaces.
After we've covered the basic usage, we will compare the plotting and models
interfaces and work with widgets that add interactivity to the visualizations.
314 | Making Things Interactive with Bokeh
Basic Plotting
As mentioned before, the plotting interface of Bokeh gives us a higher-level
abstraction, which allows us to quickly visualize data points on a grid.
output_notebook()
Before we can create a plot, we need to import the dataset. In the examples in this
chapter, we will work with a computer hardware dataset. It can be imported by using
pandas' read_csv method.
The basic flow when using the plotting interface is comparable to that of
Matplotlib. We first create a figure. This figure is then used as a container to define
elements and call methods on:
show(plot)
Once we have created a new figure instance using the imported figure() method,
we can use it to draw lines, circles, or any glyph objects that Bokeh offers. Note that
the first two arguments of the plot.line method is datasets that contain an equal
number of elements to plot the element.
Basic Plotting | 315
To display the plot, we then call the show() method we imported from the bokeh.
plotting interface earlier on. The following figure shows the output of the
preceding code:
Figure 6.5: Line plot showing the cache memory of different hardware
316 | Making Things Interactive with Bokeh
Since the interface of different plotting types is unified, scatter plots can be created in
the same way as line plots:
Figure 6.6: Scatter plot showing the cache memory of different hardware
Basic Plotting | 317
show(plot)
318 | Making Things Interactive with Bokeh
Figure 6.7: Line plots displaying the cache memory and cycle time per
hardware with the legend
Basic Plotting | 319
When looking at the preceding example, we can see that once we have several lines,
the visualization can get cluttered.
We can give the user the ability to mute, meaning defocus, the clicked element in
the legend.
plot.legend.click_policy="mute"
show(plot)
320 | Making Things Interactive with Bokeh
Figure 6.8: Line plots displaying the cache memory and cycle time per hardware with a
mutable legend; cycle time is also muted
Basic Plotting | 321
Note
All the exercises and activities in this chapter are developed using
Jupyter Notebook. The files can be downloaded from the following link:
https://packt.live/39txwH5. All the datasets used in this chapter can be found
at https://packt.live/3bzApYN.
import pandas as pd
from bokeh.plotting import figure, show
3. Import and call the output_notebook method from the io interface of Bokeh
to display the plots inside a Jupyter Notebook:
dataset = pd.read_csv('../../Datasets/world_population.csv', \
index_col=0)
5. Verify that our data has been successfully loaded by calling head on
our DataFrame:
dataset.head()
322 | Making Things Interactive with Bokeh
Figure 6.9: Loading the top five rows of the world_population dataset
using the head method
Basic Plotting | 323
6. Populate our x-axis and y-axis with some data extraction. The x-axis will hold all
the years that are present in our columns. The y-axis will hold the population
density values of the countries. Start with Germany:
7. After extracting the necessary data, create a new plot by calling the Bokeh
figure method. Provide parameters such as title, x_axis_label, and
y_axis_label to define the descriptions displayed on our plot. Once our
plot is created, we can add glyphs to it. Here, we will use a simple line. Set the
legend_label parameter next to the x and y values to get an informative
legend in our visualization:
"""
plotting the population density change in Germany in the given years
"""
plot = figure(title='Population Density of Germany', \
x_axis_label='Year', \
y_axis_label='Population Density')
plot.line(years, de_vals, line_width=2, legend_label='Germany')
show(plot)
324 | Making Things Interactive with Bokeh
Figure 6.10: Creating a line plot from the population density data of Germany
Basic Plotting | 325
8. Now add another country—in this case, Switzerland. Use the same technique
that we used with Germany to extract the data for Switzerland:
9. We can add several layers of glyphs on to our figure plot. We can also
stack different glyphs on top of one another, thus giving specific and data-
improved visuals. Add an orange line to the plot that displays the data from
Switzerland. Also, plot orange circles for each data point of the ch_vals list
and assign it the same legend_label to combine both representations, the
line, and circles:
"""
plotting the data for Germany and Switzerland in one visualization,
adding circles for each data point for Switzerland
"""
plot = \
figure(title='Population Density of Germany and Switzerland', \
x_axis_label='Year', y_axis_label='Population Density')
plot.line(years, de_vals, line_width=2, legend_label='Germany')
plot.line(years, ch_vals, line_width=2, color='orange', legend_
label='Switzerland')
plot.circle(years, ch_vals, size=4, line_color='orange', \
fill_color='white', legend_label='Switzerland')
show(plot)
326 | Making Things Interactive with Bokeh
10. When looking at a larger amount of data for different countries, it makes sense
to have a plot for each of them separately. Use gridplot layout:
"""
plotting the Germany and Switzerland plot in two different
visualizations that are interconnected in terms of view port
"""
from bokeh.layouts import gridplot
plot_de = figure(title='Population Density of Germany', \
x_axis_label='Year', \
y_axis_label='Population Density', \
plot_height=300)
Figure 6.12: Using a gridplot to display the country plots next to each other
328 | Making Things Interactive with Bokeh
Figure 6.13: Using the gridplot method to arrange the visualizations vertically
Basic Plotting | 329
Note
To access the source code for this specific section, please refer to
https://packt.live/2Beg0KY.
We have now covered the very basics of Bokeh. Using the plotting interface makes
it easy to get some quick visualizations in place. This helps you understand the data
you're working with.
This simplicity is achieved by abstracting away complexity, and we lose much control
by using the plotting interface. In the next exercise, we'll compare the plotting
and models interfaces to show you how much abstraction is added to plotting.
import numpy as np
import pandas as pd
from bokeh.io import output_notebook
output_notebook()
330 | Making Things Interactive with Bokeh
dataset = pd.read_csv('../../Datasets/world_population.csv', \
index_col=0)
4. Call head on our DataFrame to verify that our data has been
successfully loaded:
dataset.head()
Figure 6.14: Loading the top five rows of the world_population dataset
using the head method
Basic Plotting | 331
6. Create three lists that have years present in the dataset, the mean population
density for the whole dataset for each year, and the mean population density
per year for Japan:
7. Use the plot element and apply our glyphs elements to it. Plot the global mean
with a line and the mean of Japan with crosses. Set the legend location to the
bottom-right corner:
plot = \
figure(title='Global Mean Population Density compared to Japan', \
x_axis_label='Year', y_axis_label='Population Density')
plot.legend.location = 'bottom_right'
show(plot)
332 | Making Things Interactive with Bokeh
Figure 6.15: Line plots comparing the global mean population density with that of Japan
The models interface is of a much lower level than other interfaces. We can
already see this when looking at the list of imports we need for a
comparable plot.
9. Before we build our plot, we have to find the min and max values for the y-axis
since we don't want to have too large or too small a range of values. Get all the
mean values for global and Japan without any invalid values. Get their smallest
and largest values and pass them to the constructor of Range1d. For the x-axis,
our list of years is pre-defined:
extracted_jp_vals = \
[jp_val['Japan'] for i, jp_val in enumerate(jp_vals) \
if i not in [0, len(jp_vals) - 1]]
min_pop_density = min(extracted_mean_pop_vals)
min_jp_densitiy = min(extracted_jp_vals)
min_y = int(min(min_pop_density, min_jp_densitiy))
max_pop_density = max(extracted_mean_pop_vals)
334 | Making Things Interactive with Bokeh
max_jp_densitiy = max(extracted_jp_vals)
max_y = int(max(max_jp_densitiy, max_pop_density))
xdr = Range1d(int(years[0]), int(years[-1]))
ydr = Range1d(min_y, max_y)
10. Next, create two Axis objects, which will be used to display the axis lines and
the label for the axis. Since we also want ticks between the different values, pass
in a Ticker object that creates this setup:
11. Create the title by passing a Title object to the title attribute of the
Plot object:
# creating the plot object
title = \
Title(align = 'left', \
text = 'Global Mean Population Density compared to Japan')
plot = Plot(x_range=xdr, y_range=ydr, plot_width=650, \
plot_height=600, title=title)
12. Try to display our plot now by using the show method. Since we have no
renderers defined at the moment, we will get an error. We need to add elements
to our plot:
"""
error will be thrown because we are missing renderers that are
created when adding elements
"""
show(plot)
Basic Plotting | 335
13. Insert the data into a DataSource object. This can then be used to map the
data source to the glyph object that will be displayed in the plot:
14. Use the right add method to add objects to the plot. For layout elements such as
the Axis objects, use the add_layout method. Glyphs, which display our data,
have to be added with the add_glyph method:
plot.add_layout(x_axis, 'below')
plot.add_layout(y_axis, 'left')
line_renderer = plot.add_glyph(line_source, line_glyph)
cross_renderer = plot.add_glyph(cross_source, cross_glyph)
336 | Making Things Interactive with Bokeh
15. Show our plot again to see our lines are in place:
show(plot)
Figure 6.17: A models interface-based plot displaying the lines and axes
Basic Plotting | 337
16. Use an object to add a legend to the plot. Each LegendItem object will be
displayed in one line in the legend:
17. Create the grid by instantiating two Grid objects, one for each axis. Provide the
tickers of the previously created x and y axes:
18. Finally, use the add_layout method to add the grid and the legend to our plot.
After this, display our complete plot, which will look like the one we created in
the first task, with only four lines of code:
plot.add_layout(legend)
plot.add_layout(x_grid)
plot.add_layout(y_grid)
show(plot)
338 | Making Things Interactive with Bokeh
Figure 6.18: Full recreation of the visualization done with the plotting interface
As you can see, the models interface should not be used for simple plots. It's
meant to provide the full power of Bokeh to experienced users that have specific
requirements that need more than the plotting interface.
Note
To access the source code for this specific section, please refer to
https://packt.live/3fq8pIf.
We have looked at the difference between the high-level plotting and low-level
models interface now. This will help us understand the internal workings and
potential future errors better. In this following activity, we'll use what we've already
learned and created a basic visualization that plots the mean car price of each
manufacturer from our dataset.
Next, we will color each data point with a color based on a given value. In Bokeh, like
in geoplotlib, this can be done using ColorMapper.
ColorMapper can map specific values to a given color in the selected spectrum. By
providing the minimum and maximum value for a variable, we define the range in
which colors are returned:
color_mapper = LinearColorMapper(palette='Magma256', \
low=min(dataset['cach']), \
high=max(dataset['cach']))
show(plot)
340 | Making Things Interactive with Bokeh
Next, we will implement all the concepts related to Bokeh we have learned so far.
Basic Plotting | 341
Note that we will use only the make and price columns in our activity.
In the process, we will first plot all cars with their prices and then slowly develop
a more sophisticated visualization that also uses color to visually focus the
manufacturers with the highest mean prices.
3. Load the automobiles.csv dataset from the Datasets folder using pandas.
Make sure that the dataset is loaded by displaying the first five elements of
the dataset.
5. Add a new column index to our dataset by assigning it to the values from our
dataset.index.
6. Create a new figure and plot each car using a scatter plot with the index and
price column. Give the visualization a title of Car prices and name the x-axis
Car Index. The y-axis should be named Price.
Grouping cars from manufacturers together
7. Group the dataset using groupby and the column make. Then use the mean
method to get the mean value for each column. We don't want the make
column to be used as an index, so provide the as_index=False
argument to groupby.
Adding color
12. Plot each manufacturer and provide a size argument with a size of 15.
13. Provide the color argument to the scatter method and use the field and
transform attributes to provide the column (y) and the color_mapper.
14. Set the label orientation to vertical.
Basic Plotting | 343
Figure 6.20: Final visualization displaying the mean car price for each manufacturer
Note
The solution for this activity can be found on page 456.
344 | Making Things Interactive with Bokeh
In the next section, we will create interactive visualizations that allow the user to
modify the data that is displayed.
Adding Widgets
One of the most powerful features of Bokeh is the ability to use widgets to
interactively change the data that's displayed in a visualization. To understand the
importance of interactivity in your visualizations, imagine seeing a static visualization
about stock prices that only shows data for the last year.
If you're interested in seeing the current year or even visually comparing it to the
recent and coming years, static plots won't be suitable. You would need to create one
plot for every year or even overlay different years on one visualization, which would
make it much harder to read.
Comparing this to a simple plot that lets the user select the date range they want, we
can already see the advantages. You can guide the user by restricting values and only
displaying what you want them to see. Developing a story behind your visualization is
very important, and doing this is much easier if the user has ways of interacting with
the data.
Bokeh widgets work best when used in combination with the Bokeh server. However,
using the Bokeh server approach is beyond the content of this book, since we would
need to work with simple Python files. Instead, we will use a hybrid approach that
only works with the Jupyter Notebook.
We will look at the different widgets and how to use them before going in and
building a basic plot with one of them. There are a few different options regarding
how to trigger updates, which are also explained in this section. The widgets that will
be covered in the following exercise are explained in the following table:
Adding Widgets | 345
The general way to create a new widget visible in a Jupyter Notebook is to define
a new method and wrap it into an interact widget. We'll be using the "syntactic
sugar" way of adding a decorator to a method—that is, by using annotations. This will
give us an interactive element that will be displayed after the executable cell, like in
the following example:
In the preceding example, we first import the interact element from the
ipywidgets library. This then allows us to define a new method and annotate it
with the @interact decorator.
The Value attribute tells the interact element which widget to use based on the
data type of the argument. In our example, we provide a string, which will give us a
TextBox widget. We can refer to the preceding table to determine which Value
data type will return which widget.
The print statement in the preceding code prints whatever has been entered in the
textbox below the widget.
Note
The methods that we can use interact with always have the same structure.
We will look at several examples in the following exercise.
3. In this first task, we will add interactive widgets to the interactive element of
IPython. Import the necessary interact and interact_manual elements
from ipywidgets:
4. Create a checkbox widget and print out the result of the interactive element:
@interact(Value=False)
def checkbox(Value=False):
print(Value)
Figure 6.23: Interactive checkbox that will switch from False to True if checked
Note
@interact() is called a decorator. It wraps the annotated method into
the interact component. This allows us to display and react to the change of
the drop-down menu. The method will be executed every time the value of
the dropdown changes.
348 | Making Things Interactive with Bokeh
@interact(Value=options)
def dropdown(Value=options[0]):
print(Value)
7. Create two widgets, a dropdown and a checkbox with the same value, as in the
last two tasks:
@interact(Select=options, Display=False)
def uif(Select, Display):
print(Select, Display)
9. Create an int slider using values of 0 and 100 as the @interact decorator min
and max values. Set continuous_update to false to only trigger an update on
mouse release:
@interact(Value=slider)
def slider(Value=0.0):
print(Value)
Figure 6.28: Interactive int slider that only triggers upon mouse release
Note
Although the outputs of Figure 6.27 and Figure 6.28 look the same, in Figure
6.28, the slider triggers only upon mouse release.
Note
Compared to the previous cells, this one contains the interact_
manual decorator instead of interact. This will add an execution button that
will trigger the update of the value instead of triggering with every change.
This can be really useful when working with larger datasets, where the
recalculation time would be large. Because of this, you don't want to trigger
the execution for every small step, but only once you have selected the
correct value.
Note
To access the source code for this specific section, please refer to
https://packt.live/3e8G60B.
After looking at several example widgets and how to create and use them in the
previous exercise, we will now use a real-world stock_price dataset to create a
basic plot and add simple interactive widgets.
352 | Making Things Interactive with Bokeh
The dataset of this exercise is a stock_prices dataset. This means that we will be
looking at data over a range of time. As this is a large and variable dataset, it will be
easier to show and explain widgets such as slider and dropdown on it. The dataset
is available in the Datasets folder of the GitHub repository; here is the link to it:
https://packt.live/3bzApYN. Follow these steps:
import pandas as pd
4. After downloading the dataset and moving it into the Datasets folder of this
chapter, import our stock_prices.csv data:
dataset = pd.read_csv('../../Datasets/stock_prices.csv')
5. Test whether the data has been loaded successfully by executing the head
method on the dataset:
dataset.head()
Adding Widgets | 353
Figure 6.30: Loading the top five rows of the stock_prices dataset using the head method
Since the date column has no information about the hour, minute, and second,
we want to avoid displaying them in the visualization later on and display the
year, month, and day.
6. Create a new column that holds the formatted short version of the date value.
Print out the first five rows of the dataset to see the new column, short_date:
dataset.head()
354 | Making Things Interactive with Bokeh
Note
The execution of the cell will take a moment since it's a fairly large dataset.
Please be patient.
In this task, we will create a basic visualization with the stock price dataset.
This will be your first interactive visualization in which you can dynamically
change the stock that is displayed in the graph. We will get used to one of the
aforementioned interactive widgets: the drop-down menu. It will be the main
point of interaction for our visualization.
7. Import the already-familiar figure and show methods from the plotting
interface. Since we also want to have a panel with two tabs displaying different
plot styles, also import the Panel and Tabs classes from the models interface:
8. Create two tabs. The first tab will contain a line plot of the given data, while the
second will contain a circle-based representation of the same data. Create a
legend that will display the name of the currently viewed stock:
line_plot=figure(title='Stock prices', \
x_axis_label='Date', \
x_range=stock['short_date'], \
y_axis_label='Price in $USD')
line_plot.line(stock['short_date'], stock['high'], \
legend_label=stock_name)
line_plot.xaxis.major_label_orientation = 1
circle_plot=figure(title='Stock prices', \
x_axis_label='Date', \
x_range=stock['short_date'], \
y_axis_label='Price in $USD')
circle_plot.circle(stock['short_date'], stock['high'], \
legend_label=stock_name)
circle_plot.xaxis.major_label_orientation = 1
line_tab=Panel(child=line_plot, title='Line')
circle_tab=Panel(child=circle_plot, title='Circles')
tabs = Tabs(tabs=[ line_tab, circle_tab ])
return tabs
356 | Making Things Interactive with Bokeh
9. Get a list of all the stock names in our dataset by using the unique method for
our symbol column:
Once we have done this, use this list as an input for the interact element.
10. Add the drop-down widget in the decorator and call the method that returns our
visualization in the show method with the selected stock. Only provide the first
25 entries of each stock. By default, the stock of Apple should be displayed; its
symbol in the dataset is AAPL. This will give us a visualization that is displayed
in a pane with two tabs. The first tab will display an interpolated line, and the
second tab will display the values as circles:
The following screenshot shows the output of the code in step 11:
Note
We can already see that each date is displayed on the x-axis. If we want to
display a bigger time range, we have to customize the ticks on our x-axis.
This can be done using ticker objects.
Note
To access the source code for this specific section, please refer to
https://packt.live/3fnfPvI.
We have now covered the very basics of widgets and how to use them in a
Jupyter Notebook.
Note
If you want to learn more about using widgets and which widgets can be
used in Jupyter, visit https://ipywidgets.readthedocs.io/en/latest/examples/
Using%20Interact.html and https://ipywidgets.readthedocs.io/en/stable/
examples/Widget%20List.html.
360 | Making Things Interactive with Bokeh
In the following activity, we will make use of the Bokeh DataSource to add a
tooltip overlay to our plot that is displayed upon hovering over the data points.
DataSource can be helpful in several cases, for example, displaying a tooltip on
hovering the data points. In most cases, we can use pandas DataFrames to feed data
into our plot, but for certain features, such as tooltips, we have to use DataSource:
data_source = \
ColumnDataSource(data=dict(vendor_name=dataset['vendor_name'], \
model=dataset['model'], \
cach=dataset['cach'], \
x=dataset['index'], \
y=dataset['cach']))
show(plot)
Adding Widgets | 361
Figure 6.34: Cache memory plotted as dots with tooltip overlay displaying the vendor,
model, and amount of memory
We want to use the nationality, gold, silver, and bronze columns to create
a custom visualization that lets us dig through the Olympians.
Adding Widgets | 363
Our visualization will display each country that participated in a coordinate system
where the x-axis represents the number of medals won and the y-axis represents the
number of athletes. Using interactive widgets, we will be able to filter the displayed
countries by both the maximum number of medals won and the maximum amount
of athletes axes.
Figure 6.35: Final interactive visualization that displays the scatter plot
364 | Making Things Interactive with Bokeh
There are many options when it comes to choosing which interactivity to use. We will
focus on only two widgets to make it easier for you to understand the concepts. In
the end, we will have a visualization that allows us to filter countries for the number
of medals and athletes they placed in the Olympics and upon hovering over the single
data points, receive more information about each country:
3. Import figure and show from Bokeh and interact and widgets from
ipywidgets to get started.
4. Load our olympia2016_athletes.csv dataset from the Datasets folder
and set up the interaction elements. Scroll down until you reach the cell that says
getting the max number of medals and athletes of all countries. Extract the
two numbers from the dataset.
6. Set up the @interact method, which will display the complete visualization.
The only code we will write here is to show the return value of the get_plot
method that gets all the interaction element values as parameters.
7. Implement the decorator method, move up in the Notebook, and work on the
get_plot method.
8. First, filter our countries dataset that contains all the countries that placed
athletes in the Olympic games. Check whether they have a lower or equal
number of medals and athletes than our max values passed as arguments.
9. Create our DataSource and use it for the tooltips and the printing of the
circle glyphs.
10. After that, create a new plot using the figure method that has the following
attributes: title set to Rio Olympics 2016 - Medal comparison, x_
axis_label set to Number of Medals, and y_axis_label set to Num
of Athletes.
Summary | 365
11. Execute every cell starting from the get_plot cell to the bottom—again,
making sure that all implementations are captured.
12. When executing the cell that contains the @interact decorator, you will
see a scatter plot that displays a circle for every country displaying additional
information, such as the shortcode of the country, the number of athletes,
and the number of gold, silver, and bronze medals.
Note
The solution for this activity can be found on page 465.
As we mentioned before, when working with interactive features and Bokeh, you
might want to read up about the Bokeh server a little bit more. It will give you more
options to express your creativity by creating animated plots and visualizations that
can be explored by several people at the same time.
Summary
In this chapter, we have looked at another option for creating visualizations with a
whole new focus: web-based Bokeh plots. We also discovered ways in which we can
make our visualizations more interactive and give the user the chance to explore data
in a different way.
As we mentioned in the first part of this chapter, Bokeh is a comparably new tool
that empowers developers to use their favorite language to create easily portable
visualizations for the web. After working with Matplotlib, Seaborn, geoplotlib, and
Bokeh, we can see some standard interfaces and similar ways to work with those
libraries. After studying the tools that are covered in this book, it will be simple to
understand new plotting tools.
In the next and final chapter, we will introduce a new real-life dataset to create
visualizations. This last chapter will allow you to consolidate the concepts and tools
that you have learned about in this book and further enhance your skills.
7
Combining What We Have
Learned
Overview
In this chapter, we will apply all the concepts that we have learned in all
the previous chapters. We will use three new datasets in combination
with practical activities for Matplotlib, Seaborn, geoplotlib, and Bokeh. By
the end of this chapter, you will be able to apply your skills in Matplotlib
and Seaborn. We will create a time series with Bokeh, and finally, we will
analyze geospatial data with geoplotlib. We will conclude this chapter with a
summary that recaps what we've learned throughout the book.
368 | Combining What We Have Learned
Introduction
In recent chapters, we've learned about some of the most widely used and state-of-
the-art visualization libraries for Python. In the previous chapter, we advanced from
simple static plots to building interactive visualizations using Bokeh, which allowed us
to gain control over what is displayed to the users.
To consolidate what we have learned, we will provide you with three sophisticated
activities. Each activity uses one of the libraries that we have covered in this book.
Each activity has a more extensive dataset than we have used before, which will
prepare you to work with real-world examples.
In the first activity, we will consolidate the acquired knowledge in Matplotlib and
Seaborn. For a quick recap, Matplotlib allows the generation of various plot types
with just a few lines of code. Seaborn is based on Matplotlib and provides a high-level
interface for creating visually appealing charts. It dramatically extends Matplotlib with
predefined visualization styles and color palettes.
Note
All activities will be developed in the Jupyter Notebook or Jupyter Lab.
Please download the GitHub repository with all the prepared templates and
datasets from https://packt.live/2tSthph.
The American Community Survey (ACS) Public-Use Microdata Samples (PUMS) dataset
(one-year estimate from 2017) from https://www.census.gov/programs-surveys/acs/
technical-documentation/pums/documentation.2017.html is used.
Introduction | 369
Download the following datasets and place the extracted CSV file in the
Datasets subdirectory: https://www2.census.gov/programs-surveys/acs/data/
pums/2017/1-Year/csv_pny.zip and https://www2.census.gov/programs-surveys/acs/data/
pums/2017/1-Year/csv_hny.zip.
2. Use pandas to read both CSV files located in the Datasets folder.
3. Use the given PUMA (public use microdata area code based on the 2010 census
definition, which are areas with populations of 100k or more) ranges to further
divide the dataset into NYC districts (Bronx, Manhattan, Staten Island, Brooklyn,
and Queens):
# PUMA ranges
bronx = [3701, 3710]
manhatten = [3801, 3810]
staten_island = [3901, 3903]
brooklyn = [4001, 4018]
queens = [4101, 4114]
nyc = [bronx[0], queens[1]]
370 | Combining What We Have Learned
4. In the dataset, each sample has a certain weight that reflects the weight
for the total dataset. Therefore, we cannot simply calculate the median. Use
the given weighted_median function in the following code to compute
the median:
5. In this subtask, we will create a plot containing multiple subplots that visualize
information with regard to NYC wages. Before we create the plots, some data
wrangling is necessary.
6. Compute the average wage by gender for the given occupation categories for the
population of NYC:
7. Compute the wage frequencies for New York and NYC. Use the following yearly
wage intervals: 10k steps between 0 and 100k, 50k steps between 100k and
200k, and >200k:
8. Create a plot containing multiple subplots that visualize information with regard
to NYC wages. Now, visualize the median household income for the US, New
York, NYC, and its districts. Next, visualize the average wage by gender for
the given occupation categories for the population of NYC. Then, visualize the
wage distribution for New York and NYC. Lastly, use the following yearly wage
intervals: 10k steps between 0 and 100k, 50k steps between 100k and 200k,
and >200k.
372 | Combining What We Have Learned
Figure 7.1: Wage statistics for New York City in comparison with New York
and the United States
Introduction | 373
9. Use a tree map to visualize the percentage for the given occupation
subcategories for the population of NYC:
occ_subcategories = \
{'Management,\nBusiness,\nand Financial': [10, 950], \
'Computer, Engineering,\nand Science': [1000, 1965], \
'Education,\nLegal,\nCommunity Service,'\
'\nArts,\nand Media': [2000, 2960], \
'Healthcare\nPractitioners\nand\nTechnical': [3000, 3540], \
'Service': [3600, 4650], \
'Sales\nand Related': [4700, 4965], \
'Office\nand Administrative\nSupport': [5000, 5940], \
'': [6000, 6130], \
'Construction\nand Extraction': [6200, 6940], \
'Installation,\nMaintenance,\nand Repair': [7000, 7630], \
'Production': [7700, 8965], \
'Transportation\nand Material\nMoving': [9000, 9750]}
10. Use a heatmap to show the correlation between difficulties (self-care difficulty,
hearing difficulty, vision difficulty, independent living difficulty, ambulatory
difficulty, veteran service-connected disability, and cognitive difficulty) and age
groups (<5, 5-11, 12-14, 15-17, 18-24, 25-34, 35-44, 45-54, 55-64, 65-74, and 75+)
in NYC. Following is the expected output:
Note
The solution to this activity can be found on page 472.
In the next section, we will perform an activity on Bokeh using a real-life scenario.
Bokeh
Stock price data is one of the most exciting types of data for many people. When
thinking about its nature, we can see that it is highly dynamic and continually
changing. To understand it, we need high levels of interactivity to not only look at the
stocks of interest, but also to compare different stocks, see their traded volume, and
the highs/lows of the given dates and whether it rose or sunk the day before that.
Introduction | 375
4. Make sure that the dataset is loaded by displaying the first five elements of
the dataset.
7. Import figure and show from Bokeh and interact and widgets from
ipywidgets to get started.
8. Execute the cells from top to bottom until you reach the cell that has
the comment #extracting the necessary data. Start your
implementation there.
9. Get the unique stock names from the dataset. Filter out the dates from 2016.
Only get unique dates from 2016. Create a list that contains the strings open-
close and volume, which will be used for the radio buttons to switch between
the two plots.
376 | Combining What We Have Learned
10. After extracting the necessary data, set up the interaction elements. Create
widgets for the following: a dropdown for the first stock name (the default value
will be AAPL) and a dropdown for the second stock name that will be compared
to the first (the default value will be AON).
12. Define a RadioButtons attribute to choose between the candlestick plot and
the plot that displays the traded volume (the default value will be open-close,
which will display the candlestick plot.)
13. Set up the @interact method that finally displays the complete visualization.
Provide the interaction elements that have just been set up with the @
interact decorator and call the show method with the get_plot method we
executed before.
14. After implementing the decorated method, move up in our notebook and
work on the add_candle_plot method. Start with the so-called candlestick
visualization, which is often used with stock price data. Calculate the mean for
every (high/low) pair and then plot those data points with a line with the given
color. Next, set up an add_candle_plot that gets a plot object, a stock_
name, a stock_range columns containing the data of only the selected
date range that was defined with the widgets, and a color for the line. Create a
segment that creates the vertical line, and either a green or red vbar to color
code whether the close price is lower than the open price. Once the candles are
created, draw a continuous line running through the mean high, low point of
each candle.
15. Move on and implement the line plot in the cell that contains the get_plot
method. Plot a line for the data from stock_1 with a blue color. Plot a line for
the data from stock_2 with an orange color.
Introduction | 377
16. Before finalizing this activity, add mutability to our legend, which changes the
way elements are displayed upon clicking on one of the displayed elements in
the legend of the visualization. The resulting visualization should look somewhat
like the following image:
Figure 7.4: Final interactive visualization that displays the candlestick plot
378 | Combining What We Have Learned
The following figure shows the final interactive visualization of volume plot:
Figure 7.5: Final interactive visualization that displays the volume plot
Note
The solution to this activity can be found on page 484.
As we mentioned before, when working with interactive features and Bokeh, you
might want to read up about the Bokeh server a little bit more. It will give you more
options to create animated plots and visualizations that can be explored by several
people at the same time.
Introduction | 379
Geoplotlib
The dataset that is used in this activity is from Airbnb, which is publicly available
online. Accommodation listings have two predominant features: latitude and
longitude. Those two features allow us to create geospatial visualizations that give
us a better understanding of attributes such as the distribution of accommodation
across each city.
In this activity, we will use geoplotlib to create a visualization that maps each
accommodation to a dot on a map. Each dot is colored based on either the price or
rating of that listing. The two attributes can be switched by pressing the left and right
keys on the keyboard.
In theory, we should see a price increase the closer we get to the center of
Manhattan. It will be very fascinating to see whether the ratings for the given
accommodations also increase as we get closer to the center of Manhattan:
3. Understand the dataset by observing the variables and the first few entries.
4. Since our dataset once again has columns that are named Latitude
and Longitude instead of lat and lon, rename those columns to their
short versions.
5. To use a color map that changes color based on the price of accommodation, we
need a value that can easily be compared and checked whether it's smaller or
bigger than any other listing.
380 | Combining What We Have Learned
Therefore, create a new column called dollar_price that will hold the value
of the price column as float. Make sure to fill all the NaN values of the price
column with $0.0, and review_scores_rating column with 0.0 by using
the fillna() method of the dataset.
6. This dataset has 96 columns. When working with such a huge dataset, it makes
sense to think about what data we really need and create a subsection of our
dataset that only holds the data we need. Print all the columns that are available
and an example for that column to decide what information is suitable.
7. Trim down the number of columns our working dataset has by creating a
subsection of the columns with id, latitude (as lat), longitude (as lon),
price (in $), and review_scores_rating.
8. Print the first five rows of the trimmed down the dataset.
10. Create a new ValueLayer class that extends the geoplotlib BaseLayer class.
11. Initiate the following instance variables in the __init__ method of the
ValueLayer class: first, self.data, which holds the dataset; second, self.
display, which holds the currently selected attribute name; third, self.
painter, which holds an instance of the BatchPainter class; fourth, self.
view, which holds the BoundingBox function; and lastly, self.cmap, which
holds a color map with the jet color schema, and an alpha of 255 and
100 levels:
12. Implement the bbox, draw, and on_key_release methods from the
ValueLayer class. First, return the self.view variable in the bbox
method. Then, set the ui_manager.info text to Use left and right
to switch between the displaying of price and ratings.
Currently displaying: dollar_price or review_scores_rating,
depending on what the self.display variable holds. Next, in the on_key_
release method, check whether the left or right key is pressed and switch
the self.display variable between dollar_price or review_scores_
rating. Lastly, return True if the left or the right key has been pressed to
trigger redrawing the dots, otherwise return False.
13. Given the data, plot each point on the map with a color that is defined by the
currently selected attribute, either price or rating. First, in the invalidate
method, assign a new BatchPainter() function to the self.painter
variable. Second, get the max value of the dataset given the current self.
display variable. Third, use a log scale if dollar_price is used, otherwise
use a lin scale. Fourth, map the value to color using the cmap object we defined
in the __init__ method and plot each point with the given color onto the map
with a size of 5.
This is not the most efficient solution, but it will do for now.
382 | Combining What We Have Learned
The following is an expected output that shows a dot map with color based
on rating:
Figure 7.8: New York Airbnb dot map, colored based on the price
Introduction | 383
The following is an expected output that shows a dot map with color based
on rating:
Figure 7.9: New York Airbnb dot map, colored based on the ratings
Note
The solution to this activity can be found on page 493.
As we can now see, writing custom layers for geoplotlib is a good approach for
focusing on the attributes that you are interested in.
384 | Combining What We Have Learned
Summary
This chapter gave us a short overview and recap of everything that was covered in
this book based on three extensive practical activities. In Chapter 1, The Importance of
Data Visualization and Data Exploration, we started with a Python library journey that
we used as a guide throughout the whole book. We first talked about the importance
of data and visualizing this data to get meaningful insights from it and gave a quick
recap of different statistical concepts.
Through several activities, we learned how to import and handle datasets with
NumPy and pandas. In Chapter 2, All You Need to Know about Plots, we discussed
various plot/chart visualizations and which visualizations are best for displaying
certain information. We mentioned the use case, design practices, and practical
examples for each plot type.
In Chapter 3, A Deep Dive into Matplotlib, we thoroughly covered Matplotlib and started
with the basic concepts. Next, we dived deeper into the numerous possibilities for
enriching visualizations with text. Emphasis was put on explaining almost all plotting
functions Matplotlib offers using practical examples. Furthermore, we talked about
different ways to create layouts. The chapter was rounded off by demonstrating how
you can visualize images and write mathematical expressions.
Visualizing geospatial data was covered in Chapter 5, Plotting Geospatial Data, using
geoplotlib. Understanding how geoplotlib is structured internally explained why we
had to work with the pyglet library when adding interactivity to our visualizations. We
worked with different datasets and built both static and interactive visualizations for
geospatial data.
Summary | 385
In Chapter 6, Making Things Interactive with Bokeh, we focused on working with Bokeh,
which targets modern web browsers to present interactive visualizations. Starting
with simple examples, we explored the most significant advantage of Bokeh, namely,
interactive widgets.
We ended the book with this chapter, applying all the skills that we've learned
through three real-life datasets.
With the conclusion of this book, you should now have the practical knowledge and
skills to design your own data visualizations using various Python libraries such as
NumPy, pandas, Matplotlib, Seaborn, geoplotlib, and Bokeh.
Appendix
388 | Appendix
1. Import NumPy:
import numpy as np
dataset = np.genfromtxt('../../Datasets/normal_distribution.csv', \
delimiter=',')
dataset[0:2]
4. Load the dataset and calculate the mean of the third row. Access the third row
by using index 2, dataset[2]:
np.mean(dataset[2])
100.20466135250001
5. Index the last element of an ndarray in the same way a regular Python list can be
accessed. dataset[:, -1] will give us the last column of every row:
np.mean(dataset[:,-1])
100.4404927375
Chapter 1: The Importance of Data Visualization and Data Exploration | 389
6. Get a submatrix of the first three elements of every row of the first three
columns by using the double-indexing mechanism of NumPy, which gives us an
interface to extract sub-selection:
"""
calculate the mean of the intersection of the first 3 rows \
and first 3 columns
"""
np.mean(dataset[0:3, 0:3])
97.87197312333333
7. Calculate the median of the last row of the dataset. Don't use the length of the
dataset as the index:
np.median(dataset[-1])
99.18748092
8. Use reverse indexing define a range to get the last three columns using
dataset[:, -3:]:
np.median(dataset[:, -3:])
99.47332349999999
9. To aggregate the values along an axis to calculate the rows, use axis=1:
np.median(dataset, axis=1)
np.var(dataset, axis=0)
11. Calculate the variance of the intersection of the last two rows and the first two
columns. When only looking at a very small subset of the matrix (2x2) elements,
we can apply what we learned in the statistical overview to observe that the
value is way smaller than the whole dataset:
np.var(dataset[-2:, :2])
4.674691991769191
The values of the variance might seem a little bit strange at first. You can always
go back to the Measures of Dispersion section to recap what you've learned so far.
Note
A small subset of a dataset does not display the attributes of the whole.
12. Calculate the standard deviation of the dataset. Just remember that the variance
is not the standard deviation:
np.std(dataset)
4.838197554269257
Note
To access the source code for this specific section, please refer to
https://packt.live/3hroPlv.
import pandas as pd
dataset = pd.read_csv('../../Datasets/forestfires.csv')
3. Print the first two rows of the dataset to get a feeling for its structure:
dataset[0:2]
4. Filter the dataset so that it only contains rows that have an area value of >0 since
our dataset contains several rows with an area of 0 and we only want to look at
rows that have an area larger than 0 for now:
Figure 1.61: Filtered dataset with only rows that have an area of larger than 0
Chapter 1: The Importance of Data Visualization and Data Exploration | 393
5. Get the mean, min, max, and std of the area column and see what information
this gives you. First, let's find the mean value:
area_dataset["area"].mean()
24.600185185185182
area_dataset["area"].min()
0.09
area_dataset["area"].max()
1090.84
area_dataset["area"].std()
86.50163460412126
6. Sort the filtered dataset using the area column and print the last 20 entries using
the tail method to see how many very large values it holds:
area_dataset.sort_values(by=["area"]).tail(20)
394 | Appendix
7. Get the median of the area column and visually compare it to the mean value:
area_dataset["area"].median()
6.37
Chapter 1: The Importance of Data Visualization and Data Exploration | 395
8. List all the month values present in the dataset to compare the number of fires
and the temperature and get a list of unique values from the month column
of the dataset:
months = dataset["month"].unique()
months
9. Get the amount of entries for the month of March using the shape member of
our DataFrame:
dataset[dataset["month"] == "mar"].shape[0]
54
10. Now, iterate over all months, filter our dataset for rows containing the given
month, and calculate the mean temperature. Print a statement containing the
number of fires, the mean temperature, and the month:
Figure 1.64: Amount of forest fires and mean temperature for each month
Note
To access the source code for this specific section, please refer to
https://packt.live/2NeLJ1H.
1. Bar charts and radar charts are great for comparing multiple variables for
multiple groups.
2. Suggested response: The bar chart is great for comparing the skill attributes of
the different employees, but it is not the best choice when it comes to getting
an overall impression of an employee, due to the fact that the skills are not
displayed directly next to one another.
The radar chart is great for this scenario because you can both compare
performance across employees and directly observe the individual performance
for each skill attribute.
3. Suggested response:
For both the bar and radar charts, adding a title and labels would help to
understand the plots better. Additionally, using different colors for the different
employees in the radar chart would help to keep the different employees apart.
1. Suggested response: If we look at Figure 2.20, we can see that the years 2000
and 2015 have the lightest colored squares overall. These are the two years that
have the lowest accident rates.
2. Suggested response: If we look at the trend for each month, that is, January,
April, July, and October for the past two decades, we can see a decreasing trend
in the number of accidents taking place in January.
The activity about road accidents gave you a simple example of how to use heatmaps
to illustrate the relationship between multiple variables. In the next section, we will
cover composition plots.
398 | Appendix
First Visualization
Suggested response:
1. The proposed visualization has multiple faults. First, a pie chart is supposed to
show part-of-a-whole relations, which is not the case for this task since we only
consider the top 30 YouTube music channels and not all channels. Second, 30
values are too many to visualize within a pie chart. Third, the labels overlap. Also,
it is difficult to quantify the slices as there is no unit of measurement specified.
400 | Appendix
Second Visualization
Suggested response:
1. This is also an example of using the wrong chart type. A line chart was used
to compare different categories that do not have any temporal relation.
Furthermore, informative guides such as legends and labels are missing.
2. The following diagram shows how the data should have been represented using
a comparative bar chart:
Chapter 2: All You Need to Know about Plots | 401
Figure 2.49: Comparative bar chart displaying casino data for 2 days
Since it was asked of us to visualize the median, the interquartile ranges, and the
underlying density of populations from different income groups, violin plots are
the best choice as they visualize both summary statistics and a kernel density
estimate. The density plot only shows the density, whereas box plots only illustrate
summary statistics.
402 | Appendix
# Import statements
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
4. Use Matplotlib to create a line chart that visualizes the closing prices for the
past 5 years (whole data sequence) for all five companies. Add labels, titles,
and a legend to make the visualization self-explanatory. Use the plt.grid()
function to add a grid to your plot:
# Create figure
plt.figure(figsize=(16, 8), dpi=300)
# Plot data
plt.plot('date', 'close', data=google, label='Google')
plt.plot('date', 'close', data=facebook, label='Facebook')
plt.plot('date', 'close', data=apple, label='Apple')
plt.plot('date', 'close', data=amazon, label='Amazon')
plt.plot('date', 'close', data=microsoft, label='Microsoft')
Chapter 3: A Deep Dive into Matplotlib | 403
From the preceding diagram, we can see that the stock prices of Google and
Amazon are high compared to Facebook, Microsoft, and Apple.
404 | Appendix
Note
To access the source code for this specific section, please refer to
https://packt.live/2Y35oHT.
# Import statements
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Load dataset
movie_scores = pd.read_csv('../../Datasets/movie_scores.csv')
4. Use Matplotlib to create a visually appealing bar plot comparing the two scores
for all five movies. Use the movie titles as labels for the x-axis. Use percentages
at intervals of 20 for the y-axis, and minor ticks at intervals of 5. Add a legend
and a suitable title to the plot:
# Create figure
plt.figure(figsize=(10, 5), dpi=300)
# Create bar plot
pos = np.arange(len(movie_scores['MovieTitle']))
width = 0.3
plt.bar(pos - width / 2, movie_scores['Tomatometer'], \
width, label='Tomatometer')
plt.bar(pos + width / 2, movie_scores['AudienceScore'], \
width, label='Audience Score')
# Specify ticks
Chapter 3: A Deep Dive into Matplotlib | 405
plt.xticks(pos, rotation=10)
plt.yticks(np.arange(0, 101, 20))
# Get current Axes for setting tick labels and horizontal grid
ax = plt.gca()
# Set tick labels
ax.set_xticklabels(movie_scores['MovieTitle'])
ax.set_yticklabels(['0%', '20%', '40%', '60%', '80%', '100%'])
# Add minor ticks for y-axis in the interval of 5
ax.set_yticks(np.arange(0, 100, 5), minor=True)
# Add major horizontal grid with solid lines
ax.yaxis.grid(which='major')
# Add minor horizontal grid with dashed lines
ax.yaxis.grid(which='minor', linestyle='--')
# Add title
plt.title('Movie comparison')
# Add legend
plt.legend()
# Show plot
plt.show()
In the preceding output, we can see that the audience liked the movie "The
Hobbit: An Unexpected Journey" when compared to other movies that were
rated high by Tomatometer.
Note
To access the source code for this specific section, please refer to
https://packt.live/30NVXhs.
2. Import the necessary modules and enable plotting within the Jupyter notebook:
# Import statements
import pandas as sb
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Note that we have imported the Seaborn library to load the built-in dataset that
the library provides.
# Load dataset
bills = sns.load_dataset('tips')
Chapter 3: A Deep Dive into Matplotlib | 407
4. Use the given dataset and create a matrix where the elements contain the sum
of the total bills for each day and are split by smokers/non-smokers:
bills_by_days_smoker = \
[[bills_by_days[day][bills_by_days[day]['smoker'] == s] \
for s in smoker] for day in days_range]
total_by_days_smoker = \
[[bills_by_days_smoker[day][s]['total_bill'].sum() \
for s in range(len(smoker))] for day in days_range]
totals = np.asarray(total_by_days_smoker)
Here, the asarray() function is used to convert any list into an array.
5. Create a stacked bar plot, stacking the summed total bills separated by smoker
and non-smoker for each day. Add a legend, labels, and a title:
# Create figure
plt.figure(figsize=(10, 5), dpi=300)
# Create stacked bar plot
plt.bar(days_range, totals[:, 0], label='Smoker')
plt.bar(days_range, totals[:, 1], bottom=totals[:, 0], \
label='Non-smoker')
# Add legend
plt.legend()
# Add labels and title
plt.xticks(days_range)
ax = plt.gca()
ax.set_xticklabels(days)
ax.yaxis.grid()
plt.ylabel('Daily total sales in $')
plt.title('Restaurant performance')
# Show plot
plt.show()
408 | Appendix
Figure 3.52: Stacked bar chart showing restaurant performance on different days
In the preceding output, we can see that the highest sales were made on
Saturday by both smokers and non-smokers.
Note
To access the source code for this specific section, please refer to
https://packt.live/3ea2IxY.
Activity 3.04: Comparing Smartphone Sales Units Using a Stacked Area Chart
Solution:
# Import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Load dataset
sales = pd.read_csv('../../Datasets/smartphone_sales.csv')
4. Create a visually appealing stacked area chart. Add a legend, labels, and a title:
# Create figure
plt.figure(figsize=(10, 6), dpi=300)
# Create stacked area chart
labels = sales.columns[2:]
plt.stackplot('Quarter', 'Apple', 'Samsung', 'Huawei', \
'Xiaomi', 'OPPO', data=sales, labels=labels)
# Add legend
plt.legend()
# Add labels and title
plt.xlabel('Quarters')
plt.ylabel('Sales units in thousands')
plt.title('Smartphone sales units')
# Show plot
plt.show()
410 | Appendix
Figure 3.53: Stacked area chart comparing sales units of different smartphone
manufacturers
Note
To access the source code for this specific section, please refer to
https://packt.live/2CckMJC.
Let's visualize the IQ of different groups using a histogram and a box plot:
1. Import the necessary modules and enable plotting within a Jupyter notebook:
# Import statements
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# IQ samples
iq_scores = [126, 89, 90, 101, 102, 74, 93, 101, 66, \
120, 108, 97, 98, 105, 119, 92, 113, 81, \
104, 108, 83, 102, 105, 111, 102, 107, 103, \
89, 89, 110, 71, 110, 120, 85, 111, 83, \
122, 120, 102, 84, 118, 100, 100, 114, 81, \
109, 69, 97, 95, 106, 116, 109, 114, 98, \
90, 92, 98, 91, 81, 85, 86, 102, 93, \
112, 76, 89, 110, 75, 100, 90, 96, 94, \
107, 108, 95, 96, 96, 114, 93, 95, 117, \
141, 115, 95, 86, 100, 121, 103, 66, 99, \
96, 111, 110, 105, 110, 91, 112, 102, 112, 75]
3. Plot a histogram with 10 bins for the given IQ scores. IQ scores are normally
distributed with a mean of 100 and a standard deviation of 15. Visualize the
mean as a vertical solid red line, and the standard deviation using dashed
vertical lines. Add labels and a title:
# Create figure
plt.figure(figsize=(6, 4), dpi=150)
# Create histogram
plt.hist(iq_scores, bins=10)
plt.axvline(x=100, color='r')
plt.axvline(x=115, color='r', linestyle= '--')
plt.axvline(x=85, color='r', linestyle= '--')
# Add labels and title
plt.xlabel('IQ score')
412 | Appendix
plt.ylabel('Frequency')
plt.title('IQ scores for a test group of a hundred adults')
# Show plot
plt.show()
4. Create a box plot to visualize the IQ scores. Add labels and a title:
# Create figure
plt.figure(figsize=(6, 4), dpi=150)
# Create histogram
plt.boxplot(iq_scores)
# Add labels and title
ax = plt.gca()
ax.set_xticklabels(['Test group'])
plt.ylabel('IQ score')
plt.title('IQ scores for a test group of a hundred adults')
# Show plot
plt.show()
Chapter 3: A Deep Dive into Matplotlib | 413
5. The following are IQ scores for different test groups that we can use as data:
group_a = [118, 103, 125, 107, 111, 96, 104, 97, 96, \
114, 96, 75, 114, 107, 87, 117, 117, 114, \
117, 112, 107, 133, 94, 91, 118, 110, 117, \
86, 143, 83, 106, 86, 98, 126, 109, 91, \
112, 120, 108, 111, 107, 98, 89, 113, 117, \
81, 113, 112, 84, 115, 96, 93, 128, 115, \
138, 121, 87, 112, 110, 79, 100, 84, 115, \
93, 108, 130, 107, 106, 106, 101, 117, 93, \
94, 103, 112, 98, 103, 70, 139, 94, 110, \
105, 122, 94, 94, 105, 129, 110, 112, 97, \
109, 121, 106, 118, 131, 88, 122, 125, 93, 78]
group_b = [126, 89, 90, 101, 102, 74, 93, 101, 66, 120, \
108, 97, 98, 105, 119, 92, 113, 81, 104, 108, \
83, 102, 105, 111, 102, 107, 103, 89, 89, 110, \
71, 110, 120, 85, 111, 83, 122, 120, 102, 84, \
118, 100, 100, 114, 81, 109, 69, 97, 95, 106, \
116, 109, 114, 98, 90, 92, 98, 91, 81, 85, \
86, 102, 93, 112, 76, 89, 110, 75, 100, 90, \
414 | Appendix
96, 94, 107, 108, 95, 96, 96, 114, 93, 95, \
117, 141, 115, 95, 86, 100, 121, 103, 66, 99, \
96, 111, 110, 105, 110, 91, 112, 102, 112, 75]
group_c = [108, 89, 114, 116, 126, 104, 113, 96, 69, 121, \
109, 102, 107, 122, 104, 107, 108, 137, 107, 116, \
98, 132, 108, 114, 82, 93, 89, 90, 86, 91, \
99, 98, 83, 93, 114, 96, 95, 113, 103, 81, \
107, 85, 116, 85, 107, 125, 126, 123, 122, 124, \
115, 114, 93, 93, 114, 107, 107, 84, 131, 91, \
108, 127, 112, 106, 115, 82, 90, 117, 108, 115, \
113, 108, 104, 103, 90, 110, 114, 92, 101, 72, \
109, 94, 122, 90, 102, 86, 119, 103, 110, 96, \
90, 110, 96, 69, 85, 102, 69, 96, 101, 90]
group_d = [93, 99, 91, 110, 80, 113, 111, 115, 98, 74, \
96, 80, 83, 102, 60, 91, 82, 90, 97, 101, \
89, 89, 117, 91, 104, 104, 102, 128, 106, 111, \
79, 92, 97, 101, 106, 110, 93, 93, 106, 108, \
85, 83, 108, 94, 79, 87, 113, 112, 111, 111, \
79, 116, 104, 84, 116, 111, 103, 103, 112, 68, \
54, 80, 86, 119, 81, 84, 91, 96, 116, 125, \
99, 58, 102, 77, 98, 100, 90, 106, 109, 114, \
102, 102, 112, 103, 98, 96, 85, 97, 110, 131, \
92, 79, 115, 122, 95, 105, 74, 85, 85, 95]
6. Create a box plot for each of the IQ scores of different test groups. Add labels
and a title:
# Create figure
plt.figure(figsize=(6, 4), dpi=150)
# Create histogram
plt.boxplot([group_a, group_b, group_c, group_d])
# Add labels and title
ax = plt.gca()
ax.set_xticklabels(['Group A', 'Group B', 'Group C', 'Group D'])
plt.ylabel('IQ score')
plt.title('IQ scores for different test groups')
# Show plot
plt.show()
Chapter 3: A Deep Dive into Matplotlib | 415
Note
To access the source code for this specific section, please refer to
https://packt.live/3e80Wx4.
1. Import the necessary modules and enable plotting within a Jupyter notebook:
# Import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Load dataset
data = pd.read_csv('../../Datasets/anage_data.csv')
416 | Appendix
3. Filter the data so that you end up with samples containing a body mass and a
maximum longevity as the given dataset is not complete. Select all the samples
of the aves class and with a body mass smaller than 20,000:
# Preprocessing
longevity = 'Maximum longevity (yrs)'
mass = 'Body mass (g)'
data = data[np.isfinite(data[longevity]) \
& np.isfinite(data[mass])]
# Sort according to class
aves = data[data['Class'] == 'Aves']
aves = data[data[mass] < 20000]
4. Create a Figure with a constrained layout. Create a gridspec of size 4x4. Create a
scatter plot of size 3x3 and marginal histograms of size 1x3 and 3x1. Add labels
and a Figure title:
# Create figure
fig = plt.figure(figsize=(8, 8), dpi=150, \
constrained_layout=True)
# Create gridspec
gs = fig.add_gridspec(4, 4)
# Specify subplots
histx_ax = fig.add_subplot(gs[0, :-1])
histy_ax = fig.add_subplot(gs[1:, -1])
scatter_ax = fig.add_subplot(gs[1:, :-1])
# Create plots
scatter_ax.scatter(aves[mass], aves[longevity])
histx_ax.hist(aves[mass], bins=20, density=True)
histx_ax.set_xticks([])
histy_ax.hist(aves[longevity], bins=20, density=True, \
orientation='horizontal')
histy_ax.set_yticks([])
# Add labels and title
plt.xlabel('Body mass in grams')
plt.ylabel('Maximum longevity in years')
fig.suptitle('Scatter plot with marginal histograms')
# Show plot
plt.show()
Chapter 3: A Deep Dive into Matplotlib | 417
Note
To access the source code for this specific section, please refer to
https://packt.live/2Ccl03q.
1. Import the necessary modules and enable plotting within a Jupyter notebook:
# Import statements
import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline
# Load images
img_filenames = sorted(os.listdir('../../Datasets/images'))
imgs = [mpimg.imread(os.path.join('../../Datasets/images', \
img_filename)) \
for img_filename in img_filenames]
3. Visualize the images in a 2x2 grid. Remove the axes and give each image a label:
# Create subplot
fig, axes = plt.subplots(2, 2)
fig.figsize = (6, 6)
fig.dpi = 150
axes = axes.ravel()
# Specify labels
labels = ['coast', 'beach', 'building', 'city at night']
# Plot images
for i in range(len(imgs)):
axes[i].imshow(imgs[i])
axes[i].set_xticks([])
axes[i].set_yticks([])
axes[i].set_xlabel(labels[i])
Chapter 3: A Deep Dive into Matplotlib | 419
Note
To access the source code for this specific section, please refer to
https://packt.live/3hxvFWv.
Find the patterns in the flight passengers' data with the help of a heatmap:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
data = pd.read_csv("../../Datasets/flight_details.csv")
4. Now, we can use the pivot() function to transform the data into a format that
is suitable for heatmaps:
5. Use the heatmap() function of the Seaborn library to visualize this data. Within
this function, we pass parameters such as DataFrame and colormap. Since
we got data from the preceding code, we will pass it as a DataFrame in the
heatmap() function. Also, we will create our own colormap and pass it as a
second parameter to this function:
plt.figure(dpi=200)
# you can use any sequential color palette
sns.heatmap(data, cmap=sns.cubehelix_palette(rot=-.3, \
as_cmap=True))
Chapter 4: Simplifying Visualizations Using Seaborn | 421
Note
To access the source code for this specific section, please refer to
https://packt.live/2UOMTov.
Compare the movie scores for five different movies by using a bar plot that's been
provided by the Seaborn library:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
mydata = pd.read_csv("../../Datasets/movie_scores.csv", \
index_col=0)
4. Construct a DataFrame from this given data. This can be done with the help of
the pd.DataFrame() function provided by pandas. The following code gives
us a better idea of this:
movie_scores = \
pd.DataFrame({"Movie Title": list(mydata["MovieTitle"]) * 2, \
"Score": list(mydata["AudienceScore"]) \
+ list(mydata["Tomatometer"]), \
"Type": ["Audience Score"] \
* len(mydata["AudienceScore"]) + ["Tomatometer"] \
* len(mydata["Tomatometer"])})
sns.set()
plt.figure(figsize=(10, 5), dpi=300)
Chapter 4: Simplifying Visualizations Using Seaborn | 423
We compared the ratings of Audience Score and Tomatometer for five different
movies and concluded that the ratings matched for the movie The Martian.
Note
To access the source code for this specific section, please refer to
https://packt.live/2B7ohQZ.
Compare IQ scores among different test groups using the Seaborn library:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
3. Use the read_csv() function of pandas to read the data located in the
Datasets folder:
mydata = pd.read_csv("../../Datasets/iq_scores.csv")
4. Access the data of each test group in the column. Convert this into a list using
the tolist() method. Once the data of each test group has been converted
into a list, assign this list to the variables of each respective test group:
group_a = mydata[mydata.columns[0]].tolist()
group_b = mydata[mydata.columns[1]].tolist()
group_c = mydata[mydata.columns[2]].tolist()
group_d = mydata[mydata.columns[3]].tolist()
5. Print the variables of each group to check whether the data inside it has been
converted into a list. This can be done with the help of the print() function:
print(group_a)
print(group_b)
print(group_c)
print(group_d)
6. Once we get the data for each test group, we need to construct a DataFrame
from this given data. This can be done with the help of the pd.DataFrame()
function that's provided by pandas:
7. Now, since we have the DataFrame, we need to create a violin plot using the
violinplot() function that's provided by Seaborn. Within this function, we
need to specify the titles for both the axes along with the DataFrame we are
using. The title for the x-axis will be Groups, and the title for the y-axis will be IQ
score. As far as the DataFrame is concerned, we will pass data as a parameter.
Here, data is the DataFrame that we obtained from the previous step:
plt.figure(dpi=150)
# Set style
sns.set_style('whitegrid')
# Create boxplot
sns.violinplot('Groups', 'IQ score', data=data)
# Despine
sns.despine(left=True, right=True, top=True)
# Add title
plt.title('IQ scores for different test groups')
# Show plot
plt.show()
The despine() function helps to remove the top and right spines from the
plot. Here, we have also removed the left spine. Using the title() function, we
have set the title for our plot. The show() function helps to visualize the plot.
Note
To access the source code for this specific section, please refer to
https://packt.live/30OU8ka.
Visualize the total number of subscribers and the total number of views for the top
30 YouTube channels by using the FacetGrid() function that's provided by the
Seaborn library:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
3. Use the read_csv() function of pandas to read the data located in the
Datasets folder:
mydata = pd.read_csv("../../Datasets/YouTube.csv")
4. Access the data of each test group in the column. Convert this into a list by using
the tolist() method. Once the data of each test group has been converted
into a list, assign this list to variables of each respective test group:
channels = mydata[mydata.columns[0]].tolist()
subs = mydata[mydata.columns[1]].tolist()
views = mydata[mydata.columns[2]].tolist()
428 | Appendix
5. Print the variables of each group to check whether the data inside it has been
converted into a list. This can be done with the help of the print() function:
print(channels)
print(subs)
print(views)
6. Once we get the data for channels, subs, and views, we need to construct
a DataFrame from the given data. This can be done with the help of the
pd.DataFrame() function that's provided by pandas:
data = pd.DataFrame({'YouTube Channels': channels + channels, \
'Subscribers in millions': subs + views, \
'Type': ['Subscribers'] * len(subs) \
+ ['Views'] * len(views)})
Chapter 4: Simplifying Visualizations Using Seaborn | 429
sns.set()
g = sns.FacetGrid(data, col='Type', hue='Type', \
sharex=False, height=8)
g.map(sns.barplot, 'Subscribers in millions', 'YouTube Channels')
plt.show()
We can conclude that the YouTube channel T-Series has both the highest
number of subscribers and views in the music category.
Note
To access the source code for this specific section, please refer to
https://packt.live/3d9qLLU.
Visualize the linear relationship between maximum longevity and body mass
in the regression plot by using the regplot() function that's provided by the
Seaborn library:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
3. Use the read_csv() function of pandas to read the data located in the
Datasets folder:
mydata = pd.read_csv("../../Datasets/anage_data.csv")
4. Filter the data so that you end up with samples containing a body mass and
maximum longevity. Only consider samples for the Mammalia class and a body
mass of less than 200,000. This preprocessing can be seen in the following code:
5. Once the preprocessing is done, plot the data using the regplot() function
that's provided by the Seaborn library. There are three parameters inside the
regplot() function that have to be specified. The first two parameters are
mass and longevity, wherein the body mass data will be shown on the
x-axis, and the maximum longevity data will be shown on the y-axis. For the third
parameter, provide the DataFrame obtained from the previous step:
# Create figure
sns.set()
plt.figure(figsize=(10, 6), dpi=300)
# Create a scatter plot
Chapter 4: Simplifying Visualizations Using Seaborn | 431
We can conclude that there is a linear relationship between body mass and
maximum longevity for the Mammalia class.
Note
To access the source code for this specific section, please refer to
https://packt.live/2UNM5Ax.
Activity 4.06: Visualizing the Impact of Education on Annual Salary and Weekly
Working Hours
Solution:
You're asked to determine whether education has an influence on annual salary and
weekly working hours. You ask 500 people in the state of New York about their age,
annual salary, weekly working hours, and their education. You first want to know the
percentage for each education type, so therefore you use a tree map. Two violin plots
will be used to visualize the annual salary and weekly working hours. Compare in
each case to what extent education has an impact.
It should also be taken into account that all visualizations in this activity are designed
to be suitable for colorblind people. In principle, this is always a good idea to
bear in mind:
2. Import the necessary modules and enable plotting within a Jupyter notebook:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import squarify
sns.set()
4. Use a tree map to visualize the percentages for each education type:
# Create figure
plt.figure(figsize=(9, 6), dpi=200)
squarify.plot(percentages, label=labels, \
color=sns.color_palette('colorblind', \
len(degrees)))
plt.axis('off')
# Add title
plt.title('Degrees')
# Show plot
plt.show()
5. Create a subplot with two rows to visualize two violin plots for the annual salary
and weekly working hours, respectively. Compare in each case to what extent
education has an impact. To exclude pensioners, only consider people younger
than 65. Use a colormap that is suitable for colorblind people. subplots() can
be used in combination with Seaborn's plot, by simply passing the ax argument
with the respective axes:
ordered_degrees = sorted(list(degrees))
ordered_degrees = [ordered_degrees[4], ordered_degrees[3], \
ordered_degrees[1], ordered_degrees[0], \
ordered_degrees[2]]
data = data.loc[data['Age'] < 65]
# Set color palette to colorblind
sns.set_palette('colorblind')
# Create subplot with two rows
fig, ax = plt.subplots(2, 1, dpi=200, figsize=(8, 8))
sns.violinplot('Education', 'Annual Salary', data=data, \
cut=0, order=ordered_degrees, ax=ax[0])
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=10)
sns.violinplot('Education', 'Weekly hours', data=data, \
cut=0, order=ordered_degrees, ax=ax[1])
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=10)
plt.tight_layout()
# Add title
fig.suptitle('Impact of Education on Annual Salary and '\
'Weekly Working Hours')
# Show figure
plt.show()
Figure 4.58: Violin plots showing the impact of education on annual salary
and weekly working hours
The preceding output helps us to analyze the impact of education on annual salary
and weekly working hours.
Note
To access the source code for this specific section, please refer to
https://packt.live/2AIDJ66.
Let's plot the geospatial data on a map and find the densely populated areas of cities
in Europe that have population of more than 100,000:
import numpy as np
import pandas as pd
import geoplotlib
Note
If we import our dataset without defining the dtype attribute of the
Region column as a String type, we will get a warning telling us
that it has a mixed datatype. We can get rid of this warning by explicitly
defining the type of the values in this column, which we can do by using the
dtype parameter.
3. Check the dtype attribute of each column using the dtypes attribute of
a DataFrame:
Note
Here, we can see the datatypes of each column. Since the String type is
not a primitive datatype, it's displayed as an object.
4. Use the head() method of a pandas DataFrame to display the first five entries:
5. Map the Latitude and Longitude columns into the lat and lon columns
by using simple code:
Most datasets won't be in the format that you desire. Some of them might have
their Latitude and Longitude values hidden in a different column. This is
where the data wrangling skills of Chapter 1, The Importance of Data Visualization
and Data Exploration, are required.
6. Our dataset is now ready for the first plotting. Use a DotDensityLayer to see
all of our data points:
7. Before we start breaking down our data to get a better and more workable
dataset, we want to understand the outlines of all of our data. Display the
number of countries and the number of cities that our dataset holds:
234 Countries
3173958 Cities
8. Use the size() method, which returns a Series object, to see each grouped
element on its own:
9. Display the average number of cities per country using the agg method
of pandas:
13563.923076923076
Reduce the amount of data we are working with by removing all the cities that
don't have a population value, meaning a population of 0, in this case:
Note
Breaking down and filtering your data is one of the most important aspects
of getting good insights. Cluttered visualizations can hide information.
Chapter 5: Plotting Geospatial Data | 441
10. Display the first five items of the new dataset to get a basic indication of what
the values in the Population column will look like:
11. Now, take a look at our reduced dataset with the help of a dot density plot:
"""
showing all cities with a defined population \
with a dot density plot
"""
geoplotlib.dot(dataset_with_pop)
geoplotlib.show()
442 | Appendix
On the new dot plot, we can already see some improvements in terms of clarity.
However, we still have too many dots on our map. Given the activity definition,
we can filter our dataset further by only looking at cities with a population of
more than 100k.
12. Filter the dataset to contain only cities with a population of more than 100k:
13. In addition to just plotting our 100k dataset, fix our viewport to a specific
bounding box. Since our data is spread across the world, use the built-in WORLD
constant of the BoundingBox class:
"""
displaying all cities >= 100k population with a fixed bounding box
(WORLD) in a dot density plot
"""
from geoplotlib.utils import BoundingBox
geoplotlib.dot(dataset_100k)
geoplotlib.set_bbox(BoundingBox.WORLD)
geoplotlib.show()
Figure 5.37: Dot density visualization of cities with a population of 100,000 or more
444 | Appendix
14. Compare the output with the previous plots; it gives us a better view of where
the highest number of cities with a population of more than 100,000 is. Find the
areas of these cities that are the most densely packed using a Voronoi plot:
The resulting visualization is exactly what we were searching for. On the Voronoi
plot, we can see clear tendencies. Germany, Great Britain, Nigeria, India, Japan,
Java, the East Coast of the USA, and Brazil stick out. We can now filter our
data and only look at those countries to find the ones that are best suited to
this scenario.
Note
You can also create a custom colormap gradient with the ColorMap class.
15. Filter the dataset to only countries in Europe, such as Germany and Great Britain.
Use the or operator when adding a filter to our data. This will allow us to filter
for Germany and Great Britain at the same time:
16. Use Delaunay triangulation to find the areas that have the most densely
packed cities:
"""
using Delaunay triangulation to find the most densely populated area
"""
geoplotlib.delaunay(dataset_europe, cmap='hot_r')
geoplotlib.show()
446 | Appendix
By using a hot_r color map, we can quickly get a good visual representation
and make the areas of interest pop out. Here, the areas around Cologne,
Birmingham, and Manchester really stick out:
Figure 5.39: A Delaunay triangle visualization of cities in Germany and Great Britain
Note
To access the source code for this specific section, please refer to
https://packt.live/3hBl0dE.
This section does not currently have an online interactive example, and will
need to be run locally.
Chapter 5: Plotting Geospatial Data | 447
Activity 5.02: Visualizing City Density by the First Letter Using an Interactive
Custom Layer
Solution:
dataset = pd.read_csv('../../Datasets/world_cities_pop.csv', \
dtype = {'Region': np.str})
Note
If we import our dataset without defining the dtype parameter of the
Region column as a String type, we will get a warning telling us
that it has a mixed datatype. We can get rid of this warning by explicitly
defining the type of the values in this column, which we can do by using the
dtype parameter.
3. Check the dtype parameter of each column using the dtypes attribute of
a DataFrame:
5. Focus your attention on European countries and their cities. A list of all
European countries is as follows:
6. Given this list, we want to use filtering to get a dataset that only contains
European cities. The filtering works exactly as we learned in Chapter 01, The
Importance of Data Visualization and Data Exploration. Use the europe_
country_codes column to filter down our dataset by using the isin()
method as a condition for our DataFrame:
7. Print both the length of our whole dataset and the filtered down dataset:
We want to take a quick look at the cities starting with Z in the dataset using
a DotDensity plot and also get some information about the cities using the
previously seen f_tooltip argument. To use the f_tooltip argument, we
need to wrap our dataset in DataAccessObject.
450 | Appendix
9. Create a new DataAccessObject from our cities with the Z dataset, visualize
it with a dot plot, and use a tooltip that outputs the Country and City name
separated by a - (for example, Ch - Zürich):
geoplotlib_data = DataAccessObject(cities_starting_z)
geoplotlib.dot(geoplotlib_data, f_tooltip=lambda d: '{} \
- {}'.format(d['Country'].upper(), \
d['City']).title())
geoplotlib.show()
10. As a second step, we want to use a voronoi plot to display the density of cities
starting with the letter Z. Create a new voronoi plot using a color map of
Reds_r, max area of 1e5, and an alpha value of 50 so that we can still see the
mapping peeking through:
"""
displaying the density of cities stating with \
z using a voronoi plot
"""
geoplotlib.voronoi(cities_starting_z, cmap='Reds_r', \
max_area=1e5, alpha=50)
geoplotlib.show()
Figure 5.43: A Voronoi plot showing the density of cities starting with Z in Europe
452 | Appendix
Now we will create an interactive visualization that displays each city, as a dot,
that starts with the currently selected first letter. The letter selected by default
will be A. We need a way to iterate through the letters using the left and right
arrows. As described in the introductory Custom Layers section, we can make use
of the on_key_release method, which is specifically designed for this.
11. Filter the self.data dataset in the invalidate method using the current
letter acquired from the start_letters array using the self.start_
letter index:
# custom layer creation
import pyglet
import geoplotlib
from geoplotlib.layers import BaseLayer
from geoplotlib.core import BatchPainter
from geoplotlib.utils import BoundingBox
start_letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', \
'H', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', \
'S', 'T', 'U', 'V', 'W' , 'X', 'Y', 'Z']
12. Create a new BatchPainter() function and project the lon and lat values
to the x and y values. Use the BatchPainter function to paint the points on
the map with a size of 2:
self.painter = BatchPainter()
x, y = proj.lonlat_to_screen(start_letter_data['lon'], \
start_letter_data['lat'])
self.painter.points(x, y, 2)
Chapter 5: Plotting Geospatial Data | 453
13. Call the batch_draw() method in the draw method and use the ui_
manager to add an info dialog to the screen telling the user which starting
letter is currently being used:
15. Now call the add_layer() method of geoplotlib, providing our custom layer
with the given BoundingBox class of Europe:
geoplotlib.add_layer(FilterLayer(europe_dataset, europe_bbox))
geoplotlib.show()
Figure 5.44: A dot density plot of cities starting with A in Europe in the custom layer
Chapter 5: Plotting Geospatial Data | 455
Pressing the right arrow key twice will lead to the custom layer plotting the cities
starting with a C:
Figure 5.45: A dot density plot of cities starting with C in Europe in the custom layer
Note
To access the source code for this specific section, please refer to
https://packt.live/2Y63NBi.
This section does not currently have an online interactive example, and will
need to be run locally.
456 | Appendix
import pandas as pd
from bokeh.io import output_notebook
output_notebook()
dataset = pd.read_csv('../../Datasets/automobiles.csv')
4. Use the head method to print the first five rows of the dataset:
dataset.head()
Figure 6.36: Loading the top five rows of the automobile dataset
Chapter 6: Making Things Interactive with Bokeh | 457
2. First, use the index as our x-axis since we just want to plot each car with its price.
Create a new column in our dataset that uses dataset.index as values:
dataset['index'] = dataset.index
Once we have our usable index column, we can plot our cars.
3. Create a new figure and plot each car using a scatter plot with the index and
price column. Give the visualization a title of Car prices and name the x-axis
Car Index. Name the y-axis Price:
plot = figure(title='Car prices', x_axis_label='Car Index', \
y_axis_label='Price')
plot.scatter(dataset['index'], dataset['price'])
show(plot)
458 | Appendix
1. Group the dataset using groupby and the make column. Then use the mean
method to get the mean value for each column. We don't want the make
column to be used as an index, so provide the as_index=False argument to
groupby. Print out the grouped average dataset to see how it differs from the
initial dataset:
Figure 6.38: New grouped dataset with mean values for columns
Note that we are dealing with categorical data, the manufacturer name,
this time.
460 | Appendix
show(grouped_plot)
Chapter 6: Making Things Interactive with Bokeh | 461
grouped_plot.xaxis.major_label_orientation = "vertical"
show(grouped_plot)
Figure 6.40: Car manufacturers with their mean car prices and vertical make labels
Chapter 6: Making Things Interactive with Bokeh | 463
Adding color
To give the user a little bit more information about the data, we want to add
some color based on the mean price of each manufacturer. In addition to that,
we also want to increase the size of the points to make them pop more.
5. Create a new figure with the same name, labels, and x_range as before. Plot
each manufacturer and provide a size argument with a size of 15. Provide the
color argument to the scatter method and use the field and transform
attributes to provide the column (y) and color_mapper. As we've done before,
set the label orientation to vertical:
color_mapper = \
LinearColorMapper(palette='Magma256', \
low=min(grouped_average['price']), \
high=max(grouped_average['price']))
grouped_colored_plot = \
figure(title='Car Manufacturer Mean Prices', x_axis_label='Car \
Manufacturer', y_axis_label='Mean Price', \
x_range=grouped_average['make'])
grouped_colored_plot.scatter(grouped_average['make'], \
grouped_average['price'], \
color={'field': 'y', \
'transform': color_mapper}, \
size=15)
grouped_colored_plot.xaxis.major_label_orientation = "vertical"
show(grouped_colored_plot)
464 | Appendix
Figure 6.41: Car manufacturers with their mean car prices colored based on the mean price
Chapter 6: Making Things Interactive with Bokeh | 465
Note
To access the source code for this specific section, please refer to
https://packt.live/3hxyHdr.
import pandas as pd
dataset = pd.read_csv('../../Datasets/olympia2016_athletes.csv')
5. Call head on the DataFrame to test that our data has been successfully loaded:
dataset.head()
466 | Appendix
Figure 6.42: Loading the top five rows of the olympia2016_athletes dataset
using the head method
6. Import figure and show from the plotting interface and interact, as a
decorator, from the widgets interface:
from bokeh.plotting import figure, show, ColumnDataSource
from ipywidgets import interact, widgets
7. Get a list of unique countries and one for the number of athletes and the
number of medals per country. Use the groupby method of your dataset to
achieve this:
countries = dataset['nationality'].unique()
athletes_per_country = dataset.groupby('nationality').size()
medals_per_country = dataset.groupby('nationality')\
['gold', 'silver','bronze'].sum()
8. Use two IntSlider widgets that will control the max numbers for the number
of athletes and/or medals a country is allowed to have in order to be displayed
in the visualization. Get the maximum number of medals of all the countries and
the maximum number of athletes of all the countries:
max_medals = medals_per_country.sum(axis=1).max()
max_athletes = athletes_per_country.max()
9. Use those maximum numbers as the maximum for two IntSlider widgets.
Display the max_athletes_slider in a vertical orientation and the max_
medals_slider in a horizontal orientation. In the visualization, they should be
described as Max. Athletes and Max. Medals:
max_medals_slider=\
widgets.IntSlider(value=max_medals, min=0, max=max_medals, \
step=1, description='Max. Medals:', \
continuous_update=False, \
orientation='horizontal')
10. After setting up the widgets, implement the method that will be called with
each update of the interaction widgets. Use the @interact decorator for
this. Instead of value ranges or lists, provide the variable names of our already
created widgets in the decorator:
@interact(max_athletes=max_athletes_slider, \
max_medals=max_medals_slider)
def get_olympia_stats(max_athletes, max_medals):
show(get_plot(max_athletes, max_medals))
Since we have already set up the empty method that will return a plot, we can
call show() with the method call inside it to show the result once it is returned
from the get_plot method.
468 | Appendix
11. Scroll up and implement the plotting we skipped in a previous step. The two
arguments passed are max_athletes and max_medals. First, filter our
countries dataset, which contains all the countries that placed athletes in the
Olympic games. Check whether they have less than or equal medals and athletes
than our max values, which were passed as arguments. Once we have a filtered
dataset, create our DataSource. This DataSource will be used both for the
tooltips and the printing of the circle glyphs.
Note
There is extensive documentation on how to use and set up tooltips that
you can and use, which can be accessed with the following link:
https://bokeh.pydata.org/en/latest/docs/user_guide/tools.html.
12. Create a new plot using the figure method that has the following attributes:
title set to 'Rio Olympics 2016 - Medal comparison', x_axis_
label set to 'Number of Medals', and y_axis_label set to 'Num
of Athletes':
# creating the scatter plot
def get_plot(max_athletes, max_medals):
filtered_countries=[]
for country in countries:
if (athletes_per_country[country] <= max_athletes and \
medals_per_country.loc[country].sum() <= max_medals):
filtered_countries.append(country)
data_source=get_datasource(filtered_countries)
TOOLTIPS=[('Country', '@countries'), ('Num of Athletes', '@y'), \
('Gold', '@gold'), ('Silver', '@silver'), \
('Bronze', '@bronze')]
13. Display every country with a different color by randomly creating the colors with
a six-digit hex code. The following method does this:
"""
get a 6 digit random hex color to differentiate the countries better
"""
import random
def get_random_color():
return '#%06x' % random.randint(0, 0xFFFFFF)
14. Use a Bokeh ColumnDataSource object to handle our data and make it easily
accessible for our tooltip and glyphs. We want to display additional information
in a tooltip, so add the color field, which holds the required amount of random
colors; the countries field, which holds the filtered list of countries; the
gold, silver, and bronze fields, which hold the number of gold, silver,
and bronze medals for each country, respectively; the x field, which holds the
summed number of medals for each country; and the y field, which holds the
number of athletes for each country, to our DataSource object:
15. Execute the last cell with our @interact decorator once more. This time, it will
display our scatter plot with our interactive widgets. We will see each country in
a different color. Upon hovering over them, we will get more information about
each country, such as its short name, number of athletes, and the number of
gold, silver, and bronze medals they earned. The resulting visualization should
look as follows:
You've built a full visualization to display and explore data from the 2016
Olympics. We added two widgets to our visualization, which allowed us to filter the
displayed countries.
Note
To access the source code for this specific section, please refer to
https://packt.live/2CdiAl5.
# Import statements
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import squarify
sns.set()
2. Use pandas to read both CSV files located in the Datasets folder:
p_ny = pd.read_csv('../../Datasets/acs2017/pny.csv')
h_ny = pd.read_csv('../../Datasets/acs2017/hny.csv')
3. Use the given PUMA (public use microdata area code based on the 2010 census
definition, which are areas with populations of 100,000 or more) ranges to
further divide the dataset into NYC districts (Bronx, Manhattan, Staten Island,
Brooklyn, and Queens):
# PUMA ranges
bronx = [3701, 3710]
manhatten = [3801, 3810]
staten_island = [3901, 3903]
brooklyn = [4001, 4017]
queens = [4101, 4114]
nyc = [bronx[0], queens[1]]
def puma_filter(data, puma_ranges):
return data.loc[(data['PUMA'] >= puma_ranges[0]) \
& (data['PUMA'] <= puma_ranges[1])]
h_bronx = puma_filter(h_ny, bronx)
h_manhatten = puma_filter(h_ny, manhatten)
h_staten_island = puma_filter(h_ny, staten_island)
Chapter 7: Combining What We Have Learned | 473
5. In this subtask, we will create a plot containing multiple subplots that visualize
information with regard to NYC wages. Before we create the plots, some data
wrangling is necessary:
def median_household_income(data):
query = data.loc[np.isfinite(data['HINCP']), \
['HINCP', 'WGTP']].values
return np.round(weighted_median(query[:, 0], query[:, 1]) \
* income_adjustement)
h_ny_income_median = median_household_income(h_ny)
h_nyc_income_median = median_household_income(h_nyc)
h_bronx_income_median = median_household_income(h_bronx)
h_manhatten_income_median = median_household_income(h_manhatten)
h_staten_island_income_median = \
median_household_income(h_staten_island)
h_brooklyn_income_median = median_household_income(h_brooklyn)
h_queens_income_median = median_household_income(h_queens)
474 | Appendix
6. Compute the average wage by gender for the given occupation categories for the
population of NYC:
occ_ranges = \
{'Management, Business, Science, and Arts '\
'Occupations': [10, 3540], \
'Service Occupations': [3600, 4650], \
'Sales and Office Occupations': [4700, 5940], \
'Natural Resources, Construction, and '\
'Maintenance Occupations': [6000, 7630], \
'Production, Transportation, and Material '\
'Moving Occupations': [7700, 9750]}
wages_male = wage_by_gender_and_occupation(p_nyc, 1)
wages_female = wage_by_gender_and_occupation(p_nyc, 2)
Chapter 7: Combining What We Have Learned | 475
7. Compute the wage frequencies for New York and NYC. Use the following yearly
wage intervals: 10k steps between 0 and 100k, 50k steps between 100k and
200k, and >200k:
def wage_frequency(data):
# Only consider people who have a job: salary > 0
valid = data.loc[np.isfinite(data['WAGP']) \
& (data['WAGP'] > 0), ['WAGP', 'PWGTP']]
overall_sum = np.sum(valid['PWGTP'].values)
frequency = []
for wage_bin in wage_bins.values():
query = data.loc[(data['WAGP'] \
* income_adjustement > wage_bin[0]) \
& (data['WAGP'] \
* income_adjustement <= wage_bin[1]), \
['PWGTP']].values
frequency.append(np.sum(query) / overall_sum)
return frequency
wages_nyc = wage_frequency(p_nyc)
wages_ny = wage_frequency(p_ny)
476 | Appendix
8. Create a plot containing multiple subplots that visualize information with regard
to NYC wages. Now, visualize the median household income for the US, New
York, NYC, and its districts. Next, visualize the average wage by gender for
the given occupation categories for the population of NYC. Then, visualize the
wage distribution for New York and NYC. Lastly, use the following yearly wage
intervals: 10k steps between 0 and 100k, 50k steps between 100k and 200k,
and >200k:
ax2.set_xticks(x)
ax2.set_xticklabels(occ_categories, rotation=0, fontsize=8)
ax2.set_ylabel('Average Salary in $')
# Wage distribution
ax3.set_title('Wage Distribution', fontsize=14)
x = np.arange(len(wages_nyc)) + 1
width = 0.4
ax3.bar(x - width / 2, np.asarray(wages_nyc) \
* 100, width=width, label='NYC')
ax3.bar(x + width / 2, np.asarray(wages_ny) \
* 100, width=width, label='New York')
ax3.legend()
ax3.set_xticks(x)
ax3.set_xticklabels(wage_bins.keys(), rotation=90, fontsize=8)
ax3.set_ylabel('Percentage')
ax3.vlines(x=9.5, ymin=0, ymax=15, linestyle='--')
# Overall figure
fig.tight_layout()
plt.show()
478 | Appendix
Figure 7.10: Wage statistics for NYC in comparison with New York and the United States
Chapter 7: Combining What We Have Learned | 479
9. Use a tree map to visualize the percentage for the given occupation
subcategories for the population of NYC:
def occupation_percentage(data):
percentages = []
overall_sum = np.sum(data.loc[(data['OCCP'] >= 10) \
& (data['OCCP'] <= 9750), \
['PWGTP']].values)
for occ in occ_subcategories.values():
query = data.loc[(data['OCCP'] >= occ[0]) \
& (data['OCCP'] <= occ[1]), \
['PWGTP']].values
percentages.append(np.sum(query) / overall_sum)
return percentages
occ_percentages = occupation_percentage(p_nyc)
Note
Please note that the terms here addressed refer solely to the classifications
of disabilities as defined by the US Census Bureau (accessible through the
following link: https://www.census.gov/topics/health/disability/guidance/data-
collection-acs.html). This language does not reflect the views or intentions of
Packt or its affiliates.
Independent living difficulty: Because of a physical, mental, or
emotional problem, having difficulties performing errands alone, such as
visiting a doctor's office or shopping (DOUT).
Ambulatory difficulty: Having serious difficulty walking or climbing
stairs (DPHY).
Self-care difficulty: Having difficulty bathing or dressing (DDRS).
10. Use a heatmap to show the correlation between the different disability types
(self-care difficulty, hearing difficulty, vision difficulty, independent living
difficulty, ambulatory difficulty, veteran service-connected disability, and
cognitive difficulty) and age groups (<5, 5-11, 12-14, 15-17, 18-24, 25-34, 35-44,
45-54, 55-64, 65-74, 75+) in New York City:
age_groups = {'<5': [0, 4], '5-11': [5, 11], '12-14': [12, 14], \
'15-17': [15, 17], '18-24': [18, 24], \
'25-34': [25, 34], '35-44': [35, 44], \
'45-54': [45, 54], '55-64': [55, 64], \
'65-74': [65, 74], '75+': [75, np.infty]}
482 | Appendix
def difficulty_age_array(data):
array = np.zeros((len(difficulties.values()), \
len(age_groups.values())))
for d, diff in enumerate(difficulties.values()):
for a, age in enumerate(age_groups.values()):
age_sum = np.sum(data.loc[(data['AGEP'] >= age[0]) \
& (data['AGEP'] <= age[1]), \
['PWGTP']].values)
query = data.loc[(data['AGEP'] >= age[0]) \
& (data['AGEP'] <= age[1]) \
& (data[diff] == 1), ['PWGTP']].values
array[d, a] = np.sum(query) / age_sum
return array
array = difficulty_age_array(p_nyc)
# Heatmap
plt.figure(dpi=300 , \
cmap=sns.cubehelix_palette(rot=-.3, as_cmap=True))
ax = sns.heatmap(array * 100)
ax.set_yticklabels(difficulties.keys(), rotation=0)
ax.set_xticklabels(age_groups.keys(), rotation=90)
ax.set_xlabel('Age Groups')
ax.set_title('Percentage of NYC population with difficulties', \
fontsize=14)
plt.show()
Chapter 7: Combining What We Have Learned | 483
Note
To access the source code for this specific section, please refer to
https://packt.live/3e7xU0z.
This section does not currently have an online interactive example, and will
need to be run locally.
484 | Appendix
output_notebook()
3. After downloading the dataset and moving it into the Datasets folder, import
our stock_prices.csv data:
4. Check the first five rows on our DataFrame to make sure that our data has been
loaded successfully:
5. Since the date column has no information regarding the hour, minute, and
second (all 00:00:00), avoid displaying them in the visualization later on by
simply displaying the year, month, and day. Create a new column that holds the
formatted short version of the date value. Display the first five elements of the
dataset again to validate your new column:
dataset['short_date'] = \
dataset.apply(lambda x: shorten_time_stamp(x), axis=1)
# looking at the dataset with shortened date
dataset.head()
Note
The execution of the cell will take a moment since it's a fairly large dataset.
Please, be patient.
486 | Appendix
7. Scroll down to the cell that says #extracting the necessary data
before implementing the plotting. Make sure that you execute the cells below
that, even though this will simply pass and do nothing for now. Extract the
following information: a list of unique stock names that are present in the
dataset, a list of all short_dates that are in 2016, a sorted list of unique dates
generated from the previous list of dates from 2016, and a list with the values
open-close and volume:
# extracting the necessary data
stock_names=dataset['symbol'].unique()
dates_2016=dataset[dataset['short_date'] >= '2016-01-01']\
['short_date']
unique_dates_2016=sorted(dates_2016.unique())
value_options=['open-close', 'volume']
8. Given the extracted information from the preceding cell, define widgets and
provide the available options for it. Create a dropdown with the stock_names,
which, by default, should have the AAPL stock selected, named Compare:. The
second dropdown also uses stock_names, but, by default, should have the
AON stock selected, named to:
range_slider=\
widgets.SelectionRangeSlider(options=unique_dates_2016, \
index=(0,25), \
continuous_update=False, \
description='From-To', \
layout={'width': '500px'})
10. Add a RadioButtons group that provides the open-close and volume
options. By default, open-close should be selected, named Metric:
range_slider=\
widgets.SelectionRangeSlider(options=unique_dates_2016, \
index=(0,25), \
continuous_update=False, \
description='From-To', \
layout={'width': '500px'})
value_radio=widgets.RadioButtons(options=value_options, \
value='open-close', \
description='Metric')
Note
As we mentioned in Chapter 6, Making Things Interactive with Bokeh, we
can also make use of the widgets that are described here: https://ipywidgets.
readthedocs.io/en/stable/examples/Widget%20List.html.
488 | Appendix
11. After setting up the widgets, implement the method that will be called with each
update of the interactive widgets. Use the @interact decorator for this.
Instead of value ranges or lists, provide the variable names of our already
created widgets in the decorator. The method will get four arguments:
stock_1, stock_2, date, and value.
Since we have already set up the empty method that will return the preceding
plot, call show() with the method call inside to show the result once it
is returned from the get_stock_for_2016 method. Now, create the
interact method:
@interact(stock_1=drp_1, stock_2=drp_2, date=range_slider, \
value=value_radio)
def get_stock_for_2016(stock_1, stock_2, date, value):
show(get_plot(stock_1, stock_2, date, value))
12. Start with the so-called candlestick visualization, which is often used with stock
price data. Calculate the mean for every (high/low) pair and then plot those
data points with a line with the given color. Next, set up an add_candle_plot
function that gets a plot object, a stock_name parameter, a stock_range
parameter containing the data of only the selected date range that was defined
with the widgets, and a color for the line. Create a segment that creates the
vertical line, and either a green or red vbar to color code whether the close
price is lower than the open price. Once the candles are created, draw a
continuous line running through the mean (high, low) point of each candle:
plot.vbar(stock_range['short_date'][dec_1], w, \
stock_range['high'][dec_1], \
stock_range['close'][dec_1], \
fill_color="red", line_color="black", \
legend_label=('Mean price of ' + stock_name))
stock_mean_val=stock_range[['high', 'low']].mean(axis=1)
plot.line(stock_range['short_date'], stock_mean_val, \
legend_label=('Mean price of ' + stock_name), \
line_color=color, alpha=0.5)
Note
Make sure to reference the example provided in the Bokeh library here. You
can adapt the code in there to our arguments: https://bokeh.pydata.org/en/
latest/docs/gallery/candlestick.html.
13. After you have implemented the add_candle_plot method, scroll down and
rerun the @interact cell. You will now see the candles being displayed for the
two selected stocks. The final missing step is implementing the plotting of the
lines if the volume value is selected.
14. Add an interactive legend that allows us to mute, meaning gray out, each stock
in the visualization:
return plot
Note
To make our legend interactive, please take a look at the documentation
for the legend feature: https://bokeh.pydata.org/en/latest/docs/user_guide/
interaction/legends.html.
The complete code for this step can be found on GitHub: https://github.com/
PacktWorkshops/The-Data-Visualization-Workshop/blob/master/Chapter07/
Activity7.02/Activity7.02.ipynb.
15. After our implementation has finished, execute the last cell with our @interact
decorator once more. This time, it will display our candlestick plot and, once we
switch to the volume RadioButton, we will see the volumes displayed that have
been traded at the given dates. The resulting visualization should look something
like this:
Chapter 7: Combining What We Have Learned | 491
Figure 7.15: Final interactive visualization that displays the candlestick plot
492 | Appendix
Figure 7.16: Final interactive visualization that displays the volume plot
You have now built a full visualization to display and explore stock price data.
We added several widgets to our visualization that allows us to select "to be
compared" stocks, restrict the displayed data to a specific date range, and even
display two different kinds of plots.
Note
To access the source code for this specific section, please refer to
https://packt.live/37ADxSM.
2. Use the read_csv method of pandas to load the .csv file. If your computer is
a little slow, use the smaller dataset:
4. Remember that geoplotlib needs latitude and longitude columns with the
names lat and lon. We will, therefore, add new columns for lat and lon and
assign the corresponding value columns to them:
5. In order to use a color map that changes color based on the price of
accommodation, we need a value that can easily be compared and checked
whether it's smaller or bigger than any other listing. Therefore, create a new
column called dollar_price that will hold the value of the price column
as a float. Make sure to fill all the NaN values of the price column and the
review_scores_rating column with 0.0 by using the fillna() method
of the dataset:
"""
create new dollar_price column with the price as a number \
and replace the NaN values by 0 in the rating column
"""
dataset['price'] = dataset['price'].fillna('$0.0')
dataset['review_scores_rating'] = \
dataset['review_scores_rating'].fillna(0.0)
dataset['dollar_price'] = \
dataset['price'].apply(lambda x: convert_to_float(x))
Chapter 7: Combining What We Have Learned | 495
6. This dataset has 96 columns. When working with such a huge dataset, it makes
sense to think about what data we need and creates a subsection of our dataset
that only holds the data we need. Before we can do that, we'll take a look at all
available columns and an example for that column. This will help us decide what
information is suitable:
# print the col name and the first entry per column
for col in dataset.columns:
print('{}\t{}'.format(col, dataset[col][0]))
Figure 7.18: Each column header with an example entry from the dataset
7. Trim down the number of columns our working dataset has by creating a
subsection of the columns with id, latitude as lat, longitude as lon,
price in $, and review_scores_rating:
"""
create a subsection of the dataset with the above-mentioned columns
"""
columns=['id', 'lat', 'lon', 'dollar_price', \
'review_scores_rating']
sub_data=dataset[columns]
496 | Appendix
Figure 7.19: Displaying the first five rows after keeping only five columns
"""
import DataAccessObject and create a data object \
as an instance of that class
"""
from geoplotlib.utils import DataAccessObject
data = DataAccessObject(sub_data)
# plotting the whole dataset with dots
geoplotlib.dot(data)
geoplotlib.show()
Chapter 7: Combining What We Have Learned | 497
10. The final step is to write the custom layer. Define a ValueLayer class that
extends the BaseLayer object of geoplotlib. For the interactive feature
mentioned, we require an additional import. pyglet provides us with the
option to act on key presses:
class ValueLayer(BaseLayer):
498 | Appendix
def bbox(self):
# bounding box that gets used when the layer is created
pass
11. Initiate the following instance variables in the __init__ method of the
ValueLayer class: first, self.data, which holds the dataset; second, self.
display, which holds the currently selected attribute name; third, self.
painter, which holds an instance of the BatchPainter class; fourth, self.
view, which holds the BoundingBox function; and lastly, self.cmap, which
holds a color map with the jet color schema and an alpha of 255 and
100 levels:
class ValueLayer(BaseLayer):
12. Implement the bbox, draw, and on_key_release method for the
ValueLayer class. First, return the self.view variable in the bbox
method. Then, set the ui_manager.info text to Use left and right
to switch between the displaying of price and ratings.
Currently displaying: dollar_price or review_scores_rating,
depending on what the self.display variable holds, and lastly, in the
on_key_release method, check whether the left or right key is pressed and
switch the self.display variable between dollar_price or review_
scores_rating. Next, return True if the left or the right key has been
pressed to trigger redrawing of the dots, otherwise return False. The full
custom layer notebook cell will look like this:
class ValueLayer(BaseLayer):
def bbox(self):
# bounding box that gets used when a layer is created
return self.view
13. Given the data, plot each point on the map with a color that is defined by the
currently selected attribute, either price or rating. First, in the invalidate
method, assign a new BatchPainter() function to the self.painter
variable. Second, get the max value of the dataset given the current self.
display variable. Third, use a log scale if dollar_price is used, otherwise
use a lin scale. Lastly, map the value to color using the cmap object we defined
in the __init__ method and plot each point with the given color onto the map
with a size of 5:
Chapter 7: Combining What We Have Learned | 501
class ValueLayer(BaseLayer):
After launching our visualization, we can see that our viewport is focused on
New York. Every accommodation is displayed with one dot. Each dot is colored,
based on either its price or (upon clicking the right or left arrow) the rating. We
can see that the general color gets closer to yellow/orange the closer we get to
central Manhattan. On the other hand, in the rating visualization, we can see that
the accommodation in central Manhattan appears to be rated lower than the
accommodation outside:
Figure 7.21: New York Airbnb dot map, colored based on price
504 | Appendix
The following diagram shows a dot map with color based on rating:
Figure 7.22: New York Airbnb dot map, colored based on ratings
You have just created an interactive visualization by writing your custom layer to
display and visualize price and rating information for Airbnb accommodations
spread across New York.
Note
To access the source code for this specific section, please refer to
https://packt.live/3eioPSA.
This section does not currently have an online interactive example, and will
need to be run locally.
Index
A clutter:110 datasource: 335,
codelists:369 360, 364
accessors:285
cycled:219 decorated:376
annotated: 102,
delaunay: 256,
153, 347
area-wise:125 D 261-263, 272, 274
delimiter: 15-16, 23, 31
argsort: 30, 37
darkgrid: 209-210 dtypes:57
arithmetic: 7, 171
darkmatter: 282,
arrays: 12, 22, 57,
146, 167, 171,
285-286, 382
dashboards: 3, 306
F
189, 204-205
database: 57, 368 facetgrid: 203, 240-242
arrowprops:153
dataframe: 4, 45, faceting:90
asarray:187
47, 50, 52, 55, 60, figsize: 141, 147-148,
ascending: 36-37,
62, 73, 75, 146, 162, 178, 187
65, 72
156, 186, 215, figtext:153
attributes: 91, 177,
217-218, 239-240, figure: 3-4, 6-8, 11,
186-187, 190, 259,
242, 267-268, 276, 17-19, 27-28, 31-43,
311, 317, 319, 342,
294, 296, 321, 47, 49-50, 52-55,
364, 379, 383
330, 362, 375 59-64, 67-74,
augment:44
dataset: 1, 8-9, 13-34, 82-83, 85-92,
autopct: 160-162
36-44, 46-63, 65-76, 94-103, 105-113,
axesimage:193
79-81, 88, 114, 131, 115-123, 125-127,
axvline:172
138, 155, 162, 165, 129-133, 138-158,
168, 177, 186, 190, 160-164, 166-171,
B 204-206, 225-228, 173, 175-176,
230-231, 236, 239, 178-185, 187-188,
barplot: 229-230
242-244, 247-248, 190-191, 193-199,
baselayer: 288-290,
250-251, 253, 206-218, 220-224,
295, 381
258-263, 265-270, 226-228, 230-235,
baseline:224
272-273, 276-277, 237-239, 241-243,
basemap:256
279, 283, 290-296, 245, 247-249,
bivariate: 203, 231,
298-300, 303, 311, 251-253, 257, 260,
233, 235, 253
314, 316-317, 319, 262-263, 266-271,
bokehjs: 308-310
321-323, 325, 273-274, 277-278,
330-331, 339, 280-281, 284-285,
C 341-342, 351-354, 287, 291-294,
356, 360, 362, 297, 299-302,
cartodb:284
364-365, 368-370, 307-308, 310-312,
cartopy:256
375, 379-381 314-332, 335-336,
causation:9
338-340, 342-343,
345-351, 353-355,
357-358, 360-361,
hotspots: 270, 379 L
household: 105, 161,
363-364, 372-375, linestyle: 144, 146,
247, 249, 368, 371
377-378, 380-383 159, 172
housing:369
linewidth:146
hovering: 260,
G 265, 281, 308,
listing: 277, 379
311, 360, 364
genfromtxt: 15-16,
19, 23, 31
hsplit: 22, 26, 38 M
hstack: 30, 39-40
geographic:125 mastering:76
hybrid:344
geojson: 265, 275-282 matching: 22, 34, 332
geological: 124, matplotlib: 2, 76, 134,
126, 134 I 137-142, 147-148,
geoplotlib: 2, 255-262, 150-153, 156-157,
industry:5
264-270, 272, 159-160, 162, 171,
inherent:219
275-276, 278-283, 177, 180-181, 186,
285-286, 288-292, 189, 192, 199-200,
295-296, 298, J 203-204, 206-210,
302-303, 306, 212-213, 215,
jupyter: 15-16, 19, 23,
339, 365, 367, 224-225, 229, 231,
31, 46-47, 51, 58,
379, 381-385 236, 241, 246, 248,
66, 75, 148-150,
geospatial: 80, 124, 256, 261, 306-308,
156, 159, 162, 168,
126, 128, 134, 253, 313-314, 365,
172, 177, 186, 190,
255-256, 261, 272, 367-368, 384-385
198, 214-215, 225,
277, 288, 292-293, matrices: 12, 21
248, 257, 259, 267,
367, 379, 382, 384 midspread:9
272, 276, 283, 298,
misleading:87
306, 309, 321,
H 329, 341, 344-347,
missing: 44, 334
mistake:84
352, 359, 364,
histogram: 99,
368-369, 375, 379
114-115, 123,
169-170, 172-173,
N
196, 231-232, 234, K nbusiness: 370, 373
259, 269-270, 272 ncommunity:373
kdeplot:232
hobbit:86 ndarray: 12, 16,
kernel: 115, 229,
horizontal: 18, 22, 84, 20, 22-23, 31
231, 237
86-87, 117, 140, ndarrays: 12, 21-23, 48
157, 159, 364 nditer: 22, 27
nicely:332
nlegal:373
nmoving:373 shrink:153 tomatoes: 159, 230
noffice:370 splitting: 20, 22-23, toner-lite:286
nominal:10 25-26, 38 tooling: 44, 309,
non-coding:5 spread: 8, 14, 23 311, 313
stacked: 39, 80, 104, tooltip: 260, 265,
P 107-112, 134,
157, 163-168
280-281, 299,
360-361
parsed:353 stackplot:167 topics: 206, 282
parser:199 stamen:282 top-level:138
particular: 11, 142, stride:12 tornado:306
152, 224-225 string: 142-145, 158,
passed: 141, 147,
241, 286, 364
161, 248, 346
submodule: 141, 192
V
subplot: 139, 181, valuable:76
R 183-184, 187,
regplot:243
189, 252
subsection: 25,
W
regression: 312, 355, 380 weather:126
242-245, 253 subset: 20-21, web-based:
render: 257, 268, 308 26, 56, 311 305-306, 365
r-project:5 subtask:370
suitable: 4, 44, 91,
S 129, 131, 159,
218-219, 226, 228,
schema: 257, 381 250, 252, 344, 380
scored: 89, 121-122 suited: 2, 11, 91, 93,
scores: 86, 91, 134, 138, 166,
159-160, 172-175, 183, 213, 219
214-215, 217-218, switch: 208, 283,
230-231, 239, 289, 300, 347, 355,
380-381 375, 379, 381
seaborn: 2, 76, 157,
165, 200, 203-206,
208-215, 217-220,
T
222, 225, 228-232, timestamp: 291,
234, 236-248, 294-296, 353
252-253, 313, 365, titles: 128, 139, 152,
367-368, 384-385 155-156, 159
semantic: 204-205 together: 6, 38-39,
showmeans:171 44, 219, 342