4 Data Visualization
4 Data Visualization
In python, we have one more conditional statement called elif statements. Elif statement is used
to check multiple conditions only if the given if condition false. It's like an if-else statement
and the only difference is that in else we will not check the condition but in elif we will do
check the condition.
Elif statements are similar to if-else statements but elif statements evaluate multiple conditions.
DATA VISUALIZATION 71
Let’s take an example to implement the elif statement, in this example the if block will get
executed if the given if-condition is true, or elif block will get executed if the elif-condition is
true, or it will execute the else block if both if and elif conditions are false.
Nested if-else statements mean that an if statement or if-else statement is present inside another
if or if-else block. Python provides this feature as well, this in turn will help us to check multiple
conditions in a given program. An if statement present inside another if statement which is
present inside another if statements and so on.
Numpy and Pandas > Numpy overview - Creating and Accessing Numpy Arrays > Numpy
overview - Creating and Accessing Numpy Arrays
What is numpy ?
NumPy is the fundamental package for scientific computing in Python. It is a Python library
that provides a multidimensional array object, various derived objects (such as masked arrays
and matrices), and an assortment of routines for fast operations on arrays, including
mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms,
DATA VISUALIZATION 72
basic linear algebra, basic statistical operations, random simulation and much more.
Difference between numpy arrays and lists
There are several important differences between NumPy arrays and the standard Python
sequences, NumPy arrays have a fixed size at creation, unlike Python lists (which can grow
dynamically).
The elements in a NumPy array are all required to be of the same data type, and thus will be
the same size in memory. The exception: one can have arrays of (Python, including NumPy)
objects, thereby allowing for arrays of different sized elements.
NumPy arrays facilitate advanced mathematical and other types of operations on large numbers
of data. Typically, such operations are executed more efficiently and with less code than is
possible using Python’s built-in sequences.
A "numpy" array or "ndarray" is similar to a list. It's usually fixed in size and each element is
of the same type, we can cast the list to numpy array by first importing the numpy. Or We can
also quickly create the numpy array with arange function which creates an array within the
range specified.
In example Since there is no value after the comma (20,) this is a one-dimensional array.
DATA VISUALIZATION 73
Accessing NUMPY 1D ARRAY
Accessing and slicing operations for 1D array is same as list. Index values starts from 0 to
length of the list.
If you only use the arrange function, it will output a one-dimensional array. To make it a two-
dimensional array, chain its output with the reshape function.
In this example first, it will create the 15 integers and then it will convert to two dimensional
array with 3 rows and 5 columns.
To access an element in a two-dimensional array, you need to specify an index for both the row
and the column.
DATA VISUALIZATION 74
Introduction to Pandas
What are pandas ?
Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing
it. Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
We can import the library or a dependency like pandas using the “import
pandas” command. We now have access to many pre-built classes and functions.
In order to be able to work with the data in Python, we’ll need to read the data(csv, excel
,dictionary,..) file into a Pandas DataFrame. A DataFrame is a way to represent and work
with tabular data. Tabular data has rows and columns, just like our csv file. In order to read in
the data, we’ll need to use the pandas.read_csv function. This function will take in a csv file
and return a DataFrame.
What is csv ?
csv stands for comma-separated values, csv file is a delimited text file that uses a comma to
separate values. A CSV file stores tabular data in plain text. Each line of the file is a data
record. Each record consists of one or more fields, separated by commas.
Once the pandas library is imported, This assumes the library is installed. Then we can load a
csv file using the pandas built-in function “read csv.” A csv is a typical file type used to store
data.
DATA VISUALIZATION 75
We simply type the word pandas, then a dot and the name of the function with all the inputs.
Typing pandas all the time may get tedious.
We can use the "as" statement to shorten the name of the library; in this case we use the
standard abbreviation pd. Now we type pd and a dot followed by the name of the function we
would like to use, in this case, read_csv.
we need to give the path of the csv file as argument to the read_csv function, to read the path
string correctly we need to use ‘r’ as a prefix to the command. The result is stored in the
variable df. this is short for “dataframe." Now that we have the data in a dataframe, we can
work with it. We can use the method head to see the entire data frame or we can pass the
number of rows to be checked as an argument to the head method like df.head(5) for 5 rows.
If you want to write the dataframe to csv we can simple use the “to_csv” function
Syntax:
df.to_csv(EXPORT FILE PATH)
example:
If you see the above data frame example we see the two indexes one is loaded from the csv
file and also there is unnamed index which is by default generated by pandas while loading
the csv. This problem can be avoided by making sure that the writing of CSV files
doesn’t write indexes, because DataFrame will generate it anyway. We can do the same by
specifying index = False parameter in to_csv(...) function.
These are the three steps we should perform to do statistical analysis on pandas dataframe.
collect the data
You can store the collected data in csv, excel, or in dictionary format. For Demo we store the
home data in one csv file.
Once you run the above code you will get this DataFrame.
Once you have your Data Frame ready, you’ll be able to get the Descriptive Statistics. We
can calculate the following statistics using the pandas package:
DATA VISUALIZATION 77
Mean
Total sum
Maximum
Minimum
Count
Median
Standard deviation
Variance
You can further breakdown the descriptive statistics into the following measures:
DATA VISUALIZATION 78
Pandas working with text data and datetime columns
While working with data, it is not an unusual thing to encounter time series data. Working
with datetime columns can be quite challenge task. Luckily, pandas are great at handling time
series data. Pandas provide a different set of tools using which we can perform all the
necessary tasks on date-time data.
Let’s see how we can convert a dataframe column of strings (in dd/mm/yyyy format) to
datetime format. We cannot perform any time series-based operation on the dates if they are
not in the right format. To be able to work with it, we are required to convert the dates into
the datetime format.
For any operation we need to first create the data frame based on the data collected, we can
load the data either from csv file, or excel file or from any source. Let us use csv file for our
DATA VISUALIZATION 79
demo.
Follow the below lines of code to load the data and convert that to data Frame.
Once the data frame is ready use df.info( ) to get complete information of dataframe.
As we can see in the output, the data type of the ‘DateTime’ column is object i.e. string. Now
we will convert it to datetime format using pd.to_datetime() function.
DATA VISUALIZATION 80
After applying the pd.to_datetime () function to DateTime column, we can see in the output,
the format of the ‘DateTime’ column has been changed to the datetime format.
HOW TO CHANGE THE INDEX of the dataframe
Most of the operations related to dateTime requires the DateTime column as the primary
index, or else it will throw an error.
We can change the index with set_index() function.it takes two parameters one is column
name you want to change as index, and another one is inplace=true. When inplace=True is
passed, the data is renamed inplace, when inplace=False is passed (this is the default value,
so isn't necessary), performs the operation and returns a copy of the object.
Now the DateTime is the index of the dataframe. Now we can perform DateTime operations
very easily.
To Check all the values occurred in a particular year let’s say 2018
Run command df[‘2018’].
DATA VISUALIZATION 81
Above result gives all the values recorded in year 2018.
DATA VISUALIZATION 82
Similarly, if you want to view the observations after particular year, month and date. then the
command is
For ex: if you need Observations between May 3rd and May 4th Of 2018
then command is: df['5/3/2018':'5/4/2018']
Pandas Indexing and Selecting Data
What is Indexing ?
Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting all the rows and some of the columns, some of
the rows and all the columns, or some of each of the rows and columns. Indexing can also be
known as Subset Selection.
Let’s load one csv file and convert that to data frame to perform the indexing and selection
operations.
Once the data is loaded into data frame let’s make Name as the index of this data frame.
DATA VISUALIZATION 83
Selecting Single Column
In order to take single column, we simply put the name of the column in-between the
brackets.
DATA VISUALIZATION 84
Selecting a single row:
In order to select a single row using. loc[], we put a single row label in a .loc function.
In order to select multiple rows, we put all the row labels in a list and pass that
to .loc function.
DATA VISUALIZATION 85
Selecting multiple rows and columns:
In order to select two rows and two columns, we select two rows which we want to select and
two columns and put it in a separate list:
Dataframe.loc[["row1", "row2"], ["column1", "column2"]]
In order to select all the rows and some columns the syntax looks like:
Pandas- groupby
A groupby operation involves some combination of splitting the object, applying a function,
and combining the results. This can be used to group large amounts of data and compute
operations on these groups.
DATA VISUALIZATION 86
Let’s load one csv file and convert that to data frame to perform the group-by operations.
Now we have data frame ready let’s group the data based on ‘hlpi_name’.
Let’s print the value contained in any one of group. For that use the name of the ‘hlpi_name’.
We use the function get_group() to find the entries contained in any of the groups.
DATA VISUALIZATION 87
groupby function based on more than one category
Use groupby() function to form groups based on more than one category (i.e. Use more than
one column to perform the splitting).
Let’s print the first entries in all the groups formed using first() function.
DATA VISUALIZATION 88
Operations on groups
After splitting a data into a group, we can also apply a function to each group to perform
some operations.
Here is the Sample example to get sum of values in particular groups.
DATA VISUALIZATION 89
We can also find min, max, average . .etc.
Merge/Join Datasets
Joining and merging DataFrames is the core process to start with data analysis and machine
learning tasks. It is one of the toolkits which every Data Analyst or Data Scientist should
master because in almost all the cases data comes from multiple source and files. You may
need to bring all the data in one place by some sort of join logic and then start your analysis.
Thankfully you have the most popular library in python, pandas to your rescue! Pandas
provides various facilities for easily combining different datasets.
We can merge two data frames in pandas python by using the merge() function. The different
arguments to merge() allow you to perform natural join, left join, right join, and full outer
join in pandas.
Before you perform joint operations let’s first load the two csv files and convert them into
data frames df1 and df2.
Natural join
Natural join keeps only rows that match from the data frames(df1 and df2), specify the
DATA VISUALIZATION 90
argument how=’inner’
Syntax:
pd.merge(df1, df2, on=column', how='inner')
Return only the rows in which the left table have matching keys in the right table
Full outer join keeps all rows from both data frames, specify how=‘outer’.
Syntax:
DATA VISUALIZATION 91
Left outer join
Left outer join includes all the rows of your data frame df1 and only those from df2 that
match, specify how =‘Left.
Syntax:
DATA VISUALIZATION 92
Right outer join
Return all rows from the df2 table, and any rows with matching keys from the df1 table,
specify how =‘Right’.
Syntax:
DATA VISUALIZATION 93
UNIT – IV
Introduction to Matplotlib
Matplotlib is the most popular plotting library for python which gives control over every
aspect of a figure. It was designed to give the end user a similar feeling like MATLAB’s
graphical plotting. In the coming sections we will learn about Seaborn that is built over
matplotlib. The official page of Matplotlib is https://matplotlib.org. You can use this page for
official installation instructions and various documentation links. One of the most important
section on this page is the gallery section - https://matplotlib.org/gallery.html - it shows all
the kind of plots/figures that matplotlib is capable of creating for you. You can select anyone
of those, and it takes you the example page having the figure and very well documented code.
Another important page is https://matplotlib.org/api/pyplot_summary.html- and it has the
documentation functions in it.
Matplotlib's architecture is composed of three main layers: the back-end layer, the artist layer
where much of the heavy lifting happens and is usually the appropriate programming
paradigm when writing a web application server, or a UI application, or perhaps a script to be
shared with other developers, and the scripting layer, which is the appropriate layer for
everyday purposes and is considered a lighter scripting interface to simplify common tasks
and for a quick and easy generation of graphics and plots.
Back-end layer has three built-in abstract interface classes: FigureCanvas, which defines and
encompasses the area on which the figure is drawn. Renderer, an instance of the renderer
class knows how to draw on the figure canvas. And finally, Event, which handles user inputs
such as keyboard strokes and mouse clicks.
Artist layer: It is composed of one main object, which is the Artist. The Artist is the object
that knows how to take the Renderer and use it to put ink on the canvas. Everything you see
on a Matplotlib figure is an Artist instance. The title, the lines, the tick labels, the images, and
so on, all correspond to an individual Artist. There are two types of Artist objects. The first
type is the primitive type, such as a line, a rectangle, a circle, or text. And the second type is
the composite type, such as the figure or the axes. The top-level Matplotlib object that
contains and manages all of the elements in a given graphic is the figure Artist, and the most
important composite artist is the axes because it is where most of the Matplotlib API plotting
methods are defined, including methods to create and manipulate the ticks, the axis lines, the
grid or the plot background. Now it is important to note that each composite artist may
contain other composite artists as well as primitive artists. So, a figure artist for example
would contain an axis artist as well as a rectangle or text artists.
Scripting layer: it was developed for scientists who are not professional Programmers. The
artist layer is syntactically heavy as it is meant for developers and not for individuals whose
goal is to perform quick exploratory analysis of some data. Matplotlib's scripting layer is
DATA VISUALIZATION 94
essentially the Matplotlib.pyplot interface, which automates the process of defining a canvas
and defining a figure artist instance and connecting them.
A line plot is used to represent quantitative values over a continuous interval or time period.
It is generally used to depict trends on how the data has changed over time.
In this sub-section, we will see how to use matplotlib to read a csv file and then generate a
plot. We will use jupyter notebook. First, we do a basic example to showcase what a line plot
is.
DATA VISUALIZATION 95
Now let us do a small case study using what we just learned now:
DATA VISUALIZATION 96
DATA VISUALIZATION 97
Use the tolist() method to get index and columns as lists. View the dimensions of dataframe
using “.shape” parameter.
After that let us clean the data set to remove few unnecessary columns.
DATA VISUALIZATION 98
Default index is numerical, but it is more convenient to index based on country names.
DATA VISUALIZATION 99
Column names as numbers could be confusing. For example: year 1985 could be
misunderstood as 1985th column. To avoid ambiguity, let us convert column names to strings
and then use that to call full range of years.
Case Study – let us now study the trend of number of immigrants from
Bangladesh to Australia.
Since there are two rows of data, let us sum the values of each column and take first 20 years
(to eliminate other years for which no values are present).
Now let us use area plots to see to visualize cumulative immigration from top 5 countries to
Canada. We will use the same process to clean data that we used in the previous section.
URL - https://s3-api.us-geo.objectstorage.softlayer.net/cf-
coursesdata/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx
Now clean up data using the same process as the one in the previous section :
Bar Chart
A bar plot is a way of representing data where the length of the bars represents the
magnitude/size of the feature/variable. Bar graphs usually represent numerical and
categorical variables grouped in intervals.
Let's compare the number of Icelandic immigrants (country = 'Iceland') to Canada from year
1980 to 2013.
What is the frequency distribution of the number (population) of new immigrants from the
various countries to Canada in 2013 ?
To answer this one would need to plot a histogram - it partitions the x-axis into bins, assigns
each data point in our dataset to a bin, and then counts the number of data points that have
been assigned to each bin. So, the y-axis is the frequency or the number of data points in each
DATA VISUALIZATION 108
bin. Note that we can change the bin size and usually one needs to tweak it so that the
distribution is displayed nicely.
By default, the histogram method breaks up the dataset into 10 bins. The figure below
summarizes the bin ranges and the frequency distribution of immigration in 2013. We can see
that in 2013:
Notice that the x-axis labels do not match with the bin size. This can be fixed by passing in a
xticks keyword that contains the list of the bin sizes, as follows:
Let's use a pie chart to explore the proportion (percentage) of new immigrants grouped by
continents for the entire time period from 1980 to 2013. We can continue to use the same
dataframe further.
Raw code :
df_continents['Total'].plot(kind='pie',
figsize=(15, 6),
autopct='%1.1f%%',
startangle=90,
shadow=True,
labels=None, # turn off labels on pie chart
plt.show()
Box Plot
A box plot is a way of statistically representing the distribution of the data through five main
dimensions :
The minimum number of immigrants is around 200 (min), maximum number is around 1300
(max), and median number of immigrants is around 900 (median).
25% of the years for period 1980 - 2013 had an annual immigrant count of ~500 or fewer
(First quartile).
75% of the years for period 1980 - 2013 had an annual immigrant count of ~1100 or fewer
(Third quartile).
We can view the actual numbers by calling the describe() method on the dataframe.
Scatter Plots
A scatter plot (2D) is a useful method of comparing variables against each other. Scatter plots
look similar to line plots in that they both map independent and dependent variables on a 2D
graph. While the datapoints are connected by a line in a line plot, they are not connected in a
scatter plot. The data in a scatter plot is considered to express a trend. With further analysis
using tools like regression, we can mathematically calculate this relationship and use it to
predict trends outside the dataset.
Using a scatter plot, let's visualize the trend of total immigration to Canada (all countries
combined) for the years 1980 - 2013.
Step 1: Get the equation of line of best fit. We will use Numpy's polyfit() method by passing
in the following:
Let us compare Argentina's immigration to that of its neighbor Brazil. Let's do that using a
bubble plot of immigration from Brazil and Argentina for the years 1980 - 2013. We will set
the weights for the bubble as the normalized value of the population for each year.
There are several methods of normalizations in statistics, each with its own use. In this case,
we will use feature scaling to bring all values into the range [0,1]. The general formula is:
where X is an original value, X' is the normalized value. The formula sets the max value in
the dataset to 1, and sets the min value to 0. The rest of the datapoints are scaled to a value
between 0-1 accordingly.
Raw Code :
# Brazil
ax0 = df_can_t.plot(kind='scatter',
x='Year',
y='Brazil',
figsize=(14, 8),
alpha=0.5, # transparency
DATA VISUALIZATION 119
color='green',
s=norm_brazil * 2000 + 10, # pass in weights
xlim=(1975, 2015)
)
# Argentina
ax1 = df_can_t.plot(kind='scatter',
x='Year',
y='Argentina',
alpha=0.5,
color="blue",
s=norm_argentina * 2000 + 10,
ax = ax0
)
ax0.set_ylabel('Number of Immigrants')
ax0.set_title('Immigration from Brazil and Argentina from 1980 - 2013')
ax0.legend(['Brazil', 'Argentina'], loc='upper left', fontsize='x-large')
The size of the bubble corresponds to the magnitude of immigrating population for that year,
compared to the 1980 - 2013 data. The larger the bubble, the more immigrants in that year.
Waffle Chart
The second step is defining the overall size of the waffle chart.
The third step is using the proportion of each category to determe it respective number of tiles
Raw Code :
# if the number of tiles populated for the current category is equal to its
corresponding allocated tiles...
if tile_index > sum(tiles_per_category[0:category_index]):
DATA VISUALIZATION 122
# ...proceed to the next category
category_index += 1
Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a
specific word appears in a source of textual data (such as a speech, blog post, or database),
the bigger and bolder it appears in the word cloud.
Much better! However, said isn't really an informative word. So, let's add it to our stop words
and re-generate the cloud.
Introduction to Seaborn
Seaborn is a statistical plotting library and is built on top of matplotlib. It has beautiful default
styles and is compatible with pandas dataframe objects. In order to install it use the following
commands:
https://github.com/mwaskom/seaborn. This page has information along with the link to the
official documentation page - https://seaborn.pydata.org/. The subsection of this page
(https://seaborn.pydata.org/examples/index.html) shows the example visualizations Seaborn is
able to work with. The other important section to visit is the one which has API information
- https://seaborn.pydata.org/api.html.
Let us now delve into some of the functionalities Seaborn provides. We will run some snippet
of codes in Jupyter notebook (although you can use any other IDE) to exhibit the key
features.
Distribution Plots
We will first try to do distribution plots. To get started, we first import one of the standard
datasets that comes with Seaborn. The one we choose for our exercise is diamonds.csv. You
can pick other datasets from https://github.com/mwaskom/seaborn-data.
As seen above, we get a histogram and a Kernel Density Estimate (KDE) plot. We can customize it
further by removing KDE and specifying the number of bins.
The above diagram shows that as the carat approaches the value ‘5’, the value of the diamond
also increases which is a phenomenon we also observe. The ‘jointplot’ function can take
multiple values for the parameter ‘kind’ – scatter, reg, resid, kde, hex.
Let us see the plot using ‘reg’ which will provide regression and kernel density fits.
1. Pairplot function: will plot pairwise relationships across an entire dataframe such that each
numerical variable will be depicted in the y-axis across a single row and in the x-axis across a
single column. Try the command: sns.pairplot(dmd)
2. Rugplot function: It plots datapoints in an array as sticks on an axis. Try the command:
sns.rugplot(dmd['price'])
3. Once you are comfortable with these then you can try out KDE plotting. KDE plotting is used
for visualizing the probability density of a continuous variable or a single graph for multiple
samples.
Reference: https://seaborn.pydata.org/generated/seaborn.kdeplot.html
Categorical Plots :
Now let us discuss how to use seaborn to plot categorical data. But let us first understand
what categorical variable is. A categorical variable is one that has multiple categories but has
no intrinsic ordering specified for categories. For example: Blood type of a person can be any
one of A, B, AB or O.
Barplot and countplot allow you to aggregate data with respective to each category. Barplot
DATA VISUALIZATION 131
allows to you aggregate around some function but the default is mean.
The difference between countplot and barplot is that countplot explicitly counts the number
of occurrences.
Boxplot shows the quartiles of the dataset while the whiskers extend encompass the rest of
the distribution but leave out the points that are the outliers.
Now let us delve into matrix plots. It helps to segregate data into color-encoded matrices
which can further help in unsupervised learning methods like clustering.
In this final section, we will explore seaborn and see how efficient it is to create regression
lines and fits using this library. Implot function allows you to display linear models.
Folium is a powerful data visualization library in Python that was built primarily to help
people visualize geospatial data. With Folium, you can create a map of any location in the
world if you know its latitude and longitude values. You can also create a map and
superimpose markers as well as clusters of markers on top of the map for cool and very
interesting visualizations. You can also create maps of different styles such as street level
map, stamen map.
Folium is not available by default. So, we first need to install it before we can import it. We
can use the command : conda install -c conda-forge folium=0.5.0 --yes
Go ahead. Try zooming in and out of the rendered map above. You can customize this default
definition of the world map by specifying the center of your map and the initial zoom level.
All locations on a map are defined by their respective Latitude and Longitude values. So, you
can create a map and pass in a center of Latitude and Longitude values of [0, 0]. For a
defined center, you can also define the initial zoom level into that location when the map is
rendered. The higher the zoom level the more the map is zoomed into the center. Let's create
a map centered around Canada and play with the zoom level to see how it affects the
rendered map.
As you can see, the higher the zoom level the more the map is zoomed into the given center.
DATA VISUALIZATION 139
Stamen Toner Maps
These are high-contrast B+W (black and white) maps. They are perfect for data mashups and
exploring river meanders and coastal zones. Let's create a Stamen Toner map of Canada with
a zoom level of 4.
These are maps that feature hill shading and natural vegetation colors. They showcase
advanced labeling and linework generalization of dual-carriageway roads. Let's create a
Stamen Terrain map of Canada with zoom level 4.
These are maps that are quite like the default style, except that the borders are not visible with
a low zoom level. Furthermore, unlike the default style where country names are displayed in
each country's native language, Mapbox Bright style displays all country names in English.
Let's create a world map with this style.
Case Study
Now that you are familiar with folium, let us use it for our next case study which is as
mentioned below:
Case Study: An e-commerce company ‘ wants to get into logistics “Deliver4U” . It wants to
know the pattern for maximum pickup calls from different areas of the city throughout the
day. This will result in:
Solution:
Data set : Please download the following from the location specified by the trainer.
The dataset contains two separate data files – train_del.csv and test_del.csv. The difference is
that train_del.csv contains additional column which is trip_duration which we will not be
needed for our present analysis.
a) Import libraries – Pandas and Folium. Drop the trip_duration column and combine the 2
different files as one dataframe.
Let us now visualize the rides data using a class method called Heatmap()
Conclusion
Throughout the city, pickups are more probable from central area so better to set lot of pickup
stops at these locations
Therefore, by using maps we can highlight trends and uncover patterns and derive insights
from the data.