KEMBAR78
4 Data Visualization | PDF | Comma Separated Values | Scatter Plot
0% found this document useful (0 votes)
4 views76 pages

4 Data Visualization

Uploaded by

mmahajanme25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views76 pages

4 Data Visualization

Uploaded by

mmahajanme25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

elif statements:

In python, we have one more conditional statement called elif statements. Elif statement is used
to check multiple conditions only if the given if condition false. It's like an if-else statement
and the only difference is that in else we will not check the condition but in elif we will do
check the condition.

Elif statements are similar to if-else statements but elif statements evaluate multiple conditions.

DATA VISUALIZATION 71
Let’s take an example to implement the elif statement, in this example the if block will get
executed if the given if-condition is true, or elif block will get executed if the elif-condition is
true, or it will execute the else block if both if and elif conditions are false.

Nested if-else statements

Nested if-else statements mean that an if statement or if-else statement is present inside another
if or if-else block. Python provides this feature as well, this in turn will help us to check multiple
conditions in a given program. An if statement present inside another if statement which is
present inside another if statements and so on.

Numpy and Pandas > Numpy overview - Creating and Accessing Numpy Arrays > Numpy
overview - Creating and Accessing Numpy Arrays

What is numpy ?

NumPy is the fundamental package for scientific computing in Python. It is a Python library
that provides a multidimensional array object, various derived objects (such as masked arrays
and matrices), and an assortment of routines for fast operations on arrays, including
mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms,
DATA VISUALIZATION 72
basic linear algebra, basic statistical operations, random simulation and much more.
Difference between numpy arrays and lists

There are several important differences between NumPy arrays and the standard Python
sequences, NumPy arrays have a fixed size at creation, unlike Python lists (which can grow
dynamically).

The elements in a NumPy array are all required to be of the same data type, and thus will be
the same size in memory. The exception: one can have arrays of (Python, including NumPy)
objects, thereby allowing for arrays of different sized elements.

NumPy arrays facilitate advanced mathematical and other types of operations on large numbers
of data. Typically, such operations are executed more efficiently and with less code than is
possible using Python’s built-in sequences.

CREATING NUMPY 1D ARRAY

A "numpy" array or "ndarray" is similar to a list. It's usually fixed in size and each element is
of the same type, we can cast the list to numpy array by first importing the numpy. Or We can
also quickly create the numpy array with arange function which creates an array within the
range specified.

To verify the dimensionality of this array, use the shape property.

In example Since there is no value after the comma (20,) this is a one-dimensional array.

DATA VISUALIZATION 73
Accessing NUMPY 1D ARRAY

Accessing and slicing operations for 1D array is same as list. Index values starts from 0 to
length of the list.

Creating numpy 2D ARRAY

If you only use the arrange function, it will output a one-dimensional array. To make it a two-
dimensional array, chain its output with the reshape function.

In this example first, it will create the 15 integers and then it will convert to two dimensional
array with 3 rows and 5 columns.

Accessing NUMPY 2D ARRAY

To access an element in a two-dimensional array, you need to specify an index for both the row
and the column.

DATA VISUALIZATION 74
Introduction to Pandas
What are pandas ?

Pandas is an open-source Python Library providing high-performance data manipulation and


analysis tool using its powerful data structures. Pandas is the backbone for most of the data
projects.

Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing
it. Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.

We can import the library or a dependency like pandas using the “import
pandas” command. We now have access to many pre-built classes and functions.

In order to be able to work with the data in Python, we’ll need to read the data(csv, excel
,dictionary,..) file into a Pandas DataFrame. A DataFrame is a way to represent and work
with tabular data. Tabular data has rows and columns, just like our csv file. In order to read in
the data, we’ll need to use the pandas.read_csv function. This function will take in a csv file
and return a DataFrame.

What is csv ?

csv stands for comma-separated values, csv file is a delimited text file that uses a comma to
separate values. A CSV file stores tabular data in plain text. Each line of the file is a data
record. Each record consists of one or more fields, separated by commas.

How to read the csv file using pandas

Once the pandas library is imported, This assumes the library is installed. Then we can load a
csv file using the pandas built-in function “read csv.” A csv is a typical file type used to store
data.
DATA VISUALIZATION 75
We simply type the word pandas, then a dot and the name of the function with all the inputs.
Typing pandas all the time may get tedious.
We can use the "as" statement to shorten the name of the library; in this case we use the
standard abbreviation pd. Now we type pd and a dot followed by the name of the function we
would like to use, in this case, read_csv.

we need to give the path of the csv file as argument to the read_csv function, to read the path
string correctly we need to use ‘r’ as a prefix to the command. The result is stored in the
variable df. this is short for “dataframe." Now that we have the data in a dataframe, we can
work with it. We can use the method head to see the entire data frame or we can pass the
number of rows to be checked as an argument to the head method like df.head(5) for 5 rows.

How to Write the csv file using pandas

If you want to write the dataframe to csv we can simple use the “to_csv” function

Syntax:
df.to_csv(EXPORT FILE PATH)

example:
If you see the above data frame example we see the two indexes one is loaded from the csv
file and also there is unnamed index which is by default generated by pandas while loading
the csv. This problem can be avoided by making sure that the writing of CSV files
doesn’t write indexes, because DataFrame will generate it anyway. We can do the same by
specifying index = False parameter in to_csv(...) function.

df.to_csv(‘EXPORT FILE PATH’, index=False)

Descriptive statistics using pandas


DATA VISUALIZATION 76
There are many collective methods to compute descriptive statistics and other related
operations on pandas DataFrame.

Steps TO FOLLOW FOR Descriptive Statistics

These are the three steps we should perform to do statistical analysis on pandas dataframe.
collect the data

create the data frame

get the descriptive statistics for pandas dataframe

Collect the data:

To do any statistical analysis, first collection of data is the important task

You can store the collected data in csv, excel, or in dictionary format. For Demo we store the
home data in one csv file.

Create the data frame:

we need to create the data frame based on the data collected.


Give the homes csv file path location.

Once you run the above code you will get this DataFrame.

Get the Descriptive Statistics for Pandas DataFrame

Once you have your Data Frame ready, you’ll be able to get the Descriptive Statistics. We
can calculate the following statistics using the pandas package:

DATA VISUALIZATION 77
Mean

Total sum

Maximum

Minimum

Count
Median

Standard deviation

Variance

With describe function you will get complete descriptive stats

The syntax is: df.describe()

You can further breakdown the descriptive statistics into the following measures:
DATA VISUALIZATION 78
Pandas working with text data and datetime columns

While working with data, it is not an unusual thing to encounter time series data. Working
with datetime columns can be quite challenge task. Luckily, pandas are great at handling time
series data. Pandas provide a different set of tools using which we can perform all the
necessary tasks on date-time data.

Let’s see how we can convert a dataframe column of strings (in dd/mm/yyyy format) to
datetime format. We cannot perform any time series-based operation on the dates if they are
not in the right format. To be able to work with it, we are required to convert the dates into
the datetime format.

Convert Pandas dataframe column type from string to datetime format

For any operation we need to first create the data frame based on the data collected, we can
load the data either from csv file, or excel file or from any source. Let us use csv file for our
DATA VISUALIZATION 79
demo.
Follow the below lines of code to load the data and convert that to data Frame.

Once the data frame is ready use df.info( ) to get complete information of dataframe.

As we can see in the output, the data type of the ‘DateTime’ column is object i.e. string. Now
we will convert it to datetime format using pd.to_datetime() function.

DATA VISUALIZATION 80
After applying the pd.to_datetime () function to DateTime column, we can see in the output,
the format of the ‘DateTime’ column has been changed to the datetime format.
HOW TO CHANGE THE INDEX of the dataframe

Most of the operations related to dateTime requires the DateTime column as the primary
index, or else it will throw an error.

We can change the index with set_index() function.it takes two parameters one is column
name you want to change as index, and another one is inplace=true. When inplace=True is
passed, the data is renamed inplace, when inplace=False is passed (this is the default value,
so isn't necessary), performs the operation and returns a copy of the object.

Now the DateTime is the index of the dataframe. Now we can perform DateTime operations
very easily.

Data Frame Filtering based on index


How to filter data based on particular year.

To Check all the values occurred in a particular year let’s say 2018
Run command df[‘2018’].

DATA VISUALIZATION 81
Above result gives all the values recorded in year 2018.

How to filter data based on year and month

To View all observations that occurred in June 2018, run the


below command

DATA VISUALIZATION 82
Similarly, if you want to view the observations after particular year, month and date. then the
command is

If you need observations between two dates then command is

For ex: if you need Observations between May 3rd and May 4th Of 2018
then command is: df['5/3/2018':'5/4/2018']
Pandas Indexing and Selecting Data
What is Indexing ?
Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting all the rows and some of the columns, some of
the rows and all the columns, or some of each of the rows and columns. Indexing can also be
known as Subset Selection.

Let’s load one csv file and convert that to data frame to perform the indexing and selection
operations.

Once the data is loaded into data frame let’s make Name as the index of this data frame.

Note: index is the primary key it should not contain duplicates

DATA VISUALIZATION 83
Selecting Single Column

In order to take single column, we simply put the name of the column in-between the

brackets.

Selecting Multiple Columns

To select multiple columns, we must pass a list of columns in an indexing operator.

DATA VISUALIZATION 84
Selecting a single row:

In order to select a single row using. loc[], we put a single row label in a .loc function.

Selecting multiple rows:

In order to select multiple rows, we put all the row labels in a list and pass that
to .loc function.

DATA VISUALIZATION 85
Selecting multiple rows and columns:

In order to select two rows and two columns, we select two rows which we want to select and
two columns and put it in a separate list:
Dataframe.loc[["row1", "row2"], ["column1", "column2"]]

In order to select all the rows and some columns the syntax looks like:

Dataframe.loc [ :, ["column1", "column2"]]

Pandas- groupby

A groupby operation involves some combination of splitting the object, applying a function,
and combining the results. This can be used to group large amounts of data and compute
operations on these groups.

DATA VISUALIZATION 86
Let’s load one csv file and convert that to data frame to perform the group-by operations.

groupby function based on single category

Now we have data frame ready let’s group the data based on ‘hlpi_name’.

Once group by operation is done we get a result as groupby object.

Let’s print the value contained in any one of group. For that use the name of the ‘hlpi_name’.
We use the function get_group() to find the entries contained in any of the groups.

DATA VISUALIZATION 87
groupby function based on more than one category

Use groupby() function to form groups based on more than one category (i.e. Use more than
one column to perform the splitting).

We got the result as groupby object.

Let’s print the first entries in all the groups formed using first() function.

DATA VISUALIZATION 88
Operations on groups

After splitting a data into a group, we can also apply a function to each group to perform
some operations.
Here is the Sample example to get sum of values in particular groups.

DATA VISUALIZATION 89
We can also find min, max, average . .etc.
Merge/Join Datasets
Joining and merging DataFrames is the core process to start with data analysis and machine
learning tasks. It is one of the toolkits which every Data Analyst or Data Scientist should
master because in almost all the cases data comes from multiple source and files. You may
need to bring all the data in one place by some sort of join logic and then start your analysis.
Thankfully you have the most popular library in python, pandas to your rescue! Pandas
provides various facilities for easily combining different datasets.

We can merge two data frames in pandas python by using the merge() function. The different
arguments to merge() allow you to perform natural join, left join, right join, and full outer
join in pandas.

Understanding the different types of merge:

Before you perform joint operations let’s first load the two csv files and convert them into
data frames df1 and df2.

Natural join

Natural join keeps only rows that match from the data frames(df1 and df2), specify the

DATA VISUALIZATION 90
argument how=’inner’
Syntax:
pd.merge(df1, df2, on=column', how='inner')
Return only the rows in which the left table have matching keys in the right table

Full outer join

Full outer join keeps all rows from both data frames, specify how=‘outer’.

Syntax:

pd.merge(df1, df2, on=column', how=’outer’)


Returns all rows from both tables, join records from the left which have matching keys in the
right table.

DATA VISUALIZATION 91
Left outer join

Left outer join includes all the rows of your data frame df1 and only those from df2 that
match, specify how =‘Left.

Syntax:

pd.merge(df1, df2, on=column', how=left)


Return all rows from the left table, and any rows with matching keys from the right table.

DATA VISUALIZATION 92
Right outer join

Return all rows from the df2 table, and any rows with matching keys from the df1 table,
specify how =‘Right’.

Syntax:

pd.merge(df1, df2, on=column', how=right)


Return all rows from the right table, and any rows with matching keys from the left table.

DATA VISUALIZATION 93
UNIT – IV

Introduction to Matplotlib
Matplotlib is the most popular plotting library for python which gives control over every
aspect of a figure. It was designed to give the end user a similar feeling like MATLAB’s
graphical plotting. In the coming sections we will learn about Seaborn that is built over
matplotlib. The official page of Matplotlib is https://matplotlib.org. You can use this page for
official installation instructions and various documentation links. One of the most important
section on this page is the gallery section - https://matplotlib.org/gallery.html - it shows all
the kind of plots/figures that matplotlib is capable of creating for you. You can select anyone
of those, and it takes you the example page having the figure and very well documented code.
Another important page is https://matplotlib.org/api/pyplot_summary.html- and it has the
documentation functions in it.

Matplotlib's architecture is composed of three main layers: the back-end layer, the artist layer
where much of the heavy lifting happens and is usually the appropriate programming
paradigm when writing a web application server, or a UI application, or perhaps a script to be
shared with other developers, and the scripting layer, which is the appropriate layer for
everyday purposes and is considered a lighter scripting interface to simplify common tasks
and for a quick and easy generation of graphics and plots.

Now let's go into each layer in a little more detail:

Back-end layer has three built-in abstract interface classes: FigureCanvas, which defines and
encompasses the area on which the figure is drawn. Renderer, an instance of the renderer
class knows how to draw on the figure canvas. And finally, Event, which handles user inputs
such as keyboard strokes and mouse clicks.

Artist layer: It is composed of one main object, which is the Artist. The Artist is the object
that knows how to take the Renderer and use it to put ink on the canvas. Everything you see
on a Matplotlib figure is an Artist instance. The title, the lines, the tick labels, the images, and
so on, all correspond to an individual Artist. There are two types of Artist objects. The first
type is the primitive type, such as a line, a rectangle, a circle, or text. And the second type is
the composite type, such as the figure or the axes. The top-level Matplotlib object that
contains and manages all of the elements in a given graphic is the figure Artist, and the most
important composite artist is the axes because it is where most of the Matplotlib API plotting
methods are defined, including methods to create and manipulate the ticks, the axis lines, the
grid or the plot background. Now it is important to note that each composite artist may
contain other composite artists as well as primitive artists. So, a figure artist for example
would contain an axis artist as well as a rectangle or text artists.

Scripting layer: it was developed for scientists who are not professional Programmers. The
artist layer is syntactically heavy as it is meant for developers and not for individuals whose
goal is to perform quick exploratory analysis of some data. Matplotlib's scripting layer is

DATA VISUALIZATION 94
essentially the Matplotlib.pyplot interface, which automates the process of defining a canvas
and defining a figure artist instance and connecting them.

Read a CSV and Generate a Line Plot with Matplotlib

A line plot is used to represent quantitative values over a continuous interval or time period.
It is generally used to depict trends on how the data has changed over time.

In this sub-section, we will see how to use matplotlib to read a csv file and then generate a
plot. We will use jupyter notebook. First, we do a basic example to showcase what a line plot
is.

DATA VISUALIZATION 95
Now let us do a small case study using what we just learned now:

Download the dataset from the link:


https://www.un.org/en/development/desa/population/migration/data/empirical
2/migrationflows.asp
The data set has all the country immigration information. We will use the one for Australia
for our case study.

DATA VISUALIZATION 96
DATA VISUALIZATION 97
Use the tolist() method to get index and columns as lists. View the dimensions of dataframe
using “.shape” parameter.

After that let us clean the data set to remove few unnecessary columns.

Let us rename the column names so that it makes more sense.

DATA VISUALIZATION 98
Default index is numerical, but it is more convenient to index based on country names.

Remove the name of the index.

Let us now test it by pulling the data for Bangladesh.

DATA VISUALIZATION 99
Column names as numbers could be confusing. For example: year 1985 could be
misunderstood as 1985th column. To avoid ambiguity, let us convert column names to strings
and then use that to call full range of years.

We can also pass multiple criteria in the same line.


DATA VISUALIZATION 100
Let us review the changes we have made to our dataframes.

Case Study – let us now study the trend of number of immigrants from
Bangladesh to Australia.

Since there are two rows of data, let us sum the values of each column and take first 20 years
(to eliminate other years for which no values are present).

DATA VISUALIZATION 101


Next, we can plot by using the plot function. Automatically the x-axis is plotted with the
index values and y-axis with column values

Basic Plots using Matplotlib


Area Plot
In the previous module we used line plot to see immigration from Bangladesh to Australia.
Now let us try different types of basic plotting using matplotlib.

DATA VISUALIZATION 102


Area plot

Now let us use area plots to see to visualize cumulative immigration from top 5 countries to
Canada. We will use the same process to clean data that we used in the previous section.

URL - https://s3-api.us-geo.objectstorage.softlayer.net/cf-
coursesdata/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx

Now clean up data using the same process as the one in the previous section :

DATA VISUALIZATION 103


DATA VISUALIZATION 104
DATA VISUALIZATION 105
DATA VISUALIZATION 106
The unstacked plot has a default transparency (alpha value) at 0.5. We can modify this value
by passing in the alpha parameter.

Bar Chart

A bar plot is a way of representing data where the length of the bars represents the
magnitude/size of the feature/variable. Bar graphs usually represent numerical and
categorical variables grouped in intervals.

Let's compare the number of Icelandic immigrants (country = 'Iceland') to Canada from year
1980 to 2013.

DATA VISUALIZATION 107


Histogram

How could you visualize the answer to the following question ?

What is the frequency distribution of the number (population) of new immigrants from the
various countries to Canada in 2013 ?

To answer this one would need to plot a histogram - it partitions the x-axis into bins, assigns
each data point in our dataset to a bin, and then counts the number of data points that have
been assigned to each bin. So, the y-axis is the frequency or the number of data points in each
DATA VISUALIZATION 108
bin. Note that we can change the bin size and usually one needs to tweak it so that the
distribution is displayed nicely.

By default, the histogram method breaks up the dataset into 10 bins. The figure below
summarizes the bin ranges and the frequency distribution of immigration in 2013. We can see
that in 2013:

178 Countries contributed between 0 to 3412.9 immigrants

11 Countries contributed between 3412.9 to 6825.8 immigrants

1 Country contributed between 6285.8 to 10238.7 immigrants, and so on.

DATA VISUALIZATION 109


In the above plot, the x-axis represents the population range of immigrants in intervals of
3412.9. The y-axis represents the number of countries that contributed to the population.

Notice that the x-axis labels do not match with the bin size. This can be fixed by passing in a
xticks keyword that contains the list of the bin sizes, as follows:

DATA VISUALIZATION 110


Specialized Visualization Tools using Matplotlib
Pie Charts
A pie chart is a circular graphic that displays numeric proportions by dividing a circle (or pie)
into proportional slices. You are most likely already familiar with pie charts as it is widely
used in business and media. We can create pie charts in Matplotlib by passing in the kind=pie
keyword.

Let's use a pie chart to explore the proportion (percentage) of new immigrants grouped by
continents for the entire time period from 1980 to 2013. We can continue to use the same
dataframe further.

DATA VISUALIZATION 111


The above visual is not very clear, the numbers and text overlap in some instances.
Let's make a few modifications to improve the visuals:

Raw code :

colors_list = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'lightgreen', 'pink'] explode_list


= [0.1, 0, 0, 0, 0.1, 0.1] # ratio for each continent with which to offset each wedge.

df_continents['Total'].plot(kind='pie',
figsize=(15, 6),
autopct='%1.1f%%',
startangle=90,
shadow=True,
labels=None, # turn off labels on pie chart

DATA VISUALIZATION 112


pctdistance=1.12, # the ratio between the center of each pie slice and the start of the text
generated by autopct
colors=colors_list, # add custom colors
explode=explode_list # 'explode' lowest 3 continents)

# scale the title up by 12% to match pctdistance


plt.title('Immigration to Canada by Continent [1980 - 2013]', y=1.12)
plt.axis('equal')
# add legend
plt.legend(labels=df_continents.index, loc='upper left')

plt.show()

Box Plot

A box plot is a way of statistically representing the distribution of the data through five main
dimensions :

Minimum: Smallest number in the dataset.


First quartile: Middle number between the minimum and the median.
Second quartile (Median): Middle number of the (sorted) dataset.
Third quartile: Middle number between median and maximum.
Maximum: Highest number in the dataset.

DATA VISUALIZATION 113


DATA VISUALIZATION 114
We can immediately make a few key observations from the plot above:

The minimum number of immigrants is around 200 (min), maximum number is around 1300
(max), and median number of immigrants is around 900 (median).

25% of the years for period 1980 - 2013 had an annual immigrant count of ~500 or fewer
(First quartile).

75% of the years for period 1980 - 2013 had an annual immigrant count of ~1100 or fewer
(Third quartile).

We can view the actual numbers by calling the describe() method on the dataframe.

Scatter Plots

A scatter plot (2D) is a useful method of comparing variables against each other. Scatter plots
look similar to line plots in that they both map independent and dependent variables on a 2D
graph. While the datapoints are connected by a line in a line plot, they are not connected in a
scatter plot. The data in a scatter plot is considered to express a trend. With further analysis
using tools like regression, we can mathematically calculate this relationship and use it to
predict trends outside the dataset.

Using a scatter plot, let's visualize the trend of total immigration to Canada (all countries
combined) for the years 1980 - 2013.

DATA VISUALIZATION 115


DATA VISUALIZATION 116
So, let's try to plot a linear line of best fit, and use it to predict the number of immigrants in
2015.

Step 1: Get the equation of line of best fit. We will use Numpy's polyfit() method by passing
in the following:

x: x-coordinates of the data.


y: y-coordinates of the data.

deg: Degree of fitting polynomial. 1 = linear, 2 = quadratic, and so on.

Plot the regression line on the scatter plot.

'No. Immigrants = 5567 * Year + -10926195'

DATA VISUALIZATION 117


Bubble Plots
A bubble plot is a variation of the scatter plot that displays three dimensions of data (x, y, z).
The datapoints are replaced with bubbles, and the size of the bubble is determined by the
third variable 'z', also known as the weight. In maplotlib, we can pass in an array or scalar to
the keyword s to plot(), that contains the weight of each point.

Let us compare Argentina's immigration to that of its neighbor Brazil. Let's do that using a
bubble plot of immigration from Brazil and Argentina for the years 1980 - 2013. We will set
the weights for the bubble as the normalized value of the population for each year.

Create the normalized weights

There are several methods of normalizations in statistics, each with its own use. In this case,
we will use feature scaling to bring all values into the range [0,1]. The general formula is:

where X is an original value, X' is the normalized value. The formula sets the max value in
the dataset to 1, and sets the min value to 0. The rest of the datapoints are scaled to a value
between 0-1 accordingly.

DATA VISUALIZATION 118


Raw Code :

# normalize Brazil data


norm_brazil = (df_can_t['Brazil'] - df_can_t['Brazil'].min()) / (df_can_t['Brazil'].max() -
df_can_t['Brazil'].min())

# normalize Argentina data


norm_argentina = (df_can_t['Argentina'] - df_can_t['Argentina'].min()) /
(df_can_t['Argentina'].max() - df_can_t['Argentina'].min()

Raw Code :
# Brazil
ax0 = df_can_t.plot(kind='scatter',
x='Year',
y='Brazil',
figsize=(14, 8),
alpha=0.5, # transparency
DATA VISUALIZATION 119
color='green',
s=norm_brazil * 2000 + 10, # pass in weights
xlim=(1975, 2015)
)

# Argentina
ax1 = df_can_t.plot(kind='scatter',
x='Year',
y='Argentina',

alpha=0.5,
color="blue",
s=norm_argentina * 2000 + 10,
ax = ax0
)

ax0.set_ylabel('Number of Immigrants')
ax0.set_title('Immigration from Brazil and Argentina from 1980 - 2013')
ax0.legend(['Brazil', 'Argentina'], loc='upper left', fontsize='x-large')

The size of the bubble corresponds to the magnitude of immigrating population for that year,
compared to the 1980 - 2013 data. The larger the bubble, the more immigrants in that year.

Waffle Chart

A waffle chart is an interesting visualization that is normally created to display progress


toward goals. It is commonly an effective option when you are trying to add interesting
visualization features to a visual that consists mainly of cells, such as an Excel dashboard.

DATA VISUALIZATION 120


The first step into creating a waffle chart is determing the proportion of each category with
respect to the total.

The second step is defining the overall size of the waffle chart.

The third step is using the proportion of each category to determe it respective number of tiles

DATA VISUALIZATION 121


The fourth step is creating a matrix that resembles the waffle chart and populating it.

Raw Code :

# initialize the waffle chart as an empty matrix


waffle_chart = np.zeros((height, width))

# define indices to loop through waffle chart


category_index = 0
tile_index = 0
# populate the waffle chart
for col in range(width):
for row in range(height):
tile_index += 1

# if the number of tiles populated for the current category is equal to its
corresponding allocated tiles...
if tile_index > sum(tiles_per_category[0:category_index]):
DATA VISUALIZATION 122
# ...proceed to the next category
category_index += 1

# set the class value to an integer, which increases with class


waffle_chart[row, col] = category_index

print ('Waffle chart populated!')

Map the waffle chart matrix into a visual.

DATA VISUALIZATION 123


Prettify the chart.

DATA VISUALIZATION 124


Word Clouds

Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a
specific word appears in a source of textual data (such as a speech, blog post, or database),
the bigger and bolder it appears in the word cloud.

DATA VISUALIZATION 125


Interesting! So, in the first 2000 words in the novel, the most common words are Alice, said,
little, Queen, and so on. Let's resize the cloud so that we can see the less frequent words a
little better.

Much better! However, said isn't really an informative word. So, let's add it to our stop words
and re-generate the cloud.

DATA VISUALIZATION 126


DATA VISUALIZATION 127
UNIT-V

Introduction to Seaborn

Seaborn is a statistical plotting library and is built on top of matplotlib. It has beautiful default
styles and is compatible with pandas dataframe objects. In order to install it use the following
commands:

Anaconda users: conda install seaborn

Python users: pip install seaborn

Seaborn code is opensource so one can read it at

https://github.com/mwaskom/seaborn. This page has information along with the link to the
official documentation page - https://seaborn.pydata.org/. The subsection of this page
(https://seaborn.pydata.org/examples/index.html) shows the example visualizations Seaborn is
able to work with. The other important section to visit is the one which has API information
- https://seaborn.pydata.org/api.html.

Seaborn functionalities and usage

Let us now delve into some of the functionalities Seaborn provides. We will run some snippet
of codes in Jupyter notebook (although you can use any other IDE) to exhibit the key
features.

Distribution Plots

We will first try to do distribution plots. To get started, we first import one of the standard
datasets that comes with Seaborn. The one we choose for our exercise is diamonds.csv. You
can pick other datasets from https://github.com/mwaskom/seaborn-data.

DATA VISUALIZATION 128


We then use the “displot” function to plot the distribution of a single variable

As seen above, we get a histogram and a Kernel Density Estimate (KDE) plot. We can customize it
further by removing KDE and specifying the number of bins.

DATA VISUALIZATION 129


Joint plot allows us to plot the relationship represented by bivariate data.

The above diagram shows that as the carat approaches the value ‘5’, the value of the diamond
also increases which is a phenomenon we also observe. The ‘jointplot’ function can take
multiple values for the parameter ‘kind’ – scatter, reg, resid, kde, hex.
Let us see the plot using ‘reg’ which will provide regression and kernel density fits.

DATA VISUALIZATION 130


Next Steps for students to try :

1. Pairplot function: will plot pairwise relationships across an entire dataframe such that each
numerical variable will be depicted in the y-axis across a single row and in the x-axis across a
single column. Try the command: sns.pairplot(dmd)

2. Rugplot function: It plots datapoints in an array as sticks on an axis. Try the command:
sns.rugplot(dmd['price'])

3. Once you are comfortable with these then you can try out KDE plotting. KDE plotting is used
for visualizing the probability density of a continuous variable or a single graph for multiple
samples.
Reference: https://seaborn.pydata.org/generated/seaborn.kdeplot.html

Categorical Plots :

Now let us discuss how to use seaborn to plot categorical data. But let us first understand
what categorical variable is. A categorical variable is one that has multiple categories but has
no intrinsic ordering specified for categories. For example: Blood type of a person can be any
one of A, B, AB or O.

Now let us see examples of the plots:

Barplot and countplot allow you to aggregate data with respective to each category. Barplot
DATA VISUALIZATION 131
allows to you aggregate around some function but the default is mean.

The difference between countplot and barplot is that countplot explicitly counts the number
of occurrences.

Boxplot shows the quartiles of the dataset while the whiskers extend encompass the rest of
the distribution but leave out the points that are the outliers.

DATA VISUALIZATION 132


Violinplot shows the distribution of data across several levels of categorical variable(s) thus
helping in comparison of the distribution. Wherever actual datapoints are not present, KDE is
used to estimate the remaining points.

The stripplot draws a scatterplot where one variable is categorical.

DATA VISUALIZATION 133


Matrix Plots :

Now let us delve into matrix plots. It helps to segregate data into color-encoded matrices
which can further help in unsupervised learning methods like clustering.

DATA VISUALIZATION 134


The corr() function gives the matrix form to correlation data.
Below command generates the heatmap.

DATA VISUALIZATION 135


Let us now try the pivot_table formation. Now we need to select the appropriate data for that.
Among the available datasets in seaborn, flights data is most suitable to depict this. Let us try
to depict the total number of passengers for each month of the year.

DATA VISUALIZATION 136


The cluster map uses hierarchal clustering. It no more depicts months and years in order but
groups them similarity in the passenger count. So, it can be inferred that April and May are
similar in passenger volume.

DATA VISUALIZATION 137


Regression plot :

In this final section, we will explore seaborn and see how efficient it is to create regression
lines and fits using this library. Implot function allows you to display linear models.

Spatial Visualizations and Analysis in Python with Folium

Folium is a powerful data visualization library in Python that was built primarily to help
people visualize geospatial data. With Folium, you can create a map of any location in the
world if you know its latitude and longitude values. You can also create a map and
superimpose markers as well as clusters of markers on top of the map for cool and very
interesting visualizations. You can also create maps of different styles such as street level
map, stamen map.

Folium is not available by default. So, we first need to install it before we can import it. We
can use the command : conda install -c conda-forge folium=0.5.0 --yes

DATA VISUALIZATION 138


It is not available via default conda channel. Try using conda-forge channel to install folium
as shown: conda install -c conda-forge folium
Generating the world map is straightforward in Folium. You simply create a Folium Map
object and then you display it. What is attractive about Folium maps is that they are
interactive, so you can zoom into any region of interest despite the initial zoom level.

Go ahead. Try zooming in and out of the rendered map above. You can customize this default
definition of the world map by specifying the center of your map and the initial zoom level.
All locations on a map are defined by their respective Latitude and Longitude values. So, you
can create a map and pass in a center of Latitude and Longitude values of [0, 0]. For a
defined center, you can also define the initial zoom level into that location when the map is
rendered. The higher the zoom level the more the map is zoomed into the center. Let's create
a map centered around Canada and play with the zoom level to see how it affects the
rendered map.

Let's create the map again with a higher zoom level

As you can see, the higher the zoom level the more the map is zoomed into the given center.
DATA VISUALIZATION 139
Stamen Toner Maps

These are high-contrast B+W (black and white) maps. They are perfect for data mashups and
exploring river meanders and coastal zones. Let's create a Stamen Toner map of Canada with
a zoom level of 4.

Stamen Terrain Maps

These are maps that feature hill shading and natural vegetation colors. They showcase
advanced labeling and linework generalization of dual-carriageway roads. Let's create a
Stamen Terrain map of Canada with zoom level 4.

Mapbox Bright Maps

These are maps that are quite like the default style, except that the borders are not visible with
a low zoom level. Furthermore, unlike the default style where country names are displayed in
each country's native language, Mapbox Bright style displays all country names in English.
Let's create a world map with this style.

Case Study

Now that you are familiar with folium, let us use it for our next case study which is as
mentioned below:

Case Study: An e-commerce company ‘ wants to get into logistics “Deliver4U” . It wants to
know the pattern for maximum pickup calls from different areas of the city throughout the
day. This will result in:

DATA VISUALIZATION 140


i) Build optimum number of stations where its pickup delivery personnel will be located.
ii) Ensure pickup personnel reaches the pickup location at the earliest possible time.
For this the company uses its existing customer data in Delhi to find the highest density of
probable pickup locations in the future.

Solution:

Pre-requisites : Python, Jupyter Notebooks, Pandas

Data set : Please download the following from the location specified by the trainer.
The dataset contains two separate data files – train_del.csv and test_del.csv. The difference is
that train_del.csv contains additional column which is trip_duration which we will not be
needed for our present analysis.

Importing and pre-processing data:

a) Import libraries – Pandas and Folium. Drop the trip_duration column and combine the 2
different files as one dataframe.

DATA VISUALIZATION 141


We will need to generate some columns such as month or other time features using Datetime
package of python. Let us then use it with Folium

DATA VISUALIZATION 142


Please note that month, week, day, hour columns will be used next for our analysis
Note the following regarding visualizing spatial data with Folium:
• Maps are defined as folium.Map object. We will need to add other objects on top of this
before rendering

DATA VISUALIZATION 143


• Different map tiles for map rendered by Folium can be seen at
: https://github.com/pythonvisualization/folium/tree/master/folium/templates/tiles
• Folium.Map() : First thing to be executed when you work with Folium.
Let us define the default map object:

Let us now visualize the rides data using a class method called Heatmap()

DATA VISUALIZATION 144


Code for reference:
from folium.plugins import HeatMap
df_copy = df[df.month>4].copy()
df_copy['count'] = 1
base_map = generateBaseMap()
HeatMap(data=df_copy[['pickup_latitude', 'pickup_longitude',
'count']].groupby(['pickup_latitude',
'pickup_longitude']).sum().reset_index().values.tolist(), radius=8,
max_zoom=13).add_to(base_map)

Interpretation of the output:


There is high demand for cabs in areas marked by the heat map which is central Delhi most
probably and other surrounding areas.
Now let us add functionality to add markers to the map by using the folium.ClickForMarker()
object.
After adding the below line of code, we can add markers on the map to recommends points
where logistic pickup stops can be built

DATA VISUALIZATION 145


We can also animate our heat maps to dynamically change the data on timely basis based on
a certain dimension of time. This can be done using HeatMapWithTime(). Use the following
code :
df_hour_list = []
for hour in df_copy.hour.sort_values().unique():
df_hour_list.append(df_copy.loc[df_copy.hour == hour, ['pickup_latitude',
'pickup_longitude', 'count']].groupby(['pickup_latitude',
'pickup_longitude']).sum().reset_index().values.tolist())
from folium.plugins import HeatMapWithTime
base_map = generateBaseMap(default_zoom_start=11)
HeatMapWithTime(df_hour_list, radius=5, gradient={0.2: 'blue', 0.4: 'lime', 0.6:
'orange', 1: 'red'}, min_opacity=0.5, max_opacity=0.8,
use_local_extrema=True).add_to(base_map)
base_map

Conclusion
Throughout the city, pickups are more probable from central area so better to set lot of pickup
stops at these locations
Therefore, by using maps we can highlight trends and uncover patterns and derive insights
from the data.

DATA VISUALIZATION 146

You might also like