KEMBAR78
Grokking Data Science | PDF | Array Data Type | Array Data Structure
0% found this document useful (0 votes)
708 views61 pages

Grokking Data Science

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
708 views61 pages

Grokking Data Science

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Creating the Workspace - Jupyter Notebooks

We'll coverthe following a

© First ThingsFirst: Why Python?


* Jupyter Notebook: WhatIt Is and WhyData Scientists Love It
* Howto Install Jupyter Notebook
» 1. Running Jupyter Notebooks With the AnacondaPythonDistribution
° 2. Getting Anaconda
© Creating Your First Jupyter Notebook
* Jupyter Notebook
* Bonus Tip

First Things First: Why Python?


Ofall the languages out there, whyis Python the most popularchoicein the machinelearning
anddata science world? Welike things that are simple andintuitive. Pythonis just that, simple
andintuitive. It’s readable,it’s low in complexity, and it’s easy to learn.

As a data scientist or machine learning engineer you can use a quick implementation in Python
to validate your ideas about some complex mathy conceptsin a fast and hassle-free manner.
It’s easily understandable for you andothers.

As a DataScientist, your life revolves arounddata. Outside the playground, you stumble upon
reality. Datain reallife is oftentimes raw,unstructured, incomplete, and large. Python comes
with the promise of knowing howto handletheseissues. But how does it do that? What is so
special about Python?

All hail the mighty packages! What's so special about Python arethe great open source code
repositories that are continuously being updated. These open source contributions give Python.
its superpowers and an edgeoverother languages. Thebest thing aboutusing these packagesis
that they have a minimal learning curve. Once you havea basic understanding of Python, you
can very easily import, use, and benefit from all of the packages outthere without having to
understandeverything is going on under the hood. Lastbutnot theleast, these packages are
completely free to use as well!

Since wehaveourdatascientist haton,let’s talk aboutdata. If youstill have doubts,I'll let this
survey from IBM convince you why weshouldlearn data science in Python andnot in R or any
other language.

(0ct27, 2016
‘= prionand (ache leaing” or "data sone: 0.180%
‘= Rand (machineering”or estasence’): 0.081%
Percentage of Matching 4Postings (%)

‘= sealsand racine earn or eatascence’): 0.080%

2012 2013 2014 2015 2016

‘The Most Popular Language For Machine Learning Is...

Jupyter Notebook: WhatIt ls and Why Data Scientists


LoveIt
This story begins with IPython.IPythonis an interactive command-line terminal for Python.
Command-line terminals are not everyone’s cupoftea, so in 2011 IPython introduced a new
tool named the Notebook — a modern and powerful webinterface to Python. In 2015 the
Notebook project was re-branded asthe Jupyter Notebook.

The Jupyter Notebookis an incredibly powerful andsleek tool for developing and presenting
data scienceprojects. It can integrate codeandits outputinto a single document, combining
visualizations, narrative text, mathematical equations, and other rich media. It’s simply
awesome.
SJUPYter noted00k iatcmspne assessoe

‘Simple spectral analysis


Annan oe oie: igoe realgncont ot aso

\Weegnyloadgata sing SPaoeppt


18 [1: fron scipy.to import watite
‘Arecan eastewts petal atesgMamesbuah specgranete:

Sedcbet title Spectrogtan Dy

As if all these features weren’t sufficient enough, Jupyter Notebook can handle manyother
languages,likeR, as well. Its intuitive workflows, ease of use, and zero-cost have madeit
THE toolat the heartof any data science project.

Essentially, Jupyteris a great interface to the Python language and a must-haveforall Data
Scienceprojects.

How to Install Jupyter Notebook


1. Running Jupyter Notebooks With the Anaconda Python Distribution

Theeasiest and moststraight forward wayto get started with Jupyter Notebooks is by installing
Anaconda. Anacondais the mostwidely used Pythondistribution for data science. It comes pre-
loaded with the most popularlibraries and tools (e.g., NumPy, Pandas, and Matplotlib). What
this meansis that immediately get to real work, skipping the pain of managingtons of
installations, dependencies, and OS-specific installation issues.

2. Getting Anaconda

Follow these simplesteps:

1. Download thelatest version of Anacondafor Python3 (ignore Python 2.7).


2. Install Anacondabyfollowing the instructions on the downloadpageand/orin the
executable.

Note: Thereare other ways of running the Jupyter Notebook aswell, e.g., via pip and Docker,
but weare going to keepit sweet and simple. Wewill stick with the mosthassle-free approach.

Creating Your First Jupyter Notebook


« Backgroundinformation? 7
« Installations? #

Let’s get started with a real Jupyter Notebook.

#® Note: You do notneedto go through theinstallation processright now.At the end of


this lessonthereis an in-built Jupyter Notebookfor you to play with, without having to
leave this page!

Runthe following command from your Anacondaterminal:

1 jupyter notebook

A Jupyter server is now runningin your terminal,listening to port 8888, andit will
automatically launch the application in your default web browserat http://localhost:8888if it
doesn’t happen automatically, you can use theurlto launch it yourself. You should see your
workspacedirectory,like in the screenshotbelow:

= Jupyter Logout
Fes Runing uses
‘Solelms peroactionson thm Uo |New |
o[-)s are # Lestodtes
a ects 11 aays 90
5 contats ‘anys age
(3 Desktop tt a9ys990
© Documents Sys a90
© Downioads| 2s 390
So Fovertes 11 days 90

You can create a new Python notebookbyclicking on the New [1] button(screenshot below)
andselecting the appropriate Python version (Python3)[2].

This doesthreethings for you:

1. It creates a new Notebookfile called Untitled.ipynb in your workspace.


2. It starts a Jupyter Python kernelto run this notebook.
3. It opens the newly created notebookin a newtab.

<= jupyter Logout


Files Running —Clusters 1
Select items to perform actions on them.
Upload |Newe &

~ TextFile
Folder
env
Terminal

Notebooks
Octave

Python 2
2» Python 3

As a “Hello-World-step” you can renameyour notebookto “Hello Data Science”by clicking


Untitled [1] and typing the new name, as shownin the snapshot below.

Anotebookcontainsa list ofcells. Each cell can contain executable code or formatted text
(Markdown). Right now, the notebookcontains only one emptycodecell. Try typing
print(“Hello Data Science!”) in the cell[2], then click on the run button [3] (or press Shift-Enter).
Hitting the run button sendsthecurrentcell to this notebook’s Python kernel, which runsit and
returnsthe output. Theresult is displayed below thecodecell:

‘Sjupyter Hello Data Science <—" point: minute ago. (unsaved changes) Pogo
Fle Edt Vow Cot Kemet Widgets Hep Tried) 9 [Pyton’3 ©
B+ (x G/B) ATS Rn Cm coe vie
2
In [i]? | print(*Hello Data Science!)
Hello bata Science!
[=o

Jupyter Notebook
You cantry this here:

& Howto Use a Jupyter NoteBook?

* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.

+ You can click ZG to openthe Jupyter Notebookin a new tab.

* Go to File and click Download as and then choosethe formatofthefile to download


>. You canchoose Notebook(.ipynb) to downloadthefile and worklocally or on
your personal Jupyter Notebook.

+ A Thenotebooksession expiresafter 15 minutes ofinactivity. It will reset after 15


consecutive minutes.

‘Your app can be found at: https://wa3ere1mk42-live-app.educative.run/notebooks/helloworld.ipynb G

Click to launch app!

Congratulations! Now you haveyour first Jupyter Notebook up and running. You will need it
for the IMDB andend-to-end MLprojectthatarediscussedlaterin this course.

Bonus Tip
Whenworking with other people ona datascience project,it is a good practice to:

1. Haveexplicit rules for naming all the documents.


2. Use a version controlsystem like Git to share and save yourprogress.
Python Libraries

We'll coverthe following a

* Essential PythonLibraries for a Data Scientist


* Numpy
* Pandas
© Scikit Learn
* Matplotlib
Seaborn

As welearnedin thepreviouslesson, oneof the greatest assets of Pythonis its extensive set of
libraries. These are what makethelife of a data scientist easy — thestartof the loveaffair
between Python anddatascientists!

YOUREFLYING!
HOW?

T DUNNO... 1
DYNAMIC TYPING? I JUST TYPED
WHITESPRE? Inport ontigeaity
/ cone soniye THATS 7? [
T LEARNED ITLAST PROGRAMMING .-- I ALSO SAMPLED
NIGHT! EVERYTHING 'S FUN AGAIN! EVERYTHING IN THE
1S $0 SIMPLE! ITS A WHOLE MEDICINE CABINET
! NEW WORLD FOR COMPARISON.
HELLO WORLD IS JUST A UP HERE! i
print "Hello, world)" BUT HOWARE BUT LTHINKTHIS
You FLYING? ISTHE PYTHON.

ImageCedits: https:/Ixked.com

Essential Python Libraries for a Data Scientist


Let’s take a quick tourofthelibraries that a data scientist should really know.

Numpy
+ NumPy (Numerical Python)is a powerful, and extensively used,library for storage and
calculations. It is designed for dealing with numericaldata. It allows data storage and
calculations by providing datastructures, algorithms, andotheruseful utilities. For
example,this library contains basic linear algebra functions, Fourier transforms, and
advanced random numbercapabilities. It can also be used to load data to Python and
exportfrom it.

Bi NumPy

Pandas
« Pandasis a library that you can’t avoid when working with Python on a data science
project. It is a powerful tool for data wrangling, a process required to prepare your data so
that it can actually be consumed for analysis and model building. Pandascontainsa large
variety of functionsfor data import, export, indexing, and data manipulation. It also
provides handy datastructures like DataFrames(series of columnsandrows, andSeries(1-
dimensionalarrays), and efficient methods for handling them. For example, it allows us to
reshape, merge, split, and aggregatedata.

ful pandas

Scikit Learn
* Scikit Learnis an easy to use library for Machine Learning. It comes with a variety of
efficient tools for machine learningandstatistical modeling:it provides classification
models (e.g., Support Vector Machines, Random Forests, Decision Trees), Regression
Analysis(e.g., Linear Regression, Ridge Regression, Logistic Regression), Clustering
methods (e.g, k-means), data reduction methods(e.g., Principal ComponentAnalysis,
feature selection), model tuning, andselection with featureslike grid search, cross-
validation. It also allows for pre-processingof data. If these terms soundforeign to you
right now, don’t worry,wewill get back to all this in detail in the section on machine
learning.

Matplotlib
* Matplotlib is widely used for datavisualization like for plotting histograms, line plots, and
heat plots.
Histogram of 1: = 100, 0 =35

Seaborn
* Seabornis anothergreat library for creating attractive and information rich graphics. Its
goalis to makedata exploration and understanding easier, andit doesit very well.
Seabornis based on Matplotlib whichis its child, basically.

#® Note: Learning to use Python well means using lotoflibraries and functions
which canbe intimidating. But no need to panic —you don’t have to remember them
all by heart! Learning how to Googlethese things efficiently is among the top skills
of a good data scientist!
Learning NumPy- AnIntroduction

We'll coverthe following =A

* Why NumPy
e Lessons Overview

Why NumPy
Data comes in all shapes andsizes. We can haveimagedata,audiodata,text data, numerical
data, etc. We haveall these heterogeneoussources of data but computers understandonly 0’s
and1’s — Atits core, data can be thoughtofas arrays of numbers. In fact, the prerequisite for
performing any data analysis is to convert the data into numerical form. This meansit is
importantto be able to store and manipulatearraysefficiently, and this is where Python’s
NumPy package comesinto picture.

Now, you might be questioning, “When can I use Python’s built-in lists and to doall sorts of
computations and manipulations throughlist comprehensions, for-loops, etc., why should I
bother with NumPy arrays?”You arerightin thinking so because,in someaspects, NumPy
arrays arelike Python’s lists. Their advantageis that they provide moreefficient storage and
data operationsasthearrays growlargerin size. This is the reason NumPy arraysareat the
core ofnearly all data sciencetools in Python.This,in turn,implies that it is essential to know
NumPy well!

Lessons Overview
In this Learning NumPy series, we will start by understandingthebasics of array manipulations
in NumPy. Wewill then proceed to learn about computations, comparisons, and other more
advanced tricks.

Understanding concepts through code-examples hands-on examplesis always better because it


helpsus retain information in long-term memory.Plus,it makes learning easier, fun and more
intuitive. So,it’s a no-brainer that weare goingto learn via interactive code examplesin this
lesson.

It is also importantto recall and apply whateverwelearnby practicing — wewill end with
someexercisesto hit-refresh on the concepts learned. The reason wehaveexercisesat the end
of the entire course rather thanat the endof each lesson is because recalling information after
sometimeis a better wayof learning.

In short,this course is set up to have interactive code-examplesas wego andhit-refresh


exercisesat the end.

Enough talking! Without further ado,let’s dive into the world of NumPy.
NumPyBasics - Creating NumPy Arrays and Array
Attributes

We'll coverthe following a

* 1. Creating Arrays
* a. Arrays From Lists
© b, Arrays From Scratch
© 2. Array Attributes

1. Creating Arrays
Thereare two waysto create arrays in NumPy:

a. Arrays From Lists

Thefirst step when working with packagesis to define the right “imports”. We can import
NumPy like so:

import numpy as np

Notice that np is the standard shorthand for NumPy.

There are multiple waysto create arrays. Let’s start by creating an array from list using the
np.array function.

Wecancreatea one-dimensional(1D) array froma list by passing thelist as input parameters


to the np.array function:

np-array([1, 2, 3, 4])

3D array
2D array axis 2
axis wW
1D array az) eS
(Sle[7) axis 0 >| [o(7.0) 6 axiso->6
Oren

What does1D, 2D,3Darray mean? Image Credits: www.3resource.com

Run the codein the widget below andinspectthe outputvalues. In particular, observe the
type ofthe created array from the result of the print statement.

1 # Import the numpy package


2 import numpy as np
3
4 # Create a 1d integer array from a list
5 arrd = np.array([1, 2, 3, 4])
6
7 # Print the array and its type
8 print(arnt)
9 print(type(arr1))

11234]
<class ‘numpy.ndarray'>

If we wantto explicitly set the data type ofthe resulting array, we can use the dtype keyword.
Someof the most commonly used numpydtypes are: ‘float’, ‘int’, ‘bool’, ‘str’, and ‘object’. Say we
wantto createan array offloats, we candefineit like so:

# Create a 1d float array


onunwner

arr2 = np.array([1, 2, 3, 4], dtype="Ffloat32")


# Print the array and its type
print(type(arr2))
print(arr2)

1.318
<class ‘numpy.ndarray'>
[1. 2. 3. 4.1

In the examples above, we have seen one-dimensional arrays. We canalso define two and
three-dimensionalarrays.

# Create a 2d array from a list of lists


wewne

lists [[0,1,2], [3.4.5], [6,7,.8]]


arr2d np.array(lists)
print(arr2d)

[fo 2 2)
[345]
[67 81)

A keydifference betweenanarray anda listis that arrays allow you to perform vectorized
operations whilea list is not designed to handlevector operations. A vector operation means a
function gets applied to every item in the array.

Say wehavea list and we wantto multiply each item by 2. We cannotdo element-wise
operationsby simply saying “mylist * 2”. However, we can do so on a NumPy array.Let’s see
some code-examples:

arri = np.array([1, 2, 3, 4])


Nouwsunk

print(arrt)
# Vector (element-wise) operations
print(arra * 2)
print(arrd + 2)
print(arra * arri)

11234]
12468]
[345 6]
[1 4 9 16)

Someotherkey differences between Pythonbuilt-in lists and NumPyarraysare:

« Array size cannot be changedafter creation, you will have to create a new array or
overwritethe existing one to changesize.
* Unlikelists, all itemsin the array must be of the samedtype.
« An equivalent NumPy array occupies muchless space than a Pythonlistoflists.

b. Arrays From Scratch

Nowinstead ofusinglists as a starting point, let’s learn to create arrays from scratch. For large
arrays,it is more efficient to create arrays using routines already built into NumPy. Here are
several examples:

# Create an integer array of length 100 filled with zeros


wavanaune

np.zeros(160, dtype=int)
# Create a 3x3 floating-point array filled with 1s
np.ones((3, 3), dtype=float)
# Create an array filled with a linear sequence
# Starting at @, ending at 20, stepping by 3
# (this is similar to the built-in range() function)
10 np.arange(2, 28, 3)
11
12. # Create an array of hundred values evenly spaced between @ and 1
13 np.linspace(@, 1, 100)
14
15 # Create a 3x3 array of uniformly distributed random values between @ and 1
16 np.random.random((3, 3))
17
18 # Create a 3x3 array of random integers in the interval [@, 16)
19 np.random.randint(@, 1, (3, 3))
20
21 # Create a 3x3 array of normally distributed random values
22 # with mean @ and standard deviation 1
23. np.random.normal(@, 1, (3, 3))
24
25 np.random.randint(10, size=6) # One-dimensional array of random integers
26 np.random.randint(10, size=(3, 3)) # Two-dimensional array of random integers
27 np.random.randint(10, size=(3, 3, 3)) # Three-dimensional array of random integers

2. Array Attributes
Each arrayhas the followingattributes:

¢ ndim: the numberof dimensions


* shape: the size of each dimension

© siz thetotal size of the array


* dtype: the data typeof the array

© itemsize: thesize (in bytes) of each array element


* nbytes : the total size (in bytes) of the array

Runthe code below andobservethe output.

import numpy as np
wavanaune

# Create a 3x3 array of random integers in the interval [@, 16)


Xx = np.random.randint(@, 10, (3, 3))
print(“ndim |» X.ndim)
print(shape: x. shape)
print("x size » x.size)
1» x.dtype)
18 x.itemsize, “byte
11 print(“nbytes: » X.nbytes, "bytes")

dtype: inté4
itemsize: 8 bytes
nbytes: 72 bytes
NumPyBasics - Array Indexing and Slicing

We'll coverthe following a

* 3. Array Indexing: Accessing Single Elements


* 4, Array Slicing

3. Array Indexing: Accessing Single Elements


If we wantto get andset the values ofindividual elements in the array, we need to beable to
access single elements, correct? Accessing single elementsis called indexing arrays.

Indexing in NumPy is similar to Python’s standardlist indexing. In a 1D array, we can access


the ith valueby specifying the indexof the element weneed in square brackets. One
importantthing to rememberhereis that indexing in Pythonstarts at zero.

Observetheinputs and outputs (indicated by “#>”as start marker) for the examples given
below.

Note: When running thecode, ifyou wantto view the outputin the console, you can add
print statementsif they are already not there,like at the endofthisfirst code widget. I have
omitted them to removenoisefrom the code so thatyou can fully focus on the importantbits.

# Input array
wavanaune

x1 = np.array([1, 3, 4, 4, 6, 4])
# Assess the first value of x1
xi[e]
pa
# Assess the third value of x1.
xa[2]
10 #4
11
12 # To view the output, you can add print statemtents
13 print(x1[@])

Wecan use negative indices to index from theendofthe array:

# Get the last value of x1


Nouwsunk

xi[-1]
D4
# Get the second last value of x1
xi[-2]
6

V succeeded

If we have a multidimensional array, and wantto access items based on both column androw,
wecan passthe row and column indices at the sametime using a comma-separatedtuple as
shownin the examples below.

1 # In a multidimensional array, we need to specify row and column index. Given input array x2:
2 x2 = np.array([[3, 2, 5, 5].[@, 1, 5, 8], (3, % 5, @]])
3 x2
4 #array([[3, 2, 5, 5],
5s ® [e, 1, 5, 8],
6 [3, ®5, @]])
7
8 # Value in 3rd row and 4th column of x2
9 x2[2,3]
1e me
12. # 3rd row and last value from the 3rd column of x2
13° x2[2,-1]
14 e
16 # Replace value in 1st row and 1st column of x2 with 1
47 x2[@,0] = 1
18 #>array([[1, 2, 5, 5].
19 # [% 1 5, 8],
20 #> [3, @ 5, @]])

V succeeded

4. Array Slicing
Slicing array is a way to access subarrays,i.e., accessing multiple or a range of elements from
an array instead ofindividualitems. In other words, whenyouslice arrays you get and set
smaller subsets of items within largerarrays.

Again, we needto use square brackets to access individual elements. Butthis time, we also
need the slice notation, “:” to access a slice or a range of elementsof a given array, x:

x[start:stop:step]

If we do not specify anything for start, stop, or step, NumPyuses the default values for these
parameters: start=0, stop=size of dimension, and step=1.

Carefully go throughall of the examples given below,and observethe outputvalues for the
different combinationsofslices. As an exercise play with the indices and observe the
outputs.

1 x1 = np.arange(10) # Input array


2 xl
3 # array([@, 1, 2, 3, 4, 5, 6, 7, 8 9])
4
5 # Get the first 5 elements of x
6 x1[:5]
7 #> array([@, 1, 2, 3, 4])
8
9 # Elements after index 4
10 x1[4:]
11 #array([4, 5, 6, 7, 8, 9])
13 # From 4th to 6th position
14 x1[4:7]
15 #> array([4, 5, 6])
17 # Return elements at even place (every other element)
a8 xaf i: 2]
19 #> array([@, 2, 4, 6, 8])

21 #return elements from 1st position step by 2 (every other element starting at index 1)
220 xa[i::2]
23 #> array([1, 3, 5, 7, 9])

Y succeeded

Whatdo you think would happenif wespecify a negative step value? In this case, the defaults
for start and stop are swapped,a handy waytoeasily reverse an array!

#reverse the array


Nouwsunk

xafs:-4]
# array([9, 8, 7, 6 5, 4, 3, 2. 1, ])
# reverse every other element starting from index 5
xa[S::-2]
# array([5, 3, 1])

o succeeded

Wecan use this same approach with multi-dimensionalslices. We can define multipleslices
separated by commas:

x2 = np.array([[@,1,2], [3.4.5], [6,7,8]])


wavanaune

# array([[@, 1, 2],
> (3, 4, 5],
> [6 7, 8]])
x2[:2, :2] # Extract the first two rows and two columns
# array([[ @ 11,
> (3, 411)
10 x2[:3, ::2] # all rows, every other column
11 #array([[@, 21,
12 [3 5)
13 [6 8]])

Y succeeded

Again,try modifying the values and play with multi-dimensional arrays before moving ahead.
Wecanalso perform reverse operations on subarrays:

1 x2[::-1, ] # Reverse only the row positions


2
3 > array([l 6, 7, 8],
42 [3 4 5],
5s ® Le, 4, 21)
6
7 x2[::-4, ::-1] # Reverse the row and column positions
8 # array([l 8, 7, 6],
o® [5 4 31
10 # [2 4 el]

Y succeeded

Note: Arrayslices are not copiesofthe arrays. This means thatif we wantto doa
modification on the array obtained from theslicing operation withoutchanging the
original array, we haveto use the copy() method:

x2_subcopy = x2[::-1, ::-1]-copy()


NumPyBasics - Reshaping and Concatenation

We'll coverthe following a

5, Reshapingof Arrays
6. Concatenation and Splitting of Arrays
* a. Concatenation
° b. Splitting

5. Reshaping of Arrays
Reshapingis about changing the wayitems are arranged within thearrayso that the shape of
the array changesbuttheoverall numberof dimensions stays the same, e.g., you can useit to
convert a 1D array into 2D array.

Reshaping is a very useful operation andit can easily be done using the reshape() method.
Since a picture speaks a thousand words,let's see theeffects of reshaping visually:

v
npseshape (x, (3,2))
v
3
{e[=]s

5
7

Reshaping example (Image credits: www.3resource.com)

Howcan wedothis in code? Say we wantto create a 3x3 grid with numbersfrom 1 to 9. We
first create a 1D array and then convertit to the desired shape as shown below.

Run the codein the widget below and observethe outputsof the print statements to
understand what’s going on.

import numpy as np
reshaped = np.arange(1, 10).reshape((3, 3))
print(reshaped)

[f1 2 3]
[4.5 6]
[7 8 91]

Similarly, we can use reshaping to convert between row vectors and column vectors by simply
specifying the dimensions we want. A row vectorhasonly 1 row.This meansthat after
reshaping,all the values end up as columns. A columnvectorhasonly 1 columnandall the
values end upin rows.

x = np.array([1, 2, 3])
wavanaune

print(x)
# row vector via reshape
x_pv= x.reshape((1, 3))
print(x_rv)
# column vector via reshape
x_ev = x.reshape((3, 1))
18 print(x_cv)

1.218
1123)
[f2 231]
[4]
[2]
(311

6. Concatenation and Splitting of Arrays


Weareoften required to combinedifferentarrays or split one array into multiple arrays.
Insteadof doing this manually, we can use NumPy’s array concatenation andsplitting
operations. This means wecan handle these complextasks easily.

a. Concatenation

The concatenate() method allowsus to put arrays together. Let’s understandthis with some
examples.

The following code-examplefirst shows howto concatenatethree 1D arraysinto a single array


andthenit shows howto concatenate two 2Darraysintoa single array.

# We can concatenate two or more arrays at once.


wavanaune

x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
2 = [11,11,11]
np.concatenate([x, y, z])
®array([ 1, 2, 3, 3, 2, 4, 11, 11, 111)
# We can also concatenate 2-dimensional arrays.
1 grid = np.array([[1,2,3] . [4,5,6]])
11 np.concatenate([grid, grid])
12. #> array([[1, 2, 3],
13 [4, 5, 6],
14 t. 2, 31,
15 [4 5, 6]])
16

Y succeeded

Now, what if you are required to combinearraysofdifferent dimensions, e.g., a 2D array


with a 1D array? In such cases, np.concatenate mightnothethebest optionto use. Instead, you
can use np.vstack (vertical stack) or np. hstack (horizontalstack)to finish the task.

3[s(7 3
a
5
5 7 9
7

np.hstack((xy))
| 2
7
ee 9
357] s|7io
— |
np.hstack((x,y))

3} 5 7

5} 7 19

|
np.vstack((x,y))
“| ol|—

3 7
5 9
Le

Vertical Stacking (Image Credits: www.3resource.com)

Runthe codebelowto better understand howto dothis practically. Before looking at the
output, try to visualize the solution in your head.

Reminder: You can add print statement aroundthe outputto see results in your consoleas well.

1 x = np.array([3,4,5])
2 grid = np.array([[1,2,3],[9.1@,11]])
3
4 np.vstack([x,grid]) # vertically stack the arrays
5 # array([l 3, 4, 5],
6 > (1, 2, 3],
7 » [9, 1, 11]])
8
9 z = np.array([[19],[19]])
10 np.hstack([grid,z]) # horizontally stack the arrays
11 #array([[ 1, 2, 3, 19],
12 [9, 1, 41, 19]])

Y succeeded

b. Splitting

Wecan do the opposite of concatenation and split the arrays based on a given position for the
split points.

Splitting of an array (Image Credits: www.3resource.com)

Wecan doitin codelike so:

1 X = np.arange(10)
2 # array([@, 1, 2, 3, 4, 5, 6, 7, 8 9])
3
4 x1, x2, x3 = np.split(x,[3,6])
5 print(x1, x2, x3)
6 #[812] [345] [6789]

1.318
[012] [345] 16789)

Notice thatN split points result in N+ 1 subarrays.

Like concatenation, wecan also perform horizontal andverticalsplitting.

0. 1 2. 3. 0. 7 2. 3. 4.
4, 5. 6. 7. 5. 6. 7. 8. 9,
8. 9. 10.}/ 11. 10. 11. 12. 13. | 14.
12. | 13. 14. 15. 15. 16. 17. 18. 19.
od eee

np.hsplit( a, 2) np.vsplit(a, 2)

0. 1. offifi2iaa
4. 5. 5. 6. 7. 8. 9.
L ee
8.| 9. —a a ord
12.13. 10.|[ 11. [12.13] 14
—— 15.|| 8.
16.|/ 17.1 19.
[1 168..1 7
2.| 3. Sonen en
6.| 7.
10.11.
14] 15.

Horizontal and Vertical Splitting (Image Credits: www.3resource.com)

Let’s see this in action. Run the code below and observethe outputs.

import numpy as np
grid = np.arange(16).reshape((4, 4))
print(grid, “\n")
# Split vertically and print upper and lower arrays
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower, "\n")
41 # Split horizontally and print left and right arrays
12 left, right = np.hsplit(grid, [2])
13 print(left)
14 print(right)

Ifo 1 2 3]
[45 6 7]
[8 910 11)
[22 13 14 15]]

[fo 12 3]
[45671]
If 8 9 10 11]
[22 13 14 15]]

Good job! Weare donewith the basic operations in NumPy.

In addition to the functions we havelearned so far, there are several other very useful
functionsavailable in the NumPylibrary (sum,divide, abs, power, mod,sin,cos, tan, log, var,
min, mean, max, etc.) which can be usedto perform mathematical calculations. There arealso
built-in functions to compute aggregates and functions to perform comparisons. Weare going
to learn these conceptsin the upcoming lessons, so keep going!
NumPy Arithmetic and Statistics - Computations and
Aggregations

We'll coverthe following a

* 1. Computations on NumPy Arrays


* Mathematical Functions
© Universal Function Methods
* 2. Aggregations

1. Computations on NumPyArrays
The reason for NumPy’s importancein the Pythonicdata science worldis its ability to perform
computationsin a fast and efficient manner. Nowthatwearefamiliar with the basic nuts and
bolts of NumPy,weare going to dive into learning it to perform computations.

NumPy provides the so-called universal functions (ufuncs) that can be used to make repeated
calculations on array elements in a very efficient manner. Theseare functionsthat operate on
nD-arrays in an element-by-elementfashion. Rememberthe vectorized operations from earlier.

Mathematical Functions
Whatare someof the most commonanduseful mathematical ufuncs available in the NumPy
package? Let’s explore them with someconcrete examples.

Thearithmetic operators, as shownin the code widget below, are conveniently wrapped around
specific functions built into NumPy;for example, the + operatoris a wrapperfor the add ufunc.

Run thecodein the widget below, tweaktheinputs, and observethe outputsofthe print
statements.

1 import numpy as np
2
3. x = np.arange(10)
4
5 # Native arithmentic operators
6 print(”
7 print("x + 5 5)
8 print("x - 5 - 5)
9 print("x * 5 *5)
1 print("x / 5 7/5)
41 print("x ** 2 =", x ** 2)
12. print("x %2 =", x % 2)
13
14 # OR we can use explicit functions, ufuncs, e.g. “add” instead of "+"
15 print(np.add(x, 5))
16 print(np.subtract(x, 5))
17 print(np.multiply(x, 5))
418 print(np.divide(x, 5))
19 print(np.power(x, 2))
20 print(np.mod(x, 2))

x= [012345678 9)
x+5=[5 6 7 8 91011 12 13 14]
x-5=[-5-4-3-2-1 0 1 2 3 4]
x“ 5= [0 5 10 15 20 25 30 35 40 45]
x/5=[0. 0.20.40.6 0.81. 1.21.4 1.6 1.8]
x [0 1 4 9 16 25 36 49 64 81)
x22 [0101010101]
[5 6 7 8 91011 12 13 14]
[-5 -4-3-2-1 0 1 2 3 4]

Someofthe most useful functions for data scientists are the trigonometric functions. Let’s look
into these. Let’s define an arrayof anglesfirst and then compute sometrigonometric functions
based on those values:

theta np.pi, 4)
wewne

print( theta)
print(” np.sin(theta))
print(” np.cos(theta))
print("tan(theta) np.tan(theta))

theta = [0. 1.04719755 2.0943951 3.14159265]


sin (theta) [0.00000000e+00 8.66025404e-01 8.66025404e-01 1.22464680e-16]
cos (theta) [2.0.5 -0.5 -1. ]
tan (theta) [ 0.00000000e+00 1.73205081e+00 -1.73205081e+00 -1.22464680e-16]

Similarly, we can also obtain logarithms and exponentials.

Note: These might not seem useful to you at the moment, butyouwill see their direct
applicationin ourfinal project.

x= [1 2.3]
wavanaune

print (” x)
print(” np .exp(x))
print(” np -exp2(x))
print ("3x np.power(3, x))

print("In(x) np.log(x))
print(“log2(x) np..10g2(x))
print(“logi@(x) np.10g10(x))

x = [t, 2, 3]
e*x = [ 2.71828183 7.3890561 20.08553692)
2x = (2. 4. 80]
3x = [3 9 27]
In(x) = [0. 0.69314718 1.09861229)
log2(x) [0. 1. 1,5849625]
logiO (x) = [0. 0.30103 9.47712125]

Universal Function Methods


ufuncs provide some methodsthat take two input parameters and return one output parameter.
reduce and accumulate are two ofthe most importantones,solet’s look into those.

a. Calling the reduce method

Say we want to apply someoperation to reduce an arrayto a single value. Wecan use the
reduce() method for this. This method repeatedly applies the given operation to the elements of
an array until only a single result remains. For example, calling reduce on the add functions
returnsthe sum ofall elementsin the array:

x = np.arange(1, 6)
wewne

sum_all = np.add.reduce(x)
print(x)
print(sum_all)

12345]
15

Note: add.reduce() is equivalentto calling sum(. In fact, when the argumentis a NumPy
array, np.sum ultimately calls add.reduce to do the job. This overhead ofhandlingits
argumentanddispatching to add.reduce can make np.sum slower. For moredetails, you can
refer to this answer on StackOverflow.

b. Calling the accumulate method

If we need tostore all the intermediate results of the computation, we can use accumulate()
instead:

Xx = np.arange(1, 6)
wewne

sum_ace = np.add.accumulate(x)
print(x)
print(sum_acc)

12345]
[1 3 6 10 15]

2. Aggregations
When wehavelarge amounts ofdata,asa first step, welike to get an understandingof the data
first by computingits summarystatistics, like mean and standarddeviation.

Note: Wewill look into the theoretical aspects of these statistical conceptsin the “Statistics
for DataScience”section, so don’t worry ifyou don’t remember whatstandarddeviationis,
for instance!

NumPy provides somevery handybuilt-in aggregate functionsthatallow us to summarize our


data, €.g., np.mean(x) gives us the meanvalueofthe array. Understanding with codein a hands-
on wayis alwaysbetter,so let’s explore these aggregates with code.

import numpy as np
wavanaune

x = np.random.random(100)
# Sum of all the values
print("sum of values is np. sum(x))
# Mean value
print("Mean value is: ", mp.mean(x))
1 #For min, max, sum, and several other NumPy aggregates,
11 #a shorter syntax is to use methods of the array object itself,
12. # i.e. instead of np.sum(x), we can use x.sum()
13 print(” ", x.sum())
14 print(” + X.mean())
45 print(” x.max())
16 print(” > x.min())

Sum of values is: 47.853818213265974


Mean value is: 0. 47853818213265975
Sum of values is: 47,853818213265974
Mean value is: 0. 47853818213265975
Max value is: 0,998942680150157
Min value is: 0.013531968734082356

Similarly, we can perform aggregate operations on multi-dimensional arrays as well. Also,if we


wantto compute the minimum row wiseor columnwise, wecan use the np.amin version
instead. Let’s see how:

import numpy as np
wavanaune

grid = np.random.random((3, 4))


print(grid)
print(“Overall su grid. sum())
print("Overall Mi grid.min())
# Row wise and column wise min
18 print("Column wise minimum: ", np.amin(grid, axis=@))
11 print("Row wise minimum: ", np.amin(grid, axis=1))

[[0.79270742 0.58274491 0.68668489 0.7915523 ]


[0.14343324 0.04807954 0.493414 0.789767 ]
[0.85668912 0.52553822 0.13584375 0.55073788]]
overall sum: 6.397239665415485
overall Min: 0.048079540255743125
Column wise minimum: [0.14343324 0.04807954 0.13584375 0.55073788]
Row wise minimum: [0.58274491 0.04807954 0.13584375]

Now weknowhowto perform mathematical operations and aggregations. In the nextlesson, we


will learn about somesubtle data operations like using boolean masks and performing
comparisons.
NumPy Arithmetic and Statistics - Comparison and Boolean
Masks

We'll coverthe following a

* 3. Comparisons and Boolean Masks


* a. Comparisons
* b, Boolean Masks
* Final Thoughts

3. Comparisons and Boolean Masks

a. Comparisons
In this world reignedby thesocial media, the trap of making comparisonsis just about
everywhere.So, staying true to the culture of making comparisons,let’s talk about comparisons
in NumPy

NumPy provides comparison operators suchas less than and greater than as element-wise
functions. The result of these comparison operatorsis always an array with a Boolean data
type, i.e., we get boolean array as output which containsonly True andFalse values depending
on whetherthe elementat that indexlives up to the comparison or not. Let’s see this in action
with some examples.

Runthe codein the widget below andobservetheoutputsof the print statements to


understandis going on.

1 import numpy as np
2
3 x = np.array([1, 2, 3, 4, 5])
4
5 print(x < 2) # less than
6 print(x >= 4) # greater than or equal

[ True False False False False]


[False False False True True]

Wecanalso do an element-by-element comparison of twoarrays and include compound


expressions:

x = np.array([1, 2, 3, 4, 5])
wewne

# Elements for which multiplying by two is the same as the square of the value
(2 * x) == & *2)
# array([False, True, False, False, False], dtype-bool)

V succeeded

Wecanalso countentries in the boolean array that weget as outputs. This can help us perform
otherrelated operations,like getting the total countofvaluesless than 6, np. count_nonzero , OF
checkingif all the valuesin thearray are less than 10, np.al1 and np.any:

import numpy as np
x = np.arange(10)
print(x)
# How many values less than 6?
print(np.count_nonzero(x < 6))
# Are there any values greater than 8?
10 print(np.any(x > 8))
12. # Are all values less than 16?
13 print(np.all(x < 1@))

1.178
[0123456789]
6
True
True

b. Boolean Masks
Amore powerfulpattern than just obtaining a boolean outputarrayis to use boolean arrays as
masks. This meansthat weare selecting particular subsetsof the array that satisfy some given
conditions by indexing the boolean array. We don’t just wantto knowif an index holds a value
less than 10, we want to getall the valuesless than 10 themselves.

Suppose wehavea 3x3 grid with randomintegers from 0 to 10 and we want anarrayofall
values in the original array thatare less than 6. We can achievethis like so:

1 import numpy as np
2
3 # Random integers between [@, 18) of shape 3x3
4 x = np.random.randint(@, 18, (3, 3))
5 print(x)
6
7 # Boolean array
8 print(x < 6)
9
1 # Boolean mask
411 print(x[x < 6])

1.928
[t7 1 21
[8 6 6]
[0 3 91]
[[Palse True True]
[False False False]
[ True True False]]
11203]

Whyarethese operations important?

By combining boolean operations, masking operations, and aggregates, we can very quickly
answera lot of useful questions about ourdataset.

Say weare given a populationdataset, we can answer questionslike:

« Whatis the minimumageof peoplein that dataset?


« Whatis the maximum age?
« How manypeoplein a given country are belowthe age of 18?
« How manypeopleare abovetheage of 25 and unemployed?

In general, wecanselect subsets of the data based on some conditionsofinterest.

Final Thoughts
Congratulations! Weareat the end of our lessons on NumPy f Ofcourse, wewill keep
bumping intoit in the upcoming lessons as well; especially in the Projects section.

For a deeper diveinto all the goodness NumPyhastooffer, hereis their official documentation.

Lastbutnot the least, before moving on with new concepts, make sure to test your NumPy
knowledgeandsolidify the concepts learned so far by completing the exercises in the next
lesson.
Exercises: NumPy

We'll coverthe following a

© TimeToTest YourSkills!

* Q1.Create a null vector(all zeros) of size 10 andsetit in the variable called


Tes

Q2. Create a 1D array of numbers from to 9 andsetit in the variable


called “arr”.

Q3. Create a 3x3x3 array with random values and setit in the variable
called “arr”.

Q4. Create a 10x10 array with random valuescalled “arr4”. Find its
minimum and maximum values andset them in the variablescalled
“min_val” and “max_val” respectively.
Q5.First create a 1D array with numbers from 1 to 9 andthen convert it
into a 3x3 grid. Store the final answerin thevariable called “grid”.

Q6. Replace the maximum valuein thegiven vector, “arr6”, with -1.
Q7. Reverse the rowsofthe given 2Darray, “arr7”.
Q8. Subtract the mean of each row ofthe given 2D array,“arr8”, from the
values in thearray. Setthe updatedarray in “transformed_arr8”.

Time To Test Your Skills!

Q1.Create a null vector (all zeros) of size 10 and setit in the variable
called “Z”.

1 # Your solution goes here (Z = ....)

ag Need Hint? Hide Solution a

Solution eR o

1 Z = np.zeros(1@)

Q2. Create a 1D array of numbersfrom 0 to 9 and setit in the variable


called “arr”.

1 # Your solution goes here

ag Need Hint? Hide Solution a

Solution eR o

1 arr = np.arange(10)

Q3. Create a 3x3x3 array with random valuesand setit in the variable
called “arr”.

1 # Your solution goes here

ag Need Hint? Hide Solution a

Solution eR o

1 arr = np.random.random((3,3,3))

Q4. Create a 10x10 array with random values called “arr4”. Find its
minimum and maximum values and set them in the variables called
“min_val” and “max_val” respectively.

1 # Your solution goes here

ag Need Hint? Hide Solution a

Solution eR o

1 arr4 = np.random.random((10,10))
2° min_val arr4.min()
3) max_val = arr4.max()

Q5. First create a 1D array with numbers from 1 to 9 and then convert
it into a 3x3 grid. Store the final answer in the variable called “grid”.

1 # Your solution goes here

ag Need Hint? Hide Solution a

Solution eR o

1 grid = np.arange(1, 10).reshape((3, 3))

Q6. Replace the maximum value in the given vector, “arr6”, with -1.

# Input
arré = np.arange(10)
# Your solution goes here

ira} Need Hint? Hide Solution a

Solution eR o

1 arré = np.arange(10)
2 arré[arr6.argmax()] = -1

Q7. Reverse the rowsof the given 2D array, “arr7”.

# Input
arr7 = np.arange(9) .reshape(3,3)
# Your solution goes here

iras) Need Hint? Hide Solution ca

Solution eR o

1 # Input
2. arr? = np.arange(9).reshape(3,3)
3
4 # Solution
Sarr? = arr7[::-1]
6
7

Q8. Subtract the mean of each rowof the given 2D array, “arr8”, from
the values in the array. Set the updated array in “transformed_arr8”.

To get the meanalong therow axis, you can use the numpy.mean method, mean(axis=1,
keepdims=True)

# Input
arr8 = np.random.rand(3, 16)
# Your solution goes here

ag Hide Solution a

Solution eR o

1 arr = np.random.rand(3, 10)


2 transformed_arr8 = arr8 - arr8.mean(axis=1, keepdims=True)
Learning Pandas - An Introduction

We'll coverthe following a

* Learning Data Manipulation With Pandas


e Lessons Overview

Learning Data Manipulation With Pandas


There are someingredients,like salt, without which almostnodish is complete. Pandas isjust like
salt — some need more and someless, but almost every data science projectneedsit.

Pandasis a very powerful and popular package built on top of NumPy. It provides an efficient
implementation of data objects built on NumPy arrays and many powerful data operations.
Thesekind of operations are knownas data wrangling — steps required to preparethe data
so thatit can actually be consumedfor extracting insights and modelbuilding.

This might surprise you, but data preparation is whattakes the longest in a data science
project!

THIS KS YOUR MACHINE LEARNING SYSTEM?


YUP! YOU POUR THE DATA INTOTHS BIG
PILE OF UNEAR ALGEBRA, THEN COLLECT
THE ANSWERS ON THE OTHER SIDE.
ETTANSLERS ARE RONG? }
JUSTSTIR THE PILE UNTIL
THEYSTART LOOKING RIGHT

Image Source: https:/ixked.com/1838/

The two primary components of Pandasare the Series and DataFrameobjects. A Series is
essentially a column. And a DataFrameis a multi-dimensional table made upofa collection of
Series; it can consist of heterogeneousdata types and even contain missing data.

At thevery basic level, Pandasobjects can be thoughtof as enhanced versions of NumPyarrays


in whichthe rowsand columnsareidentified with labelsinstead of simple integerindices.

Lessons Overview
Pandasprovides manyuseful tools and methods in addition to the basic data structures. These
tools and methods require familiarity with the core data structures though,so wewill start by
understanding the nuts andbolts of Series and DataFrames. Then wewill diveinto all the good
things that Pandashasto offer by analyzing somereal data — be ready to explore the IMDB-
movies dataset!
Pandas Core Components- The Series Object

coverthe following Aa

e 1.Series From Lists and Arrays

e 2. Series From Dictionaries

1. Series From Lists and Arrays


A Pandas Series is a 1D array of indexeddata essentially a column. It can be created from list
or an array using the pd.Series() method as showninthe code-widget below.

Run the code andobservethe outputto understandthe conceptsin a hands-on example.

Notice thatthe standard shorthandfor importing Pandasis pd.

1 import pandas as pd
2 series = pd.Series([@, 1, 2, 3])
3. print(series)

o
wnro

1
2
3
dtype: inté4

From the previous output, we can see thata Series consists of both a sequenceof values and a
sequence ofindices. The values are simply a NumPyarray, while the indexis an array-like
object of type pd. Index . Values can be accessed with the correspondingindex using the already
familiar square-bracketandslic ing notations:

1 import pandas es pd
2 series = pd.Series([@, 1, 2, 3, 4, 5])
3
4 print("values series. values)
5 print(Indice: series.index, "\n")
6
7 print(series[1], "\n") # Get @ single value
8
9 print(series[1:4]) # Get @ range of values

values: [0123 4 5]
Indices: RangeIndex(start=0, stop=6, step=1)

aoa
2 2
33
dtype: inté4

But why should weuse Series when we have NumPy arrays?

Pandas’ Series are much moregeneral and flexible than the 1D NumPy arrays. The essential
differenceis the presence of the index; while the values in the NumPyarray have an
implicitly defined integer index(to get andsetvalues), the PandasSerieshas an explicitly
defined integer index, which gives theSeries object additional capabilities.

For example, in Pandas, the index doesn’t haveto be an integer— it can consist of values of any
desired type, e.g., we can use strings as an index andthe item access worksas expected. Here is
an exampleofa Series based on non-integer index:

1 import pandas es pd
2 data = pd.Series([12, 24, 13, 54],
3 index=['a', "b', ‘c', ‘d"])
4
5 print(data, “\n")
6 print("Value at index b:", data[‘b'])

a 12
b 24
© 13
a 54
dtype: inté4
Value at index b: 24

2. Series From Dictionaries


Wecansee that Pandas’ Series look much like dictionaries in Python. In fact, we can think of a
PandasSeries like a specialization of a Python dictionary. A dictionary is a structure that maps
arbitrarykeysto a set of arbitrary values, and Series is a structure that mapstyped keys toa
set of typed values. This type information makes them moreefficient compared to standard
dictionaries.

Let’s see how wecan create Series from a dictionary, and then wewill perform indexing
andslicing on it. Say we havea dictionary with keysthat are fruits and values that correspond
to their amount. We wantto use this dictionary to create a Series object and then access values
using the namesofthefruits:

1 import pandas as pd
2
3 fruits_dict = ‘apples’:
4 “oranges’
5 “bananas*
6 “strauberries’: 20}
7
8 fruits = pd.Series(fruits_dict)
9 print("value for apples: ", fruits[‘apples*], “\n")
18
11 # Series also supports array-style operations such as slicing:
12 print(fruits[ bananas’: ‘strauberries"])

value for apples: 10

bananas 3
ozanges, 5
strawberries 20
dtype: inté4
Pandas Core Components - The DataFrame Object

We'll coverthe following a

* The DataFrame Object


» 1. Constructing a DataFrame From a Series Object
» 2. Constructing a DataFrame From a Dictionary
» 3. Constructing a Dataframeby Importing Data From File

The DataFrame Object


In the previouslesson, we learned aboutSeries. The next fundamental structure in Pandas that
wewill learn aboutis the DataFrame. While Seriesis essentially a column, a DataFrameis
a multi-dimensional table madeupof a collection of Series. Dataframesallow us to store
and manipulate tabular data where rowsconsist of observations and columnsrepresent
variables.

Thereare several waysto create a DataFrameusing pd.DataFrame() . For example, we can


create a DataFramebypassing multiple Series into the DataFrame object, we can convert a
dictionary to a DataFrameor wecan import data from csvfile. Let’s look at eachof these in
detail.

1. Constructing a DataFrame From a Series Object


Wecan create a DataFramefrom single Series by passingtheSeries object as inputto the
DataFramecreation method, along with an optional input parameter, column, which allows us
to namethe columns:

1 import pandas es pd
2
3 data_s1 = pd.Series([12, 24, 33, 15],
4 index=[ apples’, ‘bananas’, ‘strawberries’, ‘oranges’ ])
5
6 # ‘quantity’ is the name for our column
7 dataframel = pd.DataFrame(data_s1, columns=[ ‘quantity’ ])
8 print(dataframe1)
9

quantity
apples 12
bananas 24
strauberries 33
oranges 15

2. Constructing a DataFrame From a Dictionary


Wecan construct a DataFrame form anylist of dictionaries. Say we havea dictionary with
countries, their capitals and some othervariable (population,size of that country, numberof
schools, etc.):

1
2
3
4
5 data = pd.DataFrame(dict)
6 print(data)

SomeColumn capital country


o 100 Oslo Norway
1 200 Stockholm Sweden
2 300 Madrid Spain
3 400 Paris France

Wecanalsoconstruct a DataFrame from a dictionary of Series objects. Say we have two


differentSeries; one for the price offruits and one for their quantity. We wantto put all the
fruits related data togetherinto a single table. We can dothislike so:

import pandas es pd
wavanaune

quantity = pd.Series([12, 24, 33, 15],


index=[ apples’, "bananas’, ‘strawberries’, ‘oranges’ ])
price = pd.Series([4, 4.5, 8, 7.5],
index=[ apples’, "bananas’, ‘strawberries’, ‘oranges’ ])
df = pd.DataFrame({ quantity’: quantity,
18 “price’: price})
11 print(df)

price quantity
apples 4.0 12
bananas 4.5 24
strauberries 8.0 33
oranges 7 15

3. Constructing a Dataframe by Importing Data From a


File
It’s quite simple to load data from variousfile formats, e.g., CSV, Excel, json into a DataFrame.
Wewill be importing actual data for analyzing the IMDB-moviesdataset in the nextlesson.
Hereis whatloading data from differentfile formats lookslike in code:

import pandas es pd
@Vounune

# Given we have a file called data1.csv in our working directory:


df = pd.read_csv(‘datal.csv")
#given json data
df = pd.read_json(‘data2.json")

Wehaveonlyjust scratched the surface andlearned howto construct DataFrames.In the next
lessons we will go deeper andlearn-by-doing the many methodsthat we can call on these
powerful objects.
Pandas DataFrame Operations - Read, View and Extract
Information

We'll coverthe following a

* Learning Pandas with IMDB-Movies Dataset


* Important DataFrame Operations
* 1. Reading Data From CSVs
* 2. Viewing the Data
© 3. Getting Information Aboutthe Data
* Jupyter Notebook

Learning Pandas with IMDB-Movies Dataset


Time for somereal fun! Wehavelearned howto create DataFramesalready. Now weare
going to explore the manyoperations that can be performed on them.

To makethis step more engaging and fun, weare going to work with the IMDB Movies
Dataset. The IMDBdataset is a publicly available dataset that contains information about
14,762 movies. Each row consists of a movie and for each movie wehave informationliketitle,
yearofrelease, director, numberof awards, rating, duration etc. Sounds fun to explore, right?
Let’s putour datascientist’s hat on, anddiveinto the world ofthe movies! 9 &

IMDBData from 2006to 2016 (Image Source: Kaggle)

Important DataFrame Operations


Wearegoing to go through the most important DataFrame operations for a datascientist to
know,oneby one:

1. Reading Data From CSVs


Thedataset for these lessonsis here on Kaggle. Once we have downloadedthe data, we can
load it using the DataFramecreation method we mentioned in the previouslesson.

#® Note: Once you have gone through these “IMDB-lessons”, I highly recommend you
downloadthis dataset andplay with it. It is really importantto get your hands dirty;
don’t just read through these lessons!

# You can also find the Juptyter Notebookwith all the code for these “IMDB-lessons”
on myGitprofile, here.

# You can find the live execution of the Jupyter Notebookat the endofthis lesson.

1 import pandas es pd
2
3 # Reading data from the downloaded CSV:
4 moviesdf = pd.read_csv("IMDB-Movie-Data.csv")

Wehavecreated a DataFrame, movies_df, with a default index. Whatif we want to be able to


access each rowbythetitle of the movie, and not by some integer index?

Onewaytosettitles as our index is by passing the column nameas an additional parameter,


index_col, to the read_csv methodatfile load time. The second wayis to do this at a later stage
by explicitly calling the set_index() method onthe created DataFrame. Wecansee this in the
codesnippet below.

Havingan indexallowsfor easy andefficient searches. Looking up rowsbased on index values


is like looking up dictionary values based on a key. For example, whenthetitle is used as an
index, we can quickly fetch the row for a particular movie by simply using its title for lookup
insteadoftryingto find out its row numberfirst.

Note that wearecreating both title-indexed and default DataFrames(we don’t need both), so
that we can understandthe indexing concept better by comparing thetwoas, in steps 2 and 4.

#1, We can set the index at load time


wewne

movies_dftitleindexed = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title’)


# 2. We can set the index after the DataFrame has been created
moviesdftitleindexed = moviesdf.set_index('Title’)

2. Viewing the Data


Once wehavecreated the DataFrame,a helpful first step is to take a sneak peekat thedata.
This helps us create a mentalpictureof the data and become morefamiliar with it.

Wecan use the head() method to visualize the first few rowsof our dataset. This method
outputs the first 5 rows of the DataFramebydefault, but we can pass the numberof rows we
wantas input parameter.

Let’s first view the non-explicitly indexed (default) DataFrame:

1 movies_d¥.head()
2
3 # To ouput the top ten rows
4 movies_df.head(10)

movies_df.head()

Rank Te Genre
escripton Director Runtime rang votes REVEMYE.
Actors Year (untme Revenue metascore

0 + usrdane
alaxy
rego eq
Mofine —ActonAaventreserri
meget
— James Chis Prat Vin Diesel,
Sacooper zoe
$ aot «12181 787074 30013 760
Following
12 Promeneus Adventrentysteysciri mankind cues ygy
‘tommeorgnat Riley NoamRapace,
PERSEGren Logan 2012 «azd_—=7.0 868201284850
tchol Fa
Tree gis re sore anes MeAvoy. Aya
2 3 ‘Split Horror Thriter mani a Shyamalan
“idnapped by 8M. Night =YOt~oy.
Taylor-Joy, Haley Lu 2016 nT Haley
73 187606 138.12 620
dag "
nactyot vate
3 & Sing AnmatonComeayFamiy «(UTNE Cetephe winerepoon,
iccongughyReese
Se a 2016 108=«72«GOSKS «TOS —=—580
rusting ea
45 suisce don Adventure Fata
SEE ActenAdvertreranisy ofS awd Ayer Ayer Wi“Marge
aCe! Oawd Sty, tare
Raibe, olaLet,isa 201820
"MorgetRabbe, a sa802
982 esTaY a0
‘someoth

Nowif weprint the rowsof the DataFrame with the explicit index, we cansee that the nameof.
the indexed column,“Title”, gets printed slightly lowerthantherest of the columns,and it is
displayed in place of the column which was showing row numbersin the previouscase (with
default index):

1 moviesdftitle_indexed.head()

novies_df_title_indexed.head()

Rank Genre Deserption Director Runtime ating votes EVENS aetascore


‘cors Year (Eumme
we
1 petonadvenurescori Ago of dames Chis cova
—-teowacse Prat. Vn Diesel aor gya 7570743904
ements criminals are Gunn Bradley Cooper, Zoe S.. 20" CO oe oi ae
‘ores
Prometieus 2 Adventuentsteysari Fotoweg
mankind
C0081 yy
"""wwongnot RACY NOOR RapceMchosl
araratcrwen Lagan 2012
ra «1247.0 BSH «12040——80
spit 9 Three ora3 gyyy ygge lanes
HorecTnnier wan_inapoedty Avy.Moyes
Tayordoy Aya 206 «17-73 1870061381220
wa 0p scar
Sng 4 AnmatonComedyFaMty suraig hhumanod
MITUCOS
van
Chsopne
COD Namen 2016
nucconaugheyReese
winerapoon, Seth a «10872, 60SAS «2702880

suicide Asecet
orverrent WaSme, Jade,
sulcide 5 ActonAtvertxe Fantasy STEN Dad ayer HSMM JoedLeln. 2016aig 23,62 872732502400
ome ot

After this simple visualization, we are already morefamiliar with our dataset. Now we know
which columns makeup our data and what the values in each column looklike. We can now
see that each rowin ourdataset consists of a movie andfor each movie we have information
like “Rating”, “Revenue”, “Actors” and “Genre”. Each columnisalso called a feature,attribute, or
a variable.

Wecanalso see that each movie hasan associated rank as well and that rowsare ordered by
the “Rank” feature.

Similarly, it can be useful to observethelast rowsof the dataset. Wecan do this by using the
tail() method.Like the head( method,tail() also accepts the numberof rows we want to view
as input parameter.

Let’s lookatthelast three rowsof our dataset (worst movies in termsof rank):

1 moviesdftitle_indexed.tail(3)

movies_df_title_indexed.tail(3)

rank Gone Description Director Actors Year uuntie rating votes RAVE". etascore
we
ELLE
‘Step Up2: om ormauscronane Romante “evespans mn vodr
occur ae
two dance a Robert Heian,
EvanGansBrana
vigan, Cassie 2008 862 Toe enor i 00
Search oop verte. Comacy APM O XtendsHlendgembarkenban Sct Me 2014
ey MtPay, TJThomas -99«86 ABB} NAN 20
y cna missontoreun” ASO) gage
Wine Lives 1000 comeayFamiyFartaay Asuty busiesaman conn
‘inde mseapped .
gary Ken Spacey,
Gomer dente
Rate 206
Tanstch e783 12435 «96H

3. Getting Information About the Data


As first step, it is recommended to get the 10000-foot view of the data. We aregoing to look
at twodifferent methodsfor getting this high-level view: info() and describe() .

a. infoQ:This method allowsusto get someessential details aboutourdataset, like the number
of rows and columns, the numberofindex entries within that index range,thetype ofdata in
each column,the numberof non-null values, and the memory used by the DataFrame:

1 # This should be one of the very first commands you run after loading your data:
2 moviesdftitle_indexed.info()

movies_df_title_indexed.info()
<class ‘pandas.core.frame.DataFrame' >
Index: 10@0 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
Rank 1900 non-null intea
Genre 1000 non-null object
Description 1000 non-null object
Director 1000 non-null object
Actors 1000 non-null object
Year 1000 non-null intea
Runtime (Minutes) 1000 non-null intea
Rating 1900 non-null floatea
votes 1900 non-null intea
Revenue (Millions) 872 non-null floatea
Metascore 936 non-null floates
dtypes: floatea(3), inte4(4), object(4)
memory usage: 93.8+ KB

As wecan see from the snippet above,ourdatasetconsists of 1000 rows and 11 columns,andit
is using about 93KB of memory. An important thingto noticeis that we have two columnswith
missing values: “Revenue” and “Metascore”. Knowing which columns have missing valuesis
importantfor the next steps. Handling missing data is an importantdata preparationstep in
anydata science project; moreoften than not, we need to use machine learning algorithms and
methods for data analysis that are not able to handle missing data themselves.

The outputof the info() methodalso allowsshowsus if we have any columns that we expected
to be integers butareactually strings instead. For example, if the revenue had been recorded
as string type, before doing any numerical analysis on that feature, we would have needed to
convertthe values for revenuefromstringto float.

-shape: This is a fast and useful attribute which outputs a tuple, <rows, columns>, representing
the numberof rows and columnsin the DataFrame. Thisattribute comes in very handy when
cleaning and transformingdata. Say we had filtered the rows based on somecriteria. We can
use shape to quickly check how manyrowsweareleft with in thefiltered DataFrame:

moviesdftitleindexed. shape
# Output: (1088, 11)
## Note: .shape has no parentheses and is a simple tuple of format (rows, columns).
# From the output we can see that we have 100@ rows and 11 columns in our movies DataFrame.

b. describeQ:Thisis a great method for doing a quick analysis of the dataset. It computes
summary statistics of integer/doublevariables andgives us somebasicstatistical details like
percentiles, mean, and standarddeviation:

1 # We can do a quick analysis of any data set using:


2 moviesdf_title_indexed.describe()

movies_df_title_indexed.describe()

Rank Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore


count 1000.000000 1000.000000 1000.000000 1000.000000 1.000000e+03 872.000000 936.000000
mean 500.500000 2012.783000 113.172000 6.723200 1.698083e+05, 82.956376 58985043
std 288.819436 3.206962 18810908 0.945429 1.887626e+05 103.253540 17.194757
min 1.000000 2006.000000 66.000000 1.900000 6.100000e+01 0.000000 —11,000000
25% 250.750000 2010.000000 100,000000 6.200000 3.630900e+04 13.270000 47.0000
50% 500.500000 2014.000000 111.000000 6.800000 1.107990e+05 47985000 §9.500000
75% 750.250000 2016.000000 123,000000 7.400000 2.399098e+05 113.715000 72.0000
max 1000.000000 2016.000000 191.000000 9.000000 1.791916e+06 936.630000 100.000000

Wecansee that wehavea lot of useful high-level insights about our data now. For example, we
can tell that our dataset consists only of movies from 2006 (min Year) to 2016 (max Year). The
maximum revenue generated by any movie during that period was 936.63M USD while the
mean revenue was 82.9M USD. Wecan analyze all the other features as well and extract
important informationlike a breeze!

Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:

& Howto Use a Jupyter NoteBook?

* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.

+ You can click to openthe Jupyter Notebookin a new tab.

* Go to File and click Download as and then choosethe formatofthefile to download


>. You canchoose Notebook(.ipynb) to downloadthefile and worklocally or on
your personal Jupyter Notebook.

+ A Thenotebooksession expiresafter 15 minutes ofinactivity.

‘Your app can be found at: httos://vvq3ere1mk42-live-app.educative.runnotebooks/Read%2CView%26Extractinformation.ipynb G

Click to launch app!

Note: We will talk aboutthesestatistics concepts in very detail later, so don’t worry
if any of these stats soundsalien toyou.
Pandas DataFrame Operations - Selection, Slicing, and Filtering

We'll coverthe following a

4. DataSelection andSlicing

5. Conditional Data Selection and Filtering


* Jupyter Notebook

4. Data Selection and Slicing


Wehavelearned howto get a high-level view of our data and somebasic data summaries. Nowlet’s
focus on some moreinteresting DataFrame manipulation techniques, performing data selection,
slicing, and extraction. Fora clearer understanding, wewill look at working with columns and
then wewill learn how to manipulate DataFrames row-wise.

Oneimportant thing to rememberhereis that although manyof the methods can be applied to both
DataFrame and Series, these two havedifferent attributes. This means we need to know which type
of object we are working with. Otherwise, we can end up with errors.

a. Working With Columns

Wecan extract a column byusing its label (column name) and the square bracketnotation:

1 genre_col = movies_d¥[’Genre’]

The abovewill return a Series object. If we want to obtain a DataFrameobject as outputinstead,


then weneedto pass the column name(s)as a list (double square brackets), as shown below:

# We can select any column using its label:


wavanaune

# To obtain a Series as output


col_as_series = movies_df['Genre']
# Print the object type and the first 5 rows of the series
print(type(col_as_series))
col_as_series.head()
18
11 # To obtain a dataFrame as output
12 col_as_df = movies_df[['Genre"]]
13
14 # Print the object type and the first 5 rows of the DF
415 print(type(col_as_df))
16 col_as_df.head()

# To obtain a Series as output # To obtain a dataFrame as output


col_as_series = movies_df['Genre'] col_as_df = movies_df[["Genre"]]
print (type(col_as_series)) print (type(col_as_df))
col_as_series.head() col_as_df.head()
<class ‘pandas.core.series.Series'> <class ‘pandas.core.frame.DataFrame’ >
@ Action, Adventure, Sci-Fi
1 Adventure,Mystery, Sci-Fi Genre
2 Horror, Thriller Gaensernl
3 Animation, Comedy, Family
4 Action, Adventure, Fantasy 1 Adventure,Mystery.Sci-Fi
Name: Genre, dtype: object a
AnimationComedy,Family
ActionAdventure,Fantasy

If we wantto extract multiple columns, wecan simply add additional column names tothelist.

#Since it’s just a list, adding another column name is easy:


RUN

extracted_cols = movies_dftitleindexed[[ ‘Genre’, ‘Rating’, ‘Revenue (Millions)']]


extracted_cols.head()

#Since it's just a List, adding another column name is easy:


extracted_cols = movies_df_title_indexed[['Genre’, ‘Rating’, ‘Revenue (Millions)']]
extracted_cols.head()

Genre Rating Revenue (Millions)


Tite
Guardians of the Galaxy ActionAdventure.Sc-Fi 8.1 393.13
Prometheus Adventure.Mystery.Sci-Fi 7.0 126.46
split HororThriler 7.3 138.12
Sing Animation,Comedy.Family 7.2 270.82
Suicide Squad ActionAdventure,Fantasy 6.2 325.02

Notice the difference when weusethe indexed DataFrame(“movies_df_indexed”) vs default-indexed


one (“movies_df”): we havean index ontitle so in thelast snippet, movietitles are getting displayed
instead of row numbers.

b. Working With Rows

Nowlet’s look at how to perform slicing by rows. Here wehaveessentially the following indexers:

* loc: the loc attribute allows indexing andslicing that always referencesthe explicit index,i-e.,
locates by name. For example, in our DataFrameindexedbytitle, we will usethetitle of the
movieto select the required row.
* iloc : theiloc attribute allows indexing andslicing that always references the implicit Python-
style index,i.e., locates by numerical index. In the case of our DataFrame,wewill pass the
numerical index of the movie for which weare interested in fetching data.
* ix: this is a hybrid of the other two approaches. Wewill understandthis better by looking at
some examples.

# With loc we give the explicit index. In our case the title, “Guardians of the Galaxy”
wewne

gog = moviesdftitleindexed.loc["Guardians of the Galaxy"]


# With iloc we give it the numerical index of “Guardians of the Galaxy":
gog = moviesdftitleindexed.iloc[o]

# With Loc we give the explicit index. In our case the title, "Guardians of the Galaxy”:
gog = movies_df_title_indexed.loc["Guardians of the Galaxy"]
g0g
Rank. 1
Genre Action,Adventure, Sci-Fi
Description A group of intergalactic criminals are forced ...
Director James Gunn
Actors Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
Year 2014
Runtime (Minutes) 121
Rating 8.1
Votes 787074
Revenue (Millions) 333.13
Metascore 76
Name: Guardians of the Galaxy, dtype: object

# With iloc we give it the numerical index of “Guardians of the Galaxy”:


gog = movies_df_title_indexed.iloc[9]
Bog
Rank 1
Genre Action,Adventure, Sci-Fi
Description A group of intergalactic criminals are forced ...
Director James Gunn
Actors Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
Year 2014
Runtime (Minutes) 121
Rating 8.1
Votes 757074
Revenue (Millions) 333.13
Metascore 76
Name: Guardians of the Galaxy, dtype: object

Wecan also get slices with multiple rowsin the same manner:

multiple_rows movies_dftitle_indexed.loc[ ‘Guardians of the Galaxy’: ‘Sing’ ]


multiple_rows movies_dftitle_indexed.iloc[@:4]
multiple_rows

multiple_rows = movies_df_title_indexed.loc[ ‘Guardians of the Galaxy’ sing’)


multiplerows = movies_df_title_indexed.iloc(o:4]
multiple_rows

Rank Genre Description Director Runtime Rating Votes fame,


‘Actors Year quumuteny Revenue Metascore
Title
‘Agroup of
Guardians of riue-sc-r _—_‘lergalactc James Chis Prat, Vin Diesel
theGalaxy 1 ActionAdventuraSc-Fi crimpais are Gunn Bradley Cooper, Zoe's... 204 1218.4 TO7OTA 388.13 ree
forced
Following cues to ygy NoorRapece, Logan
Prometheus 2 Adventurenysteysc-fi mecrgnot PHY ygrhan-creen Michael 2012 124—=«7.0. «4868201284865
mankind, ate Fa,
Three gs are yy yjgny James McAvoy. Anya
g,Minont —Tayior-doy, Haley Lu 2016 1177.3. 187606 138.12 «620
spit 3 with a diag...bya SY
HorrozThniler rmanKidnapped Richar.
inacty of Matthew
Sing 4 Animation. Comedy Family humanoida Christophe
animals, econaugheyReese 2016 108 72«G0SAS 270.32 590
fustng tee Lourgelet \witnerspoon, Seth Ma,

If we donot wantto selectall the columns, we can specify both rows and columnsat once;thefirst
indexrefers to rows while the second one(after the coma) to columns:

Remember: thedot notationis start:step:end. If we just have somethinglike:4, it meansthe starting


pointis the Oth index.

1 # Select all rows uptil ‘Sing’ and all columns uptil ‘Director’
2 movies_df_title_indexed.loc[:'Sing’, :'Director’]
3 moviesdftitleindexed.iloc[:4, :3]

movies_df_title_indexed.loc[: ‘Sing’, :‘Director’]

Rank Genre Description Director


Title
Guardians of the Galaxy 1 Action,Adventure.SciFi A groupof intergalactic criminals are forced James Gunn
Prometheus 2 Adventure,Mystery’SciFi Followingclues tothe origin of mankind, ate Ridley Scott
split 3 Horror,Thriler Three gis are kidnapped by aman witha diag... M. Night Shyamalan
Sing 4 Animation,Comedy.Family Ina city of humanoid animals, a hustlingthea... Christophe Lourdelet
movies_df_title_indexed.iloc[: ) 23]
Rank Genre Description
Title
Guardiansofthe Galaxy ‘Action Adventure,Sci-Fi A groupofintergalactic criminals are forced
Prometheus 2 Adventure.Mystery.Sci-Fi Following clues to theorigin of mankind, a te
split 3 Horror,Thriler Threegirls arekidnapped by a man with a diag,
Sing 4 Animation,Comedy,Family In city of humanoid animals, a hustling thea,

Nowlet’s lookat the hybrid approach,ix. It’s just like the other two indexing options, except that we
can use a mix of explicit and implicit indexes:

1 # Select all rows uptil Sing and all columns uptil Director
2 moviesdftitleindexed.ix[:'Sing’, :4]
3 moviesdftitleindexed.ix[:4, :"Director’]

5. Conditional Data Selection and Filtering


Wehavelooked atselecting rows and columnsbasedonspecific indices. But what if we don’t know
the index (implicit or explicit) of the row that we wantto perform dataselection orfiltering based
on someconditions on?

Say we wantto filter our movies DataFrame to show only movies from 2016 or all the movies that
had rating of more than 8.0?

Wecan apply boolean conditionsto the columnsin our DataFrame asfollows:

# We can easily filter rows using the values of a specific row.


onunwner

# For example, for geting all our 216 movies:


movies_df_title_indexed[movies_df_title_indexed[ Year" ] 2016]
# All our movies with a rating higher than 8.0
moviesdf_title_indexed[movies_dftitleindexed[‘Rating’] > 8.0 ]

Nowlet’s look at some more complex filters. We can make our conditions richer with logical
operators like “|” and “&”.

Say we wantto retrieve the latest movies (movies released between 2010 and 2016) that had a
very poorrating (scoreless than 6.0) but were among thehighestearners at the boxoffice
(revenue abovethe 75th percentile). We can write our queryas follows:

1 moviesdftitle_indexed[
2 ((movies_df_title_indexed['Year'] >= 2010) & (movies_df_title_indexed[‘Year’] <= 2016))
3 & (movies_df_title_indexed[ ‘Rating'] < 6.0)
4 & (movies_df_title_indexed[ ‘Revenue (Millions)'] > movies_df_title_indexed[ ‘Revenue (Millions)"].quan
5 ]

movies_df_title_indexed[
((movies_df_titleindexed{'Vear"] >= 2010) & (movies_df_title_indexed[ "Year"] <= 2016))
& (movies_dftitleindexed{ 'Rating'] < 6.0)
& (movies_df_title_indexed{ ‘Revenue (Millions)'] > moviesdf_title_indexed[ ‘Revenue (Millions)'].quantile(@.75))

Rank Genre Description Director Runtime Rating Votes (Ramee


Actors Year quit’ Revenue Metascore
Title
Literature student Sam Dakota Johnson,
64 DramaRomance, Theiler ‘Anastasia Taylor Jamie Dorman, 2016 125 41 Dasara 166.18 460
Steeles life cna... Johnson Jenner Ehle EL
Foltoning a ghost Weise Mecarthy.
Ghostbusters 00 Acton comeayranosy FSBOrata pauirelg ‘Krtenicon, ig Kale 2018 110««89MAT7IT_— 1283800
Tranatonmers 127 Abs
Acton Adventure So-#|_ escape ght must
Fom Mina NeoletzMar ise, ack 2014 16587256488 2484s 20
ee ‘a bounty hunte. ey Reynor, Stan.
TeBreaking
TalghtDawn- 967 Atertnebithof
Adventue,DramaFantasy Renesmee. he
gy)Bil pobertPatineon,
—_Kiston Stewart, 2012 1S 19432929230 «820
Pant Gatos gabe yar
Grown Upe2 295 Atermo%ranie ogg
comecy famiypacrons O28 KemAtom Sande,
romeioun' ok Davecore
iomen 8 2013 10184 Ti4dea S367 100
Clashotthe 575 pctonAsvenureFanisy Peedried ""vsoncromm, L048 am“""llSunecuon,
battles the mini,
Wertingon,
et Ralph Fiennes.Ja.
2010 108 «<8 aos tea19 «380

Theresultstell us that “Fifty Shades of Grey” tops the list of movies with the worst reviewsbut the
highest revenues! In total there are 12 movies that match these criteria.

Note that the 75th percentile was given to us earlier by the .describe() method (it was 113.715M $),
andtheseareall movies with revenue abovethat.

Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:

& Howto Usea Jupyter NoteBook?

Click on “Click to Launch” g7 button to work andsee the code running live in the
notebook.

‘You can click ZG to openthe Jupyter Notebookin a new tab.

Go to File and click Downloadas andthen choosethe formatofthefile to download &.


You can choose Notebook(.ipynb) to downloadthefile and worklocally or on your
personal Jupyter Notebook.

A\ The notebooksession expires after 15 minutes ofinactivity.

‘Your app can be found at: https://vvq3ere1mk42-live-app.educative.runinotebooks/Selection%2CSlicing%26Filtering.ipynb G

Click to launch app!


Pandas DataFrame Operations - Grouping and Sorting

We'll coverthe following =A

* 6. Grouping
* 7. Sorting
* Jupyter Notebook

6. Grouping
Thingsstartlooking really interesting when we group rowswith certain criteria and then
aggregate theirdata.

Say we wantto group ourdataset by director and see how much revenue(sum)each director
earnedat the box-office and then also look at the average rating (mean)for each director.
Wecandothis by using the groupby operation on the column ofinterest, followed by the
appropriate aggregate (sum/mean),like so:

1 # Let’s group our dataset by director and see how much revenue each director has
2 movies_df.groupby( ‘Director’ ).sum()
3
4 # Let’s group our dataset by director and see the average rating of each director
5 movies_df.groupby( ‘Director’)[[ ‘Rating ]].mean()

for example, Let's growp ourdataset by director and see how auch revenue ach director has: =aeies_4 af groupby(
growpby("bireckor DEL tating'I].Rating’ 11-mean()
econ
tories.df. grupty(“Dsrector")-sum()
Rank Year Runtime [ainues} Rating Votes. Revenve(ions), Metscore ating

‘amiennan 92 aor ie 8s oar ize eae


Adam non 722018 SS] os ome ‘Adam Leon 6500000
gon exay 910 ea30 4 mmo aneez7 wom7 ze ‘Adam MeKay 7.00000
‘Adem Shankman 1480. 4018 20 128 1687 temas rato paieimeNere
Adam ings 14544050 1 Ne ans no m0 ‘Adam wingard_ 5.900000
oreo Posen 552 2018 a) ox me ‘Monso Poyart 6400000
ising Wan 630 2016 us 78 020 eae ‘isting Walsh 7800000,
ianStayer 157 2016 es ome as ‘Atan Satayey_ 6300000
ua senor 554 2016 er ars sso ‘AtivaSchafer 6.700000
Aan tor 429° 4028 ) m0 ‘Man Tayler 6750000,
Abert Hughes 422 2010 va 69 zea sir mo Atvert ges 6900000

As wecan see, Pandas groupedall the ‘Director’ rows by name into one. And since we used
sum()for aggregation, it added togetherall the numerical columns. The values for each of the
columns nowrepresentthe sum of values in that column for that director.

For example, wecan see that the director Aamir Khan hasa very high averagerating (8.5) but
his revenue is much lower comparedto manyotherdirectors(only 1.20M $). This can be
attributed to thefact that wearelookingat a dataset withinternational movies, and Hollywood
directors/movies have understandably much higher revenues comparedto movies from
internationaldirectors.

In addition to sum() and mean() Pandas provides multiple other aggregation functionslike
min() and max().

Alt! Can you find a problem in the code when we apply aggregation to get the sum?

Thisis not the correct approachforall the columns. Wedo not wantto sum all the ‘Year’ values,
for instance. To make Pandas apply the aggregation on someofthe columnsonly, we can
specify the name ofthe columnsweareinterested in. For example, in the second example, we
specifically passed the ‘Rating’ column,so that the meandid not get applied to all the columns.

Groupingis an easy and extremely powerful data analysis method. Wecan useit to fold
datasets and uncoverinsights from them, while aggregationis oneof the foundationaltools of
statistics. In fact, learning to use groupby() to its full potential can be oneof the greatest
usesof the Pandaslibrary.

7. Sorting
Pandasallowseasy sorting based on multiple columns. Wecan apply sorting on theresult of
the groupby(operation or we can applyit directly to the full DataFrame. Let’s see this in action
via two examples:

1. Say we wantthe total revenueperdirector andto have our results sorted by earnings,
not in alphabeticalorderlike in the previous examples.

Wecanfirst do a groupby() followed by sum() (just like before) and then wecancall
sort_values ontheresults. To sort by revenue, we need to pass the name of that column as
inputto the sorting method; wecan also specify that we wantresults sorted from highest
to lowest revenue:

1 #Let’s group our dataset by director and see who earned the most
2 movies_df.groupby( ‘Director’ )[[ ‘Revenue (Millions)"]].sum().sort_values(['Revenue (Millions)"], asc

#Let’s group our dataset by director and see who earned the most
movies_df.groupby( ‘Director')[['Revenue (Millions)"]].sum()-sort_values([ ‘Revenue (Millions)'], ascending=False)

Revenue(Millions)
Director
J. Abrams 1683.45
David Yates 162051
Christopher Nolan 1515.09
Michael Bay 142132
Francis Lawrence 1290.81
Joss Whedon 1082.27
Jon Favreau 1025.60
Zack Snyder 975.76
Peter Jackson 26045

2. Now, say we wantto see which movies had both the highest revenue andthe highest
rating:

Any guesses?!

1 # Let's sort our movies by revenue and ratings and then get the top 10 results
2 data_sorted = movies_df_title_indexed.sort_values(['Revenue (Millions)', 'Rating’], ascending-False)
3 data_sorted[['Revenue (Millions)*, ‘Rating’ ]].head(1@)

‘Pandas allow easy sorting based on multiple colums as in this example


Hlet's sort our movies by revenue and ratings and then get the top 10 results
data_sorted = movies_éf_indexed.sort_values([ ‘Revenue (Millions)', Rating'], ascending=False)
data_sorted{[*Revenue (Millions)", ‘Rating’ ]] -head(10)
Revenue (Millions) Rating
te
‘StarWare: Episo Force Awakens wees 8
avatar 705178
Jurassic World 652.1870
‘The Avengers. ou28 8
‘The Dark Knight 520290
Rogue One sear 79
Finding Dory 4902074
Avengers: Age of Ultron 4se00 74
The Dark Knight Rises 440.1388
‘The Hunger Games: Catching Fire a2ass 76

Wenow knowthat J.J. Abramsis the director who earned the most$$$at the boxoffice,
andthat Star Warsis the movie with the highest revenue andrating, followed by some
other very popular movies!

Star Wars: The Force Awakens(Image Source: starwars.com)

Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:

& Howto Use a Jupyter NoteBook?

* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.

+ You can click to openthe Jupyter Notebookin a new tab.

* Go to File and click Download as and then choosethe formatofthefile to download


>. You canchoose Notebook(.ipynb) to downloadthefile and worklocally or on
your personal Jupyter Notebook.

+ A Thenotebooksession expiresafter 15 minutes ofinactivity.

‘Your app can be found at: httos://vvq3ereLmk42-live-app.educative.run/notebooks/Grouping%26Sorting.ipynb G

Click to launch app!


Pandas DataFrame Operations - Dealing With Missing and
Duplicates

We'll coverthe following a

* 8, Dealing With Missing Values


a. Detecting Null Values

b, Dropping Null Values


c. Imputation(Filling Null Values)
* 9. Handling Duplicates
e 10. Creating New Columns From Exi:

* Jupyter Notebook

8. Dealing With Missing Values


Thedifference between fake data andreal-worlddatais thatrealdata is rarely clean and
homogeneous. Oneparticularissue that we need to tackle when working with real datais that
of missing values. Andit’s not just about values being missing,different data sources can
indicate missing values in different waysas well.

Thetwoflavors in which wearelikely to encounter missing or null valuesare:

« None: A Pythonobject that is often used for missing data in Python. Nonecan only be used
in arrays with data type ‘object’ (i.e., arrays of Python objects).
+ NaN (Not a Number): A specialfloating-point valuethatis used to represent missing data.
floating-point type meansthat, unlike with None’s objectarray, we can perform
mathematical operations. However, rememberthat, regardlessofthe operation,the result
of arithmetic with NaN will be another NaN.

Runthe examples in the code widget below to understandthedifference between the two.
Observethat performing arithmetic operationson the array with the Nonetype throwsa run-
timeerrorwhile the code executes without errors for NaN:

import numpy as np
wavanaune

import pandas es pd
# Example with None
None_example np.array([@, None, 2, 3])
print("dtype None_example.dtype)
print(None_example)
# Example with NaN
10 NaN_example = np.array([@, np.nan, 2, 3])
11 print("dtype Natl_example.dtype)
12 print(NaN_example)
13
14 # Math operations fail with None but give NaN as output with NaNs
15 print("Arithmetic Operations”)
16 print("Sum with NaNs:", NaN_example.sum())
17 print("Sum with None:", None_example.sum())

1.548
dtype = object
[0 None 2 3]
dtype = floated
[0. nan 2. 3.]
Arithmetic Operations
Sum with NaNs: nan

Traceback (most recent call last):


File "main.py", line 17, in <module>
print ("Sum with None None_example.sum())
File "/usz/local/1ib/python3.5/dist-packages/numpy/core/_methods.py", line 32, in _sum
return umr_sum(a, axis, dtype, out, keepdins)
TypeError: unsupported operand type(s) for +: ‘int’ and ‘NoneType’

Pandasis built to handle both NaN and None,andittreats the two asessentially
interchangeablefor indicating missing or null values. Pandasalso provides us with many
useful methods for detecting, removing, and replacing null values in Pandasdata structures:
isnull(), notnull(), dropna(), and fillna() . Let’s see all of these in action with some
demonstrations.

a. Detecting Null Values

isnull() and notnull() are two useful methodsfordetecting null data for Pandas data
structures. They return a Boolean maskoverthe data. For example,let’s see if there are any
movies for which we have some missing data:

1 moviesdftitleindexed. isnull()

moviesdftitleindexed. isnull()

Rank Genre Description Director Actors Year Runtime Rating votes


(mas Revenue metascore
Frameee
Tie
Guardians ofthe Galaxy Fake Fase False False False. False False False False Fake False
Prometheus False Fase False False. False False False False False Fae False
‘Spit False Fase False False. False False False False False Fase False
Sing False Fase False False. False False False False False Fale False
Suicide Squad False False False False False. False False False False Fake False
The GreatWall False Fase False False False False False False False Fale False
Lalaland Fase Fase False False False False False False False Fae False
Mindhorn False Fase False False False False False False False Tue False

As wecan see from the snippet of the Boolean mask above, isnull( returns a DataFrame where
eachcell is either True or False depending on that cell’s missing-valuestatus. For example, we
can see that we do not havethe revenueinformation for the movie “Mindhorn”.

Wecan also count the numberofnull valuesin each column using an aggregate function for
summing:

1 moviesdftitleindexed. isnull().sum()

moviesdf titleindexed. isnull().sum()


Rank °
Genre e
Description e
Director e
Actors e
Year e
Runtime (Minutes) °
Rating e
votes e
Revenue (Millions) 128,
Netascore 64
dtype: intea

Nowweknowthat wedo not knowthe revenuefor 128 movies and metascorefor 64.

b. Dropping Null Values

Removing null values is very straightforward. However,it is not always the best approach to
deal with null values. And here comesthe dilemmaof dropping vs imputation,replacing nulls
with somereasonable non-null values.

In general, dropping shouldonly be performed whenwehavea small amountofnull data


because we cannot just drop single values from the DataFrame — dropping means removing
full rowsorfull columns.

dropna() allowsusto very easily drop rowsor columns. Whetherweshould go by rows or


columns depends onthedataset at hand;thereis no rule here.

« By default, this method will dropall rowsin which anynull valueis present and return a
new DataFramewithoutalteringtheoriginal one. If we want to modify our original
DataFrameinplace instead, we can specify inplace=True.

« Alternatively, we can drop all columnscontaining any null values by specifying axis=1.

# Drop all rows with any missing data


wewne

movies_df_title_indexed.dropna()
# Drop all the columns containing any missing data
movies_df_title_indexed.dropna(axis=1)

Whatdoes dropping data meanfor our IMDBdataset?

« Dropping rows would remove 128 rows whererevenueis null and 64 rows where
metascoreis null. This is quite somedata losssince there’s perfectly good data in the other
columns of those dropped rows!
« Dropping columns would removethe revenue and metascore columns — not a smart
moveeither!

To avoidlosingall this good data, we can also chooseto drop rowsor columns based on a
threshold parameter, drop onlyif the majority of data is missing. This can hespecified using
the howorthresh parameters, which allow fine control of the numberofnulls to allow in
through the DataFrame:

# Drop columns where all the values are missing


onunwner

d¥.dropna(axis="columns’, how="all")
# Thresh to specify a minimum number of non-null values
# for the row/column to be kept
df .dropna(axis="rows", thresh=10)

c. Imputation (Filling Null Values)

As wehavejust seen, dropping rowsor columnswith missing data can result in a losing a
significant amountofinteresting data. So often, rather than dropping data, we replace missing
values with a valid value. This new value can be a single number,like zero,or it can be some
sort of imputation orinterpolation from the good values,like the mean or the median value of
that column.For doing this, Pandas providesus with the very handy fillna() method for
doing this.

For example,let’s impute the missing values for the revenue column using the mean revenue:

# Getting the mean value for the column:


wavanaune

revenue = movies_df_title_indexed[ ‘Revenue (Millions) ]


revenue_mean = revenue.mean()
print("Mean Revenue:", revenue_mean)
# Let's fill the nulls with the mean value:
revenue.fillna(revenue_mean, inplace=True)
10 # Let's get the updated status of our DataFrame:
11 movies_df_title_indexed.isnull().sum()

# Getting the mean value for the column:


revenue = movies_df_indexed[ ‘Revenue (Millions)"]
revenue_mean = revenue.mean()

revenue_mean
82.95637614678897

# Let's fill the nulls with the mean value:


revenue. fillna(revenue_mean, inplace=True)
# Let's get the updated status of our DataFrame:
movies_d#_indexed.isnull().sum()
Rank
gLessssscc00

Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
dtype: intea

Wehavenowreplaced all the missing values for revenue with the meanofthe column, and as
wecan observefrom the output,by using inplace=True we have modified theoriginal
DataFrame — it has no morenulls for revenue.

Note: While computing the mean, the aggregate operation did notfail, even if we had missing
values, becausethe dataset has missing revenues denoted by NaN, as showninthe snippet
below:

movies_df_title_indexed. loc[ ‘Mindhorn* ]


Rank 8
Genre Comedy
Description A has-been actor best known for playing the ti...
Director Sean Foley
Actors Essie Davis, Andrea Riseborough, Julian Barrat...
Year 2016
Runtime (Minutes) 89
Rating 6.4
Votes 2498
Revenue (Millions) NaN
Metascore 71
Name: Mindhorn, dtype: object
Revenuevalue before imputation was NaN.

This wasa very simple way of imputing values. Insteadof replacing nulls with the meanofthe
entire column, a smarter approach could have beento be morefine-grained — we could have
replacedthe null values with the mean revenuespecifically for the genreof that movie,instead
of the meanforall the movies.

9. Handling Duplicates
Wedonot have duplicate rowsin our moviesdataset, butthis is not always the case. If we do
have duplicates, we wantto make surethat weare not performing computations,like getting
thetotal revenueperdirector, based on duplicate data.

Pandasallowsusto very easily remove duplicates by using the drop_duplicates() method. This
method returnsa copy of the DataFramewith duplicates removed unless wechooseto specify
inplace=True,just like for the previously seen methods.

Note:It’s a good practiceto use .shape to confirm the changein numberofrowsafter the
drop_duplicates() methodhas been run.

10. Creating New Columns From Existing Columns


Often while analyzing data, wefind ourselves needing to create new columnsfrom existing
ones. Pandas makes this a breeze!

Say we wantto introduce a new columnin our DataFramethat has revenue per minute for
each movie. Wecandividethe revenueby the runtimeandcreate this new columnvery easily
like so:

# We can use ‘Revenue (Millions)* and ‘Runtime (Minutes)' to calculate Revenue per Min for each.mov|
wewne

movies_df_title_indexed[ ‘Revenue per Min’] =


movies_df_title_indexed[ ‘Revenue (Millions)']/movies_d¥title_indexed[ ‘Runtime (Minutes) "]
movies_df_title_indexed.head()

# Let's use the ‘Revenue (MiLLions)' and ‘Runtime (Minutes)’ to catcutate Revenue per Min for each movie
novies_df_titleindexed"Revenue per Min"] = movies_éftitleindexed[ ‘Revenue Millions)" ]/moviesdftitle_indexed{ Runtime (
novies_df_title_indexed.head(),

Rank Genre Description Director funtme rating votes EVEmetascore


Actors ear (lutte Revenue Revense
Revenue
Tite
Guar
wariane
forme 1 AconAdvenurescihi intergaacae
itepalacte
"“crmnas “ZTS ra Prat.
Daal Pra, Bradey
VsVin 2014 121,81 T5TO7A 35919 700 2759140
calaxy aretorces coop Zee 8
Fotowing
Prometheus 2 Advenureaysteyserri mannches tore‘SEY Logan
“ongnor Noonrshat:
Rapace 2012 «1247.0. 485820 Ha848 0.018890
8 Green tacos Fa
ve
Tree ais
Seed James McAvoy, Anya
spit 3 HonecTviter Wisnapped
yamen MANN
Shyamaian “ToveyJoy, Maley
MayLu 2018 «1177.3. 1876061381220. 1.180813
wane
ag.
Inaciyot athe
Sing fumanos Chisaphe MeconauanexFeese
4 AnmatoncomeayFemly srima.a arg 9g 7.2 gost 27032890. 2502069
sing a
thee
A secret sored
5 Acton Adventure Fantasy eoue "agency OavdAjer Letowasmh
government MagetRcboe,
oa 0 2016 123««6? 395727 250200 264249
some ofth

From the snippet above, wecan see that we have a new column atthe end of the DataFrame
with revenueper minute for each movie. This is not necessarily useful information,it was just
an example to demonstrate howto create new columns based onexisting data.

Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:

& Howto Use a Jupyter NoteBook?

Click on “Click to Launch” g7 button to work andsee the code running live in the
notebook.

You canclick ZG to openthe Jupyter Notebookin a new tab.

Go to File andclick Downloadas andthen choosethe formatofthefile to download


>. You canchoose Notebook(.ipynb) to downloadthefile and worklocally or on
your personal Jupyter Notebook.

A\ Thenotebooksession expires after 15 minutes ofinactivity.

‘Your app can be found at: httos://vvq3ere 1k42-live-app.educative.run/notebooks/DealMissing%26DuplicateValues.ipynb G

Click to launch app!


Pandas DataFrame Operations - Pivot Tables and Functions

We'll coverthe following a

* 11. Pivot Table


» 12. Applying Functions
* Jupyter Notebook
© Final Thoughts

11. Pivot Table


Wehaveseen how groupinglets us explorerelationships within a dataset. A pivot table is a
similar operation. You have probably encountered it in spreadsheets or some other programs
that operateontables. If you have ever worked with Excel, Pandas can be used to create Excel
style pivot tables.

Thepivottable takes simple column-wise data as input, and groups theentriesinto a two-
dimensionaltable to give a multidimensional summary of the data. Hard to understand?
Let’s understandthe concept with an example!

Say we wantto comparethe $$$ earnedby the variousdirectors per year. We can create a
pivottable using pivot_table ; wecanset index =‘Director’ (row ofthe pivottable) and get the
yearly revenueinformation by setting columns ‘Year’:

1 # Let's calculate the mean revenue per director but by using a pivot table instead of groupby as se
2 moviesdftitle_indexed.pivot_table( ‘Revenue (Millions)', index="Director’,
3 aggfunc='sum’, columns="Year").head()

The aggfuncparametercontrols what type of aggregation is applied (meanby default). As in


groupby,this can be a string representing one of the many commonchoices,like ‘sum’, ‘mean’,
‘count’, ‘min’, ‘max’.

#Let's calculate the mean revenue per director but by using a pivot table instead of groupby as seen previously
movies_df_title_indexed.pivot_table('Revenue (Millions)', index='Director',aggfunc='sum', columns="Vear').head()

Year 2006 2007 2008 2009-2010 2011 2012 2013 2014 2015 2016,
Director
‘AamirKhan NaN 120 NaN NaN NaN NaN NaN NaN NaN NaN NaN
‘Abdellati Kechiche NaN NaN NaN NaN NaN NaN NaN 22 NaN NaN Naty
‘Adam Leon NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
‘Adam MeKay 148.21 NaN 100.47 NaN 11922 NeN NaN NeN NaN 70.24 NaN
‘Adam Shankman NaN 11882 NaN NaN NaN NaN 3851 NaN NaN NaN NaN

From our pivot table, we can observethat for Aamir Khan wehaveonly revenue for 2007. This
can imply twothings: 2007 wastheonly yearin this 10 year period whenanyofhis movies got
released or that we simply do not have completedata for this director. Wecan also see that
Adam McKayhasthe most moviesoverthis ten year period,withhis highest revenuebeing in
2006and lowest in 2015.

With a simple pivottable, we can see the annual trendin revenueperDirector;pivot tables
are indeed a powerful tool for data analysis!

12. Applying Functions


Applying functionsto a dataset, apply() , is very handyforplaying with data and creating new
variables. To apply(a function meansreturning somevalueafter passing each row/column of
a DataFrame through somefunction. The function can bea defaultoneor user-defined.

You might be thinking, “Can’t wedothis by iterating over the DataFrameorSerieslike with
lists?” Yes, you are right, wecan. The problemisthat it would not be anefficient approach,
especially when dealing with large datasets. Pandasutilizes which meansvectorization
(operationsareapplied to wholearrays instead ofindividual elements).

For example, we could use a function to classify movies into four buckets (“great”, “good”,
“average”, “bad”) based on their numericalratings. We can dothis in twosteps:

« First, define a function that when given a rating determinestheright bucket for that
movie.
« Then apply that function to the DataFrame.

This is how wecan doit in code:

1 #41. Let's define the function to put movies into buckets based on their rating
2 def rating_bucket(x):
3 if x >= 8.8:
4 return “great”
5 elif x >= 7.0:
6 return “good”
7 elif x >= 6.0:
8 return “average”
9 else:
18 return “bad”
11
12 #2. Let's apply the function
13 movies_df_title_indexed[ "RatingCategory"] = movies_df_title_indexed["Rating”].apply(rating_bucket)
14
15 #3. Let's see some results
16 movies_df_title_indexed.head(10)[[ ‘Rating’ , "RatingCategory" ]]

# 1. Let's define the function to put movies into buckets based on their rating
def rating_bucket(x):
AF x= 8.0:
return “great
elif x >= 7.0:
return “good”
eli f x
wurn “average”
return “bad”
# 2. Let's apply the function
novies_dftitleindexed[ "RatingCategory"] = moviesdftitleindexed[Rating"].apply(rating bucket)
#3. Let's see some results
movies_dftitleindexed-head(10){{'Rating’ , ‘RatingCategory’)
Rating RatingCategory
Tite
(Guardians ofthe Galmy 8.1 ‘reat
Prometheus 7.0 008
spit 73 00
sing 72 geod
Sulcide Squad 62 average
The Grest Wall 6.1 average
Latatand 03 ‘reat
Mindnorn 64 average
The LostetyofZ 7.1 008
Passengers 70 008

Accordingto our rating method, we can see that “Guardiansof the Galaxy” and “La La
Land” are great movies, while “Suicide Squad” is just an average movie!

La LaLand (Image Source: https:/ww.slashfilm.com/la-la-land-review/)

Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:

& Howto Use a Jupyter NoteBook?

* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.

+ You can click to openthe Jupyter Notebookin a new tab.

* Go to File and click Download as and then choosethe formatofthefile to download


>. You canchoose Notebook(.ipynb) to downloadthefile and worklocally or on
your personal Jupyter Notebook.

+ A Thenotebooksession expiresafter 15 minutes ofinactivity.

Your app can be found at: httos://vvq3ere1mk42-live-app.educative.runinotebooks/PivotTables%26Functions.ipynb G

Click to launchapp!

Final Thoughts
Whata fun rideit has been! From data exploration and data extraction to data
transformation, we have learned so many Pandas magictricks! And as a bonus, wehavealso
gained manyinteresting movies-insights along the way!

Being well-versedin Pandas operationis one oftheessential skills in data science. It’s
importantto have a good grasp on these fundamentals.If you want to go furtheror learn more
Pandastricks,here is Pandas extensiveofficial documentation.

A handy cheat sheet and someexercises to embedthese conceptsin your long-term


memory are awaitingyou in the next lessons! Keep going —you are doing great! (Q
Pandas:Further Readings and Cheat Sheet

Pandasofficial documentationis very extensive. To make navigating through iteasier, here are
somegood placesfor a broader and/or more detailed overvie

« Essential Basic Functionality


¢ Tutorials
* Cookbook

Belowis a handy Pandascheatsheet!

Data Wrangling angling in pa


with pandas easy EAE EAEEES iy Sewer vcous
sccominensyonasss, KE
Se El
Cheat Sheet nat ttt SS nac e a)
http://pandas.pydata.org Vata uy greene Scomtwesna ef ———
MHA
Sister Sclcranaon,
(¢fsort_values('mpe")
‘eaerrs yascluotoih).
‘fsort_values(‘npg' ascendingetalse)
(eaerrowsbysalescta cohenigh oau).
paanelt(ot) (¢franane(cotums
era teclusos= ('y':"year"))
Diane
» 32, 32]), sortindex()
index « ii2, 3})
‘Specsoreachaun =
Ez) 22 Sontheindes of Dataframe
(efreset_index()
4 = pd.oatarrane( Seat intOatafcamerowmabeymoving
pé.concat(faa¢f2))
evenrnsofDatrames pé.concat({a,ef2],
"Acpendconsatames axtset) “4f-drop(coleense{"Length",
DreamastomOsta Hetph€°1)

crepe
ct asap gatas
tt, pa tania
itynr
te saea
TR
oc a ce
orSon
siepe Sy —Te
Select rows by povtion, +

om
=e. = a. aoe
Soe enough
‘otherpanosmethodcanbeatedtthe
‘es ThGancecen
imgovesvenaofode ora
Fo [fase i co 7]
cannaelena, [owen
oc fociloners
ew = scanbttennd
setae* th ion (i
Sariable®
Sale’: “val “yae’, licen s pene) =< Ser
aectocter ('#ta
>
felesrearesfersaertaang [usmccrmcmera] “fleet e130,0

‘F("w" Jovalve_counts()
Handling Missing Data Enon
a
ropa) an
at
:
Count number of rows with each unique vate of variable
aFaso
Drop rows with any column having A/a data.
ann ot ode
nn
t
aa

ef describe) LEMNOS Stand |


tan cre a tir nhoa L for rob 7 RESIEDD perenne,
‘hows'Left", ons'x2")
22M
a ennaientices
“centsneer
at Volume
oe statesehr
Length"atMelghtut-bepth
EUHEDY persia, right",
et, ane")
agecaame eee
pa.qcut(atcol, ny Tobe
28-EEenmachgontowt
jn
resisretuned aspondsStesforeach in cole eno bute,
um) mint) EIEUIED penmerae(adt, et, »
(iat nt tin, "Nianec Brea | ontaraaa)
‘Court
Su now Arvesof STAIpereecast, at,
wee * howe‘outer, ons")
ednvale ofeach abet 82 ERE onda tanaey, oon
‘quantile((0.28,0.78}) Seren These ntoproducevein ofale tech the © nan
‘Outifeachobject elu or sngSri or teleies, ome
spply(function) ‘Sandadevitonof ech ‘in(axised) Fierions
mpfnctan each bjt bj. mas
Clement we ‘Yo,uppersse) mente min
abs() CE
at dtdaa, tsancbaoa)
‘rows nlhat havematch nb
‘im auer acngutieshls Ableae 2
seyt bye" eal The samplesbelow can sho beaptedto wowpintis cava the) TIE) (NeenoneattanGorotbebe
i mendabt
oo wn
editSorcery, fhmcion apesee prowann, athereturnedett e
en a
{rowedbyeaues ncohimn seotheergothe orgatate
“hoon
cicect*
ES” i
=
meaty,
=
Si = a
Sate Sh SP
Windows
STcane Eo, MR| ror 7EE seem.
om : MydF, 2d#, hows"outer’)

rt
df expanding() df.plot.hist() éf.plot.scatter(x="w',y="h") cs
sta oe :

lee
df rolLing(n) on Andteatoretrve)
Exercises: Pandas

We'll coverthe following a

Timeto Test Your Skills!

* Q1. Create a DataFramefrom thegivendictionary data andindexlabels


andstoreit in the variable called “df”.
Q2. a) Select the column labelled “Listeners” and storeit in the variable
called “col”. b) Select thefirst row andstoreit in the variable called “row”.

Q3. Selectall the rows wherethe Genreis ‘Pop’ and store the result in the
variable “pop_artists”.
Q4. Selectthe artists who have more than 2,000,000listeners and whose
Genreis ‘Pop’ and savethe output in the variable called “top_pop”.
Q5.Perform a grouping by Genre using sum() as the aggregation function
and storethe results in the variable called “grouped”.

Time to Test Your Skills!


Note: Wearegoing to create a DataFramecalled “df”in thefirst exercise, and wewill keep
referring to the same DataFrame as ourinput in therestofthe exercises.

Q1. Create a DataFrame from the given dictionary data and index labels
and storeit in the variable called “df”.

import pandas es pd
wavanaune

# Input
data = {‘Artist': ['Ariana Grande’, ‘Taylor Swift’, ‘Ed Sheeran’, ‘Justin Bieber’, ‘Lady Gaga’, ‘Br
“Genre’: ['Jazz", ‘Rock’, "Jazz", ‘Pop’, “Pop’, ‘Rock"],
‘Listeners’: [1300000, 27@0008, Seageee, 2000000, 3000000, 1108000]}
labels = ['AG', ‘TS’, 'ED', ‘JB‘, "LG", “BM']
1@ # Your solution goes here
11
12 # Uncomment the print statement once done
13° # print(dF)

ag Need Hint? Hide Solution a

Solution eR o

1 df = pd.DataFrame(data, index-labels)
2 print(dF)

Q2. a) Select the column labelled “Listeners” and storeit in the variable
called “col”. b) Select the first row andstoreit in the variable called
“row”.

(Remember, wearestill using the DataFramecalled df.)

# Your solution goes here


onunwner

# Uncomment the print statement once done


# print("Row:", row)
# print("Col:", col)

acrsg Need Hint? Hide Solution co

Solution eR o

dF iloc[@] # or dF.loc[*AG’]
dF‘ Listeners"]

Q3. Select all the rows where the Genreis ‘Pop’ and store the result in
the variable “pop_artists”.

1 # Your solution goes here


2
3 # Uncomment the print statement once done
4 # print(pop_artists)
5

acrsg Need Hint? Hide Solution co

Solution eR o

1 pop_artists = df[df["Genre"] == “Pop” ]

Q4.Select the artists who have more than 2,000,000 listeners and
whose Genreis ‘Pop’ and save the outputin the variable called
“top_pop”.

1 # Your solution goes here


2
3 # Uncomment the print statement once done
4 # print(top_pop)
5

acrsg Need Hint? Hide Solution co

Solution eR o

1 top_pop = d¥[((dF[ Genre] == “Pop") & (df['Listeners’] > 2000000))]

Q5. Perform a grouping by Genre using sum() as the aggregation


function andstore the results in the variable called “grouped”.

# Your solution goes here


Nouwsunk

grouped = d¥.groupby(‘Artist’).sum()
# Uncomment the print statement once done
# print(grouped)

acrsg Need Hint? Hide Solution co

Solution eR o

1 grouped = d¥.groupby(“Genre").sum()
Data Visualization - An Introduction

As a datascientist, yourjobis to tell stories, with data and to communicateinterestinginsights


to stakeholders. Data visualization plays a big rolein this:

« In the earlystagesof a project, creating visualizationshelps in Exploratory Data Analysis


(EDA). It helps in gaininginsights about your data by making things easier and clearer to
understand. When weplace thingsin a visual context patterns, trends, and correlations
that might have otherwise gone undetected, cometo the surface.
« Towardsthe endoftheproject,it’s very important to be ableto presentresults in a clear,
concise, and compelling mannerso that your audience (which is often going to be non-
technical people) can understandthe results as well.

al ad A
| Ae} i ry

3
de Image credits: PolicyViz

Python offers multiple graphing libraries that comepacked with lots ofdifferent features.
Matplotlib is the most popularlibrary for creating visualizations in an easy way, so we are
goingto use it as a basis for learningtheartof data visualization.

In this series of lessons on data visualization, wewill start with general Matplotlib usage tips.
Then wewill go throughthedetails of main visualization techniques and also learn how to
create them with code-examples.

® Before we continue, let me just remindyou of whatI said earlier: Learning to use Python
well means using lotof libraries and functions. Butyou don’t have to remembereverything
by heart — Googleisyourfriend!
Data Visualization - Matplotlib Tips

We'll coverthe following a

© General Matplotlib Tips


© 1. Importing Matplotlib
© 2.Setting Styles
© 3. DisplayingPlots
© 4. Saving Figuresto File

General Matplotlib Tips


Before wediveintothedetails of creating visualizations with Matplotlib, here are a few useful
tips aboutusing this package:

1. Importing Matplotlib
fust as we used the np shorthand for NumPy, and pd for Pandas,pit is the standard shorthand
for Matplotlib:

import matplotlib.pyplot as plt

Note that Matplotlib is a hugelibrary, so weare only importing the pyplotpartofit. This is
useful to save memory and speedupcode. Otherwise we'll be importing gigabytes oflibraries
even when wearejustinterested in using it to perform sometrivial tasks.

2. Setting Styles
plt.style directive can be usedto choosedifferent prettyfying styles for our figures. There a
numberof pre-defined styles provided by Matplotlib. For example, there’s a pre-definedstyle
called ggplot, which tries to copy the look andfeel of ggplot, a popularplotting packagefor R.
Below are some examples, both code andvisual outputs,ofthe available styles; you can go
through theofficial reference sheetfor a complete overview.

# Examples of available styles


plt.style.use('classic')
plt.style.use('ggplot')
plt.style.use('seaborn-whitegrid' )
plt.style.use(['dark_background', ‘presentation’ ])

ny
NW
WER
Nw
QOOOOO}

Examplesof available style. Image Credits: https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html

3. Displaying Plots
a. If you are using Matplotlib from within a script, the function p1t.show() is the way to go. It
triggers an eventthatlooks for all currently active figure objects and opens one or more
interactive windowsto display them.

b. If you are working with a Jupyter notebook, plotting interactively within the notebook can be
done with the %matplotlib command:

© %matplotlib notebook will create interactive plots embedded within the notebook.

© %matplotlib inline will create static images of your plots embedded in the notebook.

4. Saving Figuresto File


Matplotlib provides a handy feature for saving figures in a wide range offormats using the
savefig() command.For example,to savea figure calledfig as a png file, you can run this:

fig. savefig('figure.png')

Notice thatthefile formatis inferred from the extensionofthe given filename.

Useful introductory tips “ We are now ready to learn about somefun visualization
techniques.
Data Visualization Techniques - Scatter, Line, and
Histogram

We'll coverthe following a

Visualization Techniques

e 1. Scatter Plots
e 2.Line Plots

© 3. Histograms

Visualization Techniques

1. Scatter Plots
Scatter plots are deceptively simple and commonly used, butsimple doesn’t meanthat they
aren’t useful!

Ina scatter plot, data points are represented individually with a dot, circle, or some other
shape. Theseplotsare great for showingtherelationship between twovariables as we can
directly see the raw distributionof the data.

To createa scatter plot in Matplotlib we can simply use the scatter method. Let’s see how by
creating a scatter plot with randomly generated data points of manycolors andsizes.

First, let’s generate some random datapoints, x andy, and set random valuesfor colors and
sizes because we wanta pretty plot:

A complete “runnable” exampleis at the end.

# Generating Random Data


wewne

x = np.random.randn (100)
y = np.random.randn(190)
colors = np.random.rand(100)
sizes = 1080 * np.random.rand(1@0)

Nowthat wehaveourtwovariables(x, y) and the colors and sizes of the points, we can call the
scatter methodlike this:

plt.scatter(x, y, c=colors, s=sizes, alpha=0.2, cmap='viridis')

alpha = @.2: The optionalalpha input parameter(optional) of the scatter method allows usto
adjust the transparency level so that we can view overlapping data points. You will understand
this concept better, visually, in the final output.

cmap='viridis' : Matplotlib provides a wide range of colormaps, an optional parameter, for


creating visually appealing plots. For example, you can set cmap=plt.cm.Blues ,
cmap=plt.cm.autumn , OF cmap=plt.cm.gist_earth.

The color argument, c, is automatically mapped to a colorscale, and the size argument, s, is in
pixels. To view the color scale nextto theplot on theright-hand side, we can use the colobar()
command:

plt.colorbar()

Nowlet’s put it all together and run the full codeto see the output. Note that wewill save the
output to a file to display the plot with Educative’s code widget.

As an exercise, try changing the parametervalues to understandtheir effect on the output.

#1, Importing modules


wavanaune

import matplotlib.pyplot as plt


import numpy as np
# 2. Generating some random data
rng = np.random.Randomstate(®)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
10 sizes = 1000 * rng.rand(100)
11
12 #3. Create a scatterplot using the a colormap.
13. plt.scatter(x, y, c=colors, s=sizes, alpha=0.2,cmap="viridis’)
14 plt.colorbar(); # To show the color scale
15
16 plt.savefig(“output/scatter.png”)

Wecan see thatthis scatter plot has given us the ability to simultaneously explore four
different dimensions ofthe data.If this was a realdataset, the x, y location of each point can
be usedto representtwodistinct features. The size of the points can be used to represent a
third feature,and the color mappingcan beused to represent differentclasses or groups of the
data points. Multicolor and multifeature scatterplots like this are very useful both for
exploring and presenting data. Hereis an interesting example:

‘otal_bill

Example ofScatter Plot Analysis

2. Line Plots
Alineplot displays data points connected bystraightline segments,instead of showing them
individually. This type ofplot is useful for finding relationships between datasets .Wecan easily
see if values for one dataset are affected by thevariationsin theother. This will tell us if they
are correlated or not. For examplein theplot below, D and C don’t seem to be going on own
paths independently:

wI601 mGG9 1605 IEwIE-09 mE ITO

Example of Line Plots Analysis

Say we havea datasetwith the population of twocities over time and we wantto see if.
thereis some correlation in their population sizes. We can use the plt.plot() function to
plot population against time. We would puttime onthe x-axis and population on they-axis.

In order to comparethe populationsizes ofthe twocities, we wantthe population data for both
to be represented on the sameplot. To create a single figure with multiple lines, we can just call
the plot function multiple times. And thento distinguish betweenthe two, wecan adjust the
colors using the color keyword.

Finally, to label theplot, title and axis labels, we can simplycall:

plt.title("Some cool title for the plot")


plt.xlabel("Some x-axis label")
plt.ylabel("Some y-axis label");

Let’s lookatthis from startto finish:

1 #1, Importing modules


2 import matplotlib.pyplot as plt
3
4 #2. Getting the Data
5 year = [1978, 1988, 1990, 2000, 2010, 2026]
6 popA = [4.9, 5.3, 7.1, 10.7, 13.5, 15.6]
7 pope = [44.4 55.6, 69.7, 87.1, 95.4, 100.5]
8
9 # 3, Visualising the Date
10 plt.plot(yeer, pop_A, color="g")
11 plt.plot(yeer, pop_8, color="r")
12. plt.xlabel( Countries")
13. plt.ylabel( ‘Population in million’)
14 plt.title(‘Populations over the years")
16 plt.savefig(“output/line. png”)

3. Histograms
Histogramsare useful for understanding thedistribution of data points. Basically, a histogram
is a plot wheredatais split into bins. A bin is a rangeofvaluesorintervals. For each bin, we get
the count of how manyvaluesfall into it. The x-axis represents bin ranges while y-axis shows
frequency. Thebinsare usually specified as consecutive, non-overlappingintervals of a
variable. For example,in the plot below, we have information about how manyflights were
delayed for a rangeoftimeintervals:

Histogram ofArrival Delays


Flights

25 50 75 100 125
Delay (min)

Example Histogram

Wecanplot histograms using the hist() function:

import numpy as np
Nouwsunk

import matplotlib.pyplot as plt


data = np.random.randn(1¢00)
plt.hist(xedata) ;
plt.savefig("output/hist1.png")

histQ has manyoptionsthat can be used to tune boththecalculation andthe display. Let’s see
an example of a more customized histogram for the same data:

1 plt.hist(data, bins=30, normed=True, alpha=0.5,


2 histtype='stepfilled’, color="steelblue’,
3 edgecolor="none");
4
5 plt.savefig("output/hist2.png")

To create multiple histogramsin the sameplot, wecan either call the hist() function multiple
times orpassall the input valuesto it at once. The x parameterofhist() can take both a single
array or a sequenceofarrays as input.
Data Visualization Techniques - Bar and Box Plot

We'll coverthe following a

Visualization Techniques

° 4. Bar Plots

° 5. Box Plots

* Final Thoughts

Visualization Techniques

4. Bar Plots
Barplots are an effective visualization technique when wehavecategorical data thatconsists
ofvarious categories or groups.

For example, wecan use barplots to view test scores categorized by gender. These plots allow
us to easily see the difference between categories (gender) because thesize ofthe barvaries
with the magnitudeofthe represented variable(score), and categories are clearly divided and
color-coded.

However,thereis a catch. Bar plots work well when weare dealing with a few categories.If
there are too manycategories though,the barscan get very cluttered and theplots can quickly
becomehard to understand.

Wecan haveseveralvariationsof bar plots. The image below showsthree differenttypesof bar
plots: regular, grouped, and stacked:

Scores by croup and gender Scoresby grupanegender

reget : [ | | | [ :

Notice that bar plots look very similar to histograms. Howdowedistinguish between the
two? Whatare thedifferences betweenthe two?

A histogram representsthe frequency distribution ofcontinuousvariables while a bar graph


is a diagrammatic comparison ofdiscrete variables. The histogram presents numerical data
whereas the bar graph showscategoricaldata. The histogram is drawn in such a way that
thereis no gap betweenthebars.

Say wehavedatain the formoflists for the numberofusersfor different programming


languages:

# Input data
labels = (‘Python', 'C++', ‘Java’, ‘Perl’, 'C#')
num_users = [12,8,15,4,6]

To make barplot from this data, wefirst need to convertthelabels from string data into
numericaldata thatbe used for plotting purposes. We can use NumPy’s arange method to
generate an array of sequential numbersof the samelength asthelabelsarraylike this:

index = np.arange(len(label))
# index: [@,1,2,3,4]

Nowwecaneasily represent languages on the x-axis and num_userson the y-axis using the
plt.bar() method:

1 import numpy as np
2 import matplotlib.pyplot as plt
3
4 # Input data
5 labels = (‘Python’, "C++", ‘Java’, ‘Perl’, ‘C#")
6 numusers = [12,8,15,4,6]
7 index = np.arange(len(labels))
8
9
10 # Plotting
41 plt.bar(index, num_users, align="center’, alpha-0.5)
12 plt.xticks(index, labels)
13. plt.xlabel(' Language")
14 plt.ylabel(‘Num of Users’)
15 plt.title('Programming language usage")
17 plt.savefig(‘output/barplot.png" )

5. Box Plots
Wepreviously looked at histograms which weregreatfor visualizing the distribution of
variables. But whatif we need moreinformation thanthat? Box plots provideusa statistical
summary of the data. They allow us to answerquestionslike:

« Is the data skewed, whenall the values are concentrated on oneside?

« Is the median different from the mean?


« Whatdoesthe standarddeviation looklike?

Note:If all of these terms soundalien to you at the moment, don’t worry! We will
coverthese terms, andget backto boxplots in moredetail, in the Statistics Lessons.

Ina box plot, the bottom andtop of the box are alwaysthe 1st and 3rd quartiles, 25% and 75%
of the data, and the bandinsidetheboxis always the 2nd quartile, the median. The dashed
lines with thehorizontalbars on theend,or whiskers, extend from the box to showthe range
ofthe data:

10

Thur FA sat sun


cay

Exampleof Box plot Analysis

Wecan use the boxplot() method to create our insight-rich boxplots:

plt.boxplot (data_to_plot)

This method also takes various parameters as input to customize the color and appearanceof
the boxes, whiskers, caps, and median. Let’s see an example based on some randomly
generated data:

1 # numpy is used for creating fake data


2 import numpy as np
3 import matplotlib.pyplot as plt
4
5 # Create data
6 data_group1 = np.random.normal (100, 10, 200)
7 data_group2 = np.random.normal(82, 30, 200)
8 data_group3 = np.random.normal(9e, 20, 2@@)
9
10 # combine these different datasets into a list
11 data_to_plot = [data_group1, data_group2, data_group3]
13 # Create the boxplot
14 # patch_artist must be True to control box fill (color-filled boxes)
15. plt.boxplot(data_to_plot, patch_artist = True)
47 plt.xlabel( Group")
18 plt.ylabel('Scores")
19 plt.title("Some Group Scores")
21 # Save the figure
22 plt.savefig( ‘output/boxplot.png’ )

Final Thoughts
These were examples of some must-know Data Visualization techniques using Python’s most
popularvisualization library, Matplotlib. These are simpleyet powerful visualization
techniques which you canuse them to extractrich insights from your datasets.

Of course,the visualization story doesn’t endhere: notonly arethere other more elaborate
visualizations to extract even deeperinformation from yourdata, like Heat Maps and Density
Plots, but there are many otheruseful libraries for coolvisualizationstoo. Hereis a brief
introductionof the most popular ones:

« Pandas Visualization: easyto use interface, built on Matplotlib


« Seaborn:high-level interface, based on Matplotlib,for attractive and informativeplots
« ggplot: based on R's ggplot2
« Plotly: can createinteractive plots
« Bokeh: interactive visualization library for beautiful visual presentation ofdata in web
browsers

Again, data science is an extensivefield, so you cannot, and should not, expect to learn
and rememberallthe possible libraries, methods, andtheir details!

Wewill get back to creating morerichvisualizations, in the Projects Section. Wewill also
workwith morelibraries, like seaborn, and build upon what wehavelearned so far. Stay
Tuned! @
Data Visualization Cheat Sheet

We'll coverthe following a

Picking the Right Data Visualization

Picking the Right Data Visualization


Pickingtheright visualization techniquecan hetricky. Your data could technically work well
with multiple types of plots, but you need to pick theonethat ensures your messageis clear,
accurate, and concise.

In general, line, bar, and column chartsare goodto represent changeovertime. Pie charts can
showparts-of-a-whole, andscatterplots are niceif you havea lotof data.

To make yourlife easier here is a cool cheat sheet on selecting the right visualization methods
from the Harvard CS-109 extension program:

Chart Suggestions—A Thought-Starter


—e)
== Pram
sue aac) [eam
; [aes ae) =a
] afd all ||
a om Few Catagories
Few ems Cc Daa "NowGplicDatt
‘Many Periods
Single or Fel Capes ManyCrore
‘bew Perils

|___svat
em od
Comparison

: ‘What
like wswould you an
|Distribution
aa
SB Composition
SSO Rn "
om

ou r
he T,
oetas,
=
seer | E
eG | Weel

Pick the Right Data Visualization (Image Credits: Harvard CS-109extension program)
Quiz: Data Visualization

We'll coverthe following =~

© Time To Test YourSkills!

Time To Test Your Skills!

a Suppose wedid an experiment to study the chemicalreaction oftwo substances A and B. We


measuredthe quantities of A and B, and then wealsoobservedthe temperature and color of
the productresulting from their chemicalreaction. Now wewantto visualize all these four
variablesin a meaningful way. Which of the following plots would be a good choiceto get a
goodoverall picture of our experiment?

O AyScatter Plot

OB) BoxPlot

O Histogram

Reset Quiz C Question 1of4


O attempted
Introduction

Statistics is a key componentofa datascientist’s toolbox. Unfortunately, courses and books on


basic statistics rarely cover the topic from a data science perspective. Theidea behindthis
series of lessons is to provide youa practical guide for understanding thestatistical concepts
that shouldbeat the fingertips of every good data scientist.

In the data visualization lessons, we saw thatwecaneasily obtain insights aboutthe data using
various typesof plots. So wheredoesstatisticsfit in?

Ata high-level,statistics is about performing a mathy technicalanalysis ofthe data. It helps us


build on theinsights gained from ourvisualizationsby giving usthe ability to perform fine-
grained andin-depth data analysis. It helps us understandthe structure ofour data. Having
this kind of understandingis important because wecan then choose and apply data science
techniquesthatarethe bestfit for that shape of data. This meanswelet thedata to do the
talking; we reach conclusionsthat are thorough and thoughtoutinstead of based on guess
work.

Withoutfurther ado,let’s dive into the worldofstatistics!


Statistical Features - Basics

We'll coverthe following a

* Basic Concepts
* Mean
* Median
* Standard Deviation
Correlation Coefficient

Basic Concepts
Thefirst step in analyzing data is to get familiar with it. Our good old NumPyprovides a lot of
methods thatcanhelp us dothis easily. We are going to look at someof these methodsin this
lesson. Along the way, we are going to understand the meaning of important statistical terms as
weencounterthem.

The mostbasic yet powerful terms that you could comeacrossare the mean, mode, median,
standarddeviation, and correlation coefficient. Let’s understand these with an example
dataset and using NumPy.

Say wehave a dataset consisting of students’ examsscores andthetimethey invested in


studying for the exam. Whatcan welearn aboutthis data usingstatistics?

Runthe codein the widget below andtry to understand what’s happeningbefore reading the
description thatfollows.

import numpy as np
wavanaune

# The dataset
learninghours = [1, 2, 6, 4, 10]
scores = [3, 4, 6, 5, 6]
# Applying some stats methods to understand the data:
print("Mean learning time: ", np.mean(learning_hours))
print("Mean score: ", np.mean(scores))
10 print("Median learning time: ", np.median(learninghours) )
11 print("Standard deviation: ", np.std(learninghours))
12 print("Correlation between learning hours and scores:", np.corrcoef(learning_hours, scores))

Mean learning time: 4.6


Mean score: 4.8
Median learning time: 4.0
Standard deviation: 3.2
Correlation between learning hours and scores: [[1. 0.88964891)
[0.esceaec2 1. 1

Mean
The meanvalueis the average ofa dataset, the sum the elements divided by the numberof
elements. As the namesays, np.mean() returns the arithmetic meanofthedataset.

Median
The medianis the middle elementof the set of numbers. If the length ofthearrayis odd,
np.median() gives us the middlevalueof a sorted copyof thearray.If the length ofthe array is
even, wegetthe average ofthe two middle numbers.

Standard Deviation
Standarddeviation is a measure of how muchthedatais spreadout,andis returned by the
np.std() method. Morespecifically, standard deviation shows us how muchourdatais spread
out around the mean. Standard deviation could answerthe questions “Areall the scores close to
the average?” or, "Are lots of scores way above or way below the average score?"Using standard.
deviation we havea standard way of knowing what is normal and what is high or extra low.

In mathematical terms, standard deviationis the squarerootof the variance. So now you ask,
“Whatis variance?”
Varianceis defined as the average ofthe squared differences from the mean. Let mebreakthis
downfor you.

To calculate the variance manually we wouldfollow thesesteps:

1. Compute the mean(the simple average of the numbers)


2. Then for each number, subtract the mean andsquaretheresult, i.e., the squared
difference.
3. Then computethe average ofthose squareddifferences.

Let’s calculate the standarddeviation for learning hours manually.First let’s get the mean
value:

= 14+2+64+4+10
1424044410 4
Mean
3
Nowto calculatethe variance,get the difference of each element from the mean, squarethat,
andthen averagetheresult:

(1 — 4.6)? + (2 — 4.6)? + (6 — 4.6)? + (4 — 4.6)? + (10 — 4.6)?


Variance =
5
= 10.24

Finally, the standarddeviation is just the square root of the variance, so:

StandardDeviation = V10.24 = 3.2


Doing these calculations step by step makes us appreciate how easy NumPy makes ourlife. It
allowsusto performstatistical analysis without having to remember any mathy formulas
and/orlong steps — simple and neat!

Correlation Coefficient
Whentwosetsof data are strongly linked together wesay they havea high correlation.

Funfact: the word Correlation is made of Co,meaning “together”, and Relation.

Correlation is positive whenthevaluesfor the twosetsofdata increase together, while


correlation is negative when one value decreasesas the other increases. For example, caloric
intake and weight, or time spent studying and GPA, are highly correlated data; while a person’s
nameandthetype of food theyprefer is an exampleof low correlation.

Acorrelationcoefficient is a way to puta valueto the relationship. Correlation coefficients


have a value of between -1 and 1:

* 1isa perfectpositive correlation


* Ois no correlation meaning the values don’t seemlinked at all
« -1is a perfect negative correlation

np.corrcoef() returns a matrix with thecorrelation coefficients. This method comesin handy
whenwewantto see if thereis a correlation between twoor morevariablesin our dataset.

Correlation Is Not Causation!

Correlation Is Not Causationis a very commonsaying. Whatit really meansis that a


correlation doesnot prove onethingcauses the other.

For example, say an ice-cream shop has data for how manysunglasses weresold bya big store
in the neighborhoodovera period oftime, and they decide to comparesunglassessalesto their
ice cream sales. From their results, they find a high correlation between thesales ofthe two.
Doesthis mean that sunglasses make people wantto buy ice cream? Hmm,no!

So, in layman terms,“Correlation Is Not Causation” tries to remindus that correlation does not
proveone thing/event causesthe other, but:

Oneevent might cause the other


The other mightcausethefirst to happen
They maybelinkedbya different reason
Orthe result could have been random chance

ANOTHER HUGE. STUDY YOURE NOT...THERE ARE 57


FOUND No EVIDENCE. THAT MANY PROBLEMS WITH THAT.
CELL PHONES CAUSE CANCER. JUST To BE SAFE, UNTIL
WHAT WAS THE WHO. THINKING? T S€£ MORE DATA TM
( I THINK THEY Just GOING ToASSUME CANCER
corir —e CAUSES CELL PHONES.

ImageCredits: https:/fxked.com/925/

Getting back to ourfirst example, performinga simplestatistical analysis on our dataset gave
usa lot ofinsights. In summary, wecansee that:

« The meanlearninghoursis 4.6.


« The meanscoreis 4.8.
« The median learning hoursis 4.0.
« Thestandarddeviationfor the learning hoursis 3.2.
« Thereis a high correlation between how mucha studentstudiesin terms of hours and
their final grade! Well, at least based on our made-updataset.
Statistical Features - Working With Box Plots

We'll coverthe following a

* Anatomy ofa BoxPlot


* Five-Number Summary
* Interpreting A Box Plot

Another importantstatistical conceptis that of percentile, so let’s get a good understanding of


this essential feature. Also,let’s learn to interpretstatistical features from plots.

Anatomy of a Box Plot


Do you remember the boxplots from the lessons on data visualization? As we sawearlier, we
can write somevery simple code using Matplotlib’s boxplot() method to obtain statistical
features in the form of box plots:

©oUTLIER Move than 3/2


ee times ofupper quartic

MAXIMUM Greatestvalue,
‘excluding cutirs

LUPPERQUARTILE 25% of
data greater than this value
MEDIAN 50% of data is
‘greater than this value;
idle ofdataset

LOWER QUARTILE 25% of


data less than this value

MINIMUM Leastvalue
‘excludingouters
@—ouTLIERLess than 32
timesof ower quartle

Howto read a box-plot (Image credits: flowingdata.com)

A boxplot is basically a graphthat presents information from a five-number summary. If we


lookat the diagram above,wecansee thatin a boxplot:

* Theendsof the boxarethefirst (lower) and third (upper) quartiles — the box spans the
so-called interquartile range. Thefirst quartile basically represents the 25th percentile,
meaning that 25% of the data pointsfall below thefirst quartile. The third quartile is the
75th percentile, meaning that 75% of the points in the data fall belowthethird quartile.
« The median, markedbya horizontal line inside the box, is the middle value of the dataset,
the 50th percentile. Median is used instead of mean becauseit is more robust to outlier
values (wewill talk about this again later and understand why).
+ The whiskersarethe twolines outside the box that extendto the highest and lowest(or
min/max) observations in our data.

Five-Number Summary
To recap,a five-number summary is madeupof thesefive values: the maximum value, the
minimum value, the lower quartile, the upper quartile, and the median.

Thesevalues are presented together and ordered from lowestto highest:

+ Minimum value

¢ Lowerquartile (Q1/25th Percentile)


¢ Median value (Q2/S0th Percentile)
¢ Upper quartile (Q3/75th Percentile)
¢ Maximum value

Thesefive numbersgive us a summary ofthe data as each valuedescribes a specific part of a


dataset: the medianidentifies the centerof a dataset; the upper and lower quartiles span the
middlehalf of a data set; and the highest and lowest observationsgiveus insights into the
actual dispersionof the data. The five-number summary is a useful measureofspreadin the
dataset.

Interpreting A Box Plot


« Ashortboxplottells us that manyof ourdata points are similar, we have manyvalues ina
small range. On the other hand,a tall box plot implies that muchofthe data points are
quite different, we havevalues that are spread over a wide range.
A medianvaluethat is closer to the bottom tells us that mostof our data points have lower
values. While a median valuecloser to thetoptell us that most of our data has higher
values. Basically, a medianline that is not in the middleofthe boxis an indication of
skewed data.
Whataboutthe length of those whiskers?
Long whiskerstell us that ourdata has a high standarddeviation andvariance,i.e., the
values are spread out and varya lot. If there are long whiskerson one side ofthe box, but
not theother,thenit’s an indicationthat our data varies, but only in onedirection.

Isn't this a lot of useful information from a few simplestatistical features that are easy to
calculate? Rememberto make use of them while doing a preliminary investigation of a large
dataset, when comparing two or moredatasets, and when you need a descriptive analysis
including data skewednessoroutliers of your data.
Basics of Probability

We'll coverthe following a

© WhatIs Probability?
* WhyIs Probability Important?
How DoesProbability Fit in Data Science?
* Calculating Probability of Events
* a. IndependentEvents
* b, DependentEvents
* c, Mutually Exclusive Events
d. Inclusive Events

Conditional Probability

For anyonetakingtheirfirst steps in data science, probability is a must-know concept.In this


lesson, wewill learn this importantpiece of the puzzle by going through each conceptina
simple way.

What Is Probability?
Probability is the numerical chance that somethingwill happen;it tells us how likelyit is that
someeventwill occur.

Probability is oneof thoseintuitive concepts that weuse on a daily basis, withoutnecessarily


realizing that we aretalking probability. Ourlivesare full of uncertainties; unless someone has
superpowersto foresee the future, we don’t know the outcomesofa particularsituation or
eventuntil it actually happens. Will I pass the exam withflying colors? Will it snow today? Will
my favorite team win the match? These are some examples of uncertain events. In statistical
terms, “team won”is the outcome while “my team winning today’s match”is the event.
Probability is the measureof howlikely this outcomeis.

For example,if it is 80% likely that my team will win today, the probability of the outcome“the
team won” for today’s matchis 0.8; while the probability of the opposite outcome, “it lost”, is
0.2, i.e., 1 - 0.8. Probability is represented as a numberbetween0 and 1, where0 indicates
impossibility and 1 indicatescertainty.

Why Is Probability Important?


With all the uncertainty and randomnessthat occursin our daily life, probability helps us
makesenseof these uncertainties. It helps us understandthe chancesofvarious events. This, in
turn, meansthat we can makeinformeddecisions based onestimatesor patterns of data
collected previously. For example,if it is likely to rain, we can grab an umbrella before heading
out. Orif a user is unlikely to check our app without a reminder, we can send them a
notification.

How DoesProbability Fit in Data Science?


Understanding the methods and models needed fordatascience,like logistic regression which
wewill encounterin the Machine Learning section, randomization in A/B testing, or
experimental design, and sampling of data are examplesof use-cases that require a good
understanding of probability.

Calculating Probability of Events


Probability is a type of ratio where we compare how manytimes an outcomecan occur
comparedto all possible outcomes. Simply put:

NumberOfWaysItCanHappen
ProbabilityO fAnEventHappening = TotalNumberO[Outcomes

Example 1
Whatis the probability you get a 6 when youroll a die?

A die has6 sides, 1 side contains the number6. We have1 wanted outcomeoutofthe 6 possible
outcomes, therefore, the probability of gettinga 6 is 1/6.

a. Independent Events
Example 2
Whatis the probability of getting three 6s if we have 3 dice?

« The probability of getting a 6 on onedieis 1/6.

+ Theprobability of gettingthree6s is: P(6, 6,6) = 1/6 * 1/6 * 1/6 = 1/216


This was an example ofprobability of IndependentEvents:

Twoevents are independentwhen the outcomeofthefirst event doesnot influence the


outcomeof the second event — getting a 6 whenrollingthefirst die does not affect the outcome
ofrolling the seconddie.

When wedeterminethe probability of two independentevents we multiply the probability of


thefirst eventby the probability of the second event:

P(X and Y) = P(X) * P(Y)

b. Dependent Events
Example 3
Whatis the probability of choosing twored cardsin a deck of cards?

This is an example of DependentEvents:

Twoevents are dependent whenthe outcomeofthefirst event affects the outcomeof the
secondevent. To determinethe probability of two dependentevents, weuse the following
formula:

P(X and Y) = P(X) * P(Y after X has occurred)

Getting back to our problem:

Since a deck ofcardshas 26 black cards and26 red cards, the probability of randomly choosing
a red cardis:

P(red) = 26/52 = 1/2


However,nowthat wehavealreadytaken outonered card from the deck,the probability of
choosing a red card again from the same deck becomes:

PQnared) = 26/52 * 25/51 = 25/102

c. Mutually Exclusive Events


Twoevents are mutually exclusive whenit is impossible for them to happen together. Turning
left and turning right are mutually exclusive events — you can’t do both at the sametime.
Similarly, getting heads andtails while tossing up a coin are also mutually exclusive. Well,
except in the world of quantum physicists!

P(A and B) = @

The probability that oneof the events occursis the sum oftheir individual probabilities.

P(X or Y) = P(X) + P(Y)

Example 4
a. Whatis the probability of getting a King and a Queen from a deckof cards?

Acard cannot be a King AND a Queenatthe sametime! So the probability of a King and a Queen
is 0 (impossible).

Wecan have a King OR a Queen. In a Deckof 52 Cards:

+ the probability of a King is 1/13, so P(King)=1/13


the probability of a Queenis also 1/13, so P(Queen)=1/13
* When we combine those two Events: P(King or Queen) = (1/13) + (1/13) = 2/13

d. Inclusive Events
Inclusive events are events that can happenat the sametime. To get the probability of an
inclusive event, wefirst add the probabilities of the individual events and then subtract the
probability of the two events occurring together:

P(X or Y) = P(X) + P(Y) ~ P(X and Y)

Example 5
If you choose a card from deck, whatis the probability of getting a Queenor a Heart?

It is possible to get a Queen anda Heartat the sametime, the Queen ofHearts which is the
intersection P(X and Y). So:

P(QueenOrHeart) = P(Queen) + P(Heart) — P(QueenO fHearts) =


= 4/52 +13/52 — 1/52 = 16/52

Conditional Probability
Conditional probability is a measureofthe probability of an eventgiven that another event has
occurred. In other words,it is the probability of one event occurring with somerelationship to
oneor moreotherevents.

Say event X is that it is raining outside, andthere a 0.3 (30%) chanceofrain today. Event Y
mightbe thatyou will need to go outside with a probability of 0.5 (50%).

A conditional probability would lookat these two events, X AND Y, in relationship with one
another.In the previous example this wouldbe the probability thatit is both raining and you
need to gooutside.

Conditional probability is given by:

P(Y|X) = P(X and Y) / P(X)

Example 6
Whatis the probability of drawing 2 Kings from a deckofcards?

« Forthefirst card the chance of drawing a King is 4 outof 52 since there are 4 Kings ina
deck of 52 cards: P(X) = 4/52
« After removing a King from the deck,only 3 of the 51 cardsleft are Kings, meaning the
probability of the 2nd card drawnbeing a King isless likely: P(Y| X) = 3/51
* So, the chanceofgetting 2 Kings is about 0.5%:

P(XandY) = P(X) * P(Y|X) = (4/52) * (3/51) = 12/2652 = 1/221


Avery important, and extensively used, derivation from conditional probability is the famous
Bayes Theorem. Andthat’s what wearegoing to dive-in in the nextlesson. The probability
journey continues;see you in the next lesson!
Bayesian Statistics

We'll coverthe following a

e WhatIs Bayesian Statistics?

* Bayes’ Theorem Example:

What Is Bayesian Statistics?


Whatwehavelearnedso far about probability falls into the category of FrequencyStatistics.
Butthere is another more powerful form ofstatistics as well, andit called Bayesian Statistics,
sometimes called, BayesianInference. Bayesian Statistics is a more general approach to
statistics; it describes the probability of an event based on the previous knowledgeof the
conditions that might be related to the event. It allows us to answerquestionslike:

« Hasthis happened before?


« Isitlikely, based on my knowledgeof the situation, that it will happen?

DID THE SUN JUST EXPLODE?


(ITS NIGHT; 50 WERE NOT SURE.)
“THiS NEUIRND DETECTOR MERGURES
\WHEFERTHE GUN HAG GONE NOR,
akTREN, ITROUS TWO DCE. IF THEY
‘BOTH COFEUPog ITLES TOUS.
re |
ameete
COENORE
X68.

a|
FREQUENTIST STATISTCIAN:
THE PROGAUITYOF TH RESULT
BAYESIAN STFTTOAN:
DAD SY GAME 007 BerYOu $5
<0.05, T. CONKILDE HASNT

yal ImageCredits: https://xkcd.com/

Let’s look at an example. Ever wonder how a spam filter could be designed?

Say an email containing, “You wonthe lottery” gets marked as spam. The question is, how cana
computer understandthat emails containing certain wordsarelikely to be spam? Bayesian
Statistics does the magic here!

Spam filtering based on a blacklist would be too restrictive and it would havea highfalse-
negative rate, spam that goes undetected. Bayesianfiltering can help by allowing the spam
filter to learn from previousinstances of spam. As we analyze the words in a message, we can
computeits probability of being spam using Bayes’ Theorem. Andasthefilter gets trained with
more and more messages,it updates the probabilities that certain words lead to spam
messages. BayesianStatistics takes into account previous evidence.

BayesianStatistics is based on the Bayes’ Theorem:Thisis basically a wayoffinding a


probability when we know certain other probabilities. The magical Bayes’ formula looks
like this:

P(A|B) ~
P(A)P(BIA)
P(B)
Thistells us how often A happensgiven that B happens,written P(A |B), when wehave the
following information:

+ Howoften B happensgiventhat A happens,written P(B| A).


* Howlikely A is on its own, written P(A).

« And howlikely B is on its own, written P(B)

If we mapthis to the spam filter example,we get:

P(spam) P(words|spam)
P(spam|words) = P(words)

Bayes’ Theorem Example:

Say wehavea clinic that tests for allergies and we wantto find outa patient’s probability of
having an allergy. We knowthat our testis not alwaysright:

« Forpeople thatreally do havethe allergy,the test says “Positive” 80% of the time,
(Positive | Allergy)
« The probability of the test saying “Positive” to anyoneis 10%, P(Positive)

If 1% of the population actually hastheallergy, P(Allergy), and a patient’s test says “Positive”,
what is the chancethat thepatientreally does havetheallergy, i.e., P(Allergy | Positive)?

Bayes’ theoremtells us:

P(Allergy)P'(Positive| Allergy) _ 0.01 * 0.8


P(Allergy|Positive) = = 0.08
P(Positive) 10
This meansthe chancethatthepatientactually hastheallergy is only 8%! Soundlike a clinic we
should stay away from!
Probability Distributions - An Introduction

We'll coverthe following a

Introduction

e Random Variables

Typesof Probability Distributions

e Probability Functions

Introduction
Wehavelearnedthat probability gives us the percent chance of an event occurring. Now, what
if we want an understandingof the probabilities ofall the possible values in our experiment?
This is where probability distributions comeinto play.

A probability distribution is a function that representsthe probabilities of all possible values.


Thisis a very important concept in data science, by specifying the relative chanceofall possible
outcomes. Probability distributions allow us to understandthe underlying trendsin ourdata.
For example, if we have somemissingvalues in our dataset, we can understand the
distribution of our data using probability distributions and then replace missing values with
the mostlikely values.

Random Variables
Forthe next couple of lessons, we are going to look at someof the most importantprobability
distributions. But before wediveinto probability distributions, we need to understandthe
different types of data we can encounter.

Thesetof possible values from a random experimentis called a RandomVariable. Random


Variables can be eitherdiscrete or continuous:

« Discrete Data(a.k.a. discrete variables) can only take specified values. For example, when
weroll die, the possible outcomesare1, 2, 3, 4, 5, or 6 and not 1.5 or 2.45.

+ Continuous Data(a.k.a. continuousvariables) can take any value within a range. This
range canbefinite or infinite. Continuous variables are measurementslike height, weight,
and temperature.

Types of Probability Distributions


Since probability distributions describe the distribution of the values of a random variable, the
kind ofvariable determinesthetype of probability distribution weare dealing with. This
meansthat probability distributions can be divided into the following twotypes:

« Discrete probability distributionsfor discrete variables


« Probability density functions for continuous variables

Probability Functions

Thereis just one more conceptweneed to understand before jumping into the different
distributions.

The probability function for a discrete random variableis often called Probability Mass
Function, while for continuous variables we have theso-called Probability Density Function
(a.k.a. Probability Distribution Function).
Bie BIN Blu Be Flo Bo

Theprobability mass function (pmf) p(S).or discrete probability distribution function fordiscrete variables, D, specifies the
probability distribution for the sum of counts from two dice. For example,the figure shows that p(11) = 2/36 = 1/18. The
pmfallows the computation of probabilities of events such as P(S > 9) = 1/12 + 1/18 + 1/36 = 1/6, andall other probabilities
in thedistribution.

For example, we could have a continuous randomvariable that represents possible weights
ofpeoplein a group:
Frequency(8)
25

Boay Weight (bs)

Probability Density Function

The probability density function showsall possible valuesfor Y. For example, the random
variable Y could be 100 lbs, 153.2 lbs or 201.9999lbs.

Whyarethese functions important?

The probability density function can help us to answer things like: Whatis the probability that a
person will weigh between 170lbs and 200lbs?

Nowthat wehavedonethe ground work,in the nextlessons weare goingto cover the most
importantdistributions for both discrete and continuousdata types.
Types of Distributions - Uniform, Bernoulli, and Binomial

We'll coverthe following a

© TypesofDistributions
* 1, Uniform Distribution
* 2. Bernoulli Distribution
3. Binomial Distribution

Types of Distributions
A few words beforewestart:

#® Wewill be looking at the mathematical representationsfor the various


distributions. You do NOTneed to rememberthem by heart! We are going to look at
them so that we can get a completepicture.

#® The importantthing is to be ableto identify distributions from their graphs and


to knowtheir majorproperties and/or distinguishingfeatures. For example, sayyou
plot the distribution ofan interesting variable in your dataset; you should be able to
tell what kindof distribution that variable is following — is it Normal or Poisson or
something else?

&® Pay special attention to the NormalDistribution andits properties; you should
knowthatonereally well as youarelikely to encounterit most frequently.

1. Uniform Distribution
Thisis a basic probability distribution whereall the values have the sameprobability of
occurrencewithin a specified range; all the valuesoutside that range have probability of 0. For
example, when weroll a fair die, the outcomes can only be from to 6 andtheyall have the
sameprobability, 1/6. The probability of getting anything outsidethis range is 0 — youcan’t get
a7.

The graph of a uniform distribution curve looks like this:

0.25
oo Uniform(1,6)

0.15 Uniform(4,12)
at
0.08
at
a 2 4 6 @ wm 12

Unifrom Distribution

Wecan see that the shape of the uniform distribution curveis rectangular. This is the reason
whythisis often called the rectangulardistribution.

Thedensity function,f(X), of a variable X that is uniformly distributed can bewritten as:

1
b-a
where a and Db are the minimum and maximum values ofthe possible range for X.

The mean and variance ofthe variable X can then hecalculated like so:

_atb
Mean = E(X)
2
b—a)2
Variance = V(X) = Ona

2. Bernoulli Distribution
Although the name sounds complicated,this is an easy oneto grasp.

A Bernoulli distributionis a discrete probability distribution. It is used when a random


experimenthas only two outcomes,“success”or “failure”; “win” or “loss”*, and single trial.

For example, the probability, P, of getting Heads (“success”) while flipping a coin is 0.5. The
probability of “failure” is 1 - P, i.e., 1 minus the probability of success. There’s no midway
betweenthe twopossible outcomes.

Arandom variable,X, with a Bernoulli distribution can take value 1 withthe probability of
successp, and the value 0 with theprobability of failure 1-p.

The probabilities of success andfailure donot need to be equallylikely, think about the results
ofa football match. If we are considering a strong team, the chances of winning would be much
higher compared to those of a mediocre one. The probability of success, p, would be much
higherthan theprobability of failure; the two probabilities wouldn’t be the same.

There are many examples of Bernoulli distribution such as whetherit’s going to rain tomorrow
or not(rain in this case would meansuccessandnorain failure) or passing (success) and not
passing (failure) an exam.

Say p=0.3, we can graphically represent the Bernoulli distribution like so:

‘Bernoulli Distribution
&
a
re
iJ
a

0
Scenarios

The probability density function, P(X), of a Bernoulli distributionis given by:

P(x) =p" * (L— p)",


wherex € (0, 1)

This can also be written as:

mao f peed
l-p 2=0

The expectedvalue, F(X), of a random variable,X, having a Bernoulli distribution can be found
as follows:

Mean = E(X) =1*p+0*(1—p) =p


Thevariance of a Bernoulli distribution is calculated as:

Variance = V(X) = E(X*) —[E(X)P = p—p?=


p(l = p)

3. Binomial Distribution
Bernoulli distribution allowedus to represent experimentsthat have two outcomes but only a
single trail. What if we have multiple trials? Say wetoss a coin not one but manytimes. This
is where an extension of Bernoulli distribution comesintoplay, Binomial Distribution.

A BinomialDistribution can be thoughtofas the probability of having successor failure as


outcomein an experimentthat is repeated multiple times. In other words, when only two
outcomes are possible, successorfailure, win or lose, and the probability of success andfailure
is same for all thetrials, the probability of Heads whentossing a coin does not change from one
toss to another.

Again,just like in case of Bernoulli, the outcomes don’t need be equallylikely. Also, eachtrialis
independent — the outcome ofa previous toss doesn’t affect the outcomeof the currenttoss.

Puttingit all together, Binomialdistributions must meet the followingcriteria:

Thereare only two possible outcomes in trial- either successorfailure.


The probability of success(tails, heads,fail, or pass) is exactly the sameforall trials.
The numberof observationsortrials is fixed, a total numberofn identicaltrials. In other
words, wecanonlyfigure out the probability of something happeningif wedoit a certain
numberoftimes. If wetoss a coin once, our probability of getting tails is 50%. If we toss a
coin 20 times, our probability of getting tails is very close to 100%.
Each observation ortrial is independent,none ofthetrials havean effect on the
probability of the nexttrial.

Sinceit’s easier to understand concepts by looking atgraphical representationsrather thanjust


heavy formulas,let’s look at the graphical representation of a Binomial Distribution first for
varying n and constant p andthen vice-versa:

BinomialDistribution PDF
Probabiity

Random Variable

ImageCredits: https:/www.boost.org

BinomialDistribution PDF
—n=20 p=0.1
—n=20 p=0.5
—n=20 p=0.9
Probabiity

Random Variable

ImageCredits: https:/www.boost.org

The meanandvariance of a binomialdistribution are given by:

Mean = E(X)=n*p
Variance = V(X) =n*p*(1—p)

Do you notice something from these formulas? Wecan observethat the Bernoulli Distribution
that wesawearlier was just a special case of Binomialif we set n=1.

Binomial Probability Distribution | N= 10, P=0.5


Probability

P(B <= 2) = 0.055. P(B >= 8) = 0.055.

a
a 3 4 5
Number of Successes
6
Ne
Example of a binomial distribution chart. Image Credits: https:/|ww.spss-tutorials.com/binomial-test/

Wearenot doneyet! We have 3 moredistributionsto cover; and westill have to learn about
the most important continuousdistribution. To the next lesson!
Typesof Distributions - Normal

We'll coverthe following a

* 4, Normal Distribution (Gaussian)

4. Normal Distribution (Gaussian)


Anormaldistribution, the bell curve or Gaussian Distribution,is a distribution that represents
the behaviorin mostsituations. For example, exams scores aretypically a bell curve where
moststudentsgetthe averagescore, C, a small numberofstudents scoresa B or a D, and an
even smaller numberscoresan or an A. This results in a distribution thatlooks like a bell:

Thebell curve is symmetrical, half of the data will fall to the left of the mean valueand half will
fall to the rightofit.

Wesaythe data is "normally distributed":


‘Mean
leyA Y jon has:has:
‘The NormalDisi tribution
le + mean = median = mode
= symmetry about the center
+ 50% of values less than the mean
Eee and 50% greater than the mean

ImageCredits: https:/www.mathsisfun.com

Manythingsfollowthis type of spread,this is whythis is used mostoften; you’d seen anywhere


from businesses to academia to government. Here are some examplesto give an ideaof the
variety covered by normaldistributions:

« Heights of people
« Blood pressure
* IQ scores

Salaries
* Size of objects produced by machines

Somefacts to rememberabout whatpercentageofour datafalls within a certain numberof


standarddeviations from the mean:

* 68% ofvaluesare within 1 standard deviation of the mean


* 95% ofvaluesare within 2 standard deviations of the mean
¢ 99.7% ofvalues are within 3 standarddeviations of the mean

ImageCredits: University ofVirginia.

Whyis it good to know standarddeviations from the mean?

Because wecan saythat anyvalueis:

likely to be within 1 standarddeviation (68 out of 100 should be)


* very likely to be within 2 standarddeviations (95 out of 100 should be)
« almostcertainly within 3 standard deviations (97 out of 100 should be)

The numberof standard deviations from the meanisalso called the standardscoreorz-score.
Z-scores are a way to compareresults from test to a “normal” population.

Say wearelooking at a survey about heights. A z-scorecantell us wherea person’s heightis


comparedto the average population’s mean height.A scoreofzerotells us that the value
exactly matchesthe average, whilea scoreof+3tells us that the value is much higher than
average.

The meanandvariance of a random variable, X, whichis said to be normally distributedis


given by:

Mean = E(X) =
Variance = V(X) =
where is the mean value and the standard deviation.

Thez-scoreis then given by:

rp
2 b

Standardize

950 970 990 1010 1030 1050 1070 -3-2 -1 0 +1 42 43


A NormalDistribution The Standard NormalDistribution

ImageCredits: https:/www.mathsisfun.com/data/standard-normal-distribution.htm|

All normal distributionsdo not necessarily have the same meansandstandarddeviations. A


normal distribution with a mean of0 and a standard deviation of1 is called a standard
normal distribution. If all the valuesin a distribution are transformed toZ scores, then the
distribution will have a mean of 0 and a standard deviationof1. This processoftransforming a
distribution to one with a meanof 0 and standarddeviationof1 is called standardizing the
distribution. Standardizing can help us make decisions aboutour data moreeasily.

Note: This is the mostimportant continuous random distribution. So, make sure you
understandit well before moving on to the nextlesson.
Typesof Distributions - Poisson and Exponential

We'll coverthe following a

5. Poisson Distribution

6. Exponential Distribution

5. Poisson Distribution
Thisdistribution gives us the probability of a given numberof events happeningin a fixed
interval oftime.

Say we have the numberofbreadssold by a bakery every day.If the average number for seven.
days is 500, we can predict the probability of a certain day having moresales, e.g., more on
Sundays. Another example could be the number of phonecalls received bya call center per
hour.Poisson distributions can be used to makeforecasts about the number ofcustomers or
sales on certain days or seasons ofthe year.

Why aresuch forecasts important?

Think aboutit: if more items than necessary arekept in stock, it means loss for the business.
Onthe other hand, under-stocking wouldalso result in loss because customers need to be
turned awaydueto not having enough stock. Poisson can help businesses estimate when
demandis unusually high so that they canplanfor the increase in demandin advance while
keeping wast of resources to a minimum. However,its applications are not only forsales or
specifically business related, somedifferent kinds of examples could be forecasting the number
of earthquakes happening, next month ortraffic flow andideal gap distance.

Fora distribution to be called a Poissondistribution,the following assumptions need to be in


place:

« The numberof successes in twodisjoint time intervals is independent.


« The probability of a success during a small-timeintervalis proportionalto the entire
length of the timeinterval. The probability of success in an interval approaches zero as the
interval becomes smaller.

The probability distribution representing the numberofsuccesses occurringin a given time


interval is given by this formula:

eh
P(X) = zl

where,
x=0,1,2,3..,
e =the natural numbere,
j= mean numberofsuccesses in the given timeinterval,
X, Poisson Random Variable, is the numberofevents in a timeinterval and P(X)is its
probability distribution (probability massfunction).

The mean,the expected numberof occurrences,y, can be calculatedas:

f=At
where,
is the rate at which an eventoccurs,
tis the length ofa time interval

Let’s look at a graphical representation ofa Poisson distribution, and howitvaries with the
changein expected numberof occurrences:
-eer
cee
fl
°
ie8

Thehorizontal axis is the index k, the numberof occurrences. A is the expected numberof occurrences, which need not be an
integer. The vertical axis is the probability of k occurrencesgiven A. The functionis defined only at integer values ofk.

6. Exponential Distribution
Exponential distribution allowsus to go a step further from thePoisson distribution. Say we
are using Poisson to model the numberof accidents in a given time period. Whatif we wanted
to understandthetime interval between the accidents? This is where exponential distribution
comes intoplay; it allows us to modelthetime in between each accident.

Someother examples ofquestions that can be answered by modeling waiting times:

« How muchtimewill go by before a majorearthquakehits a certain area?


« Howlong will a car componentlast beforeit needs replacement?

The probability density function for an exponential distributionis given by:

f(a) =e"
where,
e =the natural numbere,
A= mean time betweenevents,
xX =a randomvariable

A graphical representation ofthe density function for varying values of the mean time between
events looks likethis:
18
probability density
0s 10°
L
0.0

Wecan observethat thegreater the rate of events, the faster the curve drops, andthe lower the
rate, the flatter the curve.

Phew!This wasthelastone. In the nextlesson wearegoingto putall of this together and


recap everything to makesure youfirmly grasp these concepts.
Probability Distributions Recap

We'll coverthe following =A

* Recap With Hints


* Final Thoughts

Recap With Hints


Wehaveseen six major probability distributions. Let’s recap therelations among themsothat
wecanget a better understanding of whatto use and whentouseit.

From left to right: Uniform, Normal, Poisson, Exponential

Thefirst step is to identify whether weare dealing with a continuousor discrete random
variable. Once that’s done, here are somehelpful hints to proceed from there:

+ If we see a Normal (Gaussian) Distribution, we should go forit because there are many
algorithms that, by default, will perform well specifically with this distribution;it is the
mostwidely applicable distribution — it’s called “Normal” for a reason!

« A Poisson Distribution is similar to the Normal distribution, but with an addedfactor of


skewedness. When the skewednessis low, a Poissondistribution will havea relatively
uniform spreadin all directionsjustlike the Normal distribution.

* Binomial distribution when dealing with the numberofsuccesses in trials.

« Bernoulli distributionis a special case of Binomial distribution with singletrial.

* Poisson is about the numberofevents in an interval of space ortime.

« Exponential distribution is handy when dealing with time between events.

* Categorical variables can be easily interpreted using a Uniform distribution.

zy
“T always fee! so normal, so bored, you know. Sometimes I would
like to do something... you know... something... mmm... Poissonian.”

Image Credits: http:/www:thescientificcartoonist.com/

* Outofthesix distributionsthat we have seen, makesure to havea firm grasp over at


least the Uniform and Normal/Gaussiandistributions.

Before westop talking aboutdistributions,let’s see some code examplesforplotting


distributions in Python. Wecaneasily use the distplot() method from the seabornlibrary as
follows:

# Import libraries
wavanaune

import numpy as np
import seaborn as sns
# Create some random fake data
x = np.random.random(size=100)
# Plot the distribution
sns.distplot(x);

Distribution plot for randomly generated data

# Create some data that follows a Normal Distribution


wewne

Xx = np.random.normal(size=100)
# Plot the distribution
sns.distplot(x);

Distribution plot for data that follows a Normaldistribution

04
03
os 02

on
00 025 000 025 050 075 100 125 00 3 2 2 0 if 2 3 4

Distribution plots: Randomly distributed data (left) and Normally distributed data (right)

Final Thoughts
Probability distributionsarea toolthat you musthavein your data scientist’s toolbox, you will
need them at onepointor another! In these lessons we have donea deepdivein six major
distributions and learned abouttheir applications. Now, you do not need to memorizetheir
functionsandall the nitty-gritty details. However,it is importantto be ableto identify,
relate and differentiate among thesedistributions.

Statistical significance is a term thatis often thrown around withoutactually understanding


whatit means. In the nextlesson weare goingto learn aboutStatistical Significance, a
wildly misunderstood concept. So, keep going! We are almostat theendofthis section on
Statistics! @
Statistical Significance

We'll coverthe following a

* WhatIs Statistical Significance?


* Components ofStatistical Significance
* 1. Hypothesis Testing
* 2.Normal Distribution
° 3. P-value
* Example
© Final Thoughts

WhatIs Statistical Significance?


Statistics is guesswork based on mathematics. It is not an exact science, so when dealing with
statistical results, we need to know howcloseourguessis to reality. When someoneclaimsthat
somedata proves their point, we can’tjust accept it, asif all the juggling with complexstatistics
led to results that can’t be questioned! Weneed tousestatistical significance to reach
conclusionsinstead. Statistical significance is a measure of whetherour findings are
meaningfulor just a result of random chance.

As with most skills and conceptsin life, breaking things downinto sub-skills or basic
componentsis a great way to approachlearning. In this lesson weare going to chunkstatistical
significance into its base componentsandthen put all the pieces together to understandthis
conceptin anintuitive bottom-up approach.

Componentsof Statistical Significance


Statistical significance can be broken downinto three base components:

« Hypothesis Testing
¢ NormalDistribution
° P-values

Wewill first understand these three componentstheoretically and then wewill putit all
togetherwith the helpof a practical example.

1. Hypothesis Testing
Hypothesistesting is a technique for evaluating a theory using data. The hypothesis is the
researcher’s initial belief aboutthe situation before the study. The commonlyaccepted factis
knownasthe null hypothesis while the opposite is the alternate hypothesis. The researcher's
taskis to reject, nullify, or disprove the null hypothesis. In fact, the word“null” is meant to
implythat it’s a commonly accepted fact that researchers work to nullify (zero effect).

For example, if we consider a study aboutcell phones and cancerrisk, we might have the
following hypothesis:

« Null hypothesis: “Cell phones have noeffect on cancerrisk.”


« Alternative hypothesis (the one underinvestigation): “Cell phones affect the risk of
cancer.”

tTfol) PNR
cell phone use of cell phone use

Image Credits: Penn State's SC200coursesite, https://sites.psu.edu/siowfa15/2015/09/30/can-cell-phone-usage-cause-


cancer/

Thesestudies can be anything from a medicaltrial to a study evaluating customer retention.


The common goal amongthesestudies is to determine whichofthe two hypothesesis better
supported by the evidence foundfrom our data. This meansthat we needto be able to test
these hypotheses— thetesting part of hypothesistesting. How do wedotest?

There are many hypothesis tests that work by making comparisonseither between two groups
or between one group andtheentire population. Weare goingto look at the most commonly
used z-test.

Does z-test ring anybells? In the previous lessons, we cameacrossthe concept ofz-scores while
learning about normal distributions. Remember?? Thesecondbuildingblockof statistical
significance is built upon normaldistributions andz-scores. Ifyou need a refresher, before
continuingfurther, revisit the section on normal distributions.

2. Normal Distribution
As welearnedearlier, the normaldistributionis used to representthedistribution of our data
andit is defined by the mean,u (centerof the data), and the standard deviation, o (spread in
the data). These are two important measures becauseany point in the data can then be
representedin termsofits standard deviation from the mean:

99.7%ofthe data are within


3 standard deviations ofthe mean. >|
95% within
2 standard deviations
68% within
|—1 standard —|
deviation

a er a

For the normaldistribution,the values less than onestandard deviation away from the mean account for 68% of the set;
while two standard deviations from the mean account for 95%; and three standard deviations accountfor 99.7%. Image
Credits: Wikipedia

Standardizing theresults by using z-scores whereyou subtract the mean from the data point
anddivide by the standard deviationgivesus the standard normal distribution.

From z-scoreto z-test: A z-test is a statistical techniqueto test the Null Hypothesis against the
Alternate Hypothesis. This technique is used when the sampledata is normally distributed and
the population size is greater than 30. Why 30?

Accordingto the Central Limit Theorem as the sample size grows and numberof data points
exceeds 30, the samples are considered to be normally distributed. So wheneversamplesize
exceeds 30, we assumedata is normally distributed and wecanusethez-test.

As the nameimplies, z-tests are based on z-scores, which tell us where the sample meanlies
comparedto the population mean:
&|
als

where,
&: mean of sample,
4 mean of population,
o: standarddeviation of the population,
mn: numberof observations

z-scores on the extremes — higher endor lowerend- indicatethat ourresult is meaningful


becauseitis less likely to have occurredjust by chance.

But what determines howhigh the high should be and howlowthe low should be in orderfor us
to accept the results as meaningful?

To quantify the meaningfulness ofourresults, we need to understand thethird componentof


statistical significance, p-values.

3. P-value
The p-value quantifies the rarenessin our results. It tells us how often we'd see the numerical
results of an experiment (our z-scores)if the null hypothesisis true andthere are no
differences between the groups. This meansthat wecanusep-values to reach conclusions in
significance testing.

Morespecifically, we compare the p-valueto a significancelevel a to make conclusions about


our hypotheses:

If the p-value is very small or lowerthanthesignificance level we chose,it means the


numbers wouldrarely occur by chancealone, and wecanreject the null hypothesis in favor of
the alternative hypothesis. On the other hand,if the p-valueis greater than or equalto the
significancelevel, then wefail to reject the null hypothesis. This doesn’t mean weaccept the
null hypothesis though!

More likely observation


ma
2 P-value
25
3
B Very unlikely Very unrlikely
3 observations observations
g
eé/4 Observed ro
data point
<—________'—____»
Setof possible results

A p-value(shadedgreenarea) is the probability of an observed


(or more extreme) result assuming thatthe null hypothesis is true.

Image Credits: Wikipedia

But where doesthis a come from?

Althoughthe choiceof a dependson the situation, 0.05 is the most widely used valueacrossall
scientific disciplines. This meansthat p<.05 is the threshold beyond whichstudyresults can be
declaredto bestatistically significant,i.e., it’s unlikely the results were a result of random
chance. If we run the experiment100 times, we’d see these same numbers, or more extreme
results, 5 times, assuming the null hypothesisis true.

Again, a p-value ofless than .05 meansthat thereis less than a 5% chanceof seeing our
results, or more extremeresults, in the world wherethe null hypothesisis true.

Note that p<.05 does not meanthere’s less than a 5% chancethat our experimental results are
due to random chance.Thefalse-positive rate for experiments can be muchhigher than 5%!

P-VAWE INTERPRETATION
0.001
0.0102 |HIGHLY SIGNIFICANT
0.
0.03 |
aoe --SeNFIONT
0.0.040590_}— O41 CRAPATRIOENSDO.
0.05) THE EDGE e
doe JOF eurcrnC
007 | HIGHLY SUGGESTIVE,
0.08 |_SGNIRCANTe THE
dor Pow
0.077) Hey wok
0.1 THIS, Hexen
‘SUBGROUP ANALYSIS

ImageCredits: https://xkcd.com/

Note: Sincethis is a tricky concept that most get wrong butis important to understandit
well, again: p-value doesn’t necessarilytell us if our experimentwasa successornot,it
doesn’t prove anything!It just gives us the probability that a result at least as extreme as
that observed would haveoccurredif the null hypothesis is true. The lower the p-value,
the moresignificantthe result becauseit is less likely to be caused bynoise.

From z-scoreto p-value:Thez-scoreis called our test-statistic. Once we havea test-statistic,


wecaneitheruse the old-fashioned approach oflookingat tables or use any programming
language ofour choice to convert z-scores into p-values. For example, in Python, we can use
SciPy library’s scipy.stats module that provides many handystatistical functions.

Now, putting it all together; if the observed p-valueis lower than the chosen threshold a, then
weconcludethat theresultis statistically significant.

As a final note, an importanttake awayis that at the endof the day calculating p-values is not
the hardest parthere! Therealdealis to interpret the p-valuesso that we can reach
sensible conclusions. Does 0.05 work asthe threshold foryour study or should you use0.01 to
reach any conclusions instead? And whatis ourp-valuereally telling us?

Example
Let’s putall the pieces togetherby looking at an example fromstartto finish.

A company claimsthatit has a high hiring bar whichis reflected in its employees having an IQ
above the average. Say a random sample of their 40 employees has a mean IQscoreof 115.Is this
sufficient evidence to support the company’s claim given the mean populationIQ is 100 with a
standarddeviation of 15?

1. State the Null hypothesis: the accepted factis that the population meanis 100 — Hp: u=
100.

2. State the Alternate Hypothesis: the claim is that the employees have above average IQ
scores — Hy: u > 100.

3. State the threshold for the p-value — level: we will stick with the most widely used
value of 0.05.

4. Find thetest statistic using this formula forthez-test:

115 — 100
= 6.32
15/40
The company meanscoreis 115, which is 6.32 standard error units from the population mean
of 100.

5. Get the p-value from thez-score: Using an onlinecalculator for converting z-scoresto p-
values,wesee that the probability of observing a standard normal value below 6.32 is <
.00001.

6. Interpret the p-value: our result is significantat p < 0.05, so we can reject the null
hypothesis — the 40 employees of interest have an unusually higher IQ score comparedto
random samplesofsimilar size from theentire population.

Final Thoughts
There were quite a few concepts in this lesson. To makesurethat our understandingis crystal
clear, wewill engrain these concepts with some exercises.

Also,this was thelast lesson on Statistics, so well-done for having come this far §Q Let’s keep
going, there are fun Machine Learninglessons awaiting us ahead!
Quiz: Statistics

We'll coverthe following a

Timeto Test Your Skills!

° 1. Basics

2.Statistical Significance

Time to Test Your Skills!

1. Basics
Forthegiven list of numbers,stored in a variable data, computeits basic statistical
features andstore the results in the given variables. You can use NumPy to calculate
these values.

1 import numpy as np
2
3. # Input list
4 data = [23, 57, 10, 10, 12, 35, 2, 74, 302, 10]
5
6 # Repalce the “None” values with your solutions
7 # Use NumPy to calculate the values for each variable
8 mean = None
9 median = None
19 standard_deviation = None
12. print("Mean is ", mean)
13 print("Median is ", median)
14 print("SD is ", standard_deviation)

Era) Hide Solution a

Solution eR o

1 mean = np.mean(data)
2 median = np.median(data)
standard_deviation = np.std(data)

‘The HR committee of a company wants to determinethe average number of employees per


teamin their company. There are 50 teams in the company. They dividethe total numberof
employeesby 50 anddeterminethatthe average numberof employeesper team is 4.2. Which
ofthe following mustbe true?

© Ay Thereare total of 210 employeesin the company

© 8) The most common numberof employees per team is 4.2.

©. o©Halfof the teams have more than children.

Reset Quiz C

2. Statistical Significance
Interpreting P-Values

Say weare workingona study thatteststhe impactof smoking onthe duration of pregnancy. Do
womenwho smokeruntherisk of shorter pregnancy and prematurebirth? Ourdatatells usthat
the meanpregnancylength is 266 days and wehavethe following hypothesis:
Null hypothesis, Ho: = 266
Alternate hypothesis, Ha: < 266
Wealso havedata from a random sample of 40 women who smokedduringtheir pregnancy. The
meanpregnancylengthofthis sample is of 260 days with a standard deviation of21 days. The z-
scoretells us thatthe p-value inthis case is 0.03.
Whatprobability does the p=0.03 describe? Based ontheinterpretationof the p-value, select
whether the given statements are Valid or Invalid.

a ‘There is a 3% chance that women who smokewill have a mean pregnancyduration of 266
days.

O Ayvalid

©. 8) invalia

Question 1 of2 (5
Reset Quiz C O attempted
Introduction

Machine learning (ML)is a term that is often thrown aroundas if it is some kind of magic that
once appliedto yourdata, will create wonders! If we lookatall the articles about machine
learning on planet Internet, wewill stumble uponarticles of two types: heavy academic
descriptionsfilled with complicated jargon orfluff talk about machine learning being a magic
pill.

THIS IS YOUR MACHINE LEARNING SYSTEM?


YUP! YOU POUR THE DATA INTO THIS BIG
PILE OF LINEAR ALGEBRA, THEN COLLECT
THE ANSWERS ON THE OTHER SIDE.
WHAT I THE ANGERS ARE WRONG? )
JUSTSTIR THE PILE UNTIL
THEY START LOOKING RIGHT.

ImageCredits: https:/ixked.com

In these series of lessons, weare going to havea simple introductionto the subject so that we
can grasp the fundamentals well. We will diveinto the practical aspects of machine learning
using Python’s Scikit-Learn package via an end-to-endproject.

Pleasenotethatthis is not meant to be a comprehensive introductiontothe field of


machinelearning; thatis a large subject andnecessitates a full course ofits own!

Thegoals ofthis series of lessonsare:

« To introduce the fundamental concepts of machine learning.


« To learn aboutseveralof the mostimportant machinelearning algorithms and develop an
intuition into how they work and whenand wheretheyare applicable.
« To get an understanding of whatarethe necessary steps and how theycan be applied toa
machinelearning project via a real end-to-end example.

But before we continue, you might be asking yourself, "What’s really the difference between
Data Science and Machine Learning?!"

The twofields do havea big overlap, and they often sound interchangeable. However,if we
wereto consider an oversimplified definition of the two, wecouldsaythat:

« Data scienceis used to gain insights and understanding of the data.


« Machinelearning is used to produce predictions.

Thatsaid, the boundary betweenthetwois nota distinct one; mostpractitioners needto be


capable of switching back andforth between the two whichis whywearegoing to diveinto
machinelearning in a data science course.
Understanding Machine Learning

We'll coverthe following a

WhatIs Machine Learning?


* Main Components of Machine Learning
* Machine Learning Applications

What Is Machine Learning?


Machine Learningis essentially about teaching computersto learn from data:

MachineLearningis thefield of study that gives computersthe ability to learn without


being explicitly programmed.

Theidea is thatthere are generic algorithmsthat cantell you somethinginteresting about a


set of data without having to write any custom codespecific to the problem. Instead of
writing explicit code, you feed data to the genericalgorithm andit builds its own logic based
onthedata.

Let’s say we wantto recognize objectsin a picture. In the old days programmers would have
hadto write code for every object they wanted to recognize,e.g., person,cat, vehicles. This is
not a scalable approach. Today, thanks to machinelearning algorithms, one system can learn to
recognize both by just showing it many examples of each. For instance, the algorithm is able to
understandthata cat is a cat by looking at examples of pictures labelled as “this is a cat” or
“this is not a cat”, and by being corrected every timeit makes a wrongguess aboutthe object in
thepicture. Then,if showna seriesof newpictures,it begins to identify cat photos in the new
set just like a child learnsto call a cat a cat and a dog a dog.

This magicis possible because the system learns basedon the propertiesof the object in
question, a.k.a. features.

For example, while learningto distinguish


between apples and oranges in a very
rudimentary way, color could be used as a
feature, andall the red colored fruits would
thenget assigned as “apple” while the ones
with an orangecolor would get labelled as
“orange”.

Your spamfilter is another example of a Spam


machinelearning program. Thereis no emai B
explicit algorithm, but given enough examples SK Machine Leaning
Model
of spam and non-spam emails, the generic ‘Not Spam
machinelearning algorithm can automatically
ee
learn to flag spam emails. This is achieved by
detecting specific patterns, e.g., occurrence of
certain words andphrases in spam emails
compared to non-spam examples. The greater
thevariety in the samples we provide to our
algorithms,theeasieritis to find relevant
patterns andpredict correctresults.

Main Components of Machine Learning


Based on our examples, can you spot the three main components of machine learning?
Basically, we need three componentsto train our machinelearning systems:

« Data: this is why datais being called the newoil! Data can becollected both manually and
automatically. For example,users’ personal details like age and gender,all their clicks, and
purchasehistory are valuable data for an onlinestore. Do yourecall “ReCaptcha” which
forces you to “Selectall the street signs”? That’s an example of some free manual labor!
Data is not always images;it could be tables of data with manyvariables (features), text,
sensor recordings, sound samples etc., depending on the problem at hand.

TO COMPLETE YOUR REGISTRATION, PLEASE. TELL US


WHETHER OR NOT THIS IMAGE. CONTAINS A STOP SIGN:

ANSWER QUICKLY—OUR SELF-DRIVING


CAR IS ALMOST AT THE INTERSECTION.

50 MUCH OF ‘Al’ 5 JUST FIGURING OUT WAYS


TO OFFLOAD WORK ONTO RANDOM STRANGERS.
ImageCredits: xked.com

« Features: featuresare often also called variables or parameters. Theseare essentially the
factors for a machineto look at — theproperties ofthe “object” in question,e.g., users’ age,
stock price, area ofthe rental properties, numberof wordsin a sentence,petallength,size
ofthe cells.
Choosing meaningful features is very important. Continuing with our example of
distinguishing apples from oranges, say we take bad featureslike ripeness and seed count.
Since these are not really distinct properties ofthe fruits, our machine learning system.
won't be ableto do a good job atdistinguishing between apples and oranges based on
these features.
Rememberthat it takes practice and thoughtto figure out what features to use as they are
not alwaysasclear asin this trivial example.

Algorithms: Machinelearningis based on general purposealgorithms. For example, one


kindof algorithm is classification. Classification allowsus to putdata into different groups.
Theinteresting thingis that the sameclassification algorithm used to recognize
handwritten numberscouldalso be used to classify emails into spam and not-spam
without changing a line of code! Howisthis possible? Although the algorithm is the same,
it’s fed different inputdata, so it comes up with differentclassification logic. However,this
is not meantto implythat one algorithm can be used to solveall kinds of problems! The
choice ofthe algorithm is made based onthe type of problem at hand,e.g., are we working
withpredicting stock prices or do we wantto assign labelslike spam or not-spam? Wewill
learn thedetails in the coming sections.
Thechoice ofthe algorithm is importantin determining thequality of the final machine
learning model. However,onevery important thing to rememberisthatif the data is
crappy, even thebest algorithm won’t help. Garbagein, garbage out is what they always
say. This is why acquiring as much dataas possible is a very importantfirst step in getting
started with machinelearning systems.

Machine Learning Applications


Can you think of some examplesof Machine Learning that you use everyday?

Here are somepopular applications:

Virtual Personal Assistants: Siri, Cortana, Alexa, Google Now

Finance: Frauddetection, prediction and execution oftrades at speeds and volumesthat


humanscan’t compete with.
Social Media: Face Recognition, People You May Know,Pages You MightLike.
Retail: Product Recommendations; maximization of revenue by learning customers’
habits.
* Online customer support: Customer supportrepresentatives are being increasingly
replaced by chatbots.
Medicine: Medical diagnosis, drug discovery, understandingofrisk factors for diseases in
large populations.
Search Results: When you search on Google, the backend keeps an eye on whether you
clicked on thefirst result or wenton to the second page — thedatais used to learn from
mistakes so that relevant information can be found quicker nexttime.

Predictive
Political policing -—‘Surveillance
campaigns sige
Optical character
recognition

Recommendation
engines

Google Ads ‘Autonomous(“selt-


Filtering Personal assistants: driving’) vehicles
algorithms) Google Now, Advertisin g
news feeds i soft Cortana,
Micro pe in
Apple Siri, etc. inteligence

Image Credits: Introduction to Machine Learning- Scientific Figure on ResearchGate. Available from:
https:/Awww.researchgatenet/figure/Machine-Learning-Application_fig1_323108787
Types of Machine Learning Algorithms

We'll coverthe following a

* 1.Supervised Learning
* 2. Unsupervised Learning
» 3. Semi-supervised Learning
* 4, ReinforcementLearning
* Final Thoughts

Machine Learningalgorithmscan be broadly categorized into the followingfour groups:

« Supervised Learning
« Unsupervised Learning
« Semisupervised Learning
« Reinforcement Learning

1.Supervised Learning
In Supervised Learning,thetraining data provided as inputto the algorithm includesthefinal
solutions,called labels or class becausethe algorithm learns by “looking” at the examples with
correct answers. In other words,the algorithm has a supervisoror a teacher whoprovides it
with all the answersfirst, like whetherit’s a cat in the picture or not. And the machine uses
these examplesto learn one by one. The spamfilter is another good exampleofthis.

Anothertypical task, of a different type wouldbe to predict a target numeric valuelike housing
prices from setoffeatureslike size, location, numberof bedrooms. Totrain the system, we
again need to provide manycorrect examples of knownhousingprices,including both their
features andtheir labels.

Whilecategorizing emails or identifying whether thepictureis of a cat or a dog was a


supervised learning algorithm oftypeclassification, predicting housing prices is known as
regression. What’s the difference?

In regression the outputis a continuousvalueor a decimal numberlike housingprices. In


classification, the outputis a label like “spam or not-spam” andnot a decimal number;the
outputonly takes valueslike 0 or 1 where wecould have1 for “spam” and0 for “non-spam”.
Basically, the type of algorithm wechoose(classification or regression) depends on the
type of output we want.

Examplesof Supervised Learning Algorithms:

« Linear Regression
« Logistic Regression
« SupportVector Machines
¢ Decision Trees and Random Forests
« k-Nearest Neighbors
¢ Neural networks

While the focusofthis lesson is to learn aboutthe broadcategories, wewill be diving deeper
into each ofthese algorithms individually in the "Machine Learning Algorithms"lesson.

2. Unsupervised Learning
In Unsupervised Learningthe data has nolabels; the goal of the algorithm is to find
relationshipsin the data. This system needsto learn withouta teacher. For instance, say we
havedata about a website’s visitors and we wanttouseit to find groupingsofsimilarvisitors.
Wedon’t know andcan’t tell the algorithm which groupa visitor belongs to;it finds those
connections withouthelp based on somehidden patternsin the data. This customer
segmentation is an example of what is knownasclustering,classification with no predefined
classes and based on some unknownfeatures.

Another well-knownusecaseis image compression. Whensaving an image,if weset the


palette, let’s say, to 32 colors, clusteringwill findall the “blueish” pixels, calculate the “average
blue” andset it for all the blue pixels. This helps us in achieving a lowerfile size.

Examplesof Unsupervised Algorithms:

* Clustering: k-Means
« Visualization and dimensionality reduction

* Principal Component Analysis (PCA), t-distributed


* Stochastic Neighbor Embedding (t-SNE)
« Association rule learning: Apriori

3. Semi-supervised Learning
Semi-supervised learning deals with partially labeled training data, usually a lot of unlabeled
data with somelabeled data. Most semi-supervised learning algorithmsare a combination of
unsupervised and supervised algorithms.

Google photosis a good exampleofthis. In a set of family photos, the unsupervised partof the
algorithm automatically recognizes the photos in which eachof the family members appears.
For example,it can tell that person A appearsin picture 1 and 3 whileperson B appearsin
picture 1 and 2. After this step, all the system needsfrom us is onelabel for each person and
thenthe supervisedpart of the algorithm can nameeveryone in every photo. Bingo!

4. Reinforcement Learning
Reinforcement Learningis a special and more advanced category wherethe learning system or
agentneedsto learn to makespecific decisions. The agent observes the environmentto which it
is exposed,it selects and performsactions, and gets rewardsorpenalties in return. Its goalis to
choose actions which maximize the rewardovertime. So, bytrial anderror, and based on past
experience,the system learnsthebeststrategy, called policy, on its own.

A good exampleof Reinforcement Learning is DeepMind’s AlphaGo. The system learned the
winning policy at the gameofGo by analyzing millions of games andthen playing againstitself.
At the championshipof Go in 2017, AlphaGo wasableto beat the human world championjust
by applying thepolicy it had learned earlierbyitself.

Environment s
@ ov<erve
Select action
using policy

© Action:
Get reward
or penalty

Update policy
{learning step)
erate until an
© optimal poticy is
found

Image Credits: https:/ww.matutitech.com/businesses-reinforcement-learning/

TAXONOMYOF MACHINE LEARNING METHODOLOGIES


Meaningful
Compression structure mage Customer Retention
Disovery Classification
Big Data Fetore
Visualization tation Gomis

Recommender Unsupervised
Machine Ere)
[Advertising
Prediction Populaiy
‘systems
fer 4 Learning en)
forecasting
Targeted Market
Marketing Forecasting
Real-Tine stimating
Decisions Ute Expectangy

Robot
ae SKILAcquiston|

Figure 10: An overview of machine learning techniques; Source: Jha, V.

ImageCredits: DHL, Artificial Intelligencein Logistics, 2018 (Pdf, 45 pp, No opt-in).

Final Thoughts

This wasa gentle introduction to Machine Learning. Hopefully, you are excited to learn more
aboutthis cool subject! Now that weare familiar with the broad types of machinelearning
algorithms,in the next lesson, weare goingto diveinto the specifics of individual machine
learning algorithms.
Machine Learning Algorithms|

We'll coverthe following a

© Introduction
* 1. Linear Regression
* 2. Logistic Regression
* 3. Decision Trees
* 4, Naive Bayes
* 5. Support Vector Machine (SVM)

Introduction
In this lesson, weare going to learn about the most popular machinelearning algorithms. Note
that weare not going to do a technical deep-diveasit would beout ofourscope. Thegoal is to
coverdetails sufficiently enough so that you can navigate through them when needed. The key
is to know aboutthedifferent possibilities so thatyou can then go deeper on a need’s
basis.

In this algorithm tour weare goingto learn about:

1 . Linear Regression

2. . Logistic Regression
3. . Decision Trees

4. . Naive Bayes

5. Support Vector Machines, SVM

6. K-Nearest Neighbors, KNN

7. K-Means
8. Random Forest
9. Dimensionality Reduction
10. Artificial Neural Networks, ANN

1. Linear Regression
LinearRegression is probably the most popular machinelearning algorithm.

Rememberin high school when you hadto plotdata points on a graph with an X-axis and a Y-
axis and then find theline ofbest fit? That was a very simple machinelearningalgorithm,
linear regression. In moretechnical terms, linear regression attempts to representthe
relationship between oneor moreindependent variables (points on X axis) and a numeric
outcomeor dependentvariable (value on axis) by fitting the equation of a line to the data:

Y=axX+b

Example of simple linear regression, which has oneindependent variable (x-axis) and a dependent variable (y-axis)

For example, you mightwantto relate the weights (Y) of individuals to their heights (X) using
linearregression. This algorithm assumesa strong linear relationship between input and
output variables as we would assumethatif height increases then weight also increases
proportionally in a linear way.

The goalhereis to derive optimal valuesfor a and b in the equation above,so that our
estimated values,Y, can be asclose as possible to their correctvalues. Note that we know the
actual values for Y duringthe training phase because wearetrying to learn our equation from.
the labelled examples given in thetraining dataset.

Once our machine learning modelhas learned thelineofbest fit via linear regression,this line
can then be usedto predictvalues for newor unseendatapoints.

Different techniques can heused to learn the linear regression model. The most popular
methods is thatof least squares:

Ordinary least squares: The method of least squares calculates the best-fitting line such that
thevertical distances from each data pointto the line are minimum. Thedistances in green
(figure below) should be kept to a minimumsothat thedata points in red canbe as close as
possible to the blueline(line ofbestfit). If a pointlies on thefitted line exactly then its vertical
distance from thelineis 0.

To be morespecific, in ordinary least squares, the overall distance is the sum ofthe squares of
thevertical distances (greenlines) for all the data points. Theideais to fit a model by
minimizing this squared erroror distance.

In linear regression, the observations(red) are assumed to bethe result of random deviations (green) from an underlying
relationship (blue) between a dependentvariable (y) and anindependent variable (x). While finding theline of best fit, the
goalis to minimize the the distance shownin green-- red points as close as possible to theblueline.

Note: While using libraries like Scikit-Learn, you won’t have to implement any of
thesefunctionsyourself. Scikit Learn provides out of the box modelfitting! We will
see this in action once we reach our Projects section.

Whenwearedealing with only one independentvariable,like in the example above, wecall it


Simple Linear Regression. Whenthere are morethan one independentvariables,(e.g., we want
to predict weight using morevariables than just the person’s height) then this type of
regressionis termed as Multiple Linear Regression. As showninthefigure below,while finding
theline ofbest fit, we can use a polynomial, or a curvedlineinstead of a straightline; this is
called PolynomialRegression.

Linear Regression (green) vs Polynomial Regression(red)

2. Logistic Regression
Logistic regression has the same mainidea as linearregression. Thedifferenceis thatthis
techniqueis used whenthe outputor dependentvariable is binary meaning the outcome can
haveonly twopossible values. For example, let’s say that we wantto predictif age influences
the probability of having a heartattack.In this case, ourpredictionis only a “yes” or “no”, only
twopossible values.

In logistic regression,theline ofbest fit is not a straight line anymore. Theprediction for the
final outputis transformed using a non-linear S-shaped functioncalledthelogistic function, gO.
Thislogistic function mapsthe intermediate outcomevaluesinto an outcomevariable Y with
values ranging from 0 to 1. These 0 to 1 values can then be interpreted as the probability of
occurrenceofY.

In the heart attack example, we havetwoclasslabels to be predicted, “yes” as 1 and “no”as 0.


Say weset a threshold orcut-off valueto 0.5, all instances with a probability higher than this
thresholdwill be classified as instances belonging to class 1 , while all those having probability
belowthe threshold will be assignedto class 0. In other words, the propertiesof the S-shaped
logistic function makelogistic regressionsuitable forclassification tasks.

Probabof passing exam versushours ofstaying

oursoning

Graph ofa logistic regression curve showing probability of passing an exam versus hours studying

3. Decision Trees
Decision Trees also belongto the category of supervised learning algorithms, but they can be
used for solving both regression andclassification tasks.

In this algorithm,thetraining modellearns to predict values of thetarget variable by learning


decision rules with a tree representation. A tree is madeupof nodes corresponding to a feature
or attribute. At each node weask a question about the data based on theavailable features,
e.g., Is it raining or not raining?. Theleft and right branches representthe possible answers.
Thefinal nodes,leaf nodes, correspondto a class label/predicted value. The importance for
each feature is determined in a top-down approach — the higherthe node, the more
importantits attribute/feature. This is easier understood with a visual representation of an
example problem.

Say we wantto predict whetheror not we should waitfora table at a restaurant. Below is an
example decisiontree that decides whetheror not to wait in a given situation based on
differentattributes:

Patrons
SomeFall
No Yes WaitEstimate
<1l0
30460

N Alternate Hungry Yes

No Yes No Yes

Reservation Fri/Sat Yes Alternate


No, Yes % Yes so\

Bar Yes No Yes| |Yes Raining


No i Yes Yes No

No Yes all leaves Yes or NO Ves No

ImageCredits: http:/Awww.cs.bham.ac.uk/~mmk/Teaching/Al/

In this example,our attributes are:

« Alternate: alternative restaurant nearby


¢ Bar: bar area to wait
Fri/Sat: true on Fridays and Saturdays
« Hungry: whether weare hungry

* Patrons: how manypeople in restaurant (none, some,orfull)


« Raining: raining outside

« Wait-Estimate: estimated waiting time (<10,10-30,30-60,>60)

Our data instances are thenclassified into “wait” or “leave”based ontheattributes listed
above. From thevisual representationof the decision tree, we canseethat “wait-estimate” is
moreimportantthan “raining” because it is present ata relatively higher nodein thetree.

4. Naive Bayes
Naive Bayesis a simple yet widely used machinelearningalgorithm based on the Bayes
Theorem Rememberwetalked aboutitin theStatistics section?It is called naive because the
classifier assumes that the inputvariables are independentof each other, quite a strong and
unrealistic assumptionfor real data!. The Bayes theorem is given by the equation below:

Pll) = Es
P(a\c) * P(c)

where,
P(c|x) = probability of the eventofclass c, given the predictorvariable x,
P(x|c) = probability ofx given c,
P(c) = probability of the class,
P(x) = probability of the predictor.

To putit simply, the model is composedoftwotypes of probabilities:

« Theprobability of each class;


« The conditional probability for each class given each value of x

Say wehavea training data set with weatherconditions, x, and the correspondingtarget
variable “Played”, c. We canuse this to obtain the probability of “Players will playif it is rainy”,
P(c|x). Note that evenif the answeris a numerical value ranging from to 1, this is an example
of a classification problem — wecanuse the probabilities to reach a “yes/no” outcome.

Naive Bayesclassifiers are actually a popularstatistical technique of spam e-mail filtering. It


worksbycorrelating the use of tokens typically words, with spam and non-spam e-mails and
thenusing Bayes’ theorem tocalculate a probability that an email is spam ornot.

5. Support Vector Machine (SVM)


SupportVector Machines is a supervised algorithm used mainlyfor classification problems.In
this algorithm,weplot each data item as a pointin n-dimensional space, wheren is the number
of input features. For example,with two inputvariables, we would have a two-dimensional
space. Basedonthese transformations, SVM finds an optimal boundary,called a hyperplane,
that best separates the possible outputsbytheir class label. In a two-dimensional space,this
hyperplanecan hevisualized as a line although notnecessarily a straightline. The task of the
SVM algorithm is to find the coefficients that provide the bestseparation ofclasses by this
hyperplane.

The distance between the hyperplaneandtheclosestclass point is called the margin. The
optimal hyperplaneis one that hasthe largest margin that classifies points in such a way that
the distance betweenthe closest data point from both classes is maximum.

In simple words, SVM tries to draw twolines betweenthedata points with thelargest margin
between them. Say weare given plot oftwoclasses, black and whitedots, on a graph as
shownin the figure below. The job of the SVM classifier wouldthen be to decide the bestline
that can separatethe black dots from the whitedots, as shownin thefigure below:

H1 does not separate the two classes. H2 does, but only with a small margin. H3 separates them with the maximal margin.
Machine Learning AlgorithmsII

We'll coverthe following a

6. K-Nearest Neighbors (KNN)


7. K-Means

8. Random Forest

9. Dimensionality Reduction

10. Artificial Neural Networks (ANN)

Final Thoughts

6. K-Nearest Neighbors (KNN)


KNNalgorithmis a very simple and popular technique. It is based on thefollowing idea from
real life: You are the averageof thefive peopleyou most associate with!

KNNclassifies an object by searching throughtheentire trainingset for the k mostsimilar


instances,the k neighbors, and assigning a common outputvariableto all those k instances.
Thefigure below representsa classification example. Thetest sample (green dot) should be
classified either to blue squares orto red triangles. If k = 3 (solid line circle) it is assigned to the
red triangles becausethereare 2 triangles and only 1 squareinsidetheinnercircle. If k = 5
(dashedlinecircle)it is assigned to the blue squares(3 squares vs. 2 triangles inside the outer
circle):

Example of k-NN classification

Theselectionofk is critical here; a small value can resultin a lot of noise and inaccurate
results, while a large valueis not feasible and defeats the purposeof the algorithm.

Although mostly used forclassification, this techniquecan also be used for regression
problems. For example, when dealing with a regression task, the output variable can be the
meanofthe k instances, while for classification problems this is often the modeclass value.

Thedistance functions for assessing similarity between instances can be Euclidean, Manhattan,
or Minkowski distance. Euclidean distance, the most commonly usedone,is simply an
ordinary straight-line distance between twopoints. To be specific,it is the squareroot of the
sum of the squares ofthe differences between the coordinates of thepoints.

7. K-Means
K-meansis a type of unsupervised algorithm for dataclustering. It follows a simple procedure
to classify a given data set.It tries to find K numberofclustersor groupsin the dataset. Since
weare dealing with unsupervised learning,all we haveis our training data X and the number
ofclusters, K, that we wantto identify, but no labelled traininginstances (i.e., no data with
knownfinal output category that wecoulduse to train our model). For example, K-Means could
be used to segmentusersinto K groups based ontheir purchase history.

Thealgorithm iteratively assigns each data pointto one ofthe K groups basedontheir features.
Initially, it picks k points for each ofthe K-clusters, knownasthe centroid. A new data pointis
putinto thecluster having theclosest centroid based onfeaturesimilarity. As new elements are
added tothecluster, the cluster centroidis re-computed and keeps changing. The new centroid
becomesthe averagelocationofall the data points currently in thecluster. This processis
continuediteratively until the centroids stop changing. At the end, each centroidis a collection
of feature values that definethe resulting group.

41. Kinitial "means"(in this 2. k clusters are created by 3. The centroid of each of the 4. Steps 2 and 3 are repeated
‘case k=3) are randomly associating every observation k clusters becomes the new until convergence has been
generated within the data with the nearest mean. The mean, reached,
‘domain (shown in color). partitions here represent the
Voronoi diagram generated by
the means.

Image Credits: Wikipedia

Continuing with the purchasehistory example,the red cluster mightrepresentusers thatlike


to buy tech gadgets and the blue one mightbe usersinterested in buying sports equipment.
Nowthealgorithm will keep movingthe centroid for each user-segmentuntil it is able to create
K groups. Andit will do so by trying to maximizethe separation between groups andusers
outside ofthe group.

8. Random Forest
Random Forestis oneof the most popular and powerful machinelearning algorithms. It is a
type of ensemble algorithm. The underlying idea for ensemble learning the is wisdom of
crowds,theidea that the collective opinion of manyis morelikely to be accurate than that
of one. The outcomeofeach ofthe models is combined and a prediction is made.

In RandomForest, we have an ensembleofdecisiontrees, seen earlierin, algorithm 3. When


wewantto classify a new object, wetakethevoteof each decision tree and combine the
outcometo makea final decision; majority vote wins.

=
® tet x
Miia node 00) *

Be?
mA
ae a
dod wae
pty see

-— a
ERE ERED BEER
{a) In thetraining process, each decision tree is built based on a bootstrap sample ofthe training set. which contains two
kinds of examples(green labels and red labels). (b) In the classification process, decision fortheinput instanceis based on
the majority voting results among allindividualtrees. Image Source:Scientific Figure on ResearchGate,
https:Awww.researchgate.net/figure/llustration-of-random-forest-a-In-the-training-process-each-decision-tree-is-
built_fig3_317274960

9. Dimensionality Reduction
In thelast years, there has been an exponential increase in the amountof data captured. This
meansthat many machinelearning problems involve thousandsor even millions of features
for each training instance! This not only makestraining extremely slow but makesfinding a
good solution muchharder. This problem is often referred to as the curse of dimensionality.
In real-world problems,it is often possible to reduce the numberoffeatures considerably,
making problemstractable.

For example,in an imageclassification problem,if the pixels on the image borders are almost
always white, these pixels can completely be dropped from thetrainingset withoutlosing
much information.

In simple terms, dimensionality reduction is about assembling specific features into more high-
level ones withoutlosing the most importantinformation. Principal ComponentAnalysis (PCA)
is the most popular dimensionality reduction technique. Geometrically speaking, PCA reduces
the dimensionof a dataset by squashing it onto a lower-dimensional line, or more generally a
hyperplane/subspace, whichretains as muchofthe original data’s salient characteristics as
possible.
zeimeag

Feature 1

Say wehavea set of 2D points as shownin thefigure above. Each dimension correspondsto a
feature weareinterested in. Although thepoints seem to be scattered quite randomly, if we pay
close attention, we can see that wehavea linearpattern (blueline). As wesaid, the key pointin
PCAis Dimensionality Reduction, the process of reducing the numberof the dimensionsof the
given dataset; it does this by findingthe direction along which ourdata varies the most.

In the example above,it is possible to achieve dimensionality reduction by approximating all


the data points to a single line. The projection onto a line reduces the dimensionality of our
datasetfrom 2D to 1D.

Lastbutdefinitely not theleast,let’s look into Artificial Neural Networks, whichareat the very
coreof Deep Learning.

10. Artificial Neural Networks (ANN)


ANNareideal for tackling large and highly complex machine learning tasks, such as
recommending thebest videos to watch to hundreds ofmillions of users every day(e.g.,
YouTube), powering speech recognition services(e.g., Siri, Cortana) or learning to beat the
world championat the gameof Go (DeepMind’s AlphaGo).

ANNrequire a huge amountoftraining data, high computational powerandlong training


timebut, in the end,they are able make very accurate predictions.

The key idea behind ANNis to use the brain’s architecture for inspiration on howto build
intelligent machines.

A
ein star OURPUES
Mysinated axon
Inputs
Neuronwith signalflow from inputs at dendrites to outputs at axonterminals

To train a neural network,a set of neurons are mapped out andassigned a random weight
which determines howthe neurons process newdata, images, text, sounds,etc. The correct
relationship between inputs and outputsis learned from training the neural network on input
data. Since duringthetraining phase the system getsto see the correct answers,if the network
doesn’t accurately identify the input - doesn’t see a face in an image, for example — then the
system adjusts the weights. Eventually, after sufficient training, the neural networkwill
consistently recognize the correct patternsin speech,text or images.

Hidden

Input

Coy

Anartificial neural network is an interconnected groupof nodes,inspired by a simplification of neuronsin a brain. Here, each
circular node represents an artificial neuron and an arrow represents a connection from the output ofone artificial neuron to
theinput of another.

As a neural networkis a essentially a set of interconnected layers with weighted edges and
nodescalled neurons. Betweentheinput and outputlayers wecaninsert multiple hidden
layers. ANN makeuseof only two hidden layers. However,if we increase the depth ofthese
layers then we are dealing with the famous Deep Learning.

v i To=X ty Tr T;
(Label) (Feature/lmage) (Input Layer) (Hidden Layer 1) (Hidden Layer2). (Hidden Layer3)
m=?
(OutputLayer)
Cat \

ImageCredits: https:/www.ibm.com/blogs/research/2019/06/deep-neural-networks

Note:For a deep-dive into Neural Networks, I highly encourageyou to readthisarticle.

Final Thoughts
Now wehavea very good overview of the most commonly used machinelearningalgorithms.
Hope you enjoyed this walk-through!

With thebasics covered, we are nowina good placetolookatthe implementation of these


algorithms in Python. However, before we moveonto the hands-on part, we have a few more
theoreticaltopics to cover. So stay tuned!
Quiz: Machine Learning Algorithms

We'll coverthe following 0

« Time to Test Your Skills!

Time to Test Your Skills!

a Suppose you have the record of numberof rainy days in Octoberfor the last 20 years. Whatis
thebest modelto estimate the numberof rainy days for current October?

© A) LinearRegression

© B) Logistic Regression

. Question 1 of 10
Reset Quiz @
< O attempted >
Evaluating a Model

We'll coverthe following a

© Precision, Recall, and Confusion Matrix


© The Accuracy Trap
© Precision, Recall, and Confusion Matrix
* Worked Example
* AUC-ROC Curve
* ROC Curve Analysis: Example Case Study

Precision, Recall, and Confusion Matrix


Wehavelearned about various ML models, but how do weevaluate them? For regression, we
can use thedifference betweentheactual andthe predicted values — Root Mean SquareError,
RMSE,or ordinary least square method, to be more precise — but whataboutclassification
models?

Onemightthink thataccuracyis a good enough measureto evaluate the goodnessof a model.


Accuracy is a very important evaluation measure,butit might not be the best metricall the
time. Let’s understandthis with an example.

The Accuracy Trap

Say weare building a modelthat predictsif patients have a chronicillness. We knowthatonly


0.5% ofthe patients havethe disease, or are “Positive” cases. Now, a dummy model could
alwaysgive “Negative” as a default result andstill have a high accuracy (99.5%!) because our
datasetis skewed. Outofall the patients only 0.5% have thedisease, so by giving “Negative” as
a default answerfor 100% of the cases, the model is still able to getthe predictions rightin
99.5% of the cases — we have a modelwith a very high accuracy! Butis this of any good?
Absolutely not! Andthis is where someother performance measures comeinto play.

Precision, Recall, and Confusion Matrix

Before wetalk about these measures,let’s understand a few terms:

1. TP / True Positive: the case waspositive, andit was predicted as positive


2. TN / True Negative: the case was negative, and it was predicted as negative
3. FN False Negative: the case waspositive, butit was predicted as negative
4, FP / False Positive: the case was negative, butit was predicted aspositive

Since pictures help us to rememberthingsbetter:

ImageCredits: http:/Awww.info.univ-angers fr

Nowthat we knowthe meaningoffalsepositives, false negatives, true positives, and true


negatives, we can learn about the famous Confusion Matrix.

A confusion matrix has two rowsandtwo columnsthat report the numberoffalse positives,
false negatives, true positives, and true negatives. Basically, it is a summary table showing how,
good our modelis at predicting examplesofvarious classes.

For example,if we havea classification modelthathas been trainedto distinguish between cats
anddogs, a confusion matrix will summarize the results of testing the algorithm on new data.
Assuming a sample of 13 animals — 8 cats and 5 dogs — our confusion matrix wouldlook like
this:

Actualclass
cat Dog
2B, |oat/s |2
3 4
38
E° (dog 3 3

Based on this, we can obtain two important measures:

« Precision: Theratio of correctpositive predictions to the total predicted positives,the


positive predictive value

TP
Precision
Treciston = ———
TP + FP

« Recall: Ratio of correctpositive predictions to the total actual positives examplesin the
dataset, the sensitivity

TP
Recall = LEN

relevantelements
———
false negatives true negatives
ee e ° °

selected elements

How
itemsmany selected
are relevant? Howmany relevant
items are selected?

Precision = Recall = ———

Precision and recall. ImageCredits: Wikipedia

Puttingthis all together, which would bethe correct measureto answerthe following
questions?

1. What percentageof our predictions were correct?

* Accuracy

2. What percentageof the positive cases did weidentify?

« Recall

3. What percentageofpositive predictions were correct?

Precision

In our case of predicting if a person has a chronicillness, it would bebetter to have a high
Recall because wedo not wantto leave any untreated any patients whohavethe disease. It’s
better to have false alarmsrather than missing positive cases, so we might be okay with the
low precision buthigh recall trade-off.

Note: In case our dataset is not skewed,butrather a balanced representationof the twoclasses,
thenitis totally okay to use Accuracy as an evaluation measure:

TP+TN _ TP +TN
Accuracy = Sy = TPL TN PP PN
Worked Example

Before continuing on,let’s look at a completed example:

Suppose thefecal occult blood (FOB)screentest is used in 2030 peopleto look for bowel cancer:

Patients with bowel cancer


(as confirmed on endoscopy)
Condition positive
Test Positive predictive value
Fecal outcome True positive False positive =TP/(TP + FP)
occult (TP) =20 (FP) = 180 = 20/ (20 + 180)
Bees) positive = 10%

screen
test False negative True negative
(FN) = 10 (TN) = 1820

Sensitivity
=TP/(TP + FN)
= 20/ (20 + 10)
= 67%

Image Credits: Wikipedia

AUC-ROC Curve
AUC(Area Underthe Curve) - ROC (Receiver Operating Characteristics) curveis a performance
measurement for a classification model at various classification threshold settings. Basically, it
is a probability curve thattells us how well the modelis capableofdistinguishing between
classes. The higher the AUC valueofour probability curve,the better the model is at predicting
Os as Os and 1s as 1s.

Whatdo we meanby various thresholdsettings?


Say weset the thresholdto 0.9. This meansthat if for any given sample our trained model
predicts a value higher than 0.9, our outputclass will be predicted as positive class; otherwise,
it will be placed in thenegativeclass.

The ROC curveis plotted with True Positive Rate (Recall/Sensitivity) against the False Positive
Rate (FPR, 1 - Specificity) where TPRis on y-axis and FPRis on the x-axis, where:

* Sensitivity, Recall, Hit Rate, or True Positive Rate (TPR)

wpe TP TP
Sensitivity = TPR = - -TP aIN

« Fall-out,(1 - Specificity) or False Positive Rate (FPR)


FP FP
1 — Specificity = FPR = WV " PPaIN

A great model has AUCnearthe1 indicatingit has an excellent measureof separability. On the
other hand, a poor modelhas AUCnearto the 0 meaningitis predicting Os as 1s and 1s as 0s.
And When AUCis0.5, it means the model hasno class separation capacity whatsoeverandit’s
essentially making random predictions.

Let’s understandthis better via an example analysis taken from a medicalresearch journal:

ROC Curve Analysis: Example Case Study

We'veidentified a potential biomarker, Protein “A”, of Alzheimer’s disease thatis elevated in


Alzheimer’s patients comparedto healthy patients (Figure 1 in the image below). We now need
to identify a good threshold valueofthis protein in order to have a modelthatcan be used to
identify Alzheimer’s patients with a good performance.

« Ifthe thresholdis too low:a lot of healthy patients will be wrongly diagnosed
« Ifthe thresholdis too high: a lot of healthy patients will be wrongly diagnosed

AROC curvecanhelpus in identifying the sweetspot, a balance between TPR and FPR.

AROC curveis generated acrossall the threshold settings and the AUC (area under the curve)
value is determined(Figure 3 in the image below).

« Higher AUCvalues indicate a better biomarker.


« Forourfinal mode, we choosea point along the ROC curve(valueforthe threshold) so that
wehavean acceptableor optimal trade-off betweensensitivity and specificity.

Let’s say the black dashedlineis the ROC curvefor ourdata in this example. We could choose X
= 0.1 and Y= 0.8, so that our modelbased onthe given biomarker, Protein A, would have a
specificity of 90% anda sensitivity of 80% in identifying Alzheimer’s patients.

Condition
ProteinA Vise deease
eee Healthy
True
i ars a ,
3]: i| osdacase pe positive Er Yongonttionatert
z ied aor = Poor [Biomarker3)
E3 Healthy negative fe
Sensitivity -
Proteinsn concentration
nce iat rue postiveHssdsese [rue Specificity -
negate / Healthy 0 021 -Specitcty
04 0.6 0.8 1.0
Feetovnpeg hoganttopastins a
(Sectionsonanen igre 2 caastenctsty aspeateny fre. congnonct
‘hengeac Otane atee petbeats
rouerenepedcevette
‘Setewir
Scanspy caocoer hpal ‘hte ema
etere basaetbey
otpetpoCOS)
eed pea
‘ve lowsont bat hepa

ROC-Curve-Analysis. Image Credits: https://raybiotech.com/learning-centerfroc-curve-analysis/


Quiz: Evaluating a Model

We'll coverthe following 0

Timeto Test Your Skills!

Time to Test Your Skills!

Whatis therecall, specificity andprecisionof the confusion matrix below?

Predicted
Apples Oranges

Apples TP =10 FP =35

Actual

Oranges FN= 40 TN= 15

O AyRecall = 20%
Specificity = 30%
Precision = 22%

O By Recall = 30%
Specificity = 20%
Precision = 22%

Reset Quiz C
Key Points to Remember

We'll coverthe following a

© 1, It's Generalization That Counts


* 2. Data Alone Is Not Enough
* 3, Feature Engineering Is the Key
* 4, Learn Many Models, Not Just One
* 5, Correlation Does NotImply Causation

Machinelearning algorithms comewith the promise ofbeing ableto figure out how to perform
importanttasks by learning from data,i.e., generalizing from examples withoutbeing explicitly
told whatto do. This meansthat the higher the amount ofdata, the more ambitious problems
can hetackled by these algorithms. However,developing successful machine learning
applications requires quite some “black art” that is hardto find.

Let’s go through someofthe lessons learned by machine learning researchers and


practitioners (put together in a great research paper by Professor Pedro Domingos), so that we
can avoid someof the majorpitfalls.

|
|
|
i

1. It’s Generalization That Counts


The fundamental goalof machinelearningis to generalize beyond the examplesin thetraining
set. No matter how muchdata wehave,it is very unlikely that we will see those exact examples
again attest time. Doing well on thetraining set is easy. The most common mistake among
beginners is to test on the training data and havetheillusionofsuccess. If the chosen classifier
is then tested on newdata,it is often no better than random guessing. So, set some ofthe data
aside from the beginning, andonly useit to test your chosenclassifier at the very end, followed
by learning your finalclassifier on the whole data.

Ofcourse, holding out data reduces the amountavailable for training. This can be mitigated by
doing cross-validation: randomly dividing your training datainto subsets, holding out each
one while training on therest, testing each learnedclassifier on the unseen examples, and
averagingtheresults to see how well the particular parametersetting does.

2. Data Alone Is Not Enough


Whengeneralizationis the goal, we bumpinto another major consequence: data alone is not
enough, no matter how muchofit you have. Very general assumptions, like similar examples
havingsimilar classes, are a large reason why machinelearning has beenso successful.

Domain knowledge and an understanding of our data are crucial in making the right
assumptions. The need for knowledgein learning shouldnot be surprising. Machine learning
is not magic;it can’t get something from nothing.It doesis get more from less though.
Programming,like all engineering,is a lot of work: wehaveto build everything from scratch.
Learningis morelike farming, whichlets nature do most ofthe work. Farmers combine seeds
with nutrients to grow crops. Learners combine knowledge with data to grow programs.

Image Credits: Machine Learning for Biomedical Applications: From Crowdsourcing to Deep Learning;
http:/mediatum.ub.tum.de/doc/1368117/47614.pdt

3. Feature Engineering Is the Key


At the endof the day, some machinelearning projects succeed and somefail. What makes
the difference? Easily the most importantfactoris the features used. If we have many
independent features that correlate well with the class, learning is easy. On theother hand,if
the class is based on recipe that requires handlingthe ingredients in a complex way before
they can be used, things become harder. Feature engineering is basically about creating
newinputfeatures from your existing ones.

Very often the raw data does not even comeina form readyfor learning. But we can construct
features fromit that can beusedforlearning. In fact, this is typically where mostof the
effort in a machine learning project goes.It is often also one ofthe mostinteresting parts,
whereintuition, creativity and “black art” are as importantasthe technical stuff.

First-timers are often surprised by howlittle time in a machinelearning project is spent


actually doing machinelearning.Butit makes senseif you consider how time-consumingitis
to gatherdata, integrateit, clean it, and pre-processit, and how much trialanderror can go
into feature design. Also, machine learning is not a one-shotprocessof building a dataset and
running a learner, butrather an iterative process of runningthe learner, analyzing the
results, modifying the data and/orthe learner, and repeating. Learningis often the quickest
partofthis, but that’s because we’vealready masteredit pretty well! Feature engineering is
moredifficult because it’s domain-specific, while learners can be largely general-purpose.
Of course,oneofthe holygrails of machine learning is to automate more and moreofthe
feature engineering process.

ot
amo
house_info {
snum_ bedrooms: 3
Fanaa >ue0, Provessofeating
‘streel_name: "Shorebird Way”
vin ooms: I
sementr 9221
232" Is feature engineering,
Col,
|} 0,
Raw data doesntcome 1
tous as featurevectors,

Feature engineering maps raw data to MLfeatures. Image Credits: https://developers.google.com/machine-learning/crash-


course/representation/feature-engineering

4. Learn Many Models, Not Just One

In the early days of machinelearning, people tried manyvariations of different learners but
still only used the bestone. But then researchers noticed that, if insteadofselecting the best
variation found, we combine manyvariations, the results are better often much better and
with onlya little extra effort for the user. Creating such model ensembles is now very common:

« Inthe simplesttechnique, called bagging, weuse the samealgorithm buttrain it on


different subsetsoforiginal data. In the end wejust average answers or combine them by
somevoting mechanism.
« In boosting, learners are trained one by one sequentially. Each subsequent one paying
mostofits attention to data points that were mis-predicted by the previous one. And
continuing until we aresatisfied with the results.
« Instacking,the output of different independentclassifiers becomethe input of a new
classifier which givesthe final predictions.

In the Netflix prize, teams from all over the world competedto build the best video
recommendersystem.As the competition progressed, teams foundthat they obtained the best
results by combining their learners with other teams’, and mergedinto larger andlarger
teams. The winner and runner-up wereboth stacked ensembles ofover 100 learners and
combining the two ensemblesfurther improvedtheresults. Togetheris better!

5. Correlation Does Not Imply Causation

ANOTHER HUGE STUDY YOURE NOT...THERE ARE 50


FOUND No EVIDENCE. THAT MANY PROBLEMS WITH THAT.
CELL PHONES CAUSE CANCER. JUST To BE SAFE, UNTIL
WHAT WAS THE WHO. THINKING? T S€E MORE Daa TM

Pe
nc

RBs
GOING ToASSUME CANCER
2 HIN THe Us CAUSES CELL PHONES.

ImageCredits: https:/ixked.com

Wehaveall heardthat correlation doesnot imply causation butstill people frequently thinkit
does.

Often the goal of learning predictive models is to use them asguidesto action. If we find that
beer anddiapersare often boughttogetherat the supermarket, then perhapsputting beer next
to the diapersectionwill increase sales. But unless we do an actual experimentit’s difficult to
tell if this is true. Correlationis a sign of a potential causal connection, and we can useit as a
guideto further investigation andnotasour final conclusion.

Note: This lesson is an excerptfrom my blogpost onthis topic. You can readthefull
article here.
Machine Learning Project Checklist

Checklist

RA

You have beenhired as a new DataScientist and you have an exciting project to work on!
Howshould yougo aboutit?

In this lesson, weare going to go through a checklist and talk about somebestpractices that
you should consider adopting when working on an end-to-end MLproject.

A checklist to guide you through your machinelearning projects

1. Framethe problem andlookat the big picture:

Understandthe problem,both formally and informally


Figure outtheright questions to ask and howto frame them
Understandthe assumptions based on domain knowledge

. Getthe data: Do NOT forget aboutdata privacy and compliancehere, they are of
N

paramountimportance! Ask questions and engage with stakeholders,if needed.

. Explore the data and extractinsights:


wo

Summarizethe data: find the type ofvariables or map out the underlying data structure,
find correlations amongvariables, identify the most importantvariables, check for
missing values and mistakesin the data etc.
Visualize the data to take a broad lookat patterns, trends, anomalies, and outliers. Use
data summarization and data visualization techniques to understandthestory the data is
telling you.

4. Start simple: Begin with a very simplistic model, like linearorlogistic regression, with
minimal and prominentfeatures (directly observed and reportedfeatures). This will allow
youto gain a good familiarity with the problem at handandalsoset theright direction for
the nextsteps.

. More Feature Engineering: Preparethe data to extract the moreintricate data


a

patterns.Combine and modify existing features to create new features.

. Explore manydifferent models andshort-list the best ones based on comparative


a

evaluation, e.g., compare RMSE or ROC-AUCscoresfor different models.

. Fine-tune the parameters of your models and consider combining them for the best
x

results.

. Presentyoursolution to the stakeholdersin a simple, engaging, and visually appealing


00

manner. Usethe art ofstory-telling, it works wonders!

Rememberto tailor your presentation based on the technical levelof your target
audience. For example, when presenting to non-technical stakeholders, rememberto
convey key insights without using heavy technical jargon. Theyare likely not going to be
interested in hearing about all the cool ML techniques you adopted,but rather on end
results and keyinsights.

. If the scope of yourproject is more than just extracting and presentinginsights from data,
00

proceed with launching, monitoring, and maintaining your system.

Of course,this checklist is just a referencefor getting started. Once youstart working on real
projects, adapt, improvise, andat the endof each projectreflect on the takeaways, learning
from mistakesis essential!

Note: Weare going to go into thetechnical details ofall these steps, and their sub-steps, in
the “Project Lessons”; the purposeofthis checklist is to serve as a very high-level guideline
orbest practices reminder.

With this, wearefinally ready to moveon to ourproject andget our handsdirty! @@


Introduction

We'll coverthe following a

* A Dataset and a Machine Learning Problem, What Should You Do?


* Overview ofthe Main Steps:
© 1. Exploratory Data Analysis
» 2. Prepare the data for machinelearning algorithms
© 3. Transformation Pipelinesin Scikit-Learn
* 4, Assess Machine LearningAlgorithms
* 5, Fine-Tune Your Model
© 6, Presentthe Solution
* 7. Launch, Monitor, and Maintain the System
© Important Preliminary Steps

A Dataset and a Machine Learning Problem,


What Should You Do?
Say you havebeenrecentlyhired asa Data Scientist to work on a project and you have been
given somerealestate data. How can you approach the problem in a systematic and
structured wayrather than ending up with a spaghetti code? Whatarethesteps to follow?

In this section, we are going to deconstructthe main step needed to work on a MLproject via a
real end-to-end example with code. Weare going to work with a challenge based on a Kaggle
Competition.

Overview of the Main Steps:


Thereisn’t a golden approachthatevery data scientist must adopt that would workfor every
single project. However, there are some good practices and steps recommended bythesages of
thefield that one should keepin mind.Forthis project, we are going to adopt an approach that
is an adaptationof advice collected from various books andarticles on the subject (details in
“Study Material”), and from personal experience.

Of course,this is to serve as a reference skeleton to guide you through yourprojects; you


should add ordelete steps based onthe specific needs of your project.

G8). ©
Get Data Train Model Improve

oo" Clean, Prepare


& Manipulate Data
Test Data

Hereare the main steps wearegoing to go through:

1. Exploratory Data Analysis


* Understand thedata structure
« Discover and visualize the data to gain insights
o Explore numerical attributes
© Lookfor correlations among numerical attributes
o Explorecategoricalattributes

2. Prepare the data for machine learning algorithms


« Deal with missing values
* Handle outliers
« Deal with correlated attributes
« Handletext andcategorical attributes
« Feature scaling

3. Transformation Pipelines in Scikit-Learn

4. Assess Machine Learning Algorithms


« Train and evaluate multiple models onthetraining set
* Comparative analysis of the models andtheir errors
« Evaluation Using Cross-Validation

5. Fine-Tune Your Model

6. Present the Solution

7. Launch, Monitor, and Maintain the System

Important Preliminary Steps


Before jumpinginto coding andplaying with the data, if you have been given a real problem
to workon, yourfirst question should be to ask your manager/the stakeholders/the owner of
the project what exactly the businessobjectiveis. Just building a machine learning modelis
not the endgoal, they arelikely interestedin thebenefits from the solution,so get an
understanding of how the companyexpectsto use your work andbenefit from it.

Whyis this an importantstep for a real project?


Because having a clear understanding ofthe business objective will determine:

« how you framethe problem


« whatalgorithmsyou will select
« whatperformance measureyouwill use to evaluate your model
« the amount of effort you should spend tweaking yourfinal model
« how youpresentyoursolution

Whatdoes framing the problem mean?


It meansthat first you need to understandif you are dealing with supervised, unsupervised, or
Reinforcement Learning? Haveyou beengivena classification task, a regression task, or
somethingelse?

Now withoutfurther ado,let’s moveonto the nextlesson and learn by practicing!


Kaggle Challenge - Exploratory Data Analysis

We'll coverthe following a

* 1. Exploratory Data Analysis


Understand the Data Structure

Explore Numerical Attributes

Correlations Among Numerical Attributes

Explore Categorical Attributes


Jupyter Notebook

Our project is based on the Kaggle Housing Prices Competition. In this challenge we are given a
dataset with different attributes for houses and their prices. Ourgoalis to develop a modelthat
can predict the prices of houses based on this data.

Atthis point we already know twothings:

1. We are givenlabeled training examples


2. We are asked to predict a value

What do these tell us in terms of framing our problem? The first point tells us that this is clearly
a typical supervised learningtask, while the second one tells us that this is a typical
regression task.

If we look at thefile with data description, data_description.txt, we can see the kind of
attributes we are expected to have for the houses we are workingwith. Here is a sneak peek
into someof the interesting attributes andtheir description from that file:

« SalePrice - the property’s sale price in dollars. Thisis the target variable that we are
trying to predict.
« MSSubClass: The buildingclass.
« LotFrontage: Linearfeet of street connected to property.
« LotArea: Lot size in square feet. Street: Type of road access.
« Alley: Type of alley access.
« LotShape: General shape of property.
« LandContour: Flatness of the property.
« LotConfig: Lot configuration.
« LandSlope:Slope of property. Neighborhood: Physical locations within Ames city limits.
* Condition1: Proximity to main road or railroad.
« HouseStyle: Style of dwelling.
« OverallQual: Overall material and finish quality.
* OverallCond: Overall condition rating.
« YearBuilt: Original construction date.

#® Note: Before moving further, download thedataset, train.csv, from here. Launch your
Jupyter notebookand thenfollow along! It is important to get your handsdirty; don’t
just read throughthese lessons!

# You can also find the Juptyter notebookwith all the codeforthis project on my Git
profile, here.

# Youcan seethe live execution ofcode in the Jupyter Notebookat the endof the
lesson and can alsoplay with it.

1. Exploratory Data Analysis


Importing modules and getting the data

Let’s start by importing the modules and getting the data. In the codesnippet below,it is
assumed that you have downloaded the csv file and saved it in the working directory as
‘/data/train.csv’.

# Core Modules
@Vounune

import pandas es pd
import numpy as np
# Basic modules for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
Ymatplotlib inline

1 # Load data into a pandas DataFrame from given filepath


2 housing = pd.read_csv("./data/train.csv')

Understand the Data Structure


Wenow haveour DataFrameinplace, so let’s get familiar with it by looking at the columnsit
contains:

1 # Get column names of the df


2 housing.columns

In [3]: M_ housing.columns
Out[3]: Index(["Id", 'MSSubClass’, ‘MSZoning’, ‘LotFrontage’, ‘LotArea’, ‘Street’,
‘Alley’, ‘LotShape', ‘LandContour', ‘Utilities’, ‘LotConfig’,
*LandSlope’, ‘Neighborhood’, ‘Condition1', ‘Condition2", ‘Bldgtype",
‘HouseStyle', ‘OverallQual', ‘OverallCond', ‘YearBuilt’, ‘YearRenodAdd’ ,
*RoofStyle’, ‘RoofMatl', ‘Exteriorist’, ‘Exterior2nd’, ‘MasVnrType’,
“MasVnrArea’, ‘ExterQual’, 'ExterCond’, ‘Foundation’, ‘BsmtQual’,
“BsmtCond", 'BsmtExposure’, ‘BsmtFinTypel’, ‘BsmtFinSF1',
“BsmtFinType2", 'BsmtFinSF2", "BsmtUnfSF', ‘TotalBsmtSF', ‘Heating’,
"Weatinggc’, ‘CentralAir’, ‘Electrical’, ‘1stFirsF’, ‘2ndFirsF’,
“LonQualFinSF’, ‘GrLivarea’, ‘BsmtFullBath’, ‘BsmtHalfBath’, ‘Fullath’,
‘HalfBath’, 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual
“TotRnsAbvGrd', ‘Functional’, ‘Fireplaces’, ‘Fireplacegu’, ‘GarageType’,
*GarageYrBlt', ‘GarageFinish’, ‘GarageCars’, ‘Garagedrea’, ‘GarageQual’,
“GarageCond’, ‘PavedOrive’, ‘WoodDeckSF’, ‘OpenPorchSF*
*EnclosedPorch', ‘3SsnPorch', ‘ScreenPorch’, ‘PoolArea’, ‘PoolQc’,
“Fence’, 'NiscFeature’, 'MiscVal', ‘MoSold’, ‘YrSold’, ‘SaleType’,
*saleCondition’, ‘SalePrice’],
dtype=‘ object")

Howmanyattributes do wehavein total? Of course, we are not going to count them ourselves
from the results above! Let’s get the number of rows and columnsin our DataFrame by calling
shape.

1 # Get the Shape of data


2 housing. shape

In [4]: WM # Get the shape of data


housing. shape
out[4]: (1460, 81)

There are only 1460 training examples in the dataset, which meansthat it is small by machine
learning standards. The shape of the dataset also tells is that we have81 attributes. Of the 81
attributes, one is the Id for the houses — not useful as a feature - and oneis the target variable,
SalePrice, that the model should predict. This means that we have 79 attributes that have the
potentialto be used to train ourpredictive model.

Now let’stake a look at the top five rows using the DataFrame’s head() method.

1 # Get the top 5 rows


2 housing.head()

In [5]: W_housing-head()
outs}:
|d_MSSubCiass MSZoning Lotfrontage LotArea Sweet Alley LotShape LandContour Utilities. PoolArea PoolC Fence MiscFeature MiscVal
on o Re 50-650 Pave NON Reg AP NaN NaN MeN
12 2 RL 2009600 Pave NaN Reg ta ANP NaN NaN nen oo
2a © RL 20 11280 Pave NN IRI La AIP NaN NaN Nan 0
a4 70 RL 600 9550 Pave NaN IRI La ANP NaN NaN NN 0
as © RL 24014200 Pave NN IR La AIP NaN NaN NaN 0
5 rows * 81 columns

Each row represents one house. Wecansee that we have both numerical(e.g., LotFrontage)
and categoricalattributes (e.g., LotShape). Wealso notice that we have many missing values
(NaN) as not all the houses have values set for all the attributes.

Wehavea columncalled Jd whichis not usefulas an attribute. We can either omit it or use it as
an indexfor our DataFrame. Weare going to drop that column becauseindexes of houses are
not relevant for this problem anyway.

The info() method is usefulto get a quick descriptionof the data, in particularthe total
number of rows, each attribute’s type and number of non-null values. So let’s drop the Id
column andcall the info() method:

1 housing housing.drop("Id", axis=1)


2 housing.info()

‘ee san-nl sbjece

The info() method tells us that we have 37 numericalattributes, 3 float64 and 34 int64, and 43
categorical columns. Notice that we have manyattributes that are not set for most of the
houses. For example, the Alley attribute has only 91 non-null values, meaning that all other
houses are missingthis feature. We will need to take care of thislater.

Here wehave a mix of numerical andcategorical attributes. Let’s look at these separately and
also use the describe() method to get their statistical summary.

Note that wecan distinguish between numericaland categoricalattributes byfiltering based


on their dtypes.

Numerical Attributes

1 # List of numerical attributes


2 housing. select_dtypes(exclude=[ ‘object’ ]).columns

In [7]: WM # List of numerical attributes


housing. select_dtypes(exclude=[ "object*]).columns
Out[7]: Index(['MSSubClass’, “LotFrontage’, ‘LotArea’, ‘OverallQual’, ‘OverallCond",
*YearBuilt’, "YearRemodAdd", ‘MasVnrArea’, “BsmtFinSF1", ‘BsmtFinSF2',
“BsmtUnfSF", ‘TotalBsmtSF*, ‘1stFIrsF’, ‘2ndFirsF*, ‘LowQualFinsF’,
‘GrLivarea’, ‘BsmtFullBath’, ‘BsmtHalfBath’, ‘FullBath", ‘HalfBath’,
“BedroomAbvGr’, ‘KitchenAbvGr', ‘TotRmsAbvGrd", ‘Fireplaces’,
“GarageYrBlt’, "GarageCars", ‘GarageArea’, "WoodDeckSF’, ‘OpenPorchsF",
*EnclosedPorch’, "3SsnPorch", ‘ScreenPorch’, "PoolArea’, ‘MiscVal’,
‘MoSold", ‘YrSold", *SalePrice’],
dtype="object’)

1 # Get the data summary with upto 2 decimals and call transpose() for a better view of the results
2 housing.select_dtypes(exclude=[ ‘object’ ]).describe() .round(decimals=2).transpose()

In [9]: MW housing. select_dtypes(exclude=[ ‘object’ ]).describe().round(decimals=2).transpose()


out [9]:
count mean std min 25% 50% 75% max
MsSubClass 1460.0 5690 4230 200 20.00 50.0 70.00 1900
LotFrontage 1201.0 70.05 2428 21.0 59.00 690 80.00 313.0
LotArea 1460.0 10516.83 9981.26 1300.0 7553.60 9478.5 1601.50 2152450
Overaliqual 1460.0 6.10 1.38 1.0 5.00 60 7.00 100
Overalicond 1460.0 5.58 11 1.0 5.00 50 6.00 90
YearBuilt 1460.0 1971.27 30.20 1872.0 1954.00 1973.0 2000.00 2010.0
YearRemodAdd 1460.0 1984.87 20.65 1950.0 1967.00 1994.0 2004.00 2010.0

The count, mean, min, and max columnsare self-explanatory. Note that the null values are
ignored; for example, the count of LotFrontageis 1201, not 1460. The std column showsthe
standard deviation which measures howdispersed the values are.

The 25%, 50%, and 75% columnsshow the corresponding percentiles: a percentile indicates the
value belowwhicha given percentage of observations in a groupofobservationsfalls. For
example, 25%of the houses have YearBuilt lower than 1954, while 50%are lower than 1973
and 75%are lower than 2000.

Recall from thestatistics lessons that the 25th percentile is also knownas the 1st quartile, the
50thpercentile is the median, and the 75th percentile is also knownas the 3rd quartile.

Categorical Attributes

# Get the categorical attributes


wewne

housing. select_dtypes(include=['‘ object’ ]).columns


#Get the sumamry of categorical attributes
housing. select_dtypes(include=[ ‘object ']).describe().transpose()

In [10]: M_ housing. select_dtypes( include=[object"]).coluns


Out[1@]: Index({'MsZoning’, ‘Street’, ‘Alley', ‘LotShape’, ‘Landcontour’, ‘Utilities’,
“Lotconfig', ‘LandSlope’, ‘Neighborhood", ‘Condition’, ‘Condition2",
“BldgType’, ‘HouseStyle’, 'RoofStyle’, ‘RoofMatl', ‘Exteriorist’,
‘exterior2nd’, 'NasVnrType', "ExterQual’, 'ExterCond", ‘Foundation’,
BsmtQual’, "BsmtCond", ‘BsmtExposure’, 'BsmtFinTypel', ‘BsmtFinType2",
‘Heating’, ‘Heating@c’, ‘CentralAir', ‘Electrical’, ‘KitchenQual
‘Functional’, "Fireplacegu', ‘GarageType', ‘GarageFinish', ‘Garagegual’,
‘“GarageCond", "PavedDrive’, ‘Poolgc', ‘Fence’, ‘MiscFeature’,
‘SaleType’, ‘SaleCondition’},
dtype= object")
There are 43 categorical columns with the following characteristics:
In [11]: M_ housing. select_dtypes(includes[ ‘object’ ]).describe().transpose()
out(aa):
count unique top freq
MSzoning 14005 RL 1161
Street 14602 Pave 1454
Alley 912 G8
Lotshape 1460 4 Rog 925
Landcontour 14604 \w isit
uuiites 1460-2 APU 1489
Lotcontig 1460 5 Inside 1062
Landsiope 14003 ou 1982
Neighborhood 1460 25--NAmes 225,
Condition! 1460 9 Norm 1260
Condition2 1460 8 Nom 1445
Bidgtype 14605 tam 1220
Housestyle 1460 8 1Story 726
RoofStyle 1460 6 ~—Gable. 1141,
RootMatl 1460 8 CompShg 1434

Note that for categorical attributes wedo not get a statistical summary. But we can get some
important information like number of uniquevalues and top values for eachattribute. For
example, wecansee that wecan have 8 types of HouseStyle, with 1Story houses being the most
frequent type.

Explore Numerical Attributes


Lookingat data distributions

Let’s have a detailed look at our target variable, SalePrice.

1 # Descriptive statistics summary


2 housing[ 'SalePrice’ ].describe()

In [12]: M_ # Statistics summary


housing[ "SalePrice’].describe()
out[12]: count 1460. 000000
mean ——-'180921.195890
std 79442.502883
min 34900. 000000
25% 129975000000
50% 163000, 000000
73% 214000.000000
max 755000. 000000
Name: SalePrice, dtype: floates

1 # Get the distribution plot


2 sns.distplot(housing[ ‘SalePrice’]);

In [13]: DM. sns.distplot(housing[ "SalePrice’]);

0.000008,
0.000007
0.000006
0.000008,
0.000004
0.000003,
0.000002
0.000001
0.000000
‘© 100000 200000300000.400000 500000600000 700000800000
Salerce

The distribution plot tells us that we have a skewed variable. In fact from the statistical
summary, we already saw that the meanprice is about 181K while 50%of the houses weresold
for less than 163K.

When dealing with skewed variables, it is a good practice to reduce the skew ofthe dataset
becauseit can impact the accuracyof the model. This is an important step if weare going to use
linear regression modeling; other algorithms,like tree-based RandomForests can handle
skewed data. Wewill understand this in detail later under “Feature Scaling”. For now,let’s look
at the updated distribution ofour target variable once we applya log transformation toit.
Applying a log transformation meansto simply take the log of the skewed variable to improve
the fit by altering the scale and makingthe variable more normally distributed.

# Take the log to make the distribution more normal


wewne

sns.distplot(np.log(housing[ "SalePrice’]))
plt.title( Distribution of Log-transformed SalePrice’)
plt.xlabel(‘log(SalePrice)")
plt.show()

In [14]: sns.distplot(np.og(housing[ ‘SalePrice’ ]))


plt.title( "Distribution of Log-transformed SalePrice*)
plt.xlabel(*log(SalePrice)")
plt.show()
Distribution of Log-transformed SalePrice
2
10
os
06
os
02
00
wo WS no US wo WS BO BS
log(SalePrice)

Wecan clearly see that the log-transformed variable is more normally distributed and we have
managed to reduce the skew.

What about all the other numerical variables? What dotheir distributions look like? We can
plot the distributionsof all the numerical variables by calling the distplot() method ina for
loop, like so:

## What about the distribution of all the other numerical variables?


wavanaune

num_attributes = housing.select_dtypes(exclude='object’).drop(['SalePrice’], axi:


# Print num of variables to make sure we didn't mess up in the last step
print(len(num_attributes.columns))
fig = plt.figure(Figsize=(12,18))
for i in range(len(num_attributes.columns)):
fig.add_subplot(9,4,i+1)
18 sns.distplot(num_attributes. iloc[:,i].dropna(), hist = False, rug = True)
a plt.xlabel(num_attributes.columns[i])
12
13. plt.tight_leyout()
14 plt.show()
In [15]: M/## what about the distribution of all the other numerical vartables?
rnum_attributes = housing. select_dtypes(exclude="object") .drop({'SalePrice'], axis=1) .copy()
print(len(num_attributes.columns))
fig = plt.Figure(Figsize=(12,18))
for i in range(len(nun_attributes.colums)):
fig.add_subplot(9,4,1+1)
sns.distplot(num_attributes.iloc[:,4].dropna(), hist = False, rug = True)
plt.xlabel (num_attributes.colums{i])
plt.tight_Layout()
plt.show()
36
01s. on os
00010
010 con 02
0.005, 200008 a
000 | obo m0 000 ow m0 m0 200000 ‘100000 200000 oo % 5 ®
ssubciass Lotrrentage toearea Overattual
bo ons 003 00075
e010 002 20050
os 005 00 0025
oo 6 = 0 2000 1900-2000 000 1980 1975. 2000 2005 00000 ‘© s00 1000 1500
Overton ‘eorsust ‘eorRemoakod Masvovarea
0.0006 0010
0010 ° 20010
0005 0002 00005 2.0005
0000 ee) 2.0000 © $00 1000 3500 00000 1000 2000, 00000 0 m0 00 00
Samtrinsf Bamernsr2 Bameunst Toealbsmes
2002 000075
ooo cont |

Notice how varying the distributions andscales for the different variablesare,this is the
reason weneedto do feature scaling before wecan usethese features for modeling. For
example, we can clearly see how skewedLotAreais. It is in dire need of somepolishing before
it can be used for learning.

EB Wewill get back to all the needed transformations and “applyingthefixes” later. In this
exploratory analysis steps, we are just taking notes on whatweneedto take care ofin order to
create a goodpredictive model.

Looking for Outliers

In thestatistics lesson, we learnedthat boxplots give us a good overviewofour data. From the
distribution of observationsin relation to the upper and lower quartiles, we can spotoutliers.
Let’s see this in action with the boxplot() method anda for looptoplotall the attributes in one
go:

fig = plt.figure(Figsize=(10, 15))


@Vounune

for i in range(len(num_attributes.columns)):
ig.add_subplot(9, 4, i+1)
sns.boxplot(y=num_attributes.iloc[:,i])
plt.tight_layout()
plt.show()

In [16]: WM fig = plt.figure(figsize-(10, 15))


for i in range(len(num_attributes.colums)):
fig.add_subplot(9, 4, i+1)
‘sns.boxplot(y=numattributes. iloc{:,4])
plt.tight_layout()
plt.show()
EAE:
weaed

Pi
overatQual
otares

i”
0

ww
we :
vos oop
e.ebid

5foo
go
j
§ 1000 i~
f= 5

B From the boxplots wecan see thatfor instance LotFrontage values above 200 and LotArea
above 150000 can be marked asoutliers. However,insteadof relying on our own “visual sense”
to spotpatterns and definethe range for outliers, when doingdata cleaning, wewill use the
knowledgeofpercentiles to be more accurate. For now our takeawayfromthisanalysis is that
weneedto take care ofoutliers in the data cleaning phase.

Just-for-fun plot

Ourbrains are very goodat spotting patterns on pictures, but sometimes we needto play
aroundwith visualization parametersandtry out different kindof plots to make those patterns
standout. Let’s create a fun exampleplot for learning to play with visualizations, especially
whenwewantto analyze relations among multiple variables at once.

Weare going to look attheprices. The radius ofeachcircle represents GrLivArea(option s),
and the colorrepresents theprice (option c). We will use a predefined color map (option cmap)
called jet, which rangesfrom blue (low values)to red(high prices).

1 housing.plot(kind="scatter", x="Overallqual”, y="YearBuilt”, s-housing["GrLivarea"], label="6riiva


2 alpha=0.3, figsize=(10,7), c="SalePrice", cmap=plt.get_cmap("jet"), colorbar=True)
3
4 pit. legend()

In [73]: M "Yearsuilt", sshousing["GrLivarea"], label="GrLivarea”, alpha:0.3,


figsize=(10,7), c="SalePrice”, cmap=plt.get_cmap("jet"), colorbar=True
)
plt-legend()
‘ut[73]: <matplotlib.legend.Legend at @x2632Fc95630>

Theplotabovetells us that the housingprices are very much related to the YearBuilt (y-axis)
and OverallQual (x-axis). Newerand higher quality houses mean more expensiveprices.This is
shownincreasing red going towards upper-right endofthe plotandvice versa.Prices are also
related to the GrLivArea,radiusofthecircle.

Correlations Among Numerical Attributes


Correlationtells us the strength of the relationship betweenpairsofattributes.In an ideal
situation, we would havean independentsetoffeatures/attributes, butreal data is not ideal. It
is useful to know whethersomepairsofattributes are correlated and by how much becauseit
is a goodpractice to removehighly correlated features.

Wecan use the corr() methodtoeasily getthe correlations and then visualize them using the
heatmap() method - Python does feel like magic often, isn’t it?!

The corr() methodreturns pairsofall attributes and their correlation coefficients in range[-1;
1], where1 indicatespositive correlation, -1 negative correlation and 0 meansnorelationship
betweenvariablesatall.

# Correlation of numerical attributes


wavanaune

corr = housing.corr()
# Using mask to get triangular correlation matrix
¥, ax = plt.subplots(figsiz
mask = np.zeros_like(corr, dtyp.
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask, cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True, ax=ax,

In [74]: M|# Correlation of numerical attributes


core = housing.core()
# Using mask to get triangular correLation matrix
, ax plt.subplots(figsize=(12, 1€))
mask = np.zeros_like(corr, dtype=np.bool)
mmask[np.triu_indices_from(mask)] = True
sns-heatmap(corr, mask-mask, cmap-sns.divergingpalette(220, 10, as_cmap-True), square-True, ax-ax, vmin = -1.0, vmax = 1.0,
Out{74]: cmatplotlib.axes. subplots. AxesSubplot at @x2632fbeaf28>

a0

From the heatmap,wecaneasily see that we have somevariables that are highly correlated
with price (darker red) andthat therearevariables highly correlated among themselves as
well. The heatmapis useful for a first high-level overview.Let’s get a sorted list of correlations
among all theattributes andthetargetvariable, SalePrice, for a deeper understanding of what’s
going on.

In [21]: M_ corr['SalePrice’].sort_values (ascending:


Out[21]: SalePrice 1.000000
Overallqual 8.790982
GrLivarea 0.708624
GarageCars 8.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
AstF1rsF 8.605852
FullBath 0.560664
TotRmsAbvGrd 8.533723
YearBuilt 0.522897
YearRemodAdd 0.507101
GarageYrB1t 0.486362
MasVnrArea 0.477493
Fireplaces 8.466929
BsmtFinSF1 8.386420
LotFrontage 8.351799
WoodDecksF 0.324413
2ndFirsF 8.319334
OpenPorchSF 8.315856
HalfBath 8.284108
LotArea 0.263843
BsmtFullBath 0.227122
BsmtUnfSF 8.214479
BedroomAbvGr 0.168213
ScreenPorch 0.111447
PoolArea 8.092404
MoSold 0.046432
3SsnPorch 0.044584
BsmtFinsF2 -0.011378
BsmtHalfBath -0.016844
MiscVal -0.021190
LowQualFinsF -@.025606
Yrsold -@.028923
Overallcond -0.077856
MsSubClass -0.084284
EnclosedPorch -@.128578
KitchenAbvGr —_-@.135907
Name: SalePrice, dtype: floated

From thesevalues, we can see that OverallQual and GrLivAreahave the most impacton price,
while attributes like PoolArea and MoSold are not related toit.

Pair-wisescatter matrix

Wehavea lot of uniquepairsof variables i.e. N(N - 1)/2. Joint distribution can be used to look
for a relationship betweenall of the possible pairs, two ata time.

Forthe sake of completeness, we might wantto display a rough jointdistribution plot for each
pair ofvariables. This can be doneby using pairplot() from sns. Since we havea fairly big N,
so weare going to create scatter plots for only someofthe interesting attributesto get a visual
feel for these correlations.

1 col = [‘SalePrice’, ‘Overallqual’, ‘GrLivarea’, ‘VearBuilt’]


2. sns.pairplot(housing[col])

In [22]: M col = ['SalePrice’, ‘Overallqual', ‘GrlivArea', 'YearBuilt']


sns.pairplot(housing[col])
Out[22]: <seaborn.axisgrid.PairGrid at @x2632f535e48>
so0000 . « .
éo0000
§ 00000
a
2o0000
°
w] © memes e com os . oe
| emensee 6 somone 1. we omen
6| sane Somowe © O° Secemmeneen
seme sneave sememeennes so
ao . rts
Yearauit

1875
w 1900 1950 2000
SalePrice OveraliQusl YearBuilt

From thepairplots, we can clearly see how with an increase in GrLivAreatheprice increases as
well. Play aroundwith otherattributes as well.

In orderto train our “creatingplots muscle”, let’s look at other typesofplots that can make the
relationship for the highest correlated variables, OverallQual, with the targetvariable,
SalePrice, really standout.

Let’s see what the barplot() and boxplot() methodsgive us.

1 sns.barplot(housing.Overallqual, housing.SalePrice)

In [25]: M_ sns.barplot(housing.OverallQual, housing.SalePrice)


Out[25]: <matplotlib.axes._subplots .AxesSubplot at 0x26330a26048>

1 # Boxplot
2 pit. figure(figsize=(18, 8))
3 sns.boxplot(x-housing.Overallqual, y-housing.SalePrice)

In [26]: W_ #boxplot
plt.figure(figsize=(18, 8))
sns.boxplot(xshousing.Overallqual, yshousing-SalePrice)
Out(26]: <matplotlib.axes._subplots.AKesSubplot at 0x26330d37518>

- eettT
Wecansee that we have manyhighlycorrelatedattributes andthese results confirm our
commonsenseanalysis.

B Wecan take somenotes hereforthe feature selection phase whereweare going to drop the
highly correlated variables. For example, GarageCars and GarageAreaare highly correlated but
since GarageCars hasa higher correlation with thetarget variable, SalePrice, weare going to
keep GarageCars and drop GarageArea. Wewill also droptheattributes that have almost no
correlation with price, like MoSold, 3SsnPorch and BsmtFinSF2.

Explore Categorical Attributes


Let’s print again the namesofthe categorical columnsagain and then handpick someofthe
interesting ones for visual analysis.

1 cat_columns = housing. select_dtypes(include=" object’ ).columns


2 print(cat_columns)

In [27]: WM cat_columns = housing. select_dtypes(include='object').colunns.


print(cat_columns)
Index(['MSZoning’, ‘Street", ‘Alley’, ‘LotShape’, ‘LandContour', ‘Utilities’,
‘LotConfig', ‘LandSlope’, ‘Neighborhood’, ‘Condition’, ‘Condition2',
“Bldgtype', 'HouseStyle", 'RoofStyle', ‘RoofMatl', ‘Exteriorist’,
‘exteriorand’, ‘MasVarType', ‘ExterQual’, ‘ExterCond’, ‘Foundation’,
“BsmtQual', “BswtCond", "BswtExposure’, ‘BsmtFinTypel', ‘BsmtFinType2",
‘Heating’, ‘HeatingQc', ‘CentralAir', ‘Electrical’, ‘KitchenQual',
‘Functional’, 'FireplaceQu', ‘GarageType', ‘GarageFinish', ‘GarageQual’,
“GarageCond", ‘Paveddrive', ‘PoolQc’, ‘Fence’, ‘MiscFeature’,
‘SaleType", ‘SaleCondition'],
dtypes‘object")

Say we wantto lookatthe impactof KitchQual on price:

var = housing[ ‘kitchenQual"]


RUN

¥, ax = plt.subplots(figsize=(10,6))
sns.boxplot(y=housing.SalePrice, x=var)
plt.show()

In [28]: WM. var = housing[ 'kitchenQual']


f, ax = plt.subplots(Figsize=(1¢,6))
‘sns.boxplot (yshousing-SalePrice, x=var)
plt-show()

'
700000
20000 ‘ :
00000
& «00000
3 +
200000
200000
100000
°
@ a = 7
KachenQual

Wecan nowsee that Ex seemsto be the more expensive option while Fa brings the prices
down.

Whataboutthestyle of the houses? Which styles do we have and howdothey impact prices?

1 f, ax = plt.subplots(figsize=(12,8))
2 sns.boxplot(y=housing.SalePrice, x-housing.HouseStyle)
3. pit.xticks(rotation=40)
4 plt.show()

In [31]: M| f, ax = plt.subplots(Figsize:
ssns..boxplot(y=housing.SalePric ousing.HouseStyle)
plt-xticks(rotation=48)
plt.show()

'

0000 . °

wm} .
in | :
300000 ’ . —
' ’

~_ =. ™
—_—" L
:
“yf * # > SF #

Wecan see that 2Story houses havethe highestvariability in prices and they also tend to be
more expensive, while 1.5Unf arethe cheapestoption.

Say we wantto get the frequency for eachofthese types, we can use the countplot() method
from sns like so:

1 # Count of categories within HouseStyle attribute


2 fig = plt.figure(Figsize=(12, 4))
3. sns.countplot(x="HouseStyle’, datashousing)
4 plt.xticks(rotation=98)
5 plt.ylabel(‘Frequency’)
6 plt.show()

In [32]: M_ ## Count of categories within Housestyle attribute


fig = plt. figure(figsize=(12.5,4))
sns.countplot(x='HouseStyle', datashousing)
plt.xticks (rotation=90)
plt.ylabel( ‘Frequency’ )
plt. show()

F é4 54 é8 3“ ; :
a 5 vous a a

Now weknowthat most of the housesare 1Story type houses. Say we do not wanta frequency
distribution plot, but only the exact countfor each category, we canget that easily from the
DataFramedirectly:

1 housing["HouseStyle"].value_counts()

In [33]: Ml housing["Housestyle" ].value_counts()


out[33}: astory 726
2story 445,
1.SFin 154
stvi 65
SFoyer 37,
a.sUnf 14
2.Unf AL
2.5Fin 8
Name: HouseStyle, dtype: intéd

Wearealso curiousto see if the style of the houses has changed overthe years,solet’s plot the
two variables against each other.

1 pit.scatter(housing['VearBuilt’ ], housing[ ‘HouseStyle’ ])

In [30]: DM plt-scatter(housing[ ‘YearBuilt" ],housing[ ‘HouseStyle"])


Out[30]: <matplotlib.collections.PathCollection at @x2632F1714e0>
asref* ©
2sunf oe ome ow
su
Soyer
150
sre

188015001570 1840 1960 1980 2000

Now weknowthat2Story and 1Story havebeen therefor ages andthey continueto be built
while SFoyer and SLvl arerelatively newerstyles. We canalso notice that 2.5Fin, 2.5Unf and
1.5Unf are deprecated styles.

Jupyter Notebook
You can see theinstructions running in the Jupyter Notebook below:

Howto Use a Jupyter NoteBook?

Click on “Click to Launch” 7 button to work andsee the code runninglive in the
notebook.

cick
Youcanclic! [7 to open
pen the the Jupyter
Jupyt Notebook
‘ebook iin a new tab. tab.

Goto File andclick Download as and then choose the formatofthefile to download.
&. You can choose Notebook(.ipynb) to downloadthefile and worklocally or on
your personal Jupyter Notebook.

+ A Thenotebooksession expires after 15 minutesofinactivity.

Your app can be found at: https: /ivvq3ere1mk42-live-app.educative.runinotebooks/ExploratoryDataAnalysis.ipynb G

i Click to launch app!

Aaand weare donewith theinitial Exploratory Analysis! We will move onto Data Preprocessing
in the nextlesson.
Kaggle Challenge - Data Preprocessing

coverthe following A

* 2. Data Preprocessing - Prepare the Data for Machine Learning Algorithms


Deal With Missing Values

Deal With Outliers

Deal With Correlated Attributes

Handle Text And Categorical Attributes

Feature Scaling
Jupyter Notebook

2. Data Preprocessing - Prepare the Data for


Machine Learning Algorithms
Wetook ournotesin the exploratory phase,nowit’s time to act on them and prepareour data
for the machinelearning algorithms. Instead of just doing this manually, wewill also learn how
to write functions wherepossible.

Deal With Missing Values


Let’s get a sorted count of the missing valuesfor all the attributes.

1 housing. isnull().sum().sort_values(ascending-False)

In [34]: M_ housing. isnull().sum().sort_values(ascending=False)


Out[34]: Pooloc 1453
MiscFeature 1406
Alley 1369
Fence 1179
FireplaceQu 690
LotFrontage 259
GarageType 81
GarageCond 81
GarageFinish 1
GarageQual al
GarageYrelt gl
BsmtFinType2 38
BsmtExposure 38
BsmtQual 37
BsmtCond 37
BsmtFinTypel 37
MasVnrArea 8
MasVnrType 8
Electrical 1
RoofMat] e
Exteriorist e
RoofStyle e
ExterQual e
Exterior2nd e
YearBuilt e
ExterCond ®
Foundation e
YearRemodAdd e
salePrice e
OverallCond e
GarageArea e
PavedDrive e
WoodDeckSF e
OpenPorchsF e
3SsnPorch e
BsmtUnfSF @
ScreenPorch e
PoolArea e
MiscVal e
MoSold e

From the results above we can assumethat PoolQC to Bsmtattributes are missing for the
houses that do not havethese facilities (houses without pools, basements, garageetc.).
Therefore, the missing values couldhefilled in with “None”. MasVnrType and MasVnrArea
both have 8 missing values, likely houses without masonry veneer.

Whatshould wedowith all this missing data?

Most machinelearning algorithms cannot work with missing features, so we needto take care
ofthem.Essentially, we have three options:

* Getrid of the corresponding houses.

Getrid of the whole attribute or removethe whole column.

« Set the missing values to somevalue (zero, the mean,the median,etc.).

Wecan accomplish these easily using DataFrame’s dropna() , drop() ,and fillna() methods.

# Note: Whenever you choosethethird option, say imputing valuesusing the median,
you should compute the median value onthetraining set, anduseit to fill the missing
valuesin the trainingset. But you should also rememberto later replace missing values in
thetest set using the same median value when you wantto evaluate your system, and
also once the model gets deployedto replace missing values in new unseendata.

Wearegoing to apply different approachesto fix our missing values, so that we can various
approaches in action:

« Wearegoingto replace valuesfor categorical attributes with None.


« For LotFrontage, we are going to go bit fancy and computethe median LotFrontage for
all the houses in the same neighborhood,insteadof the plain medianforthe entire
column,andusethatto impute on a neighborhoodby neighborhood basis.
« Wearegoingto replace missing values for most of the numerical columns with zero and
one with the mode.
« Wearegoing to drop onenon-interesting column,Utilities.

Right now,we are goingto lookat howto dothese fixes by explicitly writing the nameof the
columnin the code. Later, in the upcomingsection on transformationpipelines, wewill learn
howto handle them in an automated manner aswell.

1 # Imputing Missing Values


2
3 housingprocessed = housing
4
5 # Categorical columns:
6 cat_cols_fill_none ["PoolQc’, ‘MiscFeature’, ‘Alley’, ‘Fence’, ‘FireplaceQu’,
7 “GarageCond’, ‘GarageQual’, ‘GaregeFinish’, “GarageType’,
8 “BsmtFinType2", “BsmtExposure’, “BsmtFinType1’, “BsmtQual’, ‘BsmtCond’,
9 “MasvarType" ]
18
11 # Replace missing values for categorical columns with None
12 for cat in cat_cols_fill_none:
13 housing_processed[cat] = housingprocessed[cat].fillna("None")
14
15 # Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
16 housing_processed[ ‘LotFrontage"] = housing_processed.groupby( Neighborhood”) ["LotFrontage"].transfo,
a7 lambda x: x.fillna(x.median()))
18
19 # Garage: GarageYrBlt, GarageArea and GarageCars these are numerical columns, replace with zero
20 for col in ['GarageYrBlt', ‘GarageArea’, ‘GarageCars’]:
2 housing_processed[col] = housing_processed[col].fillna(int(@))
22
23. # MasVnrArea : replace with zero
24 housing_processed[ ‘MasVnrarea’] = housing_processed[ ‘MasVnrérea’].fillna(int(@))
25
26 # Use the mode value
27 housingprocessed[ ‘Electrical’] = housing_processed[ ‘Electrical’ ].fillna(housing_processed[‘Electri
28
29 # There is no need of Utilities so let's just drop this column
30 housingprocessed = housingprocessed.drop(['Utilities'], axis=1)
31

In [35]: W# Inputing Missing Values


housingprocessed = housing
# Categorical columns:
cat_cols_fill_none = ["PoolQ", "MiscFeature’, ‘Alley’, *Fence’, “Fineplacegu',
"GarageCond’, "GarageQual’, "GarageFinish', ‘“GarageType",
‘BemtFintype2", "BsmtExposure’, "BamtFintypel", “BsmtQual’, “BemtCond’,
“masvneType" }
# Replace missing values for categorical columns with None
for cat in cat_cols_fill_none:
housing_processed{cat] = housingprocessed{ cat] .fillna("None")
‘ Group by neighborhood and FALL in missing value by the median Lotfrontage of all the netghborhood
housingprocessed{ LotFrontage’] = housingprocessed. groupby(“Neighborhood")["LotFrontage"].transform(
Lambda x: x. Fillna(x.median()))
4 GarageYr@lt, GorageArea ond GarageCars these are numerical columns, replace with zero
for col in ["Garageyr8lt', 'Garagedrea’, 'GarageCars*]
housing_processed{col] = housingprocessed{ col] .fillna(int(@))
masvnrarea replace with zero
housing_processed{ ‘MasVnrArea'] = housingprocessed| "KasVnrArea’ }.fillna(int(®))
tse the mode value
housingprocessed{ Electrical") = housingprocessed| "Electrical" }.fillna(housing_processed| “Electrical” }) .node()[0]
‘Athere 1s no need of Utilities so Let's just drop this column
housingprocessed = housingprocessed.drop({ "Utilities" ], axise1)
In [36]: W/# Get the count again to verify that we do not have any more missing values
housing_processed.isnul}().apply(sum) .max()
out(35): @

Deal With Outliers


To removenoisy data, weare going to remove houses where wehave someattributethat is
abovethe 0.999 quantile, highly abnormaldatapoint. Wecando this by invoking the
quantile() method on the DataFrameand thenfiltering based on the knowledgeofthe
quantiles for each attribute,like so:

num_attributes = housing_processed.select_dtypes(exclude=' object’)


@Vounune

high_quant = housing_processed.quantile(.999)
for i in num_attributes.columns:
housing_processed = housing_processed.drop(housing_processed[i][housing_processed[i]>high_quant
housing_processed. info()

In [37]: W|numattributes = housingprocessed. select_dtypes(exclude="object")


high_quant = housingprocessed. quantile(.999)
for 1 in numattributes. coluans:
housingprocessed = housing_processed.drop(housing_processed| i] [housing_processed[i]>high_quant[i]]-index)
housingprocessed. info(),
<class 'pandas.core.frane-DataFrane'>
Int64Index: 1422 entries, 0 to 1458
Data columns (total 79 columns):
NssubClass 1422 non-null inte
MSzoning 1422 non-null object
Lotfrontage 1422 non-null float6a
Lotares 1422 non-null intéa
Street 1422 non-null object
Alley 2422 non-null object
LotShape 1422 non-null object
LandContour 1422 non-null object
Lotconfig 1422 non-null object
LandSlope 1422 non-null object
Neighborhood 1422 non-null object
Condition 1422 non-null object
Condition? 1422 non-null object
BlegType 2422 non-null object
HouseStyle 2422 non-null object
Overallqual 1422 non-null intes

Invoking the info() methodon the updated DataFrametells us that weare left with 1422 rows
now.

Deal With Correlated Attributes


Using highly-correlated features when creating machine learning models can impact
performancenegatively. As we saw in the numerical analysis section, we have quite a few
correlated attributes. For example, we concluded that we can drop GarageArea becauseit is
highly correlated with GarageCarsand thereason for preferring GarageCarsis because it is
more correlated with price thanarea. (Pull out your notes from exploratory analysis at this
step.)

###% Remove highly correlated features


onunwner

# Remove attributes that were identified for excluding when viewing scatter plots & corr values
attributes_drop = ["Miscval’, ‘MoSold', ‘YrSold', "BsmtFinSF2’, "BsmtHalfBath’,, ‘MSSubCless",
“Garagesrea’, ‘GarageYrBlt", '3SsnPorch"]
housing_processed = housing_processed.drop(attributes_drop, axis=1)

In [38]: Mew" Renove highly correlated features


# Remove attributes that were identified for excluding when viewing scatter plots & corr values
attributes_drop = [‘Miscval', 'MoSold', ‘YrSold', 'BsntFinsF2", 'SsmtHalfeath' ,‘MSSubClass',
ar + ‘GarageyrBlt', *3SsnPorch*)
housingprocessed = housingprocessed. drop(attributes_drop, axis=1)

Handle Text And Categorical Attributes


Most MachineLearning algorithms need numbersasinput, so let’s convertall the categories
from text to numbers.

Acommonapproachto deal with textual datais to create one binary attribute for each
category of the feature: for example, for type of houses, we would haveoneattribute equal to 1
whenthecategory is 1Story (and 0 otherwise), anotherattribute equal to 1 when the category is
2Story (and 0 otherwise), and so on. Thisis called one-hot encoding, because only oneattribute
will be equalto 1 (hot), while the otherswill be 0 (cold). The new attributes are also known as
dummy attributes. Scikit-Learn provides a OneHotEncoder class to convert categorical values
into one-hot vectors:

Scikit-Learnis the most widely usedlibrary for working on machinelearning/data science


projects.It is simple, easy to use andit provides manyefficient tools for data mining, data
analysis and modeling.In short,it is awesome!

###% Transforming Cat variables


wewne

from sklearn.preprocessing import OneHotencoder


cat_encoder = OneHotencoder()
housing_processed_ihot = cat_encoder. fit_transform(housing_processed)
housing_processed_thot

In [39]: M#### Transforming Cat variables


from sklearn.preprocessing tmport OneHotEncoder
cat_encader = Onetiotencoder()
hougingprocessed_ahot = catencoder.fit_transfora(housingprocessed)
housingprocessed Ihot
‘out{39]: <1422%7333 sparse matrix of type "<class “numpy.float6s">"
with 99540 stored elements in Compressed Sparse Row format>

Notice thatas a result of creating new one-hot attributes our total numberofattributes has
jumpedto 7333! We have a 1422x7333 matrix whichis mostly sparse(zeros).

Feature Scaling
FeatureScaling is one of the most important transformations we need to apply to our data. As
wesaidearlier, machine learning algorithms mostly do not perform well if they are fed
numericalattributes with very different scales as input. Thisis the case for the housing data. If
you go backandlookat thedistribution plots that wecreated in the very beginning, wenotice
that LotArea rangesfrom 0 to 200000, while GarageCarsranges only from to 4.

There are two commonwaysto get all attributes to have the samescale: min-max scaling and
standardization.

« Min-max scaling (also knownas normalization): this is a simple technique. Values are
shifted and rescaled sothatthey end up ranging from 0 to 1. This can be done by
subtracting the min value anddividing by the max minusthe min,but fortunately Scikit-
Learn providesa transformer(wewill talk about transformersin a bit) called mMinMaxScaler
to do this in a hassle-free manner.This transformer also provides the feature_range
hyperparameterso that we can change the rangeif for some reason wedon’t wantthe0 to
1 scale.

(X = Xmin)
Mee Xe Xmnin
« Standardization: this is a more sophisticated approach. Rememberthelessons from.
statistics? Standardization is doneby first subtracting the meanvalue(so standardized
values always have a 0 mean), andthen dividing by the standard deviation so thatthe
resulting distribution has unitvariance. Sinceit only cares about“fixing” the mean and
variance, standardization does not limit values to a specific range, which may be
problematic for some algorithms (e.g., neural networks often expect an input value
ranging from 0 to 1). However,standardization is muchless affected by outliers. Say Bill
Gates walks into a bar, suddenly the median incomefor people in the bar would shoot up
to the moon,so min-max scaling would bea poor choice for scaling here. On the other
hand, standardization would not be muchaffected. Scikit-Learn provides a transformer
called standardScaler for standardization.

gatik
o

Insteadof applyingthesescaling transformations on a column-by-columnbasis like we have


been handling data preparationso far, in the nextlesson,weare going to understand how to
use transformation pipelinesin orderto do all this work in a more automated and cleaner
fashion.

Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:

& Howto Use a Jupyter NoteBook?

Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.

You canclick ZG to openthe Jupyter Notebookin a new tab.

Go to File andclick Downloadas and then choosethe formatofthefile to download


>. You canchoose Notebook(.ipynb) to downloadthefile and worklocally or on
your personal Jupyter Notebook.

A\ Thenotebooksession expires after 15 minutes of inactivity.

‘Your app can be found at: httos://vvq3ere1mk42-live-app.educative.runinotebooks/DataPreprocessing.ipynb G

——

Click to launch app!


Kaggle Challenge - Data Transformation

We'll coverthe following a

3. Transformation Pipelines

* Jupyter Notebook

3. Transformation Pipelines
As you cansee, from imputing missing values to featurescaling to handling categorical
attributes, we have manydata transformationstepsthat need to be executedin the rightorder.
Fortunately, Scikit-Learn is here to makeourlife easier: Scikit-Learn provides the Pipeline
class to help with such sequencesof transformations.

#® Note:Creating transformationpipelinesis optional.It is handy whendealing with a


large numberofattributes, so it is a good-to-knowfeatureof Scikit-Learn.In fact, at this
pointwecoulddirectly moveonto create our machine learning model. However,for
learning howthings are done, weare going to look at working with pipelines.

SomeScikit-Learn terminology:

« Estimators: An object that can estimate some parameters based on dataset, e.g., an
imputer is an estimator). Theestimationitself is performed bysimply calling the fit()
method.

Transformers: Someestimators (such as an imputer) can also transform dataset; these


are called transformers. The transformation is performed by the handy andeasy to use
transform() method with the dataset to transform as a parameter.

Predictors: Someestimators are capable of makingpredictionsgiven a dataset; they are


called predictors. For example, the LinearRegression modelis a predictor. A predictor has
a predict() methodthattakes a datasetof new instances andreturns a dataset of
correspondingpredictions. It also has a score() method that measures the quality of the
predictionsgiven a testset.

Based on someofthe data preparation steps we haveidentified so far, we are going to create a
transformation pipeline based on simpleImputer (*) and StandardScalar Classes for the
numericalattributes and OneHotEncoder for dealing with categorical attributes.

(*)Scikit-Learn provides a very handyclass, simpleImputer to take care of missing values. You
just tell it the type of imputation,e.g. by median,andvoila,the job is done. We have already
talked abouttheothertwoclasses.

First, we will look at a simple examplepipelineto impute and scale numerical attributes. Then
wewill create a full pipeline to handle both numerical andcategorical attributes in one go.

The numericalpipeline:

1 # Import modules
2 from sklearn.pipeline import Pipeline
3. from sklearn.preprocessing import Standardscaler
4 from sklearn.compose import ColumnTransformer
5 from sklearn.impute import SimpleImputer
6
7 # Separate features and target variable
8 housing_X = housing_processed.drop("SalePrice", axis=1)
9 housingy = housing_processed["SalePrice"].copy()
11 # Get the list of names for numerical and categorical attributes separately
12 num_attributes = housingX.select_dtypes(exclude=" object’)
13 cat_attributes = housingX.select_dtypes(include=" object’)
15 num_attribs ist(num_attributes)
16 cat_attribs = ist(cat_attributes)
18 # Numerical Pipeline to impute any missing values with the median and scale attributes
19 num_pipeline = Pipeline([
20 (imputer’, SimpleImputer(strategy="median")),
2 ('std_scaler’, StandardScaler()),
22 D

In [40]: Mf Inport modules


from sklearn.pipeline import Pipeline
from sklearn-preprocessing import Standardscaler
from sklearn.compose import Columntransformer
from sklearn.impute inport SimpleImputer
4 Separate features and target varioble
housing_X = housingprocessed.drop("SalePrice", axise1)
housing_y = housing_processed{SalePrice”].copy()
# Get the List of names for numerical and categorical attributes separately
num_attributes = housingX.select_dtypes(exclude='object')
cat_attributes = housingX.selectdtypes(includes"object')
rnum_attribs = list (numattributes)
cat_attribs = List(cat_attributes)
# Numerical Pipeline to impute any missing values with the median and scale attributes
rium_pipeline = Pipeline(|
(Cimputer', Sinpletmputer(strategy="sedian")),
(std_scaler", Standardscaler(
»

Note that we haveseparated the SalePrice attribute into a separatevariable, because for
creating the machinelearning model, weneed to separateall the features, housing_X, from the
target variable, housing_y.

The Pipeline constructor takesa list of name/estimatorpairs defining a sequenceof steps. The
namescan be whatever we wantas longas they are unique and without double underscores,

Thepipelineis run sequentially, one transformerat a time, passing the outputof each call as
the parameterto the nextcall. In this example,the last estimator is a StandardScaler(a
transformer), andthe pipelineapplies all the transformsto the data in sequence.

So far, we have handled categorical and numerical attributes separately.It is more convenient
andclean to havea single transformer handleall columns, applying the appropriate
transformationsto each column.Scikit-Learn comes to the rescue again by providing the
ColumnTransformer for the very purpose. Let’s useit to apply all the transformationsto our data
andcreate a completepipeline.

(num_pipelineis the numerical pipeline from the previous step)

full_pipeline = Columntransformer([
num_pipeline, num_attribs),
", OneHotEncoder(), cat_attribs),

# Description before applying transforms


print(housing_y.describe())
10 # Apply log-transform to SalePrice
11 housingyprepared = np.log(housing_y)
13 # Run the transformation pipeline on all the other attributes
14 housingX_prepared = full_pipeline.fit_transform(housing_X)
16 # Description before applying transforms
417 print(housing_y_prepared.describe())
19 housing_X_prepared

In [41]: M. full_pipeline = ColunaTransforser({


(Cnun", nus_pipeline, numsttribe),
(eat, Onellotencoder(), cat_attribs),
»
1 Description before applying transform=
print(housingy.describe())
# Apply Log-transform to SalePrice
housingy_prepared = np.log(housing_y)
# Run the transformation pipeline on all the other attributes
housing_Xprepared = full_pipeline.fit_transform(housingXx)
# Description before applying transforms
print(housing_y_propared.escribo())
housing_X_prepared
count _1422.@08000
ean 178405.042897
sta 74506926127
in 25211.000000
25% _-129600.000000
Sex 1615ee.609000
75% -211750.00000
max 611657.000000
Name: SalePrice, ctype: Floated
count 1422.080000
rnean 12,014792
ote 2.389598,
in 19.471950
25% 11.772287
50% 11.992260
75% 121263160
sax 131222027
Name: SalePrice, dtype: floates
‘ut{41}: <1422x281 sparse matrix of type ‘<class ‘nunpy.floatoa’>”
with 98118 stored elements in Compressed Sparse Row fornat>

Whatis happening in the ColumnTransformer?

1. We import the ColumnTransformer class.


2. Wegetthelist of numerical column namesandthelist of categorical column names.
3. Weconstruct a ColumnTransformer. The constructor requiresa list of tuples, where each
tuple contains a name,a transformer, anda list of namesof columnsthat the transformer
should be applied to.

In this example, wespecify that the numerical columnsshould be transformed using the
num_pipeline that wedefined earlier, and the categorical columnsshould be transformed
using a OneHotEncoder.Finally, we apply this ColumnTransformerto the housing data
using fit_transform(.

Andthat’s it! We have a preprocessing pipeline that takes the housing data and applies the
appropriate transformations to each column.

Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:

& Howto Use a Jupyter NoteBook?

* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.

+ You can click ZG to openthe Jupyter Notebookin a new tab.

* Go to File and click Download as and then choosethe formatofthefile to download


>. You canchoose Notebook(.ipynb) to downloadthefile and worklocally or on
your personal Jupyter Notebook.

+ A Thenotebooksession expiresafter 15 minutes of inactivity.

‘Your app can be found at: httos://vvq3ere1mk42-live-app.educative.run/notebooks/Data Transformation pyn G

——_ «

Click to launch app!


Kaggle Challenge - Machine Learning Models

We'll coverthe following a

* 4. Create and Assess Machine Learning Models


© Train and Evaluate Multiple Models onthe Training Set
Comparative analysis of the models andtheirerrors
* Evaluation Using Cross-Validation
* Jupyter Notebook

4. Create and Assess Machine Learning


Models

Train and Evaluate Multiple Models on the Training Set


Atlast! We framed the problem,wegot the data, exploredit, prepared thedata, and wrote
transformationpipelinesto clean up the data for machine learning algorithms automatically.
Weare nowready for the mostexcitingpart: to select and train a machine learning model.

Thegreat newsis that thanks toall the previoussteps,things are going to be way simpler than
you mightthink! Scikit-learn makesit all very easy!

Create a Test Set

As a first step weare goingto split ourdata into twosets: training set andtest set. We are going
to train our modelonly on part ofthe data because weneedto keep someofit aside in order to
evaluate the quality of our model.

Creating a test set is quite simple: the most commonapproach is to pick some instances
randomly, typically 20% of the dataset, and set them aside. The simplest function for doing this
Scikit-learn’s train_test_split() .

It is a commonconvention to namethefeatureset with X in the name, X_train and X_test, and


the data with the variable to be predicted with y in the name,y_train and y_test:

# Split data into train and test formate


RUN

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, ytest = train_test_split(housingXprepared, housingyprepared, test_si

In [42]: W_# Split dato into train ond test formate


from sklearn.nodel_selection import train_test_spit
X.train, Xtest, y_train, ytest = traintestsplit(housingXprepared, housingyprepared, test_size=0.2, randomstate=7)

With thetraining andtestdata in hand,creating a modelis really easy. Say we wantto create a
Linear Regression model.In general, this is whatit lookslike:

# Import modules
wavanaune

from sklearn.linear_model import LinearRegression


# Train the model on training data
model = LinearRegression()
model.fit(X_train, ytrain)
# Evaluate the model on test data
print("Accuracy%:", model.score(X_test, y_test)*100)

Andthat’s it! There you have a linear regression modelin threelines of code!

Nowwewantto create and compare multiple models, so weare goingto storetheresults from
the evaluation of each modelin variable. Since weare dealing with a regression problem, we
are also going to use RMSEas the main performance measureto assess the quality of our
models.

RMSE(Root Mean SquareError)is a typical performance measureforregression problems. It


gives an idea of how mucherrorthe system typically makesin its predictions by measuring the
differences between values predicted by the modeland theactualvalues, actual prices vs
predicted prices. It is the standard deviation of the prediction errors, a measure of how spread
outtheseerrors are from thelineofbest fit.

The equation for RMSEis simple: we sum the squareofall the errors between predicted values
andactual values, wedivide bythe total numberoftest examples and then wetake the square
root ofthe results:

Again,not to worry about implementing formulas, because we are going to measure RMSEof
our regression models using Scikit-learn’s mean_squared_error function.

One morething to rememberis that wetook thelogofourtarget variable, SalePrice. This


meansthat before evaluating RMSE, we needto convertprices backto their original values.
Inverse ofthe log meansto simply take the exponentialofthe logvalues,i.e., we will simply
call np.exp() . And since weneedto get the inverse multiple times, we are going to write a
function as a goodcoding practice,like so:

def inv_y(y):
return np.exp(y)

Let’s train our models:

from sklearn.metrics import mean_squared_error


wavanaune

from sklearn.linear_model import Lasso


from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Elastichet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
10 from sklearn.ensemble import GradientBoostingRegressor
11 from sklearn.ensemble import AdaBoostRegressor
12 from sklearn.tree import DecisionTreeRegressor
13 from xgboost import XGBRegressor
14 import xgboost
15
16 # Invert the log-transformed value
47 def inv_y(transformed_y):
18 return np.exp(transformed_y)
19
20 # Series to collect RMSE for the different algorithms: “algorithm name + rmse”
21 rmse_compare = pd.Series()
22. rmse_compare.index.name = "Model"
23
24 # Series to collect accuracy scores for the different algorithms: “algorithm name + score”
25 scores_compare pd.Series()
26 scores_compare.index.name = ‘Model"
27
28 # Model 1: Linear Regression
29 linear_model = LinearRegression()
30 linear_model.fit(x_train, ytrain)
31

In [43]: from sklearn.metrics sport mean_squared_error


from sklearn.linear_nodel import Lasso
from sklearn.1inear_wodel import LinearRegression
from sklearn.linear_rodel import Ridge
from sklearn.linear_nodel import Elasticlet
from sklearn.neighbors import KNei ghborsRegressor
from sklearn.sva import SVR
fron sklearn.ensenble import RandonForestRegressor
from sklearn.ensenble import
from sklearn.ensenble inport GradientBoosting
AdaBoostRegressor
from sklearn.troe Anport DecisionTreekogressor
from xgboost import XGBRegressor
import xgboost
# Invert the log-transformed value
of inv_y(transformed_y):
return np.exp(transformed_y)
# Series to collect RNSE for the different algorithns: “algorithm name + rnse”
rnse_conpare pd.Series()
Pmse_conpare.index.nane = ‘Model
# Series to collect accuracy scores for the different algorithms: “algorithm nane + score"
scores_conpare = pd.Series()
scoresconpare.index.nane = ‘Nodel"
# Model 1:
Linear_nodel = LinearRegression()
Linearswodel. fit(X_train, y_train)
Linear_val_predictions = Linear_vodel.predict(Xtest)
Linear_val_rmse nean_squared_error(iny_y(1inear_val_predictions), inv_y(y_test))
np.sqrt(1inear_val_rase)
conpare['Linearftegression’] = 1inear_val_rase
Linear_model.score(Xtest, y_test)*10¢
scores_conpare[ ‘LinearRegression'] = 1r_score
In [44]: | # Model 2: Decision Trees. Define the model.
tree_nodel = Decision ‘ossor(randon_state:
dtree_nodel.Fit(X train, y_train)
y_val_predictions = dtre
tase.conpare(‘Decisiontree') = deree
dtree_score = dtree_rodel.score(Xtest, y_test)*100
scores_conpare[ ‘OecisionTree'] = dtreescore

In [45]: W_# Model 3: Random Forest. Define the model


rf_model = RandoaForestRegressor(randoa_state=5)
rflnodel.fit(Xtrain, y_train)
Pf_val_predictions = rfnodel.predict(Xtest)
Pflval_nase mean_squared_error(inv_y(r#_val_predictions), inv_y(y_test))
Pflval_pase
rse_conpare[
Pf_score = rf_nodt score(Xtest, y_test)*100
scorescompare jonForest"] rf
In [46]: W# Model 4: Gradient Boosting Regression
gor_nodel = GradientBoostingRegressor(n.
jpthes, random_statess)
gbr_nodel.fit(X train, ytrain)
{gbr_val_predictions = gbr_nodel.predict(Xtest)
br_val_rase = mean_squared_error(iny_y(gbr_val_predictions), inv_y(y_test))
Bbr_val_rase = np.sqrt(gbr_val_rase)
se_conpare[ ‘Gradient8oosting™] = gbr_val_rase
br_score = gbr_nodel-score(Xtest, _test)*100)
Scores_compare[ ‘Gradient8oosting"} {gor_score

Wehavetrained four different models. As you cansee,training from one modelto another
just meansthat youjust select a different one from Scikit-Learn’s library and change a single
line of code!

Comparative analysis of the models and their errors

Nowlet’s get the performance measures for our models in sorted order, from best to worst:

1 print("RMSE values for different algorithms: *)


2 rmse_compare. sort_values(ascending=True) .round()

1 print("Accuracy scores for different algorithms: ')


2 scores_compare.sort_values (ascending = False).round(3)

In [47]: W print(‘RMSE values for different algorithes:')


ase_compare.sort_values(ascending=True) .round()
RUSE values for different algorithas:
out[47]: Moder
LinearRegression 24637.0
GradientBoosting 27212.0
RandosForest 31091.0
DecisionTree 37872.0
type: Floste
In [48]: W) print(‘Accuracy scores for different algorithas:*)
Scores_conpare.sort_values(ascending = False)-round(3)
Accuracy scores for different algorithes:
out[48): Model
LinearRegression 89.591
GradientBoosting 89.567
RandosForest 84.796
DecisionTree 72.908
type: floates

The simplest model, Linear Regression, seemsto be performing thebest, with predicted prices
that are off by about 24K. This might or might not be an acceptable amountof deviation
dependingon the desired level of accuracy or the metric wearetrying to optimize based on
our business objective.

General Notes

largeprediction error usually means an exampleof a model underfitting the training data.
Whenthis happensit can mean thatthe features do not provide enough information to make
good predictions,or that the modelis not powerful enough. The main ways to fix underfitting
are to select a more powerful model, to feed the training algorithm with better features, or to
reduce the constraints on the model.

In this case, we havetrained more powerful models, capable offinding complex nonlinear
relationshipsin thedata,like a DecisionTreeRegressoras well. However, the more powerful
model seems to be performing worse! The Decision Tree modelis overfitting badly enough to
perform even worse than the simpler Linear Regression model.

Possible solutions to deal with overfitting are to simplify the model, constrain it, or get more
training data.

Random Forests workby training many Decision Trees on random subsetsof the features, then
averaging outtheir predictions. Building a modelon top of many other modelsis called
Ensemble Learning,andit is used to improvethe performanceofthealgorithms. In fact, we
cansee that Random Forests are performing much betterthan Decision Trees.

Evaluation Using Cross-Validation

Onewayto evaluate modelsisto split the training set into a smallertraining set and a
validationset, then train the modelsagainst the smaller training set and evaluate them against
the validationset. Thisis called cross-validation. Wecan useScikit-Learn’s cross-validation
feature, cross_val_score, forthis.

Let’s perform a K-fold cross-validation on our best model: the cross-validation function
randomlysplits the training set into K distinct subsets orfolds, then it trains and evaluates the
model K times, pickinga differentfold for evaluation every timeand trainingon theother 9
folds. The result is an array containing theK evaluation scores:

from sklearn.model_selection import cross_val_score


wavanaune

# Perform K fold cross-validation, where K=:


scores = cross_val_score(linear_model, X_train, ytrain,
scoring="neg_mean_squared_error” cve10)
linear_rmse_scores = np.sqrt(-scores)
# Display results
def display_scores(scores):
18 print("Scores:", scores)
a print("Mean:", scores.mean())
12 print("Standard deviation:", scores.std())
13
14 display_scores(linear_rmse_scores)

In [49]: W. from sklearn.model_selection import cross_val_score


scores = cross_val_score(linear_nodel, X_train, y_train,
ieg.mean_squared_error”, cvs10)
Linear_rase_scores = np.sart(-scores)
dof display_scores(scores):
print("Scores:", scores)
print(*Mean:", Scores.mean())
print("Standard deviation:", scores.sta())
éisplay_scores(1inear_rase_scores)
Scores: [0.11254073 @,13883274 0,10564025 @.12889019 ,10899275 0. 11501349
(0, 10957026 0.11747952 0.13242401 @,11532405)
Mean: @,11857079770059828
Standard deviation: ©,010669351420601209
In [50]: W_ scores = cross_val_score(rf_nodel, Xtrain, y_train,
‘scoringe"neg_moansquared_error", ev#i0)
rf_onse_scores = np.sant(-scores)
cores(scores):
Scores:", scores)
ean:", Seores.mean())
‘Standard deviation:", scores.std())
display_scores(rf_rmse_scores)
Scores: [0.13156435 @,17088973 0.12085582 @.16862507 @.13179427 @.14031469
(0.14391635 0.11671486 9.15015448 0,14648264)
Moan: @.14173122731360774
Standaré deviation: 0,016548003161403323,

From theresults, we notice thatcross-validation gives us the mean andstandarddeviation for


thescoresas well. But cross-validation comes at thecostof training the model several times, so
it is not alwaysthe most viable choice.

#® Note:In general, save your models so that you can comeback to any model you want.
Make sureto save the hyperparameters,the trained parameters, andalso the evaluation
scores. Why? Becausethis will allow you to easily comparescoresacross model types and
comparethetypesoferrors they make. This will especially be useful whenthe problem is
complex, your notebookis huge and/or modeltraining timeis very large.

Scikit-learn models can be saved easily using the pickle module, or using
sklearn.externals.joblib , whichis moreefficientat serializing large NumPyarrays:

from sklearn.externals import joblib

# Save model
joblib.dump(my_model, "my_model.pkl"

# Load saved model


my_model_loaded = joblib.load("my_model.pk1")

Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:

& Howto Use a Jupyter NoteBook?

Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.

You canclick ZG to openthe Jupyter Notebookin a new tab.

Go to File andclick Downloadas and then choosethe formatofthefile to download


>. You canchoose Notebook(.ipynb) to downloadthefile and worklocally or on
your personal Jupyter Notebook.

A\ Thenotebooksession expires after 15 minutes of inactivity.

‘Your app can be found at: https://vvq3ereImk42-live-app.educative.run/notebooks/MachineLearningModels.ipynb G

——

Click to launch app!


—_——_— — ——
Kaggle Challenge - Fine Tune Parameters

We'll coverthe following a

* 5, Fine-Tune Your Model


© Grid Search
* Evaluate Using the Fine-Tuned Model
Some More Waysto Perform Fine-Tuning
» 1,Randomized Search
* 2. Ensemble Methods
* Jupyter Notebook

5. Fine-Tune Your Model


Say ourbest performing model was the RandomForestRegressor.This is a modelthat has many
input hyperparametersthat can be tweakedfor improving performance. For example, we
could have a forest with 100 or 1000 trees, or wecould use 10 or 50 features during random
selection. Whatare the best values for these hyperparametersto passas inputto the model for
training?

Grid Search
Should wefiddle with all the possible values manually and then compareresults to find the
best combination of hyperparameters? This would bereally tedious work, and we would end
up exploring only a few possible combinations.

Luckily, we can use Scikit-learn’s GridSearchcv to do this tedious search workfor us. All we
need todois tell it which hyperparameters we wouldlike to explore and which valuesto try
out, andit will evaluateall the possible combinations of hyperparametervalues, using cross-
validation.

For example,let’s see how tosearch for the best combination of hyperparametervalues for the
RandomForestRegressor:

1 from sklearn.model_selection import GridSearchcv


2
3 # Define the parameters for exploration
4 param_grid = [
5 {'nestimators': [1@, 50, 100, 150], ‘max_features’: [10, 20, 38, 48, 50, 100, 150]},
6 {‘bootstrap’: [False], ‘n_estimators’: [10, 5@, 100, 150], ‘max_features’: [10, 20, 30, 40, 50,
7 1
8
9 # The model for which we are finding params values
10 forest_reg = RandomForestRegressor()
12 grid_search = GridSearchCv(forest_reg, param_grid, cv=5,
13 neg_mean_squared_error’,
14 return_train_score=True)
15
16 grid_search.fit(x_train, ytrain)

In [51]: M from sklearn.model_selection import GridSearchCV


paramgrid = [
{Tnestimators’: [10, 50, 100, 150], ‘max_feature: 18, 20, 30, 40, 58, 100, 150]},
{bootstrap': [False], ‘nestimators': (10, 50, 1 150], ‘maxfeatures': [16, 20, 30, 40, 58, 100, 150]},
forest RandonForestRegressor()
grid_search = Gridsearchcv(forest_ri param_grid, eves,
scorings'neg_mean_squared_error',
return_train_score=True)
arid_search.fit(Xtrain, y_train)
‘ut[5i]: Gridsearchcv(eves, error_score="raise-deprecating’,
‘estinator=RandonForestRegressor(bootstrap=True, criterion='ase', max_depthsone,
mmax_featurese'auto’, max_leaf_nodessNone,
mmin_Anpurity,
mminasanples.
+ verbose=d, warmstart=False),
Fit_paramselione, Lid="warn' n_jobs=None,
[J®, 50, 100, 150], ‘maxfeatures": (10, 2, 20, 49, S@, 100, 15@)), (*bootstrap': [Fals
e], ‘nestinators| nax_features": [10, 28, 38, 40, 52, 100, 150]}],
je, return_train_score=True,
rbose-e)

Note:If we are clueless about which value a hyperparametershould have,a simple


approach is to pass consecutive powers of 10, or a smaller numberif you want a more
fine-grained search to GridSearch.

Wecan use best_params_ to visualize thebest values for the passed hyperparameters, and
best_estimator_ to get thefine-tuned model:

# Best values
wewne

grid_search.best_params_
# Model with best values
grid_search.best_estimator_

In [52]: M grid_search.best_parans_
ut{s2]: ("bootstrap': False, 'max_features': 58, ‘n_estinators': 150)
In [53]: W_ grid_search.best_estimator_
ut{53]: RandonForestRegressor(bootstrapsFalse, criterions'nse', max_depth=None,
inax_features=50, max_leaf_nodessNone, min_smpurity_decrease-0.0,
snin“impurity_splitelione, min_samples_leaf=1,1
in_sanples_splite2, min_weight_fraction_le2f8.0,
n_estimators=158, n_jobs-None, o0b_score-False,
randon_statesNone, verbose=@, wara_starteFalse)

Evaluate Using the Fine-Tuned Model

Nowthat we knowtheoptimal values for the hyperparameters(‘bootstrap’: False,


‘max_features’: 50, ‘n_estimators’: 150), let’s plug them in andsee if our Random Forest model
has improved comparedto the vanilla Random Forest modelthat wetrained earlier when we
trained multiple modelsat once:

rf_model_final = RandomForestRegressor(bootstrap-False,max_features=5, n_estimators=150, random_st


wavanaune

rf_model_final.fit(x_train, ytrain)
rf_final_val_predictions = rf_model_final.predict(X_test)
# Get RMSE
rf_final_val_rmse = mean_squared_error(inv_y(rf_final_val_predictions), inv_y(y_test))
np.sqrt(rf_final_val_rmse)
10 # Get Accuracy
11 rf_model_final.score(x_test, y_test)*100

tn [62]: MPFmodel_final = RandonForestRegreszor(bootstrap-Falee,qax_features-S0, 9_estimators-152, random_state=5)


rfmodel_final.fit(Xtrain, ytrain)
rf_final_val_predictions rfmodel_final predict(Xtest)
Pf_finalval_pmse mean_squared_error(inv.y(rf_final_val_predictions), invy(ytest)
np-sart(rf_final_val_rase)
out[s3]: 2e801.#19287572634
In [64]: W Pf_model_final.score(Xtest, y_test)*100
out 64): 87.81784897877596

Wow!Our accuracyhas gone up from about 84.8 to 87.8 while the RMSEhasdecreased from
31491 to 28801. This is a significant improvement!

In this example, we obtained a muchbetter solution by setting the max_features


hyperparameterto 50, and the n_estimators hyperparameter to 150. However,notice that
LinearRegressionis still performing better so weare going to chooseit as our final model.

Some More Waysto Perform Fine-Tuning


There are many waysto perform fine-tuning. Among which twoarediscussed here:

1. Randomized Search

Thegrid search approachis acceptable when weareexploring relatively few combinations, but
whenthe numberof combinations of the hyperparametersis large,it is often preferable to use
RandomizedSearchcv. Thisis similar to GridSearchCV class,but insteadof trying outall possible
combinations,it evaluates a given numberof random combinationsat every iteration.

2. Ensemble Methods

Another wayto fine-tuneis to try to combine the models that perform best. The group, or
ensemble,will often perform better than thebestindividual model,just like Random Forests
perform better thanthe individualDecision Treestheyrely on, especially if the individual
models makevery different types of errors.

Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:

& Howto Use a Jupyter NoteBook?

* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.

+ You can click ZG to openthe Jupyter Notebookin a new tab.

* Go to File and click Download as and then choosethe formatofthefile to download


>. You canchoose Notebook(.ipynb) to downloadthefile and worklocally or on
your personal Jupyter Notebook.

+ A Thenotebooksession expiresafter 15 minutes of inactivity.

‘Your app can be found at: httos://vvq3ereImk42-live-app.educative.run/notebooks/FineTuneModelParameters.ipynb G

——

Click to launch app!


—— <—_- ——
Kaggle Challenge - Present, Launch and Maintain the
System

We'll coverthe following a

6. Present the Solution

e 7. Launch, Monitor, and Maintain the System

6. Present the Solution


Now comes the phase where you need to showcase your results, to present your solution. Some
handy pointers:

« Highlight what you havelearned


« Whatworked and what did not
« Whatassumptions you made
¢ Your models limitations
* Create compelling presentations: useclear visualizations, easy to rememberstatements,
avoidlots of text and use the powerofstory-telling!
« Use terminology that is tailored to the technicallevelof the audience.

# Side Note: Say this housing example wasa real project. Thefinal performanceof the
model could be used to understandif ML based solution can be usedto replace human
experts in the loop. Automating thesetasks is useful because it meansthatthe experts get
to have morefree time which they can dedicate to moreinteresting and productivetasks.

7. Launch, Monitor, and Maintain the System


Say we were working on real project and after your awesomepresentation, you got the
approvalto deploy yoursolution for production. Now you'd need toget it ready for production.
You canstart doingthis by plugging in production data as inputto your model and writing
tests.

When ML models are in production,it is crucial to have monitoring in placein order to check
the system’s performanceat regularintervals andtrigger alerts when things go bananas.

Finally, you will likely need to train your models at regularintervals using fresh data. In order
to avoid doing the sametasks over andoveragain,strive to automatethis process as much as
possible. Automating meansthat you can run updates at exact intervals without
procrastination issues and your system will stay up-to-date and show badfluctuations over
time.

Of course,thesesteps are not needed if you arejust building a model, say for a Kaggle
competition. In that case youcanstopat fine-tuning!

Congratulations, you have now successfully learned to perform exploratory analysis,


preparethe data, create and evaluate multiple machine learning modelsandfine-tune
your best model! })
Assignment

We'll coverthe following a

e Practice MakesPerfect: Working With Real Data

© Open Datasets

Practice Makes Perfect: Working With Real Data


Now you havea goodidea of what a machinelearning/data science projectlookslike. You have
gained familiarity with greattools and techniques that you can useto train ML models. Asit
should be pretty obvious by now, much ofthe workis in the data preparationstep. In fact,first-
timersare often surprised by howlittle time in a machinelearningproject is spentactually
doing machinelearning. But it makes sense if you consider how time-consumingit is to gather
data,integrateit, cleanit, and pre-processit, and how muchtrial anderror can gointo feature
design. Machinelearningis not a one-shotprocess of building a dataset and runninga learner,
butrather an iterative process of running the learner, analyzing the results, modifying the data
and/orthelearner, and repeating.

The machinelearningalgorithms are importantbut, whengettingstarted,it is recommended


to be comfortable with the overall processfirst and learn just a few algorithmswell, rather
than spending all yourtime in learning advanced algorithmsatthe cost of ignoring the overall
process.

Your Turn Now!

1. First, try to improvethe performanceofthe modelfor the housing dataset by using


different models, selecting different features, replacing GridSearchCV with
RandomizedSearchCV, trying out a differentset of algorithms,etc.
2. Thenselect a dataset from a domain ofyour interest and go through the whole process
fromstartto end. Thekeyis to practice, practice and then some morepractice!

Open Datasets

Thereare thousandsof open datasets, ranging acrossall sorts of domains, just waiting for you.
Hereare a few popularplaces you canlookatto get lots of open data:

« Kaggle datasets

« Amazon's AWSdatasets

« Wikipedia’s list of Machine Learningdatasets

I would recommendyoustart on Kaggle because you will have a gooddataset to tackle, a clear
goal, and people to share yourexperience with.

Lookingforwardto hearing aboutall your great projects, and progress! @


Further Study Material

Data scienceis a vastfield. Thereis always moreto learn andexplore. Especially, if you start
browsing andlooking for resources aroundthe Internet, it won’t take long before you get
information overload. The keyis to not become overwhelmed.

Chooseone or twocourses or booksat a time. Read, learn, understand, and apply the
concepts before jumping on to the next, new, shinything.It is importantto apply the
concepts as you learn them,as we did throughoutthis course. It can be tempting to buy every
bookandstart every course, but then you will more thanlikely neverfinish any of them. So
rather than going all over the place, rememberto focus andto learn by practicing.

Henceto keepit short and sweet, I am notgoing to give you a list of 100 resources! Here are my
top two recommendationsfor you to go deeper and/or widerinto the topics we have covered
(andnot covered):

@ Data Science: Python Data Science Handbook

@ Machine Learning: Hands-On MachineLearning with Scikit-Learn and TensorFlow


Howto Get That High-Paying Job

You have mastered the mostessential concepts in Data Science now. Obviously, there are a
gazillion technologies, techniques, algorithms and everything in betweenthat a DataScientist
“must” know.Should you try to learn everything before you feel qualified to apply for a Data
Scientist job? Definitely not! Don’t get trappedin theblack hole of attempting to tick every
check box in the world. Chances are you will stay stuck ticking check boxes. It might eventually
work, but it wouldbe aninefficient process. Thereis a good reason whythereis the conceptof
learning on thejob. Let me give you two smart approaches to launch your career as a Data
Scientist based on yourpersonality type.

Whatkindof person are you?

1. Youlike to stick to more conventional ways.


2. You like to go against conventions and hack yourwayto your goals, with a “no-matter-
what-it-takes”attitude.

If your answeris 1, pick routeA.If you chose 2,pick route B. If you are in between,read both
anddecide based on whatsoundsbest to you.

Route A: The Public Portfolio

1. Think about your dreamjobas a DataScientist. Which industry wouldit be in? Do you see
yourself in finance, sports, health,fitness, beers, cookies, oceans? Whatarethethings that
pique your curiosity?
. Have you identified your kick? Good! Nowfind a public dataset from thatfield and press
N

the fast-forward button: imagine thatyou aregiving a presentation about your awesome
project. Whatstories from your data would you be telling your audience? How would you
be providing valueto the people listening? Based onthis vision of your future
presentation, reverse engineer the problem, formulate interesting questions, and define
the endgoal for your project.
. Once you are donewith your exciting project, put the bait on the hookandthrowit in the
wo

water. Build a PUBLIC portfolio and MAKE SOME NOISE! Use the powerof LinkedIn to
reach outto recruiterslet your portfolio do the talking instead of some boring
conventional CV. Go to meetups andtalk to people aboutyourprojects. Use the powerof
networking events to find potential employers.

Your portfolio will already unlock manydoors. Butif on top ofthat, your topic and findings are
valuableto a large audience, you'll be receiving more incomingrecruiting calls than you can
imagine.

If it wasn’t obvious enough,you will end up with a lot more than a compelling portfolio, you
will have learnedthelatest tools and techniquesin a fun,curiosity-driven,andlasting way.

Route B: The Uncommon Way

This is the uncommonpath,notfor the faint of heart.

Tam going to give you an uncommon approachthat can makeyou instantly standout from the
crowdand get youthat high-payingjob. It is a deceptively simple approach,but only suitable
for those whocantakeondifficult challenges and do whateverit takes to reachtheir goals. Are
you readyforit? So hereis goes:

Find a companythat you wouldlike to work at. Reach outto the decision makersin the
companyusing LinkedIn, references, or even just walk in. Andoffer to work for FREE for 2
months and while being assessed on the performance. Yes,insist that you don’t wanta salary,
or even a stipend.Say that, “I’m hereto learn and assessthesuitability of this role for me”. Then
buckle downto workfor those 2 monthsandfocus on providing VALUE.You will have become
a very valuable resourceby the end andyouwill be surprised how muchthey’ll want you to
stay with a high-paying offer after those 2 months. That’s it. Two monthsandthejob will be
yours.

Route A or RouteB, you havethe key to unlock that high-payingjob as a Data Scientist at
your dream company.

PS. If you spend even a couple weeksdoingthis, it will change your life.
Imposter Syndrome

“It’s only a matterof time until I’m called out. I’m just a fraud.”

“It’s a fluke that I gotthis job interview.”

“I have been preparingfor weeks, butI still don’t know anything.

Everhad anyfeelings of failure and pretending? Don’t worry. You arenotalone.

An estimated 70% ofus are likely to experience at somepointor anotherthesefeelings of


inadequacyandfakeness;the feeling like you’rejust on theverge of being exposed for what
youreally are — an impostor,a fraud.It’s a battle that most of us areoften fightinginternally,
butvery few havethe courageto talk aboutit. Becauseof course,if we do, wewill be
discovered and the mask will comeoff. This internal battle has a name, Imposter Syndrome.

If you end up with these feelings just before or during an interview, it can BLOW UPeverything
that you had been workingfor. This is why I decidedto talk aboutthis with you. I do not want
youto give upatthe very last moment. I do not wantyouto stop asking questions for the fear
ofbeing “discovered”, I do not wantyou to shut your mouth andstay away from speaking up
becausethereis a voice in your headsayingthat you do notbelonghere. I want you to
understand wherethatvoice is coming from and how you candeal with it. Because let metell
you onething: you are NOT an imposter.

| have written 11 books but each time| think ‘Uh-oh, they're goingto find out
now.I’ve run a game on everybody, and they're going to find me out.

—MayaAngelou

Every time | was called on in class, | was sure that | was about to embarrass
myself. Every time | took a test, | was surethatit had gone badly. And every time |
didn’t embarrass myself — or even excelled — | believed that | had fooled
everyone yet again. One day soon, the jig would be up ... This phenomenon of
capable people being plagued by self-doubt has a name — the impostor
syndrome. Both men and women are susceptible to the impostor syndrome, but
womentend to experience it moreintensely and be more limited byit.

—Sheryl Sandberg, Lean In

lam nota writer.I’ve been fooling myself and other people.

—John Steinbeck

Ultra-successful people are also plagued with these doubts andfeelings. Almostno oneis totally
immuneto the aweful Imposter Syndrome. But whatis really behind it?

Wherearethese feelings coming from?

Welive with cognitive-biases about knowledgeandlearning. Wehavethis notion in our heads


of the things that we SHOULD know. Weshould knowall the algorithms, we shouldbe pros at
all the latest tools and technologies, we should be proficientin this and that.

Fieldslike Data Science aresovastthat you can’t possibly know everything.Also, the more you
learn, the more things you will find to learn further. The knowledgegap will seem to widen
more and more.Andthis can result in making you feellike crap; like someone whoisn’t just
able to keep up with everything they mustlearn.

Thereis another hugefactor behindit, our urge to compare ourselves with others. Wefeel as if
all the people around us have way more knowledgethan wedo,they are waybetter than us,
they belong here while wejust got theticket by pure luck. Soundslike a devastating state of
mindto be in? Well, itis.

Luckily, like most syndromes,there are cures for Imposter Syndromeaswell.

Steps you can take to deal with Imposter Syndrome

Once wehaveidentified the symptomsand diagnosed ourselves as “Imposter Syndrome


Positive”, one of the first steps is to ACKNOWLEDGEthe thoughts and putthemin perspective.
Observethemsilently and ask yourself, ‘Does that thought help or hinder me?’ Then redirect
your FOCUS.

Imposter Syndrome makesyoufocuson all the things you do not know, especially before an
interview. Whenyoudetect the red alarm,redirect your focus from thelimitless possibilities of
things you do not knowtoall the things you do know,andaregoodat.It’s not the kind of
thoughts you wantbeforeaninterview. Remindyourself that you have your ownpositive
strengths, and you do not need to compare yourself with Tom and Harry. Tom and Harry might
be greatatskills “x,y,z” but you might be awesomeat “a,b,c” — we all have our own unique
strengths and weaknesses.

It’s good to rememberthat people who don’t feellike impostorsare no more intelligent or
competentor capablethan therest ofus.

Weall have knowledge-gaps, we cannotpossibly learn EVERYTHINGand answerALLthe


interview questionsperfectly. Let’s notlet ourselves be deceived bythesefeelings.

A big warninghere: I’m NOT telling you to adopt an attitude of arrogance. Farfrom it! We all
havea lot to learn,andthis is where the beauty of Continuous Learning and Growth Mindset
comeinto play. Keep learning, stay humble, and BELIEVE in yourself.
Final Thoughts

Give yourself a PAT ON THE BACKfor having madeit successfully to the end. Many
congratulations on completing all the lessons.

Before saying goodbye, hereare twothingsthat I really want you to rememberas you continue
on your journeyas a DataScientist:

“The path to becoming a great Data Scientist is not a sprint, but a marathon.”

“Don’t’ be a know-itall; be a learn-it-all.”

P.S. [hope you enjoyed this course. Let me know howit went.I'll wait to hear back from you.
Best of luck with the nextsteps in your journey to becominga great DataScientist!

You might also like