What is NumPy?
NumPy is a Python library used for working with arrays.It also has functions for working in
domain of linear algebra, Fourier transform, and matrices.NumPy was created in 2005 by Travis
Oliphant. It is an open source project and you can use it freely.NumPy stands for Numerical
Python.
Why Use NumPy?
In Python we have lists that serve the purpose of arrays, but they are slow to process.NumPy
aims to provide an array object that is up to 50x faster than traditional Python lists.The array
object in NumPy is called ndarray, it provides a lot of supporting functions that make working
with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very important.
Why is NumPy Faster Than Lists?
NumPy arrays are stored at one continuous place in memory unlike lists, so processes can access
and manipulate them very efficiently.
This behavior is called locality of reference in computer science.
This is the main reason why NumPy is faster than lists. Also it is optimized to work with latest
CPU architectures.
Which Language is NumPy written in?
NumPy is a Python library and is written partially in Python, but most of the parts that require
fast computation are written in C or C++.
import numpy
arr = numpy.array([1, 2, 3, 4, 5])
print(arr)
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Create a NumPy ndarray Object
NumPy is used to work with arrays. The array object in NumPy is called ndarray.
We can create a NumPy ndarray object by using the array() function.
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
type(): This built-in Python function tells us the type of the object passed to it. Like in above
code it shows that arr is numpy.ndarray type.
To create an ndarray, we can pass a list, tuple or any array-like object into the array() method,
and it will be converted into an ndarray:
Example
Use a tuple to create a NumPy array:
import numpy as np
arr = np.array((1, 2, 3, 4, 5))
print(arr)
Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).
nested array: are arrays that have arrays as their elements.
0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
Example
Create a 0-D array with value 42
import numpy as np
arr = np.array(42)
print(arr)
1-D Arrays
An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
These are the most common and basic arrays.
Example
Create a 1-D array containing the values 1,2,3,4,5:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
These are often used to represent matrix or 2nd order tensors.
NumPy has a whole sub module dedicated towards matrix operations called numpy.mat
Example
Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
These are often used to represent a 3rd order tensor.
Example
Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and
4,5,6:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
Check Number of Dimensions?
NumPy Arrays provides the ndim attribute that returns an integer that tells us how many
dimensions the array have.
Example
Check how many dimensions the arrays have:
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
Higher Dimensional Arrays
An array can have any number of dimensions.
When the array is created, you can define the number of dimensions by using the ndmin
argument.
Example
Create an array with 5 dimensions and verify that it has 5 dimensions:
import numpy as np
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('number of dimensions :', arr.ndim)
In this array the innermost dimension (5th dim) has 4 elements, the 4th dim has 1 element that is the
vector, the 3rd dim has 1 element that is the matrix with the vector, the 2nd dim has 1 element that is
3D array and 1st dim has 1 element that is a 4D array.
NumPy Array Indexing
Access Array Elements
Array indexing is the same as accessing an array element.
You can access an array element by referring to its index number.
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.
Example
Get the first element from the following array:
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[0])
Example
Get the second element from the following array.
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[1])
Example
Get third and fourth elements from the following array and add them.
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[2] + arr[3])
Access 2-D Arrays
To access elements from 2-D arrays we can use comma separated integers representing the
dimension and the index of the element.
Example
Access the 2nd element on 1st dim:
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st dim: ', arr[0, 1])
Example
Access the 5th element on 2nd dim:
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('5th element on 2nd dim: ', arr[1, 4])
Access 3-D Arrays
To access elements from 3-D arrays we can use comma separated integers representing the
dimensions and the index of the element.
Example
Access the third element of the second array of the first array:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
0 1 0 1
0, 1, 2 0, 1,2 0 , 1 ,2 0 ,1 ,2
print(arr[0, 1, 2])
Example Explained
arr[0, 1, 2] prints the value 6.
And this is why:
The first number represents the first dimension, which contains two arrays:
[[1, 2, 3], [4, 5, 6]]
and:
[[7, 8, 9], [10, 11, 12]]
Since we selected 0, we are left with the first array:
[[1, 2, 3], [4, 5, 6]]
The second number represents the second dimension, which also contains two arrays:
[1, 2, 3]
and:
[4, 5, 6]
Since we selected 1, we are left with the second array:
[4, 5, 6]
The third number represents the third dimension, which contains three values:
4
5
6
Since we selected 2, we end up with the third value:
6
Negative Indexing
Use negative indexing to access an array from the end.
Example
Print the last element from the 2nd dim:
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('Last element from 2nd dim: ', arr[1, -1])
Slicing arrays
Slicing in python means taking elements from one given index to another given index.
We pass slice instead of index like this: [start:end].
We can also define the step, like this: [start:end:step].
If we don't pass start its considered 0
If we don't pass end its considered length of array in that dimension
If we don't pass step its considered 1
Example
Slice elements from index 1 to index 5 from the following array:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])
Note: The result includes the start index, but excludes the end index.
Example
Slice elements from index 4 to the end of the array:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])
Pandas is a Python library.
Pandas is used to analyze data.
Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including
Pandas.
In our examples we will be using a CSV file called 'data.csv'.
Download data.csv. or Open data.csv
Example
Load the CSV into a DataFrame:
import pandas as pd
df = pd.read_csv('desktop/data.csv')
print(df.to_string())
Tip: use to_string() to print the entire DataFrame.
By default, when you print a DataFrame, you will only get the first 5 rows, and the last 5 rows:
Example
Print a reduced sample:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Pandas Read JSON
Read JSON
Big data sets are often stored, or extracted as JSON.
JSON is plain text, but has the format of an object, and is well known in the world of
programming, including Pandas.
In our examples we will be using a JSON file called 'data.json'.
Open data.json.
Example
Load the JSON file into a DataFrame:
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())
Tip: use to_string() to print the entire DataFrame.
Dictionary as JSON
JSON = Python Dictionary
JSON objects have the same format as Python dictionaries.
If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame
directly:
Example
Load a Python Dictionary into a DataFrame:
import pandas as pd
data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}
df = pd.DataFrame(data)
print(df)
Pandas - Analyzing DataFrames
Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head()
method.
The head() method returns the headers and a specified number of rows, starting from the top.
Example
Get a quick overview by printing the first 10 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
In our examples we will be using a CSV file called 'data.csv'.
Download data.csv, or open data.csv in your browser.
Note: if the number of rows is not specified, the head() method will return the top 5 rows.
Example
Print the first 5 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from the
bottom.
Example
Print the last 5 rows of the DataFrame:
print(df.tail())
Info About the Data
The DataFrames object has a method called info(), that gives you more information about the
data set.
Example
Print information about the data:
print(df.info())
Result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None
Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.
Example
Return a new Data Frame with no empty cells:
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
In our cleaning examples we will be using a CSV file called 'dirtydata.csv'.
Download dirtydata.csv. or Open dirtydata.csv
Note: By default, the dropna() method returns a new DataFrame, and will not change the
original.
If you want to change the original DataFrame, use the inplace = True argument:
Example
Remove all rows with NULL values:
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())
Note: Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will
remove all rows containg NULL values from the original DataFrame.
Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.
The fillna() method allows us to replace empty cells with a value:
Example
Replace NULL values with the number 130:
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
What is Matplotlib?
Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
Matplotlib was created by John D. Hunter.
Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C and
Javascript for Platform compatibility.
import matplotlib.pyplot as plt
Now the Pyplot package can be referred to as plt.
Example
Draw a line in a diagram from position (0,0) to position (6,250):
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
Linestyle
You can use the keyword argument linestyle, or shorter ls, to change the style of the plotted
line:
Example
Use a dotted line:
import matplotlib.pyplot as plt
import numpy as np
ypoints = np.array([3, 8, 1, 10])
plt.plot(ypoints, linestyle = 'dotted')
plt.show()
Creating Bars
With Pyplot, you can use the bar() function to draw bar graphs:
Example
Draw 4 bars:
import matplotlib.pyplot as plt
import numpy as np
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
plt.bar(x,y)
plt.show()
The bar() function takes arguments that describes the layout of the bars.
The categories and their values represented by the first and second argument as arrays.
Example
x = ["APPLES", "BANANAS"]
y = [400, 350]
plt.bar(x, y)
Horizontal Bars
If you want the bars to be displayed horizontally instead of vertically, use the barh() function:
Example
Draw 4 horizontal bars:
import matplotlib.pyplot as plt
import numpy as np
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
plt.barh(x, y)
plt.show()
Matplotlib Scatter
Creating Scatter Plots
With Pyplot, you can use the scatter() function to draw a scatter plot.
The scatter() function plots one dot for each observation. It needs two arrays of the same
length, one for the values of the x-axis, and one for values on the y-axis:
Example
A simple scatter plot:
import matplotlib.pyplot as plt
import numpy as np
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()
Colors
You can set your own color for each scatter plot with the color or the c argument:
Example
Set your own color of the markers:
import matplotlib.pyplot as plt
import numpy as np
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y, color = 'hotpink')
x = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y = np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80,85])
plt.scatter(x, y, color = '#88c999')
plt.show()
Matplotlib Histograms
Histogram
A histogram is a graph showing frequency distributions.
It is a graph showing the number of observations within each given interval.
Example: Say you ask for the height of 250 people, you might end up with a histogram like this:
You can read from the histogram that there are approximately:
2 people from 140 to 145cm
5 people from 145 to 150cm
15 people from 151 to 156cm
31 people from 157 to 162cm
46 people from 163 to 168cm
53 people from 168 to 173cm
45 people from 173 to 178cm
28 people from 179 to 184cm
21 people from 185 to 190cm
4 people from 190 to 195cm
Create Histogram
In Matplotlib, we use the hist() function to create histograms.
The hist() function will use an array of numbers to create a histogram, the array is sent into the
function as an argument.
For simplicity we use NumPy to randomly generate an array with 250 values, where the values
will concentrate around 170, and the standard deviation is 10. Learn more about Normal Data
Distribution in our Machine Learning Tutorial.
Example
A Normal Data Distribution by NumPy:
import numpy as np
x = np.random.normal(170, 10, 250)
print(x)
The hist() function will read the array and produce a histogram:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.normal(170, 10, 250)
plt.hist(x)
plt.show()
Plotting x and y points
The plot() function is used to draw points (markers) in a diagram.
By default, the plot() function draws a line from point to point.
The function takes parameters for specifying points in the diagram.
Parameter 1 is an array containing the points on the x-axis.
Parameter 2 is an array containing the points on the y-axis.
If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and [3, 10] to the
plot function.
Example
Draw a line in a diagram from position (1, 3) to position (8, 10):
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([1, 8])
ypoints = np.array([3, 10])
plt.plot(xpoints, ypoints)
plt.show()
Markers
You can use the keyword argument marker to emphasize each point with a specified marker:
Example
Mark each point with a circle:
import matplotlib.pyplot as plt
import numpy as np
ypoints = np.array([3, 8, 1, 10])
plt.plot(ypoints, marker = 'o')
plt.show()
Linestyle
You can use the keyword argument linestyle, or shorter ls, to change the style of the plotted
line:
import matplotlib.pyplot as plt
import numpy as np
ypoints = np.array([3, 8, 1, 10])
plt.plot(ypoints, linestyle = 'dotted')
plt.show()
Add Grid Lines to a Plot
With Pyplot, you can use the grid() function to add grid lines to the plot.
Example
Add grid lines to the plot:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.title("Sports Watch Data")
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.plot(x, y)
plt.grid()
plt.show()
Result:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.title("Sports Watch Data")
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.plot(x, y)
plt.grid(axis = 'y')
plt.show()
Matplotlib Pie Charts
Labels
Add labels to the pie chart with the label parameter.
The label parameter must be an array with one label for each wedge:
Example
A simple pie chart:
import matplotlib.pyplot as plt
import numpy as np
y = np.array([35, 25, 25, 15])
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
plt.pie(y, labels = mylabels)
plt.show()
Python – seaborn.pairplot() method
Last Updated : 15 Jul, 2020
Prerequisite: Seaborn Programming Basics
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics. Seaborn helps resolve the
two major problems faced by Matplotlib; the problems are ?
Default Matplotlib parameters
Working with data frames
As Seaborn compliments and extends Matplotlib, the learning curve is quite gradual. If you
know Matplotlib, you are already half way through Seaborn.
seaborn.pairplot() :
To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function.
This shows the relationship for (n, 2) combination of variable in a DataFrame as a matrix of plots
and the diagonal plots are the univariate plots.
seaborn.pairplot( data, \*\*kwargs )
Seaborn.pairplot uses many arguments as input, main of which are described below in form of
table:
Arguments Description Value
Tidy (long-
form)
dataframe
where each
data column is a DataFrame
variable and
each row is
an
observation.
Variable in
“data“ to map
hue plot aspects string (variable name), optional
to different
colors.
Set of colors
for mapping
the “hue“
variable. If a
dict, keys
should be
palette values in the dict or seaborn color palette
“hue“
variable.
vars : list of
variable
names,
optional
Variables
within “data“
to use
separately for
the rows and
{x, y}_vars lists of variable names, optional
columns of
the figure; i.e.
to make a
non-square
plot.
dropna Drop missing boolean, optional
values from
the data
before
plotting.
Below is the implementation of above method:
Example 1:
# importing packages
import seaborn
import matplotlib.pyplot as plt
############# Main Section ############
# loading dataset using seaborn
df = seaborn.load_dataset('tips')
# pairplot with hue sex
seaborn.pairplot(df, hue ='sex')
# to show
plt.show()
Output :
Example 2:
# importing packages
import seaborn
import matplotlib.pyplot as plt
############# Main Section ############
# loading dataset using seaborn
df = seaborn.load_dataset('tips')
# pairplot with hue day
seaborn.pairplot(df, hue ='day')
# to show
plt.show()
#.
Output :
A Box Plot is also known as Whisker plot is created to display the summary of the set of data
values having properties like minimum, first quartile, median, third quartile and maximum. In
the box plot, a box is created from the first quartile to the third quartile, a verticle line is also
there which goes through the box at the median. Here x-axis denotes the data to be plotted while
the y-axis shows the frequency distribution.
Creating Box Plot
The matplotlib.pyplot module of matplotlib library provides boxplot() function with the
help of which we can create box plots.
Syntax:
matplotlib.pyplot.boxplot(data, notch=None, vert=None, patch_artist=None, widths=None)
Parameters:
Attribute Value
data array or aequence of array to be plotted
notch optional parameter accepts boolean values
optional parameter accepts boolean values false and true for horizontal and vertical
vert
plot respectively
bootstrap optional parameter accepts int specifies intervals around notched boxplots
optional parameter accepts array or sequnce of array dimension compatible with
usermedians
data
positions optional parameter accepts array and sets the position of boxes
widths optional parameter accepts array and sets the width of boxes
patch_artist optional parameter having boolean values
labels sequence of strings sets label for each dataset
meanline optinal having boolean value try to render meanline as full width of box
zorder optional parameter sets the zorder of the boxplot
The data values given to the ax.boxplot() method can be a Numpy array or Python list or
Tuple of arrays. Let us create the box plot by using numpy.random.normal() to create some
random data, it takes mean, standard deviation, and the desired number of values as arguments.
import matplotlib.pyplot as plt
import numpy as np
# Creating dataset
np.random.seed(10) #The seed() method is used to initialize the random number generator.
data = np.random.normal(100, 20, 200)
fig = plt.figure(figsize =(10, 7))
# Creating plot
plt.boxplot(data)
# show plot
plt.show()
numpy.random.normal(loc , scale , size ) : creates an array of specified shape and fills it
with random values which is actually a part of Normal(Gaussian)Distribution. This is
Distribution is also known as Bell Curve because of its characteristics shape.
Output:
Customizing Box Plot
The matplotlib.pyplot.boxplot() provides endless customization possibilities to the box
plot. The notch = True attribute creates the notch format to the box plot, patch_artist =
True fills the boxplot with colors, we can set different colors to different boxes.The vert = 0
attribute creates horizontal box plot. labels takes same dimensions as the number data sets.
Example 1:
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
# Creating dataset
np.random.seed(10)
data_1 = np.random.normal(100, 10, 200)
data_2 = np.random.normal(90, 20, 200)
data_3 = np.random.normal(80, 30, 200)
data_4 = np.random.normal(70, 40, 200)
data = [data_1, data_2, data_3, data_4]
fig = plt.figure(figsize =(10, 7))
# Creating axes instance
ax = fig.add_axes([0, 0, 1, 1])
# Creating plot
bp = ax.boxplot(data)
# show plot
plt.show()
Output: