Python Pandas
Introduction
Python pandas
Open Source
Python Library
Simple yet Powerful
and Expressive Tool
Data Manipulation
and Analysis
Application of Pandas
Application of Pandas
Natural Language
Processing Statistics Analytics Big Data
Recommendation
Engine
Stock Prediction Data Science
Pandas Vs. Numpy
Numpy Pandas
Low level data structure. (np.array) High level data structures. (data frame)
It provides in-memory 2d table object
called data frame.
Support for large multidimensional arrays More streamlined handling of tabular data,
and matrices. and rich time series functionality.
A wide range of mathematical array Data alignment, handling missing data,
operations. groupby, merge, and, join methods.
Installation
Open terminal program (for Mac user) or command line (for Windows) and install it
using following command:
conda install pandas
Or
pip install pandas
Installation
● Alternatively, you can install pandas in a jupyter notebook using below code:
!pip install pandas
● To import pandas we import it with a shorter name:
import pandas as pd
Components of the pandas
● Series and dataframe are two primary components of the pandas
● A series is typically a column, and a data frame is a multi-dimensional table
made up of a group of Series
One Series
Dimensional
Components of the
Pandas
DataFrame
Multi
Dimensional
Panel Data
Pandas Series
Pandas series
Series can be created using the following constructor:
To copy the data
It takes various forms It is for data type
like ndarray, lists
Values must be unique and the
same length as data
Creating a Series
Pandas series
A pandas series can be created out of a python list or numpy array
Using list Using numpy array
Pandas series
We can create our index values while creating a series
Pass index parameter
Pandas series
String as a row index
Pandas series
● Create a series from python dictionary
● The key becomes the row index while the value becomes the value at that row
index
Pandas series
Here, the list items remain part of a single row index
Pandas series
To display the index names and values of the series use “.index” and “.values”
respectively.
Accessing a Series
Accessing elements of series
Use the index operator [] to access element in a series
Retrieve first five
elements
Accessing elements of series
Retrieve last five elements
Access element using index
Use index element to access
element
Access element using index
Use index element to access
element
Access element using index
Retrieve multiple elements
using a list of index
Filtering a Series
Filter the values
Filter all the values that are
greater than 15
Arithmetic Operations
Multiply each element in the series by 2
Use ‘*’ operator to perform
multiplication
Add corresponding elements of two series
Use ‘+’ operator to perform
addition
Ranking and Sorting
Ranking in the series
It returns the rank of the
underlying data
Sort series in ascending order
Sort series in descending order
Sort series based on index
Check for Null Values
Check null values using .isnull()
True indicates that the value is null
Check null values using .notnull()
False indicates that the value is null
Pandas DataFrame
Pandas dataframe
A data frame is two dimensional data structure, i.e., data aligned in tabular
manner(rows and column)
Features of DataFrame:
Potentially Columns are of Different Types
Size - Mutable
Labeled axes(rows and column)
Can Perform Arithmetic Operations on Rows and Column
Reading data from different
sources
Reading data from csv file
Use ‘read_csv()’ function from pandas to read data from csv file
Reading data from xlsx file
Use ‘read_excel()’ function from pandas to read data from xlsx file
Reading data from zip file
Read the zip file
Open csv file
Read the csv file
Reading data from text file
Use ‘read_csv()’ function from pandas to read data from text file
Reading data from json file
Use ‘read_json()’ function from pandas to read data from json file
Reading data from xml file
Import package to read xml
file
Parse or extract the xml file
Assign the column names of
output dataframe
Use for loop to extract all the
data
Append each observation in
data to ‘rows’
Create a dataframe ‘xml_df’
Reading data from html file
Use ‘read_html()’ function from pandas to read data from html file
Pandas DataFrame
Pandas DataFrame
● Using the previous mentioned ways to import the data in python, the data is
always a python DataFrame
● Let us now see some operations and manipulations on DataFrames
Creating DataFrames
Creating data frame using single list
1. First create a list.
2. Convert it into a
DataFrame.
Creating dataframe using list of list
Creating dataframe from dictionary of ndarrays
Creating data frame using arrays
Create list of index
Creating data frame using list of dictionaries
Read first five rows of the data
DataFrame.head() will display first five rows of the data
Read last five rows of the data
DataFrame.tail() will display last five rows of the data
Shape of DataFrame
Know more about data
● Check the dimension of the data
● Check the data type
Know more about data
● Use “DataFrame.info()” to know get
information on shape of the data, the data
type and null values in each variable
● Here we see ‘df_market” has 3 variables
with 25 observations in each
● These are non-null observations
● There are 2 categorical variables and one
numeric variable.
Indexing DataFrames
Dealing with rows and column
● Indexing is frequently required in DataFrame. It may serve the purpose
of cross tables or pivot tables
● We can either use the .iloc[] function, the .loc[] function or use some
conditions.
● The “.iloc[]” allows us to retrieve rows and columns by position, and
The “.loc[]” allows us to specify the column name or index to subset.
Dealing with rows and column
Example: Create a new DataFrame as show and access the value that is at index 0 in column
‘Name’
Dealing with rows and column
Select row by iloc[] method
Dealing with rows and column
Select 4th and 6th rows
Dealing with rows and column
Select first three columns by using column number
Dealing with rows and column
Select first and third column
Dealing with rows and column
● loc[] function selects data by the label of the rows and column
● Access the value that is at index 1 in column ‘Score’ using loc method
Dealing with rows and column
Select multiple value by row label and column label using loc
Dealing with rows and column
Select two columns from the data frame
Conditional Subsetting
Subset students who have marks more than 12.
Conditional Subsetting
Subset students who either have more than two attempts or qualify the exam.
Sorting DataFrames
Sort data frame
Sort data frame based on the values of the column
Sort data frame
Sort data frame based on the values of the column in descending order
Sort data frame
Sort data frame based on the values of the multiple columns
Sort data frame
● Note that, while sorting dataframe by multiple columns, pandas sort_value() sorts
the first variable and then the next variable next
● In this case, the function first
sorted the variable ‘percentage’
and then the variable ‘store’
Sort data frame
To sort the index of
the DataFrame use
“.sort_index()”
Ranking DataFrames
Rank the data frame
Rank the dataframe in pandas on ascending order
Rank the data frame
Rank the dataframe in pandas on descending order
Rank the data frame
● Rank the dataframe in pandas by minimum value of the rank
● Rank the data frame in descending order of percentage and if found two
percentage are same then assign the minimum rank to both the percentage
Rank the data frame
● Rank the dataframe in pandas by maximum value of the rank
● Rank the data frame in descending order of percentage and if found two
percentage are same then assign the maximum rank to both the percentage
Rank the data frame
● Rank the dataframe in pandas by dense rank
● Rank the data frame in descending order of score and if found two scores are same
then assign the same rank . Dense rank does not skip any rank (in min and max
ranks are skipped)
Thank You