Python Pandas
iteration
• The behavior of basic iteration over Pandas objects
depends on the type.
• When iterating over a Series, it is regarded as
array-like, and basic iteration produces the values.
• Other data structures, like DataFrame and Panel,
follow the dict-like convention of iterating over
the keys of the objects.
• In short, basic iteration (for i in object) produces −
• Series − values
• DataFrame − column labels
• Panel − item labels
Iterating a DataFrame
Iterating a DataFrame gives column names.
import pandas as pd
import numpy as np
N=20
df = pd.DataFrame({ ‘
A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N), 'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist() })
for col in df:
print col
import pandas as pd
import numpy as np
stud = {‘Name’:[‘P’,’R’,’A’,’J’,’B’],
‘Eng’:[67,76,75,88,92],
‘IP’:[99,99,98,97,98],
’Maths’:[98,99,97,98,90]}
df = pd.DataFrame(stud)
for col in df:
print(col)
Functions
• To iterate over the rows of the DataFrame, we
can use the following func ons −
• iteritems() − to iterate over the (key,value)
pairs
• iterrows() − iterate over the rows as
(index,series) pairs
• itertuples() − iterate over the rows as
namedtuples
iteritems()
Iterates over each column as key, value pair with label
as key and column value as a Series object.
import pandas as pd
import numpy as np
stud = {‘Name’:[‘P’,’R’,’A’,’J’,’B’],
‘Eng’:[67,76,75,88,92],
‘IP’:[99,99,98,97,98],
’Maths’:[98,99,97,98,90]}
df = pd.DataFrame(stud)
for key,value in df.iteritems():
print(key,value)
iterrows()
iterrows() returns the iterator yielding each index
value along with a series containing the data in
each row.
import pandas as pd
import numpy as np
stud = {‘Name’:[‘P’,’R’,’A’,’J’,’B’],
‘Eng’:[67,76,75,88,92],
‘IP’:[99,99,98,97,98],
’Maths’:[98,99,97,98,90]}
df = pd.DataFrame(stud)
for row_index,row in df.iterrows():
print(row_index,row)
itertuples()
itertuples() method will return an iterator yielding a
named tuple for each row in the DataFrame. The first
element of the tuple will be the row’s corresponding
index value, while the remaining values are the row
values.
import pandas as pd
import numpy as np
stud = {‘Name’:[‘P’,’R’,’A’,’J’,’B’],
‘Eng’:[67,76,75,88,92],
‘IP’:[99,99,98,97,98],
’Maths’:[98,99,97,98,90]}
df = pd.DataFrame(stud)
for row in df.itertuples():
print(row)
Sorting
• There are two kinds of sorting available in
Pandas. They are −
• By label
• By Actual Value
By Label
• Using the sort_index() method, by passing the axis
arguments and the order of sorting, DataFrame can be
sorted.
• By default, sorting is done on row labels in ascending order.
import pandas as pd
import numpy as np
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud,index=['P','R','A','J','B'])
sorted_df=df.sort_index()
print(sorted_df)
Order of Sorting
• By passing the Boolean value to ascending
parameter, the order of the sorting can be
controlled.
import pandas as pd
import numpy as np
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud,index=['P','R','A','J','B'])
sorted_df = df.sort_index(ascending=False)
print(sorted_df)
Sort the Columns
• By passing the axis argument with a value 0 or 1,
the sorting can be done on the column labels.
• By default, axis=0, sort by row.
import pandas as pd
import numpy as np
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud,index=['P','R','A','J','B'])
sorted_df=df.sort_index(axis=1)
print(sorted_df)
By Value
• Like index sorting, sort_values() is the method for
sorting by values.
• It accepts a 'by' argument which will use the
column name of the DataFrame with which the
values are to be sorted.
import pandas as pd
import numpy as np
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud,index=['P','R','A','J','B'])
sorted_df = df.sort_values(by=‘IP')
print(sorted_df)
Sorting Algorithm
• sort_values() provides a provision to choose the
algorithm from mergesort, heapsort and
quicksort. Mergesort is the only stable algorithm.
import pandas as pd
import numpy as np
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud,index=['P','R','A','J','B'])
sorted_df = df.sort_values(by=‘IP',kind='mergesort')
print(sorted_df)
head() and tail() function
• The head() function fetches first ‘n’ rows from
the pandas series. By default it shows first 5
rows of the given series.
• Ex. df.head() or df.head(2)
• The tail() function fetches last ‘n’ rows from
the pandas series. By default it shows last 5
rows of the given series.
• Ex. df.tail() or df.tail(2)
Boolean Indexing in Pandas
• In boolean indexing, we will select subsets of
data based on the actual values of the data in
the DataFrame and not on their row/column
labels or integer locations. In boolean
indexing, we use a boolean vector to filter the
data.
Boolean Indexing in DataFrame
• Boolean indexing is a type of indexing which uses
actual values of the data in the DataFrame. In
boolean indexing, we can filter a data in four
ways –
Accessing a DataFrame with a boolean index
• Applying a boolean mask to a dataframe
• Masking data based on column value
• Masking data based on index value
Accessing a DataFrame with a
boolean index :
In order to access a dataframe with a boolean
index, we have to create a dataframe in which
index of dataframe contains a boolean value
that is “True” or “False”.
Example
# importing pandas as pd
import pandas as pd
# dictionary of lists
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud, index = [True, False, True, False,True])
print(df)
In order to access a dataframe with a boolean index using .loc[], we simply pass a
boolean value (True or False) in a .loc[] function.
# importing pandas as pd
import pandas as pd
# dictionary of lists
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(stud, index = [True, False, True, False,True])
# accessing a dataframe using .loc[] function
print(df.loc[True])
Applying a boolean mask to a
dataframe :
In a dataframe we can apply a boolean mask in
order to do that we, can
use __getitems__ or [] accessor.
We can apply a boolean mask by giving list of
True and False of the same length as contain
in a dataframe.
When we apply a boolean mask it will print only
that dataframe in which we pass a boolean
value True.
Example
# importing pandas as pd
import pandas as pd
# dictionary of lists
stud = {'Name':['P','R','A','J','B'],
'Eng':[67,76,75,88,92],
'IP':[99,99,98,97,98],
'Maths':[98,99,97,98,90]}
df = pd.DataFrame(dict, index = [0, 1, 2, 3,4])
print(df[[True, False, True, False,True]])
Masking data based on column value
In a dataframe we can filter a data based on a
column value in order to filter data, we can
apply certain condition on dataframe using
different operator like ==, >, <, <=, >=.
When we apply these operator on dataframe
then it produce a Series of True and False.
Example
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["BCA", "BCA", "M.Tech", "BCA"],
'score':[90, 40, 80, 98]}
# creating a dataframe
df = pd.DataFrame(dict)
# using a comparsion operator for filtering of data
print(df['degree'] == 'BCA')
Masking data based on index value :
In a dataframe we can filter a data based on a
column value in order to filter data, we can
create a mask based on the index values using
different operator like ==, >, <, etc...
Example
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["BCA", "BCA", "M.Tech", "BCA"],
'score':[90, 40, 80, 98]}
df = pd.DataFrame(dict, index = [0, 1, 2, 3])
mask = df.index == 0
print(df[mask])