Ln. 1 - Data Handling using Pandas –I (1).pptx

Ln 1 - Data Handling
using Pandas –I

Big Picture
• Introduction to Python libraries- Pandas, Matplotlib.
• Data structures in Pandas - Series and Data Frames.
• Series: Creation of Series from – ndarray, dictionary, scalar value;
mathematical operations; Head and Tail functions; Selection, Indexing and
Slicing.
• Data Frames:
• Text/CSV files
• Operations on rows and columns: add, select, delete, rename;
• Head and Tail functions;
• Indexing using Labels, Boolean Indexing;
• Importing/Exporting Data between CSV files and Data Frames.

PRETEST
1. In lists, you can change the elements of a list in place.
(True/False)
2. The _______ brackets are used to enclose the values of a list.
3. l1= list(‘ClassXI’) returns :
4. The position of each element in the list is considered as
___________.
5. The property which changes the element of a list in place
but not changes the memory address is known as
__________.

Computer Science has been a field of continuous evolution and regular
advancements in terms of software efficiency, programming
methodologies and applications.
With the advent of data sciences or data analytics, it has become
easier and efficient to handle big data or huge data.
Data science is a large field covering everything from data collection,
cleaning, standardization, analysis, visualization and reporting.
INTRODUCTION

DATA PROCESSING
Data processing is an important part of analyzing the data because the
data is not always available in the desired format.
Various processing are required before analyzing the data such as
cleaning, restructuring or merging etc.
NumPy, Spicy, Cython, Panda are the tools available in Python which
can be used for fast processing of data.

DATA LIFE CYCLE
1. Data warehouse-
Data is stored in different formats- .csv file, an excel file, html file etc.
This data is converted into a single format and stored in a data warehouse.
It is a repository that collects data from various data sources of an organization and
arranges it into a structured format.
2. Data Analysis -
After storing data, we can perform analysis on it ie.. join and merge data, search for
data etc.
Data Analysis is the process of bringing order and structure to collected data which is
then processed to information.
3. Data Visualization-
After analysis we can plot this data in the form of a graph.
Data visualization is the process of putting data into a chart, graph, or other visual
format.
All these operations can be easily and effectively done by Python and its libraries.

DATA LIFE CYCLE
DATA
DATA
DATA
DATA
Data
warehousing
Data Visualization
Data Analysis

Python library is a collection of functions and methods which can be
used to perform any functions without writing your code.
Pandas is built on top of two core Python libraries—matplotlib for
data visualization and NumPy for mathematical operations.
Pandas acts as a wrapper over these libraries, allowing you to
access many of matplotlib's and NumPy's methods with less code.
PYTHON LIBRARIES

The Pandas is a high-performance open source library for data
analysis in Python developed by Wes McKinney in 2008.
The name Pandas is derived from the word Panel Data System– an
Econometrics from Multidimensional data.
It makes data importing and data analyzing easier.
It is a most famous Python package for data science, which offers
powerful and flexible data structures that make data analysis and
manipulation easy.
Guido van Rossum
PYTHON PANDAS INTRODUCTION

Pandas builds on packages like NumPy and matplotlib to give us a
single & convenient place for data analysis and visualization work.
It is built on NumPy and its key data structure is called DataFrame
Python with Pandas is used in a wide range of fields including
academic and commercial domains including finance, economics,
Statistics, analytics, etc.
PYTHON PANDAS

Fast and efficient DataFrame object with default and customized indexing.
Selecting particular rows and columns from data sets
Arranging data in ascending or descending order
Flexible reshaping and pivoting of data sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.
Summarising data by classification variable
Merging and concatenating two data sets
Key Features of Pandas

Right click command prompt  Run as Administrator
Click on YES on the USER ACCESS Window to open administrator
window
Make sure of the file path before you install with pip
Change your path to the folder python 3.6
Move to installation scripts folder
When you explore the folder you will see a file pip.exe
Type pip install pandas
Note-
• A package contains all the files you need for a module.
• Modules are Python code libraries you can include in your project.
• pip is the standard package manager for Python. It allows you to install
and manage additional packages that are not part of the Python standard
library.
Installing Pandas

Testing Pandas at Command Prompt

Pandas Datatypes :
Pandas dtype Python Type NumPy type Usage
object Str String_, unicode_ Text
int64 Int int, int8, int16, int32,
int64, uint8, uint16,
uint32, uint64
Integer
numbers
float64 Float float, float16, float32,
float64
Floating point
numbers
bool bool bool True / False
datetime64 NA datetime64[ns] Date & Time
values

Pandas Data structures :
A data structure is a collection of data values and operations
that can be applied to that data
Pandas deals with the following three data structures −
• Series : It is a one-dimensional structure storing
homogeneous data.
• DataFrame : It is a two-dimensional structure storing
heterogeneous data.
• Panel: It is a three dimensional way of storing items.
These data structures are built on top of Numpy array, which
means they are fast.

Series
The Series is the primary building block of Pandas.
It is a one-dimensional labelled array capable of holding data of any
type (integer, string, float etc )with homogeneous data.
For example, the following series is a collection of integers 10, 23, 56,
…
The Series data values are mutable (can be changed) but the size of
Series data is immutable.

Series
It contains a sequence of values and an associated position of data
labels called its index.
It can also be described as an ordered dictionary with mapping of
index values to data values.
Index Data
0 22
1 -14
2 52
3 100
Index Data
Jan 31
Feb 28
Mar 31
Apr 20
Index Data
‘Sun’ 1
‘Mon’ 2
‘Tue’ 3
‘Wed’ 4

Creation of Series
A Series in Pandas can be created using the ‘Series’ method.
Any list or dictionary data can be converted into series using this method.
Series can be created using constructor.
A pandas Series can be created using the following constructor −
pandas.Series( data, index, dtype, copy)
A series can be created using various input data like −
• Array
• Dict
• Scalar value or constant

A basic series, which can be created is an Empty Series.
Example - [Here ‘s’ is the Series Object]
import pandas as pd
s = pd.Series()
print s
Its output is as follows −
Series([], dtype: float64)
Note –
• Series () displays an empty list along with its default data type.
• Pd is an alternate name given to the Pandas module. Its significance is that we
can use ‘pd’ instead of typing Pandas every time we need to use it.
• Import statement is used for loading Pandas module into the memory and can
be used to work with.
Creation of Empty Series

Creating DataSeries with a list
Syntax:
<Series Object>=pandas.Series([data],index=[index])
Eg:-
import pandas as pd
s=pd.Series( [ 2,4,6,8,10])
print(s)
 S- is a series variable
 Series() – method displays a list along with
default data type
 pd is the alternative name given to panda
module
 Import statement is used to load pandas
module into the memory and can be used

Program- DataSeries
>>> s= pandas.Series ( [3,-5,7,4] , index=['a','b','c','d‘] )
>>> s
Output:
a 3
b -5
c 7
d 4
dtype: int64

>>> st = pd.Series([20, 70, 10], index=['frog', 'fish', 'hawk'])
>>> st
frog 20
fish 70
hawk 10
dtype: int64
>>> st.index.name = 'Animals'
>>> st
Animals
frog 20
fish 70
hawk 10
dtype: int64
Program

Activity
• Create a series having names of any five famous
monuments of India and assign their States as
index values.

Think and Reflect
• While importing Pandas, is it mandatory to
always use pd as an alias name? What would
happen if we give any other name?
• Try it and write your explanation in the
notebook.

Program
Months=[‘Jan’,’Feb’,’Mar’,’Apr’,’June’, ‘July’]
import pandas as pd
S=pd.Series(Months)
>>> S
0 Jan
1 Feb
2 Mar
3 Apr
4 June
5 July
dtype: object

Accessing Series index and values
#Index and values are attributes of Series.
>>> Months=['Jan','Feb','Mar','Apr','June', 'July']
>>> Months
['Jan', 'Feb', 'Mar', 'Apr', 'June', 'July']
>>> a=pd.Series(Months)
>>> a.index
RangeIndex(start=0, stop=6, step=1)
>>> a.values
array(['Jan', 'Feb', 'Mar', 'Apr', 'June', 'July'], dtype=object)
>>> a.values.tolist()
['Jan', 'Feb', 'Mar', 'Apr', 'June', 'July']

Program
import pandas as ps
games_list = ['Cricket', 'Volleyball', 'Judo', 'Hockey']
abc= ps.Series(games_list)
print(abc)
OUTPUT
0 Cricket
1 Volleyball
2 Judo
3 Hockey
dtype: object

Starter
• Create a list of 7 emirates and create a series from
that with index values showing it from 1 to 7 .

Think ?
• Is it possible to create a series from dictionary and
how?
• What will be the index value of that series ?

Creation of series from Dictionary
• Dictionary keys can be used to construct an index for a
Series.

Attribute of Series
• Series support vector operations.
• Any operation gets performed on every single element.
Eg:-
import pandas as pd
List = [5, 2, 3,7]
s1= pd.Series (List)
Guess the output of these statements:
print (list *2)
print (s1*2)

Attributes of Series
If N is a series object,
• N.Index will display the index of the series
• N.Values will display the values of the series
• N.Axes will display the range of index
• N.size will display the length of the series
The arrow on the image displays “axis 0” and its direction for the Series object.

In Python, one-dimensional structures are displayed as a row of
values. On the contrary, here we see that Series is displayed as a
column of values.
Each cell in Series is accessible via index value along the “axis 0”. For
our Series object indexes are: 0, 1, 2, 3, 4. Here is an example of
accessing different values:
import pandas as pd
N=pd.Series([‘Red’, ‘Green’,’Yellow’,’Orange’, Blue’])
print(N[0])
print (N.axes)
Red
[RangeIndex(start=0, stop=5, step=1)]
Axis in Series

ACCESSING ROWS USING HEAD () AND TAIL() FUNCTION
Series.head() function will display the top 5 rows in the series.
Series.tail() function will display the last 5 rows in the series
In both the functions, if a number is passed as parameter Pandas will
print the specified number of rows.
Eg:-
>>> a=pd.Series([2,4,6,8,10,12,14,16])
>>> a.head()
0 2
1 4
2 6
3 8
4 10
dtype: int64

To print only the first 3 rows,
To print the last 5 rows,
Create a series with 6 country’s and its capital’s as index. and do the
following operation.

To print only the first 3 rows,
>>> a.head(3)
>>>a.tail()
>>>a.tail(3)
Create a series with 6 country’s and its capital’s. and do the following
operation.

Vector operations in Series
• Series support vector operations.
• Any operation gets performed on every single element.
Eg:-
import pandas as pd
List = [5, 2, 3,7]
s1= pd.Series (List)
Guess the output of these statements:
print (list *2)
print (s1*2)

Binary operations in Series
We can perform binary operation on series like addition,
subtraction and many other operation.
In order to perform binary operation on series we have to
use some function like .add(),.sub() etc..
Any item for which one or the other does not have an entry
is marked by NaN, or “Not a Number”, which is how Pandas
marks missing data.

Binary operations in Series
>>> import numpy as np
>>> s = pd.Series([1,2,3,4,np.NaN,5,np.NaN])
>>> s
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
5 5.0
6 NaN
dtype: float64

Write a Pandas program to add, subtract, multiply and divide
two Pandas Series.
Program
import pandas as pd
ds1 = pd.Series([2, 4, 6, 8, 10])
ds2 = pd.Series([1, 3, 5, 7, 9])
ds = ds1 + ds2
print(“Sum of Series: n “ , ds)
ds = ds1 - ds2
print(“Subtraction of Series: n “ , ds)
ds = ds1 * ds2
print(“Product of two Series: n “, ds)
ds = ds1 / ds2
print(“Quotient of the Series: n “ , ds)

# importing pandas module
import pandas as pd
# creating a series
data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
# creating a series
data1 = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e’])
# add two series using .add() function.
data.add(data1)
Program

Write a Pandas program to compare the elements of the two
Pandas Series.
Program
import pandas as pd
ds1 = pd.Series([2, 4, 6, 8, 10])
ds2 = pd.Series([1, 3, 5, 7, 10])
print("Compare the elements of the said Series:")
print("Equals:")
print(ds1 == ds2)
print("Greater than:")
print(ds1 > ds2)
print("Less than:")
print(ds1 < ds2)

Program – To sort values
abc=pd.Series(['M','A','N','G','O','E','S'],index=[10,20,30,
40,50,60,70])
abc.sort_values()
abc.sort_index()
>>> abc
20 A
60 E
40 G
10 M
30 N
50 O
70 S
dtype: object

Create series from ndarray
 An array of values can be passed to a Series.
 If data is an ndarray, index must be the same
length as data.
 If no index is passed, one will be created having
values [0, ..., len(data) - 1].

import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print (s)
0 a
1 b
2 c
3 d
dtype: object
Note- We did not pass any index, so by default, it assigned the indexes ranging
from 0 to len(data)-1, i.e., 0 to 3.

import pandas as pd
import numpy as np
abc = np.array(['a','b','c','d'])
s = pd.Series(abc , index=[100,101,102,103])
print (s)
100 a
101 b
102 c
103 d
dtype: object
We passed the index values here. Now we can see the customized indexed
values in the output.

# To add 5 marks to each student in the series
#creating a series from array and specified index
import pandas as pd
import numpy as np
Marks=np.array([455,478,477,405])
M1=pd.Series(Marks, index=[“Annie", “Resmi", "Sana", “Haya"])
print(M1)
for i, j in M1.items( ): # i – index , j - values
M1.at[i] = j+5 #increase each values
print (M1)
#at - Access a single value for a row/column label pair.
Program – Mathematical operations

import pandas as pd
import numpy as np
a=np.random.randn(5)
>>> a
array([-0.63206378, -0.19692941, 0.3883878 , 0.35998536, 0.1873882 ])
>>> b=pandas.Series(a)
>>> b
0 -0.632064
1 -0.196929
2 0.388388
3 0.359985
4 0.187388
dtype: float64
numpy.random.randn()
Returns an array of defined shape, filled with random floating-point
samples.
Program – random.randn

• A dictionary can be passed as input to a Series.
• Dictionary keys are used to construct index.
d = {‘a': 1, ‘b': 0, 'c': 2}
a=pd.Series(d)
print(a)
Output-
a 1
b 0
c 2
dtype: int64
Create a Series from dictionary

>>> d1 = {'a': 100, 'b': 200, 'c':300, 'd':400, 'e':800}
>>> d2=pd.Series(d1)
>>> d2
a 100
b 200
c 300
d 400
e 800
dtype: int64

>>> d3=pd.Series(d1,index=[20,30,40,50,60])
>>> d3
20 NaN
30 NaN
40 NaN
50 NaN
60 NaN
dtype: float64
>>> d4=pd.Series(d1,index=['b','a','c','e','d'])
>>> d4
b 200
a 100
c 300
e 800
d 400
dtype: int64

import pandas as pd
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print (s)
a 0.0
b 1.0
c 2.0
dtype: float64
Observe − Dictionary keys are used to construct index.

Programs-
Write a Python program to convert a dictionary to a Pandas series.
The dictionary named Students must contain-
Key : Name, RollNo, Class ,Marks , Grade
Value : Your name, rollNo, class,marks and grade
Students={'Name':‘ABC','RollNo':80978,'Class':'XII','Marks':87
,'Grade':'A'}
>>> s=pd.Series(Students)
>>> s
Name ABC
RollNo 80978
Class XII
Marks 87
Grade A
dtype: object

Traversing the dictionary:
Python dictionaries are composed of key-value pairs, so in each loop,
there are two elements we need to access (the key and the value).
To loop over both keys and the corresponding values for each key-value
pair we need to call the .items() method.
Series.items() function return the first element of the underlying data of
the given series object.
The .items() method in a dictionary is used to generate a key and value
for each iteration.
import pandas as pd
Students={'Name':‘ABC','RollNo':80978,'Class':'XII','Marks':87,'Grade':'A'}
s=pd.Series(Students)
for i,j in Students.items():
print(i+ " : " +str(j))

>>> pers = {'color': 'blue', 'fruit': 'apple', 'pet': 'dog'}
>>> p = pers.items()
>>> p # Here d_items is a view of items
dict_items([('color', 'blue'), ('fruit', 'apple'), ('pet', 'dog')])
>>> for item in pers.items():
print(item)
('color', 'blue')
('fruit', 'apple')
('pet', 'dog')
Traversing a dictionary

for a,b in pers.items():
print(key, '->', value)
color -> blue
fruit -> apple
pet -> dog
ab ={"brand": "Ford", "model": "Mustang", "year": 1964}
for x, y in ab.items():
print(x, y)
brand Ford
model Mustang
year 1964
Traversing a dictionary

Eg. Consider the series created with names of students as index
and Marks as data using dictionary
import pandas as pd
d1={"Raj":234,"Gilbert":345}
m1=pd.Series(d1)
print(m1)
for i,j in m1.items():
m1.at[ i ]=j+5
print(m1)
Mathematical operations on Series

When a scalar is passed, all the elements of the series is
initialized to the same value.
The value will be repeated to match the length of index.
import pandas as pd
s = pd.Series(5, index=[0, 1, 2, 3])
s
0 5
1 5
2 5
3 5
dtype: int64
Create a Series from Scalar

Create a series with scalar value 7 and index as ‘A’,’B’,’C’,’D’
s = pd.Series(7, index=['A','B','C','D'])
>>> s
A 7
B 7
C 7
D 7
dtype: int64
Create a Series from Scalar

Create a Series using string as index
ab = pd.Series(‘Welcome to India’, index=['A','B','C','D'])
>>> s
A Welcome to India
B Welcome to India
C Welcome to India
D Welcome to India
dtype: object

Accessing Elements of a Series
(A)Indexing
Indexes are of two types: positional index and labelled
index. Positional index takes an integer value that
corresponds to its position in the series starting from 0,
whereas labelled index takes any user-defined label as
index

Positional Index
• Following example shows usage of the positional index
for accessing a value from a Series
the value 30 is displayed for the positional index 2

• More than one element of a series can be accessed using a
list of positional integers or a list of index labels as shown in
the following examples:
>>> seriesCapCntry = pd.Series(['NewDelhi', 'WashingtonDC',
'London', 'Paris'], index=['India', 'USA', 'UK', 'France'])
>>> seriesCapCntry[[3,2]]
France Paris
UK London
dtype: object

>>> seriesCapCntry[['UK','USA']]
UK London
USA WashingtonDC
dtype: object

Labelled Index
• The value 30 is displayed for the positional index 2
the value 3 is displayed for the labelled index Mar

ACTIVITY
• Write the statement to get NewDelhi as output using
positional index.

Indexing and slicing in Series
• In a series we can access any position values based on the
index number.
• Slicing is used to retrieve subsets of data by position.
• A slice object is built using a syntax of start:end:step, the
segments representing the first item, last item, and the
increment between each item that you would like as the step.

Accessing Data from Series with indexing and slicing
import pandas aspd1
s = pd1.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
>>> s[0]
1
>>> s[:3]
a 1
b 2
c 3
dtype: int64
>>> s[-3:]
c 3
d 4
e 5
dtype: int64

>>> fruits = ['apples', 'oranges', 'cherries', 'pears']
>>> S = pd.Series([20, 33, 52, 10], index=fruits)
>>> S
apples 20
oranges 33
cherries 52
pears 10
dtype: int64
>>> S['apples']
20
>>> S[0]
20

Find out the following-
AB
AB[2:4]
AB[1:6:2]
AB[ :6]
AB[4:]
AB[:4:2]
AB[4::2]
AB[::-1]
>>> num=[000,100,200,300,400,500,600,700,800,900]
>>> idx=['A','B','C','D','E','F','G','H','I','J']
>>> AB=pd.Series(num,index=idx)

Find out the following-
AB
AB[2:4]
AB[1:6:2]
AB[ :6]
AB[4:]
AB[:4:2] 0:4:2-- 000 200
AB[4::2] 400 600 800
AB[::-1]
>>> num=[000,100,200,300,400,500,600,700,800,900]
>>> idx=['A','B','C','D','E','F','G','H','I','J']
>>> AB=pd.Series(num,index=idx)

Create a series using 2 different lists
>>> import pandas as pd
>>> m=['jan','feb']
>>> n=[23,34]
>>> s=pd.Series(m,index=n)
>>> s
23 jan
34 feb
dtype: object

Printing the slices with the values of the label index
>>> M = pd.Series([400,500,345,450],index=['Amit','Raj','Kris','Shon'])
>>> M
Amit 400
Raj 500
Kris 345
Shon 450
dtype: int64
>>> M['Kris']
345
M[['Raj','Kris','Shon']]
Raj 500
Kris 345
Shon 450
dtype: int64
M['Raj':'Shon']
Raj 500
Kris 345
Shon 450
dtype: int64

Displaying the data using Boolean indexing
# Eg. To select marks more than 400
>>> M = pd.Series([400,500,345,450],index=['Amit','Raj','Kris','Shon'])
>>> M
Amit 400
Raj 500
Kris 345
Shon 450
dtype: int64
>>> M>400
Amit False
Raj True
Kris False
Shon True
dtype: bool
>>> M[M>400] #Will display the names of students who got marks >400
Raj 500
Shon 450
dtype: int64

Using range() to specify index in series
>>> S=pd.Series(5,index=range(4))
>>> S
0 5
1 5
2 5
3 5
dtype: int64
>>> S=pd.Series([1,2,3,4],index=range(4))
>>> S
0 1
1 2
2 3
3 4
dtype: int64

Using range() to specify index in series –for loop
>>> S=pd.Series(range(1,15,3),index=[x for i in ‘abcde’])
>>> S
a 1
b 4
c 7
d 10
e 13
dtype: int64
>>> S=pd.Series([1,2,3,4.0],index=range(4))
>>> S
0 1.0
1 2.0
2 3.0
3 4.0
dtype: float64

NaN - Creating a series using missing values
In certain situations, we need to create a series object for which size is defined but
some element or datas are missing. This is handled by defining NaN(Not a Number )
values, which is an attribute of Numpy library.
This can be achieved by defining a missing value using np.Nan

NaN - Creating a series using missing values
Import pandas as pd
Import numpy as np
data = pd.Series([1, np.nan, 2, None, 3],
index= ('abcde'))
>>> data
a 1.0
b NaN
c 2.0
d NaN
e 3.0
dtype: float64
>>> d3=pd.Series(d1,index=[20,30,40,50,60])
>>> d3
20 NaN
30 NaN
40 NaN
50 NaN
60 NaN
dtype: float64
>>> s = pd.Series(np.nan, index=[49,3, 4, 5])
>>> s
49 NaN
3 NaN
4 NaN
5 NaN
dtype: float64

Program
Write a python program to create a series of odd numbers.
odd=pd.Series(range(1, 10, 2))
>>> odd
0 1
1 3
2 5
3 7
4 9
dtype: int64

Program
Create a series with names of any 7 colours :
• Display the first element
• Display the third element
• Display the first 3 elements (Using Slicing)
• Display the element starting from 2nd till 3rd (Using
Slicing)
• Display last 2 elements (Using Slicing)

CREATING SERIES WITH RANGE AND FOR LOOP
>>> S=pd.Series(range(1,15,3),index=[x for x in 'abcde'])
>>> S
a 1
b 4
c 7
d 10
e 13
dtype: int64

Handling floating point values to generate a series
import pandas as pd
ab=pd.Series([2,4,6,7.5])
ab
0 2.0
1 4.0
2 6.0
3 7.5
Dtype : float64
Since 7.5 is a float value, it will convert the rest of the integer
values to float and so it be overall a float series.

Indexing and accessing can also be done using iloc and loc.
iloc- It is used for indexing or selecting based on position ie..
By row number and column number. It refers to position
based indexing.
Syntax is-
iloc=[<row number range>,<col number range>]
loc – It is used to index or select based on name ie.. By row
name and col name. It refers to name based indexing.
Syntax is-
loc=[<list of row name>,<list of col name>]
So, we can filter the data using the loc function in Pandas even
if the indices are not an integer in our dataset.
Note- By default, index is assigned from 0 to len-1.
iloc and loc

import pandas as pd
a=pd.Series([1,2,3,4,5], index=‘a’,’b’,’c’,’d’,’e’])
>>> a.iloc[1:4] # Displays data using index
b 2
c 3
d 4
dtype: int64
>>> a.loc['b':'e'] # Displays data location wise
b 2
c 3
d 4
e 5
dtype: int64
loc and iloc

>>> s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>> s.iloc[:3] # slice the first three rows
49 NaN
48 NaN
47 NaN
>>> s.loc[:3] # slice up to and including label 3
49 NaN
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
2 NaN
3 NaN
loc and iloc

loc vs. iloc in Pandas
loc
• Purely label-location based indexer for selection by label.
• It is primarily label based, but may also be used with a
boolean array.
• Allowed inputs are:
 A single label, e.g. 5 or 'a'.
 A list or array of labels, e.g. ['a', 'b', 'c'].
 A slice object with labels, e.g. 'a':'f' (note that contrary to usual
python slices, both the start and the stop are included!).
 A boolean array.
 A callable function with one argument (the calling Series,
DataFrame ) and that returns valid output for indexing (one of
the above)
• Note : .loc will raise a KeyError when the items are not
found

iloc-
• .iloc is primarily integer position based (from 0 to length-
1 of the axis), but may also be used with a boolean array.
• .iloc will raise IndexError if a requested indexer is out-of-
bounds, except slice indexers which allow out-of-bounds
indexing.
• Allowed inputs are:
 An integer e.g. 5
 A list or array of integers [4, 3, 0]
 A slice object with ints 1:7
 A boolean array
 A callable function with one argument
loc vs. iloc in Pandas

It is a two-dimensional data structure, just like any table (with
rows & columns).
Basic Features of DataFrame
 Columns may be of different types
 Size can be changed (Mutable)
 Labelled axes (rows / columns)
 Can perform arithmetic operations on rows and columns
CreateDataFrame
It can be created with the following-
Lists , dict , Series , Numpy arrays , Another DataFrame
Dataframes

Structure of a Dataframe
Pandas DataFrame consists of three principal components,
the data, rows, and columns.
You can think of it as an SQL table or a spreadsheet data representation.

Dataframe Creation
Dataframes can be created using constructor in pandas.
Syntax: pd.DataFrame( data, index, columns, dtype, copy)
Sr.No Parameter & Description
1 data - data takes various forms like ndarray, series, map, lists, dict,
constants and also another DataFrame.
2 index - For the row labels, the Index to be used for the resulting frame
is Optional Default np.arange(n) if no index is passed.
3 columns- For column labels, the optional default syntax is -
np.arange(n). This is only true if no index is passed.
4 dtype - Data type of each column.
5 copy - This command (or whatever it is) is used for copying of data, if
the default is False.

Creating an empty Dataframe
A basic DataFrame, which can be created is an Empty Dataframe.
>>> d=pd.DataFrame()
>>> d
Empty DataFrame
Columns: []
Index: []

Series vs Dataframe
A Series is essentially a column, and a DataFrame is a multi-dimensional
table made up of a collection of Series.

Creating a Dataframe from lists with values only
The DataFrame can be created using a single list or a list of lists.
CREATING A DATAFRAME FROM SINGLE LIST
Example1:
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print (df)

Creating a Dataframe from lists of lists (multidimensional list)
CREATE A DATAFRAME FROM A LIST OF LISTS
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print (df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13

import pandas as pd
>>> a=[12,13,14,15]
>>> b=[20,30,40,50]
>>> c=pd.DataFrame(a,index=[b],columns=['Numbers'],dtype='float')
>>> c
Numbers
20 12.0
30 13.0
40 14.0
50 15.0

Example
>>> data = [[0, 1, 2],[3, 4, 5]]
>>> df = pd.DataFrame(data)
>>> df
0 1 2
0 0 1 2
1 3 4 5

Using multi-dimensional list with column name and dtype
specified.
import pandas as pd
lst = [['tom', 'reacher', 25], ['krish', 'pete', 30],
['nick', 'wilson', 26], ['juli', 'williams', 22]]
df = pd.DataFrame(lst, columns =['FName', 'LName', 'Age'],
dtype = float)
df

Program
Display the following details in a dataframe.
Name Marks Index
Vijaya 80 B1
Rahul 92 A2
Meghna 67 C
Radhika 95 A1
Shaurya 97 A1
df = pd.DataFrame(lst, columns =['FName', 'LName', 'Age'], dtype = float)

Displaying index and col
>>> df = pd.DataFrame([[0, 1, 2], [3, 4, 5]], index=['row1',
'row2'],columns=['col1', 'col2', 'col3'])
>>> df
col1 col2 col3
row1 0 1 2
row2 3 4 5
>>> print(df.index)
Index(['row1', 'row2'], dtype='object')
>>> print(df.columns)
Index(['col1', 'col2', 'col3'], dtype='object')

Creating DataFrames from Series
 DataFrames are 2 dimensional representation of Series.
 When we represent 2 or more series in the form of rows and columns,
it becomes a dataframe.
 Lets create 2 series and pass it into a dataframe.
>>> p={'one':pd.Series([1,2,3], index=['a','b','c']), 'two':pd.Series
([11,22,33,44], index=['a','b','c','d'])}
>>> q=pd.DataFrame(p)
>>> q
one two
a 1.0 11
b 2.0 22
c 3.0 33
d NaN 44

>>> p=pd.Series([10,20,30],index=['a','b','c'])
>>> q=pd.Series([40,50,60],index=['a','b','c'])
>>> r=pd.DataFrame({'Set1':p , 'Set2':q})
>>> r
Set1 Set2
a 10 40
b 20 50
c 30 60

# To create dataframe from 2 series of student data
import pandas as pd
stud_marks=pd.Series([89,94,93,83,89],index=['Anuj','Deepak','Sohail'
,'Tresa','Hima'])
stud_age=pd.Series([18,17,19,16,18],index=['Anuj','Deepak','Sohail','Tre
sa','Hima'])
>>> stud=pd.DataFrame({'Marks':stud_marks,'Age':stud_age})
>>> stud
Marks Age
Anuj 89 18
Deepak 94 17
Sohail 93 19
Tresa 83 16
Hima 89 18

Sorting data in DataFrames
We can sort the data inside a dataframe using sort_values().
Here 2 arguments are passed- sorting field and the order of sorting (asc
or desc).
‘By’ keyword, defines the name of the field or column based on which
it is to be sorted.
>>> stud.sort_values(by=['Marks'])
Marks Age
Tresa 83 16
Anuj 89 18
Hima 89 18
Sohail 93 19
Deepak 94 17
stud.sort_values(by=['Marks'],ascending=False)
Marks Age
Deepak 94 17
Sohail 93 19
Anuj 89 18
Hima 89 18
Tresa 83 16

Creating DataFrame from Dictionary (Dictionary of Lists)
• List of dictionaries can be passed as an input data to create a dataframe.
• The dictionary keys are by default, taken as column names.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
print (df)
Age Name
0 28 Tom
1 34 Jack
2 29 Steve
3 42 Ricky

Program
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [22000,25000,27000,35000] }
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
print (df)
Brand Price
0 Honda Civic 22000
1 Toyota Corolla 25000
2 Ford Focus 27000
3 Audi A4 35000
Brand Price
car1 Honda Civic 22000
car2 Toyota Corolla 25000
car3 Ford Focus 27000
car4 Audi A4 35000

import pandas as pd
data = { ‘Name ‘ : [ ‘Tom’,’Jack’,’Steve’,’Ricky’], ‘Age’ : [28,34,29,42] }
df = pd.DataFrame (data, index = [‘rank 1’, ‘rank 2’, ‘rank 3’, ‘rank 4’ ])
print ( df )
output
AGE NAME
RANK 1 28 tOM
RANK 2 34 jACK
RANK 3 29 Steve
RANK 4 42 Ricky
Program - Create an indexed DataFrame

Create a program that shows the month and number of days in a
month.
Day Month
0 31 Jan
1 30 Apr
2 31 Mar
3 30 June
Program

DataFrame.set_index (<ColumnName>, inplace=True)
– This method selects the column specified as the row index
DataFrame.reset_index(inplace=True)
- The method will reset the row index to the default index as
0,1,2,3… etc.
Setting a column of dataframe as row index & resetting to
default row index

Suppose we want to make one of the columns as row index:
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df.set_index('Name',inplace=True)
print (df)
Age
Name
Tom 28
Jack 34
Steve 29
Ricky 42
Example

CREATE A DATAFRAME FROM DICTIONARY OF LIST TO DISPLAY THE
FOLLOWING OUPUT
Program
Events Ruby Emerald Sapphire
Cat_1 Skipping 30 20 20
Cat_2 BasketBall 40 30 20
Cat_3 Running 40 20 30

# Create a DataFrame from List of Dictionaries
import pandas as pd
data1 = [{'x': 1, 'y': 2},{'x': 5, 'y': 4, 'z':5}]
df1 =pd.DataFrame(data1)
x y z
0 1 2 NaN
1 5 4 5.0
Note − Observe, NaN (Not a Number) is appended in missing
areas.
Program

Create a DataFrame with a list of dictionaries, row indices, and
column indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a',
'b‘,’c’])
>>> df1
a b c
first 1 2 NaN
second 5 10 20.0
Program

 Create a series from a one list containing authors name and another list
containing number of articles written.
 Create a dataframe from the series created using a dictionary containing key as
“Authors” and “Articles”
 The following output must be obtained:
Program
import pandas as pd
a=["Jitender","Purnima","Arpit","Jyoti"]
b=[210,211,114,178]
s = pd.Series(a)
s1= pd.Series(b)
df=pd.DataFrame({"Author":s,"Article":s1})
df

• To access and retrieve the records from a dataframe, we need to use slice
operation.
• Slicing will display the retrieved records as per the defined range.
import pandas as pd
student={'Name':['Rinku','Ritu','Ajay','Pankaj','Aditya'], 'English':[84,56,89,
78,36], 'Economics':[96,56,89,45,95], 'IP':[83,85,88,92,97], 'Accounts':
[77,75,63,89,85]}
>>> df=pd.DataFrame(student)
>>> df
Name English Economics IP Accounts
0 Rinku 84 96 83 77
1 Ritu 56 56 85 75
2 Ajay 89 89 88 63
3 Pankaj 78 45 92 89
4 Aditya 36 95 97 85
Selecting & Accessing from DataFrame

df[1:4] # Records from 1st
to 3rd
row are displayed
Name English Economics IP Accounts
1 Ritu 56 56 85 75
2 Ajay 89 89 88 63
3 Pankaj 78 45 92 89
Note- Single row accessing is not possible.
To display a whole column,
>>> df['Name']
To display more than 1 columns,
>>> df[['Name','IP']]
>>> df['Name'][0:3]
0 Rinku
1 Ritu
2 Ajay
Name: Name, dtype: object
Selecting & Accessing from DataFrame

• Pandas provides us the flexibility to even change or rename any column inside a
dataframe.
• To change for a single column-
df.rename(columns={'Name':'Emp_Name'}, inplace=True)
• Consider a list of age of students-
a1=[20,30,25,26,15]
Rename the column ‘a1’ to ‘age’
>>> a1=[20,30,25,26,15]
>>> a1
[20, 30, 25, 26, 15]
Renaming column in DataFrame
>>> df=pd.DataFrame(a1)
>>> df
0
0 20
1 30
2 25
3 26
4 15
>>> df.columns=['Age']
>>> df
Age
0 20
1 30
2 25
3 26
4 15

• To add new columns to an already existing dataframe, the syntax is-
dfobject.colname[row_label]=new_value
>>> df['Age1']=45 # the entire column is filled up with 45
>>> df
Age Age1
0 20 45
1 30 45
2 25 45
3 26 45
4 15 45
Adding column to a DataFrame
df['Age3']=pd.Series([42,35,44,50,60])
df
Age Age2 Age3
0 20 45 42
1 30 45 35
2 25 45 44
3 26 45 50
4 15 45 60
df['Total']=df['Age']+df['Age2']+df['Age3']
df
Age Age2 Age3 Total
0 20 45 42 107
1 30 45 35 110
2 25 45 44 114
3 26 45 50 121
4 15 45 60 120

• We can update a column values by using arithmetic operators.
• We can also assign or copy the values of a dataframe with the help of assignment
operator.
• To add a new column for updated_age after 10 years for all students,
>>> df['Total']=df['Total']+10
>>> df
Age Age2 Age3 Total
0 20 45 42 117
1 30 45 35 120
2 25 45 44 124
3 26 45 50 131
4 15 45 60 130
>>> df['Updated_Age']=df['Total']
>>> df
Age Age2 Age3 Total Updated_Age
0 20 45 42 117 117
1 30 45 35 120 120
2 25 45 44 124 124
3 26 45 50 131 131
4 15 45 60 130 130
Adding column to a DataFrame

1. Create a dataframe from the dictionary of list.
Name Height Qualification
0 Jai 5.1 Msc
1 Princi 6.2 MA
2 Gaurav 5.1 Msc
3 Anuj 5.2 Msc
2. Add a column address to the dataframe with values:
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']
Sample Question-

data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],'Height': [5.1, 6.2, 5.1,
5.2],'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
>>> df['address']=['Delhi', 'Bangalore', 'Chennai', 'Patna']
>>> df
Name Height Qualification address
0 Jai 5.1 Msc Delhi
1 Princi 6.2 MA Bangalore
2 Gaurav 5.1 Msc Chennai
3 Anuj 5.2 Msc Patna
Sample Question-

Based on the given table, students are asked to create a dataframe
from a dictionary of list and perform the following:
• Change the name of the column ‘Marks’ as ‘Eng_Marks’
• Add another column ‘IP_Marks’ with a series of values
(56,78,89,77,99)
• Create a column ‘ TotalMarks’ which stores the total of Eng & IP
marks
• Display the dataframe with all the columns
Sample Question-
Name Subject Marks
0 Rahul Math 75
1 Sahil Science 80
2 Muskan Computer 69
3 Aryan SST 94
4 Vansh English 79

• DataFrame.index- The index (row labels) of the DataFrame.If the row
index has default values then RangeIndex(start=0, stop=4, step=1) is
displayed
• DataFrame.columns- Returns the column names/ column index with
dtype
• DataFrame.dtypes- Return the data types of column in the DataFrame
and also the datatype of the DataFrame.
• DataFrame.size - Return an int representing the number of elements in
the Dataframe object.
• DataFrame.shape- Return a tuple representing the dimensionality of
the DataFrame ie., the number of rows and columns in the dataframe
Properties of DataFrame-

>>>df
Name Height Qualification address
0 Jai 5.1 Msc Delhi
1 Princi 6.2 MA Bangalore
2 Gaurav 5.1 Msc Chennai
3 Anuj 5.2 Msc Patna
>>> df.size
16
>>> df.shape
(4, 4)
>>> df.dtypes
Name object
Height float64
Qualification object
address object
dtype: object
Properties of DataFrame-

What will the following fuctions return:
a) df.columns
b)df.index
c)df.shape
d)df.size
Sample Question-

import pandas as pd
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011',
'13/2/11'], 'Event':['Music', 'Poetry', 'Theatre', 'Comedy'], 'Cost':
[10000, 5000, 15000, 2000]})
print (df)
print(df.index)
print(df.columns)
print(df.dtype)
print(df.size)
print(df.shape)
Sample Question-

• The method of selecting / accessing a column of a dataframe is
similar to slicing using series.
• Pandas provides 3 methods to access dataframe column(s)
 Using the format of square brackets followed by the name of the
column passed as a string value, like
df_object.[‘column_name’]
 Using the dot notation df_object.column_name
 Using numeric indexing and the iloc attribute, like
df_object.iloc[:,<column_number>]
• Here , i stands for integer, which signifies that this command shall
return a numeric value denoting the row and column range
SELECTING A COLUMN FROM A DATAFRAME

• Example-
df[‘Total’] and df.Total will give the same output.
SELECTING A COLUMN FROM A DATAFRAME

Consider the dataframe as shown:
• DataFrame has two ordered axis.
• One goes across the top, the other goes down the left side.
 The index value: This is what you will see when you visualize a DataFrame
(The bolded black values on the vertical and horizontal axis below)
 The index position: This does not get visualized and simply represents the
ordering of the rows or columns.
USING iLOC TO RETRIEVE COLUMNS

Vertical Index Values: [0, 1, 2, 3, 4]
Vertical Index Positions: [0, 1, 2, 3, 4]
Horizontal Index Values: [‘fruit_name’, ‘price, ‘color’, ‘sweetness’]
Horizontal Index Positions: [0, 1, 2, 3]

 iloc allows us to index a DataFrame in the same way that we
can index a list; based on index position.
 The difference is that a DataFrame has a two-dimensional
index, so we need to pass in slicers for the rows first and
then for the columns.
 There are four 4 possible types of slicers we can use on the
table given:
• Scalar positions (eg:- 0,3,4)
• Range of positions (eg:- 0:1, 1:4)
• All positions (:)
• List of positions (eg:- [0,3] , [1,5])

If we want to select the data in row 2 and column 0 (i.e., row
index 2 and column index 0) we’ll use the following code:
df.iloc[2,0]
USING iloc- Integer locate

Example - USING iloc- Integer locate
>>> df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011',
'13/2/11'], 'Event':['Music', 'Poetry', 'Theatre', 'Comedy'], 'Cost':[10000, 5000,
15000, 2000]})
>>> df
Date Event Cost
0 10/2/2011 Music 10000
1 11/2/2011 Poetry 5000
2 12/2/2011 Theatre 15000
3 13/2/11 Comedy 2000
>>> df.iloc[1,1]
'Poetry'
>>> df.iloc[-1,0]
'13/2/11'
>>> df.iloc[2,2]
15000

>>> df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'], 'Cost':[10000, 5000, 15000, 2000]})
>>> df
Date Event Cost
0 10/2/2011 Music 10000
1 11/2/2011 Poetry 5000
2 12/2/2011 Theatre 15000
3 13/2/11 Comedy 2000
>>> df.iloc[2:4,0:3]
Date Event Cost
2 12/2/2011 Theatre 15000
3 13/2/11 Comedy 2000

• When we “slice” our data, we take multiple rows or multiple columns
• Keep in mind that the row number specified by the stop index value
is not included.

>>> df
>>> df
Date Event Cost
0 10/2/2011 Music 10000
1 11/2/2011 Poetry 5000
2 12/2/2011 Theatre 15000
3 13/2/11 Comedy 2000
>>> df.iloc[:,:]
Date Event Cost
0 10/2/2011 Music 10000
1 11/2/2011 Poetry 5000
2 12/2/2011 Theatre 15000
3 13/2/11 Comedy 2000
>>> df.iloc[:,0:2]
Date Event
0 10/2/2011 Music
1 11/2/2011 Poetry
2 12/2/2011 Theatre
3 13/2/11 Comedy
>>> df.iloc[0:2,:]
Date Event Cost
0 10/2/2011 Music 10000
1 11/2/2011 Poetry 5000

>>> df
Date Event Cost
0 10/2/2011 Music 10000
1 11/2/2011 Poetry 5000
2 12/2/2011 Theatre 15000
3 13/2/11 Comedy 2000
>>> df.iloc[[0,3],[0,1]]
Date Event
0 10/2/2011 Music
4 13/2/11 Comedy
>>> df.iloc[[0,3],[0,2]]
Date Cost
0 10/2/2011 10000
3 13/2/11 2000

To display only columns Date and Cost,
>>> df[['Date','Cost']]
Date Cost
0 10/2/2011 10000
1 11/2/2011 5000
2 12/2/2011 15000
3 13/2/11 2000
>>> df.iloc[:,[0,2]]
Date Cost
0 10/2/2011 10000
1 11/2/2011 5000
2 12/2/2011 15000
3 13/2/11 2000

Program
Write a code to retrieve the column and rows highlighted
in the table.

DELETING A COLUMN OR ROW FROM A DATAFRAME
• Using del keyword
• Using pop method
• Using drop method

DELETING A COLUMN FROM A DATAFRAME
• Using del keyword – [ONLY FOR COLUMN , 1 column at a time]
del df[‘<column name>’]
This will only delete the particular column , after which we have to display the
dataframe to see the changes.
>>> del df['Date']
>>> df
Event Cost
0 Music 10000
1 Poetry 5000
2 Theatre 15000
3 Comedy 2000

DELETING A COLUMN FROM A DATAFRAME
• Using pop method –
df.pop(‘<Column Name>’)-
It deletes and will display the column name that is removed from the dataframe.
>>> df.pop('Cost')
0 10000
1 5000
2 15000
3 2000
Name: Cost, dtype: int64
>>> df
Event
0 Music
1 Poetry
2 Theatre
3 Comedy

DELETING A ROW OR COLUMN FROM A DATAFRAME
• Using drop method – drop (labels, axis=1)
It will return a new dataframe with the columns deleted. Axis=1 means column
and axis=0 means row. By default it is 0.
To remove any row,
>>> df.drop([0]) OR >>> df.drop([0],axis=0)
Date Event Cost
1 11/2/2011 Poetry 5000
2 12/2/2011 Theatre 15000
3 13/2/11 Comedy 2000
To remove any column,
>>> df.drop(['Date'],axis=1)
Event Cost
0 Music 10000
1 Poetry 5000
2 Theatre 15000
4 Comedy 2000
To remove a column permanently from your dataframe
you will need to provide one more parameter
inplace=True.

DELETING A ROW OR COLUMN FROM A DATAFRAME
• To delete multiple columns :
df.drop([‘Column1’, ‘Column2’], axis=1, inplace = True)
OR
df.drop(columns=[‘Column1’, ‘Column2’], axis=1, inplace = True)
To drop rows :
df.drop([‘row1’,’row2’], axis= 0, inplace = True)
OR
df.drop(index=[‘row1’,’row2’], axis=0, inplace = True)

DELETING A COLUMN - Practical Implementation
• Create a simple dataframe with a dictionary of lists, and column
names: name, year, orders, town.
• Remove the column orders from the dataframe using del df[]
• Remove the column ‘name’ using df.pop( )
• Remove the column town using df.drop ()

Accessing elements using loc
loc –
The loc() function is used to access a group of rows and columns
by label(s).

>>>df = pd.DataFrame({"A":[12, 4, 5, None, 1],"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8], "D":[14, 3, None, 2, 6]})
>>> df.iloc[0,2]
20
>>> df.loc[0,'B']
7.0
>>> >>> df.iloc[0:2,0:2]
A B
0 12.0 7.0
1 4.0 2.0
>>> df.loc[0:2,"A":"C"]
A B C
0 12.0 7.0 20
1 4.0 2.0 16
2 5.0 54.0 11

>>> df.iloc[:,0:2]
A B
0 12.0 7.0
1 4.0 2.0
2 5.0 54.0
3 NaN 3.0
4 1.0 NaN
>>> df.loc[:,"A":"C"]
A B C
0 12.0 7.0 20
1 4.0 2.0 16
2 5.0 54.0 11
3 NaN 3.0 3
4 1.0 NaN 8
>>> df.iloc[[1,3],[2,1]]
C B
1 16 2.0
3 3 3.0
>>> df.loc[[1,3],
["A","C"]]
A C
1 4.0 16
3 NaN 3

Head and Tail in DataFrame
The method head() gives the first 5
rows and tail gives the last 5.
import pandas as pd
emp={'id':
[100,101,102,103,105,106,107],'na
me':
['Raj','Sini','Flora','Leena','Priya','De
nny','Kevin'],'Sal':
[12000,5000,2200,3200,23000,8700,
15000]}
df=pd.DataFrame(emp)
print(df)
print(df.head())
print(df.tail())
print(df.head(2))
print(df.tail(3))
id name Sal
0 100 Raj 12000
1 101 Sini 5000
2 102 Flora 2200
3 103 Leena 3200
4 105 Priya 23000
5 106 Denny 8700
6 107 Kevin 15000
id name Sal
0 100 Raj 12000
1 101 Sini 5000
2 102 Flora 2200
3 103 Leena 3200
4 105 Priya 23000
id name Sal
2 102 Flora 2200
3 103 Leena 3200
4 105 Priya 23000
5 106 Denny 8700
6 107 Kevin 15000
id name Sal
0 100 Raj 12000
7 101 Sini 5000
id name Sal
4 105 Priya 23000
5 106 Denny 8700
6 107 Kevin 15000

Transpose
T:- Transpose the dataframe (row convert into columns & columns convert into
rows.
>>> x
month sales1 sales2
0 jan 5 3
1 feb 7 5
2 mar 6 8
>>> x.T
0 1 2
month jan feb mar
sales1 5 7 6
sales2 3 5 8

reindex
Reindex will change the order of index .
>>> x=pd.DataFrame({'month':['jan','feb','mar'], 'sales1':[5,7,6],'sales2':[3,5,8]})
>>> x
month sales1 sales2
0 jan 5 3
1 feb 7 5
2 mar 6 8
>>> y=x.reindex([2,1,0])
>>> y
month sales1 sales2
2 mar 6 8
1 feb 7 5
0 jan 5 3

Binary operations
Pandas provides the methods add(), sub(), mul(), div() for carrying out binary
operations on dataframes.
Since all these operations involve 2 dataframes to act upon, they are called
Binary. (‘bi’ means ‘two’ and ‘ary’ means digits)
>>> S1=pd.DataFrame({'UT-1':[23,20,21,19,25],'UT-2':[20,23,12,16,23]})
>>> S2=pd.DataFrame({'UT-1':[13,21,22,10,21],'UT-2':[24,23,11,12,24]})
>>> S1.add(S2)
UT-1 UT-2
0 36 44
1 41 46
2 43 23
3 29 28
4 46 47

Binary operations
>>> S1.div(S2)
UT-1 UT-2
0 1.769231 0.833333
1 0.952381 1.000000
2 0.954545 1.090909
3 1.900000 1.333333
4 1.190476 0.958333
Use radd() and rsub() also.
>>> S1.sub(S2)
UT-1 UT-2
0 10 -4
1 -1 0
2 -1 1
3 9 4
4 4 -1
>>> S1.mul(S2)
UT-1 UT-2
0 299 480
1 420 529
2 462 132
3 190 192
4 525 552

1.Write the purpose of the following statement:
mtns_df.set_index('name', inplace=True)
2. Write the output of the statement:
a. mtns.loc[:, 'summited’]
b. mtns.loc['K2', :]
c. mtns.loc['K2', 'summited’]
d. mtns.loc[['K2', 'Lhotse'], :]
e. mtns.loc[:, 'height': 'summited’]
f. mtns.loc[mtns.loc[:, 'summited'] > 1954, :]
g. mtns.iloc[0, :]
h. mtns.iloc[:, 2]
i. mtns.iloc[0, 2]
j. mtns.iloc[[1, 3], :]
k. mtns.iloc[:, 0:2]

Accessing a DataFrame with a boolean index
• We can create Boolean indexes for dataFrames and searching can be done
based on True or False indexes.
• loc() is used.
• Pandas, DataFrame also support Boolean indexing.
• So we can direct search our data based on True or False indexing.
• We can use loc[ ] for this purpose.
• In order to access a dataframe with a boolean index, we have to create a
dataframe in which index of dataframe contains a boolean value that is
“True” or “False”.
import pandas as pd
dict= {'name':[“Mohak", “Freya", “Roshni"], 'degree': ["MBA", "BCA", "M.Tech"],
'score':[90, 40, 80]}
df= pd.DataFrame(dict, index = [True, False, True])
print(df.loc[True])

Accessing a DataFrame with a boolean index
import pandas as pd
data1={ 'rollno' : [101,102,103,104],
'name' : ['ram','mohan','sohan','rohan'] }
student1 = pd.DataFrame(data1,
index = [True, False, True, False],
columns=['rollno' , 'name']
)
print(student1)
Output rollno name
True 101 ram
False 102 mohan
True 103 sohan
False 104 rohan
print(student1.loc[True] )
Output rollno name
True 101 ram
True 103 sohan
-----------------------
print(student1.loc[False] )
Output rollno name
False 102 mohan
False 104 rohan

Iteration on rows and columns
• If we want to access row or column from a dataframe row or
column wise then iteration is used.
• Pandas provides 2 functions to perform iterations-
1. iterrows()
2. iteritems()

iterrows
• It is used to access the data row wise.
import pandas as pd
ab= [{'Name':'Arya','Age':20},{'Name':'Shane','Age':19}]
df=pd.DataFrame(ab)
for(i,j) in df.iterrows():
print(j)
Name Arya
Age 20
Name: 0, dtype: object
Name Shane
Age 19
Name: 1, dtype: object

iteritems
• It is used to access the data column wise.
import pandas as pd
ab= [{'Name':'Arya','Age':20},{'Name':'Shane','Age':19}]
df=pd.DataFrame(ab)
for(i,j) in df.iteritems():
print(j)
0 Arya
1 Shane
Name: Name, dtype: object
0 20
1 19
Name: Age, dtype: int64

Basic functions
>>> x=pd.DataFrame({ 'month':
['jan','feb', 'mar'], 'sales1':[5,7,6],
'sales2':[3,5,8]})
>>> x
month sales1 sales2
0 jan 5 3
1 feb 7 5
2 mar 6 8
>>> x.count()
month 3
sales1 3
sales2 3
dtype: int64
>>> x.max()
month mar
sales1 7
sales2 8
dtype: object
>>> x.min()
month feb
sales1 5
sales2 3
dtype: object
>>> x.sum()
month janfebmar
sales1 18
sales2 16
dtype: object

Basic functions
Using the functions row and column wise-
>>> x.sum(axis=0)
month janfebmar
sales1 18
sales2 16
dtype: object
>>> x.sum(axis=1)
0 8
1 12
2 14
dtype: int64

To fill NaN with desire data in particular column data
import pandas as pd
import numpy as np
data1={'rollno' : [101, 102, 103, 104],
'name' : ['ram','mohan',’sohan’,
np.NaN]}
columns=['rollno' , 'name'] )
print(student1)
O/p-
Rollno name
0 101 ram
1 102 mohan
2 103 sohan
3 104 np.NaN
>>> student1 ['name'] . fillna( 'rohit',
inplace = True)
>>> student1
rollno name
0 101 ram
1 102 mohan
2 103 sohan
3 104 rohit
student1. fillna( 999, inplace = True)
print(df)
rollno name
0 101 ram
1 102 mohan
2 103 sohan
3 104 999

Adding a new row using - append() method
import pandas as pd
data1={'rollno' : [101,102],
'name' : ['ram','mohan']}
print(student1)
rollno name
0 101 ram
1 102 mohan
#to add a new row in existing a
DataFrame
Student1= student1.append({ 'rollno' :
103, 'name': 'sohan' } ,ignore_index
=True)
print(student1)
rollno name
0 101 ram
1 102 mohan
2 103 sohan

Handling missing values (NaN) – dropping Using dropna() method
>>>import numpy as np;
>>>data1={'rollno' : [101, 102,
103, 104],'name' :
['ram','mohan','sohan', np.NaN]}
print(student1)
rollno name
0 101 ram
1 102 mohan
2 103 sohan
3 104 NaN
#to drop, all rows of NaN by default
student1 . dropna( inplace = True)
student1
rollno name
0 101 ram
1 102 mohan
2 103 sohan
#to drop, NaN of all column using axis =1
student1 . dropna(axis=1, inplace =
True)
print(student1)

To check if zero exists
data1 = {'rollno' : [101, 102, 103, 104],'name' :
['ram', 'mohan', 'sohan', 'rohan']}
student = pd.DataFrame(data1,
columns=['rollno','name'])
print(student)
rollno name
0 101 ram
1 102 mohan
2 103 sohan
3 104 rohan
>>> student.all()
rollno True
name True
dtype: bool
>>> student.all(axis=1)
0 True
1 True
2 True
3 True
dtype: bool
>>> data1 = {'rollno' : [0, 102, 103, 104],'name' :
['ram', 0, 'sohan', 'rohan']}
student = pd.DataFrame(data1,
columns=['rollno','name'])
print(student)
rollno name
0 0 ram
1 102 0
2 103 sohan
3 104 rohan
>>> student.all()
rollno False
name False
dtype: bool
>>> student.all(axis=1)
0 False
1 False
2 True
3 True
dtype: bool
all() returns whether all
elements are True over
the requested axis.

Sorting data in DataFrames
sort_values()  Seen earlier
sort_index()  To sort by index
>>> student.sort_index()
rollno name
0 10.0 ram
1 NaN 110
2 103.0 sohan
3 104.0 rohan
>>> student.sort_index(ascending=False)
rollno name
3 104.0 rohan
2 103.0 sohan
1 NaN 110
0 10.0 ram
>>> student.sort_index(axis=1)
name rollno
0 ram 10.0
1 110 NaN
2 sohan 103.0
3 rohan 104.0

Create DataFrame from csv
 CSV (Comma Separated Values) is a simple file format used to
store tabular data, such as a spreadsheet or database.
 A CSV file stores tabular data (numbers and text) in plain text.
 Each line of the file is a data record.
 Each record consists of one or more fields, separated by
commas.
 The use of the comma as a field separator is the source of the
name for this file format.

 For working with CSV files in Python, there is an in-built
module called csv.
 Files of this format are generally used to exchange data,
usually when there is a large amount, between different
applications.

Advantages of CSV format
• A simple and compact format for data storage.
• A common format for data interchange.
• It can be opened in popular spreadsheet packages like MS
Excel, Open Office-Calc, etc.
• Nearly all spreadsheets and databases support import/export
to CSV format.

 A CSV is a text file, so it can be created and edited using any
text editor.
 A file is to be created and saved in the same folder where our
programs are saved.
 To create a DataFrame from the file we need to first import
data from csvfile.
 pd.read_csv( ) is the method, which is used to read csv file
from other location.

Using MS excel
 Let us create a CSV file using Microsoft Excel on the basis of
“Employee” table.

Using MS excel
1. Launch Microsoft Excel.
2. Type the data given in the above Table in the Excel sheet .
You will also notice that some cell values are missing to represent missing
values (NaN) in Pandas dataframe.

Using MS excel
3. Save the file with a proper name by clicking File -> Save or Save As or
press Ctrl + S to open the Save As window .
4. Type the name of the file as Employee and select file type as CSV
(Comma delimited) (*.csv) from the drop-down arrow.
5. Click on Save button. Excel will ask for confirmation to select CSV format.
6. Click on OK.

Using MS excel
• Lastly, click on Yes to retain and save the Excel file in CSV format.
• To view this CSV file, open any Text Editor (Notepad preferably) and
explore the folder containing Employee.csv file.
• If you open the file in a Notepad editor, you will observe that each
column is separated by a comma (,) delimiter and each new line
indicates a new row/record.

Open csv file using Pandas DataFrame
After creating a simple “Employee” CSV file, it can be read using read_csv()
function in Pandas once you know the path of your file.
The read_csv method loads the data in a Pandas dataframe ‘df’.
pd.read_csv(“path”) shall fetch the data from csv file and display all records
at the command prompt.
Syntax for read_csv() method is:
import pandas as pd
<df>=pd.read_csv(<FilePath>)

Creating a csv from .txt file
 Create a text file with comma separated values.
 First entry being ‘the names of columns’
 Example:
#Creating a dataframe from a text file
import pandas as pd
df=pd.read_csv("sample.txt")
print(df)
print (df.columns)
Unnamed: 0 column gets displayed automatically along with the
index values. To avoid this column, use the attribute index_col =0
with read_csv() method.

More commands
• To display the shape (number of rows and columns) of the CSV file 
df.shape
>>> df.shape
(7, 5)
Reading CSV file with specific/selected columns-
• This can be done by using “usecols” attribute along with read_csv().
>>> df=pd.read_csv("Employee.csv",usecols=['Name','Age'])
Reading CSV file with specific/selected rows-
• Use “nrows” attribute used with read_csv(). nrows means number of
rows.
>>> df=pd.read_csv("Employee.csv",nrows=5)
• Here 5 rows are displayed. It will display NaN values also, if present.

More commands
Reading CSV file without header
• To avoid displaying the header of the dataframe, use
header=None option.
>>> df=pd.read_csv("Employee.csv",header=None)
Reading CSV file without index
• To avoid displaying index numbers, use index_col=0.
>>> df=pd.read_csv("Employee.csv",index_col=0)

UPDATING/MODIFYING CONTENTS IN A CSV FILE
Reading CSV file with new column names
• Use skiprow option to skip the header if it exists. Specify the new
names with names option.
df=pd.read_csv("Employee.csv",skiprows=1,names=['a','b','c','d','e'])
Replace any contents of the dataframe with NaN values-
• Done by using na_values option along with read_csv method
>>> df=pd.read_csv("Employee.csv",na_values=[26])
Here wherever the value 26 is seen, it gets updated to NaN.

Exporting data from DataFrame to csv
• To create a CSV file from a dataframe, the to_csv() method is
used.
• 2 methods-
 Create a dataframe. Transfer this to a csv file.
 Copying the contents of the original CSV file to another file.
• To export a dataframe into a csv file, 1st
create a dataframe say
df1 and use dataframe.to_csv(‘path’) method to export df1
into a new csv.
>>> df1=pd.DataFrame(df)
>>> df1.to_csv("Employee12.csv")
• Now the contents of df are copied to df1.

Example
import pandas as pd
cars = {'Brand': ['Honda Civic','ToyotaCorolla',
'FordFocus','AudiA4'],'Price': [22000,25000,27000,35000]}
df= pd.DataFrame(cars, columns= ['Brand', 'Price'])
df.to_csv('export_dataframe.csv', index = False, header=True)
#Open the notepad with export_dataframe file.
pd.read_csv('export_dataframe.csv')

Example
#To create a new CSV file by copying the contents of Employee.csv.
import pandas as pd
df= pd.read_csv(“Employee.csv”)
df.to_csv(‘Employee_new.csv')
print(df)
• Employee_new.csv file shall be created containing the same contents
as Employee.csv with default index values.
• If you open this file in a spreadsheet like MS Excel, you will get the
Student data in the form of rows/records and columns.

Ln. 1 - Data Handling using Pandas –I (1).pptx

More Related Content

Similar to Ln. 1 - Data Handling using Pandas –I (1).pptx

Recently uploaded

Ln. 1 - Data Handling using Pandas –I (1).pptx