Unit 1
Data Handling using Pandas-1
Module: Module is a file which contains python functions. It is
.py file which has python executable code or statements.
Package: Package is namespace which contains multiple
packages or modules. It is a directory which contains a special
file __init__.py.
__init__.py file denotes Python the file that contains __init__.py
as package.
Library: It is collection of various packages. There is no
difference between package and python library conceptually.
Framework: It is a collection of various libraries which architects
the code flow.
Pandas:
Pandas is the most popular open source python library used for
data analysis.
We can analyze the data in pandas in two ways-
● Series
● Dataframes
Series:
Series is 1-Dimensional array defined in python pandas to store
any data type.
Syntax:
<Series Name>=<pd>.Series(<list name>, ...)
Example:
5 15 16 4 34
Properties of Series:
• Series will contain homogeneous data type.
• Size of the series immutable
• Values in the series are mutable.
Creation of Series:
We can create a pandas series in following ways-
● From arrays
● From Lists
● From Dictionaries
● From scalar value
From Lists :
Output:
From arrays :
Output:
From Dictionary:
Output:
From Scalar Value:
Output:
Mathematical Operations on Series:
Mathematical Operations on Series (cont…):
Output:
Head and Tail functions on Series:
head and tail functions returns first and last n rows respectively.
Syntax:
<Series name>.head(n)
<Series name>.tail(n)
n-number of rows
Default value of n is 5
Selection, Indexing and Slicing on Series:
Selection: We can select a value from the series by using its
corresponding index.
Syntax:
<Series name>[<index number>]
Output:
Indexing:
Series.index attribute is used to get or set the index labels for the
given series.
Syntax:
<Series name>.index
Indexing (cont...):
Output:
Slicing:
Slicing operation on the series split the series based on the given
parameters.
Syntax:
<Series name>[<start>:<stop>:<step>]
Note: start,stop,step are optional
Default values: start=0, stop=n-1, step=1
Note: slicing will take default index
1. What is the significance of Pandas library?
2. Name some common data structures of python’s pandas
library?
3. Write the syntax and description for min, sum, describe
and idxmax functions in python pandas series?
4. What will the output produced by following code”
Stationary = [‘pencils’, ‘notebooks’, ‘scales’, ‘erasers’]
S=pd.series([20,30,52,10],index=stationary)
S2=pd.series([17,13,32,21),index=stationary)
print(S+S2)
S=S+S2
print(S+S2)
5. Find the error in following code fragment:
S2=pd.Series([101,102,102,104])
S2.index=[0,1,2,3,4,5]
S2[5]=220
print(S2)
Write a Pandas program to multiply and divide two Pandas Series. Sample Series:
[2, 4, 8, 10], [1, 3, 7, 9]
import pandas as pd
ds1 = pd.Series([2, 4, 8, 10])
ds2 = pd.Series([1, 3, 7, 9])
print("Multiply two Series:")
ds = ds1 * ds2
print(ds)
print("Divide Series1 by Series2:")
ds = ds1 / ds2
print(ds)
Write a Pandas program to convert a dictionary to a Pandas series. Sample
dictionary: d1 = {'a': 100, 'b': 200, 'c':300}
import pandas as pd
d1 = {'a': 100, 'b': 200, 'c':300}
print("Original dictionary:")
print(d1)
new_series = pd.Series(d1)
print("Converted series:")
print(new_series)
Write a Pandas program to sort a given Series.
400, 300.12,100, 200
import pandas as pd
s = pd.Series([400, 300.12,100, 200])
print("Original Data Series:")
print(s)
new_s = pd.Series(s).sort_values()
print(new_s)
Data Frames
Data Frames:
Data Frames is a two-dimensional(2-D) data structure defined in
pandas which consist of rows and columns.
Data Frames stores an ordered collection of columns that can
store data of different types.
Example:
S.No. Name Age Marks
1 Ravi 25 99
2 Kunal 26 98
Characteristics of Data Frames:
➢ It has two indices (two axes)
○ Row index (axis=0) ->known as index
○ Column index (axis=1) ->known as column-name
➢ Value in the Data Frame will be identifiable by the
combination of row index and column index.
➢ Indices can be of any type
➢ Column can have data of different types.
➢ Value is mutable
➢ Size is mutable
Creation of Data Frames:
Syntax:
<Data Frame Name>=
pandas.DataFrame(
<2D data structure>,
<columns=<column sequence>,
<index=<index sequence>,............)
We can create Data Frame in many ways, such as-
(i) Two dimensional dictionaries
(ii) Two dimensional ndarrays(NumPy arrays)
(iii) Series type object
(iv) Another Dataframe object
(v) Text/CSV files
Creating Data frame from List:
Output:
Creating Data frame from array:
Output:
Creating Data frame from Series:
Output:
Creating Data frame from another Data frame:
Output:
(i) Two dimensional dictionaries
We can create Dataframe from Two dimensional dictionaries-
➢ Creating Dataframe from list of dictionaries
➢ Creating Dataframe from dictionary of Series
Creating Dataframe from list of dictionaries:
Output:
Creating Data frame from dictionary of Series:
Output:
(v) Text/CSV files:
We can Create Dataframe from Text/CSV Files by using
read_csv() function.
Syntax:
<data frame name>
=pandas.read_csv(filepath_or_buffer, sep=',',
delimiter=None, header='infer', names=None,
index_col=None, usecols=None, …)
(v) Text/CSV files (cont..):
Output:
Accessing values in dataframe:
Accessing a particular value:
<Data frame name>[<column name>][<index>]
Accessing a group of values:
<Data frame name>.loc[<index>],[<column name>]
Accessing values in dataframe (cont…):
Output:
NaN variable in Python:
NaN , standing for not a number, is a numeric data type used to
represent any value that is undefined or unpresentable. For
example, 0/0 is undefined as a real number and is, therefore,
represented by NaN.
Iteration on Dataframes:
In Pandas Dataframe we can iterate an element in two ways:
● Iterating over rows
● Iterating over columns
Iterating over rows :
To iterate over the rows of the DataFrame, we can use the
following functions −
● iterrows() − iterate over the rows as (index,series) pairs
● iteritems() − to iterate over the (key,value) pairs
● itertuples() − iterate over the rows as namedtuples
iterrows():
Output:
iteritems():
Output:
itertuples():
Output:
Iterating over Columns :In order to iterate over columns, we
need to create a list of dataframe columns and then iterating
through that list to pull out the data frame columns.
Operations on rows and columns:
● Add
● Select
● Delete
● Rename
Column selection:
Output:
Column addition:
Output:
Column Deletion:
Output:
Column Rename:
Output:
Row selection:
Output:
Row Addition:
Output:
Row Deletion:
Output:
Row Rename:
Output:
Head and Tail functions in Data Frames:
head(n):
Returns the first n rows.
tail(n):
Returns last n rows.
Default value for n is 5
Indexing using Labels in Data Frames: We can make one of
the columns as row index label for the data frame by using the
function set_index().
Output:
Boolean indexing in Data Frames: Boolean indexing helps us
to select the data from the Data Frames using a boolean vector.
Joining, Merging and Concatenation on Data Frames:
Merge:
pandas.merge() method is used for merging two data frames.
It will have three arguments.
● Data frame names
● how - how will take any of the three values i.e., left,right or
inner
● on - on the common column name
Merge (cont..):
Join:The join method uses the index of the dataframes.
Use <dataframe 1>.join(<dataframe 2>) to join
Concatenation:Concatenate uses pandas.concat(<List of data
frames>).
Importing/Exporting Data between CSV files and Data
Frames:
Import data from CSV file to Data Frame:We can import data
from CSV File to Data Frame by using read_csv() function.
Output:
Export data from Data Frame to CSV File:We can export data
from Data Frame to CSV File by using to_csv() function.
Syntax:
<data frame name>.to_csv(<File Path>,.....)