KEMBAR78
Python libraries for analysis Pandas.pptx
1
Data Analysis with Pandas
Prof. Rahul Borate
G H Raisoni College of Engineering and Management , Pune
Agenda
• DataFrame Basic
• Importing and exporting data (CSV, Excel, JSON)
• Data Cleaning- handling missing values, duplicates
• Data wrangling-filtering, grouping and merging datasets
Why pandas?
• One of the most popular library that data scientists use
• Labeled axes to avoid misalignment of data
• Missing values or special values may need to be removed or
replaced
height Weight Weight2 age Gender
Amy 160 125 126 32 2
Bob 170 167 155 -1 1
Chris 168 143 150 28 1
David 190 182 NA 42 1
Ella 175 133 138 23 2
Frank 172 150 148 45 1
salary Credit score
Alice 50000 700
Bob NA 670
Chris 60000 NA
David -99999 750
Ella 70000 685
Tom 45000 660
Overview
• Created by Wes McKinney in 2008, now maintained by
many others.
• Powerful and productive Python data analysis and
Management Library
• Its an open source product.
Overview
• Python Library to provide data analysis features similar to:
R, MATLAB, SAS
• Rich data structures and functions to make working with data
structure fast, easy and expressive.
• It is built on top of NumPy
• Key components provided by Pandas:
• DataFrame
pandas.DataFrame
pandas.DataFrame(data, index , columns , dtype , copy )
• data
• data takes various forms like ndarray, series, map, lists, dict, constants and also
another DataFrame.
• index
• For the row labels, the Index to be used for the resulting frame is Optional Default
np.arrange(n) if no index is passed.
• columns
• For column labels, the optional default syntax is - np.arrange(n). This is only true if
no index is passed.
• dtype
• Data type of each column.
• copy
• This command (or whatever it is) is used for copying of data, if the default is False.
DataFrame Basics
• A DataFrame is a tabular data structure comprised of rows and
columns, same as a spreadsheet or database table.
• It can be treated as an ordered collection of columns
• Each column can be a different data type
• Have both row and column indices
data = {'state': [‘Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame)
#output
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
DataFrame – specifying columns and indices
• Order of columns/rows can be specified.
• Columns not in data will have NaN.
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E'])
Print(frame2)
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN
D 2001 Nevada 2.4 NaN
E 2002 Nevada 2.9 NaN
Same order
Initialized with NaN
DataFrame – index, columns, values
frame3.index.name = 'year'
frame3.columns.name='state‘
print(frame3)
state Nevada Ohio
year
2000 NaN 1.5
2001 2.9 1.7
2002 2.9 3.6
print(frame3.index)
Int64Index([2000, 2001, 2002], dtype='int64', name='year')
print(frame3.columns)
Index(['Nevada', 'Ohio'], dtype='object', name='state')
print(frame3.values)
[[nan 1.5]
[2.9 1.7]
[2.9 3.6]]
DataFrame – retrieving a column
• A column in a DataFrame can be retrieved as a Series by
dict-like notation or as attribute
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame['state'])
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
Name: state, dtype: object
print(frame.state)
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
Name: state, dtype: object
Activity 1
• Download the following csv file and load it to your python
module or use the url directly in pd.read_csv(url) which will
read it to a DataFrame
• https://www.cs.odu.edu/~sampath/courses/f19/cs620/files/data/values.csv
• Calculate the average and standard deviation (std) of the
column factor_1 and display the result.
• Pandas mean() and std()
DataFrame – getting rows
• loc for using indexes and iloc for using positions
• loc gets rows (or columns) with particular labels from the index.
• iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E'])
print(frame2)
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN
D 2001 Nevada 2.4 NaN
E 2002 Nevada 2.9 NaN
print(frame2.loc['A'])
year 2000
state Ohio
pop 1.5
debt NaN
Name: A, dtype: object
print(frame2.loc[['A', 'B']])
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
print(frame2.loc['A':'E',
['state','pop']])
state pop
A Ohio 1.5
B Ohio 1.7
C Ohio 3.6
D Nevada 2.4
E Nevada 2.9
print(frame2.iloc[1:3])
year state pop debt
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN
print(frame2.iloc[:,1:3])
state pop
A Ohio 1.5
B Ohio 1.7
C Ohio 3.6
D Nevada 2.4
E Nevada 2.9
DataFrame – modifying columns
frame2['debt'] = 0
print(frame2)
year state pop debt
A 2000 Ohio 1.5 0
B 2001 Ohio 1.7 0
C 2002 Ohio 3.6 0
D 2001 Nevada 2.4 0
E 2002 Nevada 2.9 0
frame2['debt'] = range(5)
print(frame2)
year state pop debt
A 2000 Ohio 1.5 0
B 2001 Ohio 1.7 1
C 2002 Ohio 3.6 2
D 2001 Nevada 2.4 3
E 2002 Nevada 2.9 4
val = Series([10, 10, 10], index = ['A', 'C', 'D'])
frame2['debt'] = val
print(frame2)
year state pop debt
A 2000 Ohio 1.5 10.0
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 10.0
D 2001 Nevada 2.4 10.0
E 2002 Nevada 2.9 NaN
Rows or individual elements can be
modified similarly. Using loc or iloc.
DataFrame – removing columns
del frame2['debt']
print(frame2)
year state pop
A 2000 Ohio 1.5
B 2001 Ohio 1.7
C 2002 Ohio 3.6
D 2001 Nevada 2.4
E 2002 Nevada 2.9
Removing rows/columns
print(frame)
c1 c2 c3
r1 0 1 2
r2 3 4 5
r3 6 7 8
print(frame.drop(['r1']))
c1 c2 c3
r2 3 4 5
r3 6 7 8
print(frame.drop(['r1','r3']))
c1 c2 c3
r2 3 4 5
print(frame.drop(['c1'], axis=1))
c2 c3
r1 1 2
r2 4 5
r3 7 8
This returns a new object
Handling missing data
•Why Fill in the Missing Data?.
It is necessary to fill in missing data values in
datasets, as most of the machine learning models that
you want to use will provide an error if you pass NaN
values into them.
The easiest way is to naturally handle missing data in
Python by just filling them up with 0, but it’s essential to
note that this approach can potentially reduce your
model accuracy significantly.
Handling missing data
•How to Know If the Data Has Missing Values?
Missing values are usually represented in the form of Nan or null
or None in the dataset.
 df.info() The function can be used to give information about the
dataset, including insights into missing data in Python.
 This function is one of the most used functions for data analysis.
This will provide you with the column names and the number of
non–null values in each column.
 It will also display the data types of each column. Thus, we can
find out which number columns are where null values are
present, and by looking at the data types, we can have an
understanding of which value to replace nulls with when
addressing missing data in Python.
Handling missing data
•How to Know If the Data Has Missing Values?
 The second way of finding whether we have null values in the
data is by using the isnull() function.
 print(df.isnull().sum())
Handling missing data
•Different Methods of Dealing With Missing Data
1. Deleting the column with missing data
updated_df = df.dropna(axis=1)
updated_df.info()
2. Deleting the row with missing data
If there is a certain row with missing data, then you can delete the
entire row with all the features in that row.
axis=1 is used to drop the column with NaN values.
axis=0 is used to drop the row with NaN values.
updated_df = newdf.dropna(axis=0)
Handling missing data
•Different Methods of Dealing With Missing Data
3. Filling the Missing Values – Imputation
The possible ways to do this are:
Filling the missing data with the mean or median value if it’s a
numerical variable.
Filling the missing data with mode if it’s a categorical value.
Filling the numerical value with 0 or -999, or some other number
that will not occur in the data. This can be done so that the machine
can recognize that the data is not real or is different.
Filling the categorical value with a new type for the missing values.
You can use the fillna() function to fill the null values in the dataset.
updated_df = df
updated_df['Age']=updated_df['Age'].fillna(updated_df['Age'].mean()
)

Python libraries for analysis Pandas.pptx

  • 1.
    1 Data Analysis withPandas Prof. Rahul Borate G H Raisoni College of Engineering and Management , Pune
  • 2.
    Agenda • DataFrame Basic •Importing and exporting data (CSV, Excel, JSON) • Data Cleaning- handling missing values, duplicates • Data wrangling-filtering, grouping and merging datasets
  • 3.
    Why pandas? • Oneof the most popular library that data scientists use • Labeled axes to avoid misalignment of data • Missing values or special values may need to be removed or replaced height Weight Weight2 age Gender Amy 160 125 126 32 2 Bob 170 167 155 -1 1 Chris 168 143 150 28 1 David 190 182 NA 42 1 Ella 175 133 138 23 2 Frank 172 150 148 45 1 salary Credit score Alice 50000 700 Bob NA 670 Chris 60000 NA David -99999 750 Ella 70000 685 Tom 45000 660
  • 4.
    Overview • Created byWes McKinney in 2008, now maintained by many others. • Powerful and productive Python data analysis and Management Library • Its an open source product.
  • 5.
    Overview • Python Libraryto provide data analysis features similar to: R, MATLAB, SAS • Rich data structures and functions to make working with data structure fast, easy and expressive. • It is built on top of NumPy • Key components provided by Pandas: • DataFrame
  • 6.
  • 7.
    • data • datatakes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. • index • For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed. • columns • For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed. • dtype • Data type of each column. • copy • This command (or whatever it is) is used for copying of data, if the default is False.
  • 8.
    DataFrame Basics • ADataFrame is a tabular data structure comprised of rows and columns, same as a spreadsheet or database table. • It can be treated as an ordered collection of columns • Each column can be a different data type • Have both row and column indices data = {'state': [‘Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame = DataFrame(data) print(frame) #output state year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9
  • 9.
    DataFrame – specifyingcolumns and indices • Order of columns/rows can be specified. • Columns not in data will have NaN. frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E']) Print(frame2) year state pop debt A 2000 Ohio 1.5 NaN B 2001 Ohio 1.7 NaN C 2002 Ohio 3.6 NaN D 2001 Nevada 2.4 NaN E 2002 Nevada 2.9 NaN Same order Initialized with NaN
  • 10.
    DataFrame – index,columns, values frame3.index.name = 'year' frame3.columns.name='state‘ print(frame3) state Nevada Ohio year 2000 NaN 1.5 2001 2.9 1.7 2002 2.9 3.6 print(frame3.index) Int64Index([2000, 2001, 2002], dtype='int64', name='year') print(frame3.columns) Index(['Nevada', 'Ohio'], dtype='object', name='state') print(frame3.values) [[nan 1.5] [2.9 1.7] [2.9 3.6]]
  • 11.
    DataFrame – retrievinga column • A column in a DataFrame can be retrieved as a Series by dict-like notation or as attribute data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame = DataFrame(data) print(frame['state']) 0 Ohio 1 Ohio 2 Ohio 3 Nevada 4 Nevada Name: state, dtype: object print(frame.state) 0 Ohio 1 Ohio 2 Ohio 3 Nevada 4 Nevada Name: state, dtype: object
  • 12.
    Activity 1 • Downloadthe following csv file and load it to your python module or use the url directly in pd.read_csv(url) which will read it to a DataFrame • https://www.cs.odu.edu/~sampath/courses/f19/cs620/files/data/values.csv • Calculate the average and standard deviation (std) of the column factor_1 and display the result. • Pandas mean() and std()
  • 13.
    DataFrame – gettingrows • loc for using indexes and iloc for using positions • loc gets rows (or columns) with particular labels from the index. • iloc gets rows (or columns) at particular positions in the index (so it only takes integers). data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E']) print(frame2) year state pop debt A 2000 Ohio 1.5 NaN B 2001 Ohio 1.7 NaN C 2002 Ohio 3.6 NaN D 2001 Nevada 2.4 NaN E 2002 Nevada 2.9 NaN print(frame2.loc['A']) year 2000 state Ohio pop 1.5 debt NaN Name: A, dtype: object print(frame2.loc[['A', 'B']]) year state pop debt A 2000 Ohio 1.5 NaN B 2001 Ohio 1.7 NaN print(frame2.loc['A':'E', ['state','pop']]) state pop A Ohio 1.5 B Ohio 1.7 C Ohio 3.6 D Nevada 2.4 E Nevada 2.9 print(frame2.iloc[1:3]) year state pop debt B 2001 Ohio 1.7 NaN C 2002 Ohio 3.6 NaN print(frame2.iloc[:,1:3]) state pop A Ohio 1.5 B Ohio 1.7 C Ohio 3.6 D Nevada 2.4 E Nevada 2.9
  • 14.
    DataFrame – modifyingcolumns frame2['debt'] = 0 print(frame2) year state pop debt A 2000 Ohio 1.5 0 B 2001 Ohio 1.7 0 C 2002 Ohio 3.6 0 D 2001 Nevada 2.4 0 E 2002 Nevada 2.9 0 frame2['debt'] = range(5) print(frame2) year state pop debt A 2000 Ohio 1.5 0 B 2001 Ohio 1.7 1 C 2002 Ohio 3.6 2 D 2001 Nevada 2.4 3 E 2002 Nevada 2.9 4 val = Series([10, 10, 10], index = ['A', 'C', 'D']) frame2['debt'] = val print(frame2) year state pop debt A 2000 Ohio 1.5 10.0 B 2001 Ohio 1.7 NaN C 2002 Ohio 3.6 10.0 D 2001 Nevada 2.4 10.0 E 2002 Nevada 2.9 NaN Rows or individual elements can be modified similarly. Using loc or iloc.
  • 15.
    DataFrame – removingcolumns del frame2['debt'] print(frame2) year state pop A 2000 Ohio 1.5 B 2001 Ohio 1.7 C 2002 Ohio 3.6 D 2001 Nevada 2.4 E 2002 Nevada 2.9
  • 16.
    Removing rows/columns print(frame) c1 c2c3 r1 0 1 2 r2 3 4 5 r3 6 7 8 print(frame.drop(['r1'])) c1 c2 c3 r2 3 4 5 r3 6 7 8 print(frame.drop(['r1','r3'])) c1 c2 c3 r2 3 4 5 print(frame.drop(['c1'], axis=1)) c2 c3 r1 1 2 r2 4 5 r3 7 8 This returns a new object
  • 17.
    Handling missing data •WhyFill in the Missing Data?. It is necessary to fill in missing data values in datasets, as most of the machine learning models that you want to use will provide an error if you pass NaN values into them. The easiest way is to naturally handle missing data in Python by just filling them up with 0, but it’s essential to note that this approach can potentially reduce your model accuracy significantly.
  • 18.
    Handling missing data •Howto Know If the Data Has Missing Values? Missing values are usually represented in the form of Nan or null or None in the dataset.  df.info() The function can be used to give information about the dataset, including insights into missing data in Python.  This function is one of the most used functions for data analysis. This will provide you with the column names and the number of non–null values in each column.  It will also display the data types of each column. Thus, we can find out which number columns are where null values are present, and by looking at the data types, we can have an understanding of which value to replace nulls with when addressing missing data in Python.
  • 19.
    Handling missing data •Howto Know If the Data Has Missing Values?  The second way of finding whether we have null values in the data is by using the isnull() function.  print(df.isnull().sum())
  • 20.
    Handling missing data •DifferentMethods of Dealing With Missing Data 1. Deleting the column with missing data updated_df = df.dropna(axis=1) updated_df.info() 2. Deleting the row with missing data If there is a certain row with missing data, then you can delete the entire row with all the features in that row. axis=1 is used to drop the column with NaN values. axis=0 is used to drop the row with NaN values. updated_df = newdf.dropna(axis=0)
  • 21.
    Handling missing data •DifferentMethods of Dealing With Missing Data 3. Filling the Missing Values – Imputation The possible ways to do this are: Filling the missing data with the mean or median value if it’s a numerical variable. Filling the missing data with mode if it’s a categorical value. Filling the numerical value with 0 or -999, or some other number that will not occur in the data. This can be done so that the machine can recognize that the data is not real or is different. Filling the categorical value with a new type for the missing values. You can use the fillna() function to fill the null values in the dataset. updated_df = df updated_df['Age']=updated_df['Age'].fillna(updated_df['Age'].mean() )

Editor's Notes

  • #12 Obj = pd.read_csv(‘values.csv’)
  • #15 Df = frame.drop('r1', 0) # 0 for row, 1 for columns
  • #17 s.dropna(inplace=True)
  • #18 s.dropna(inplace=True)
  • #19 s.dropna(inplace=True)
  • #20 s.dropna(inplace=True)
  • #21 s.dropna(inplace=True)