Python Pandas 1
Python Pandas 1
Data Cleaning
with Python Pandas
Rajesh Jogi
1/86
Table of Contents
1 Different Things on Series with Pandas
1.1 Slicing Series
1.2 Append Series
1.3 Operation on Series
2 DataFrame
2.1 Create DataFrame
2.2 Dataframe of Random Numbers with Date Indices
2.3 Delete Column in DataFrame
2.4 Data Selection in Dataframe
2.5 Set Value
2.6 Dealing with NULL Values
2.7 Descriptive Statistics
3 Apply function on Dataframe
3.1 Merge Dataframes
3.2 Importing multiple CSV files in DataFrame
4 LIKE OPERATION IN PANDAS
5 Regex in Pandas dataframe
6 Replace values in dataframe
7 Group By
8 Loading Data in Chunks
9 Stack & unstack in Pandas
10 PIVOT Tables
11 Hierarchical indexing
12 SWAP Columns in Hierarchical indexing
13 Crosstab in Pandas
14 Row & Column Bind
14.1 Row Bind
14.2 Column Bind
2/86
In [3]:▾ # Create series from numpy array
v = np.array([1,2,3,4,5,6,7])
s1 = pd.Series(v)
s1
Out[3]: 0 1
1 2
2 3
3 4
4 5
5 6
6 7
dtype: int32
Out[4]: dtype('int32')
D:\Software_installed\python 3.8.3\lib\site-packages\ipykernel_launcher.py:2: F
utureWarning: Series.itemsize is deprecated and will be removed in a future ver
sion
Out[5]: 4
Out[6]: 28
Out[7]: (7,)
Out[8]: 1
Out[9]: 7
In [10]: s1.count()
Out[10]: 7
3/86
In [11]: s1.size
Out[11]: 7
Out[12]: a 1
b 2
c 3
dtype: int64
Out[13]: a 1
b 2
c 3
d 4
e 5
f 6
g 7
dtype: int32
In [15]: ind2,v2
Out[16]: a1 10
a2 20
a3 30
a4 40
dtype: int64
4/86
In [17]: pd.Series(99, index = [0,1,2,3,4,5])
Out[17]: 0 99
1 99
2 99
3 99
4 99
5 99
dtype: int64
Out[18]: 0 0.255723
1 0.333076
2 0.524627
3 0.956131
4 0.938360
5 0.280243
6 0.325282
7 0.667926
8 0.928240
9 0.229452
dtype: float64
Out[19]: 0 0.255723
1 0.333076
2 0.524627
3 0.956131
4 0.938360
5 0.280243
6 0.325282
7 0.667926
8 0.928240
9 0.229452
dtype: float64
Out[20]: 0 0.255723
1 0.333076
2 0.524627
dtype: float64
5/86
In [21]:▾ # Last element of the Series
s[-1:]
Out[21]: 9 0.229452
dtype: float64
Out[22]: 0 0.255723
1 0.333076
2 0.524627
3 0.956131
dtype: float64
In [23]:▾ # Return all elements of the series except last two elements.
s[:-2]
Out[23]: 0 0.255723
1 0.333076
2 0.524627
3 0.956131
4 0.938360
5 0.280243
6 0.325282
7 0.667926
dtype: float64
Out[24]: 0 0.255723
1 0.333076
2 0.524627
3 0.956131
4 0.938360
5 0.280243
6 0.325282
7 0.667926
8 0.928240
dtype: float64
Out[25]: 8 0.928240
9 0.229452
dtype: float64
Out[26]: 9 0.229452
dtype: float64
6/86
In [27]: s[-3:-1]
Out[27]: 7 0.667926
8 0.928240
dtype: float64
Out[28]: a 1
b 2
c 3
d 4
e 5
f 6
g 7
dtype: int32
In [29]: s3
Out[29]: a1 10
a2 20
a3 30
a4 40
dtype: int64
Out[30]: a 1
b 2
c 3
d 4
e 5
f 6
g 7
a1 10
a2 20
a3 30
a4 40
dtype: int64
7/86
In [31]:▾ # When "inplace=False" it will return a new copy of data with the operation per
s4.drop('a4', inplace=False)
Out[31]: a 1
b 2
c 3
d 4
e 5
f 6
g 7
a1 10
a2 20
a3 30
dtype: int64
In [32]: s4
Out[32]: a 1
b 2
c 3
d 4
e 5
f 6
g 7
a1 10
a2 20
a3 30
a4 40
dtype: int64
Out[33]: a 1
b 2
c 3
d 4
e 5
f 6
g 7
a1 10
a2 20
a3 30
dtype: int64
8/86
In [34]: s4 = s4.append(pd.Series({'a4':7}))
s4
Out[34]: a 1
b 2
c 3
d 4
e 5
f 6
g 7
a1 10
a2 20
a3 30
a4 7
dtype: int64
Out[35]: (0 10
1 20
2 30
dtype: int32, 0 1
1 2
2 3
dtype: int32)
Out[36]: 0 11
1 22
2 33
dtype: int32
Out[37]: 0 9
1 18
2 27
dtype: int32
9/86
In [38]:▾ # Subtraction of two series
s1.subtract(s2)
Out[38]: 0 9
1 18
2 27
dtype: int32
Out[39]: 0 19
1 29
2 39
dtype: int32
Out[40]: 0 10
1 40
2 90
dtype: int32
Out[41]: 0 10
1 40
2 90
dtype: int32
Out[42]: 0 10000
1 20000
2 30000
dtype: int32
In [43]:▾ # Division
s1.divide(s2)
Out[43]: 0 10.0
1 10.0
2 10.0
dtype: float64
10/86
In [44]:▾ # Division
s1.div(s2)
Out[44]: 0 10.0
1 10.0
2 10.0
dtype: float64
Out[45]: 30
Out[46]: 10
In [47]:▾ # Average
s1.mean()
Out[47]: 20.0
In [48]:▾ # median
s1.median()
Out[48]: 20.0
Out[49]: 10.0
Out[50]: False
In [51]: s4 =s1
Out[52]: True
11/86
In [53]: s5 = pd.Series([1,1,2,2,3,3], index = [0,1,2,3,4,5])
s5
Out[53]: 0 1
1 1
2 2
3 2
4 3
5 3
dtype: int64
Out[54]: 3 2
2 2
1 2
dtype: int64
2 DataFrame
Out[55]:
Out[56]:
0
0 Java 1
Python 2C
3C++
12/86
In [57]:▾ # Add column in the Dataframe
rating = [1,2,3,4]
df[1] = rating
df
Out[57]:
0 1
0 Java 1 1
Python 2C 2
3C++ 3
In [59]: df
Out[59]:
Language Rating
0 Java 1
1 Python 2
2 C 3
3 C++ 4
In [61]: df2
Out[61]:
a b c
0 1 2 NaN
1 5 10 20.0
13/86
In [62]: df3
Out[62]:
a b
row 1 2
1 5 10
row
In [63]: df4
Out[63]:
a b c
row 1 2 NaN
1 5 10 20.0
row
2
In [64]: df5
Out[64]:
a b c d
1 5 10 20.0 NaN
row
2
In [65]:▾ # Create Dataframe from Dictionary
df0 = pd.DataFrame({'ID' :[1,2,3,4], 'Name' :['Aryan','Nayan','John','Rose']})
df0
Out[65]:
ID Name
0 1 Aryan
1 2 Nayan
2 3 John
3 4 Rose
14/86
In [66]:▾ # Create a DataFrame from Dictionary of Series
▾ dict = {'A': pd.Series([1,2,3,], index = ['a','b','c']),
'B': pd.Series([1,2,3,4], index = ['a','b','c','d'])}
df1 = pd.DataFrame(dict)
df1
Out[66]:
A B
a1.0
1
b2.0
2
c3.0
3
dNaN
4
15/86
In [70]: M = np.random.random((7,7))
M
Out[71]:
0 1 2 3 4 5 6
16/86
In [72]:▾ #Changing Column Names
dframe.columns = ['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7']
dframe
Out[72]:
C1 C2 C3 C4 C5 C6 C7
Out[75]: C1 float64
C2 float64
C3 float64
C4 float64
C5 float64
C6 float64
C7 float64
dtype: object
17/86
In [76]:▾ # Sort Dataframe by Column 'C1' in Ascending Order
dframe.sort_values(by='C1')
Out[76]:
C1 C2 C3 C4 C5 C6 C7
Out[77]:
C1 C2 C3 C4 C5 C6 C7
Out[78]:
A B
a1.0
1
b2.0
2
c3.0
3
dNaN
4
18/86
In [79]: del df1['B']
In [80]: df1
Out[80]:
A
a1.0
b2.0
c3.0
dNaN
In [81]: df5
Out[81]:
a b c d
1 5 10 20.0 NaN
row
2
In [82]:▾ # Delete Column using pop()
df5.pop('c')
In [83]: df5
Out[83]:
a b d
row 1 2 NaN
1 5 10 NaN
row
19/86
In [84]: df
Out[84]:
Language Rating
0 Java 1
1 Python 2
2 C 3
3 C++ 4
Out[85]:
Language Rating
1 Java 1
2 Python 2
3 C 3
4 C++ 4
In [87]: df.iloc[1]
In [88]: df.loc[1:2]
Out[88]:
Language Rating
1 Java 1
2 Python 2
20/86
In [89]: df.iloc[1:2]
Out[89]:
Language Rating
2 Python 2
Out[90]:
Language Rating
3 C 3
4 C++ 4
In [91]: df1
Out[91]:
A
a1.0
b2.0
c3.0
dNaN
Out[92]: A 1.0
Name: a, dtype: float64
In [93]:▾ # df1.iloc['a']# This will throw error because iloc will not work on labels
21/86
In [94]: dframe
Out[94]:
C1 C2 C3 C4 C5 C6 C7
Out[95]:
C1 C2 C3 C4 C5 C6 C7
Out[96]:
C1 C7
22/86
In [97]:▾ #row & column label based selection
dframe.loc['2020-01-20':'2020-01-22',['C1','C7']]
Out[97]:
C1 C7
Out[98]:
C1 C2 C3 C4 C5 C6 C7
Out[99]:
C1 C2 C3 C4 C5 C6 C7
Out[100]: 0.6809677686418316
23/86
In [101]:▾ # # Select all rows & first three columns
dframe.iloc[:,0:3]
Out[101]:
C1 C2 C3
In [102]: dframe.iloc[0][0] = 10
Out[103]:
C1 C2 C3 C4 C5 C6 C7
Out[104]:
C1 C2 C3 C4 C5 C6 C7
24/86
In [105]:▾ # Set value of 777 for first three rows in Column 'C6'
dframe.at[0:3,'C6']=777
dframe
Out[105]:
C1 C2 C3 C4 C5 C6 C7
Out[106]:
C1 C2 C3 C4 C5 C6 C7
25/86
In [107]: dframe.iloc[0,2] = 555
dframe
Out[107]:
C1 C2 C3 C4 C5 C6 C7
In [108]:▾ # Create Copy of the calling objects data along with indices.
# Modifications to the data or indices of the copy will not be reflected in the
dframe1 = dframe.copy(deep=True)
In [110]: dframe1[dframe1['C1']==0]
Out[110]:
C1 C2 C3 C4 C5 C6 C7
26/86
In [111]:▾ # Replace zeros in Column C1 with 99
dframe1[dframe1['C1'].isin([0])]=99
dframe1
Out[111]:
C1 C2 C3 C4 C5 C6 C7
99.000000 99.000000 99.000000 99.000000 99.000000
2020-01-20 99 99.0
99.000000 99.000000 99.000000 99.000000 99.000000
2020-01-21 99 99.0
0.918548 0.615602 0.498748 0.764083 0.855754
2020-01-22 888 777.0
99.000000 99.000000 99.000000 99.000000 99.000000
2020-01-23 99 99.0
99.000000 99.000000 99.000000 99.000000 99.000000
2020-01-24 99 99.0
99.000000 99.000000 99.000000 99.000000 99.000000
2020-01-25 99 99.0
99.000000 99.000000 99.000000 99.000000 99.000000
2020-01-26 99 99.0
In [112]: dframe
Out[112]:
C1 C2 C3 C4 C5 C6 C7
Out[113]:
C1 C2 C3 C4 C5 C6 C7
27/86
2.6 Dealing with NULL Values
In [114]: dframe.at[0:8,'C7']=np.NaN
dframe.at[0:2,'C6']=np.NaN
dframe.at[5:6,'C5']=np.NaN
dframe
Out[114]:
C1 C2 C3 C4 C5 C6 C7
Out[115]:
C1 C2 C3 C4 C5 C6 C7
e e e e
e e e e
e e e e
e e e e
28/86
In [116]:▾ # Detect Missing or NULL Values
# It will return True for NULL values and False for NOT-NULL values
dframe.isna()
Out[116]:
C1 C2 C3 C4 C5 C6 C7
Tru
Tru
In [117]:▾ # Fill all NULL values with 1020
dframe= dframe.fillna(1020) e
dframe Tru
Out[117]: e
C1 C2 C3 C4 C5 C6 C7
1020.0
2020-01-20 888 0.689842 555.000000 0.830028 0.134124 1020.000000
1020.0
2020-01-21 888 0.324641 0.858381 0.856608 0.781996 1020.000000
1020.0
2020-01-22 888 0.918548 0.615602 0.498748 0.764083 777.000000
1020.0
2020-01-23 888 0.514615 0.378123 0.829159 0.135199 0.278564
1020.0
2020-01-24 888 0.207862 0.892584 0.891020 0.922977 0.735454
1020.0
2020-01-25 888 0.167508 0.211469 0.969901 1020.000000 0.752311
1020.0
2020-01-26 888 0.943613 0.742235 0.663735 0.431511 0.553264
29/86
In [118]: dframe.at[0:5 , 'C7'] = np.NaN
dframe.at[0:2 , 'C6'] = np.NaN
dframe.at[5:6 , 'C5'] = np.NaN
dframe
Out[118]:
C1 C2 C3 C4 C5 C6 C7
Out[119]:
C1 C2 C3 C4 C5 C6 C7
30/86
In [120]:▾ #Replace first NULL value in Column C7 with 789
dframe.fillna(value={'C7':789}, limit=1)
Out[120]:
C1 C2 C3 C4 C5 C6 C7
Out[121]:
C1 C2 C3 C4 C5 C6 C7
0.431511 1020.0
2020-01-26 888 0.943613 0.742235 0.663735 0.553264
Out[122]:
C1 C2 C3 C4
31/86
In [123]: dframe
Out[123]:
C1 C2 C3 C4 C5 C6 C7
Out[124]:
C1 C2 C3 C4 C5 C6 C7
32/86
In [125]:▾ # Fill NULL values with 55
dframe.fillna(55, inplace=True)
dframe
Out[125]:
C1 C2 C3 C4 C5 C6 C7
Out[126]: C1 888.000000
C2 0.538090
C3 79.814056
C4 0.791314
C5 8.309984
C6 127.045656
C7 330.714286
dtype: float64
Out[127]: C1 888.000000
C2 0.943613
C3 555.000000
C4 0.969901
C5 55.000000
C6 777.000000
C7 1020.000000
dtype: float64
Out[128]: C1 888.000000
C2 0.167508
C3 0.211469
C4 0.498748
C5 0.134124
C6 0.278564
C7 55.000000
dtype: float64
33/86
In [129]:▾ # Median
dframe.median()
Out[129]: C1 888.000000
C2 0.514615
C3 0.742235
C4 0.830028
C5 0.764083
C6 0.752311
C7 55.000000
dtype: float64
Out[130]: C1 0.000000
C2 0.322676
C3 209.537453
C4 0.158588
C5 20.590770
C6 287.748820
C7 470.871785
dtype: float64
In [131]:▾ # Variance
dframe.var()
Out[131]: C1 0.000000
C2 0.104120
C3 43905.944333
C4 0.025150
C5 423.979805
C6 82799.383192
C7 221720.238095
dtype: float64
Out[132]: C1 888.000000
C2 0.266251
C3 0.496863
C4 0.746447
C5 0.283355
C6 0.644359
C7 55.000000
Name: 0.25, dtype: float64
34/86
In [133]:▾ #Second Quartile / Median
dframe.quantile(0.5)
Out[133]: C1 888.000000
C2 0.514615
C3 0.742235
C4 0.830028
C5 0.764083
C6 0.752311
C7 55.000000
Name: 0.5, dtype: float64
Out[134]: C1 888.000000
C2 0.804195
C3 0.875483
C4 0.873814
C5 0.852487
C6 55.000000
C7 537.500000
Name: 0.75, dtype: float64
Out[135]: C1 0.000000
C2 0.537944
C3 0.378620
C4 0.127367
C5 0.569132
C6 54.355641
C7 482.500000
dtype: float64
Out[136]: C1 6216.000000
C2 3.766630
C3 558.698394
C4 5.539198
C5 58.169890
C6 889.319594
C7 2315.000000
dtype: float64
35/86
In [137]:▾ # GENERATES DESCRIPTIVE STATS
dframe.describe()
Out[137]:
C1 C2 C3 C4 C5 C6 C7
Out[138]: C1 0.000000
C2 0.198711
C3 2.645743
C4 -1.172444
C5 2.644451
C6 2.602402
C7 1.229634
dtype: float64
Out[139]: C1 0.000000
C2 -1.897991
C3 6.999968
C4 1.040791
C5 6.994764
C6 6.820734
C7 -0.840000
dtype: float64
36/86
In [140]:▾ # Correlation
# https://www.youtube.com/watch?v=qtaqvPAeEJY&list=PLblh5JKOoLUK0FLuzwntyYI10UQ
dframe.corr()
Out[140]:
C1 C2 C3 C4 C5 C6 C7
In [141]:▾ # Covariance
# https://www.youtube.com/watch?v=xZ_z8KWkhXE&list=PLblh5JKOoLUK0FLuzwntyYI10UQ
dframe.cov()
Out[141]:
C1 C2 C3 C4 C5 C6 C7
37/86
In [142]: import statistics as st
dframe.at[3:6,'C1'] = 22
dframe
Out[142]:
C1 C2 C3 C4 C5 C6 C7
In [143]:▾ # Average
st.mean(dframe['C1'])
# dframe['C1'].mean()
Out[143]: 516.8571428571429
Out[144]: 49.69186046511628
In [145]:▾ #Returns average of the two middle numbers when length is EVEN
arr = np.array([1,2,3,4,5,6,7,8])
st.median(arr)
Out[145]: 4.5
Out[146]: 4
Out[147]: 5
Out[148]: 55.0
38/86
In [149]:▾ # Sample Variance
st.variance(dframe['C1'])
Out[149]: 214273.14285714287
Out[150]: 183662.69387755104
Out[151]: 462.89647099231905
Out[152]: 428.5588569584708
Out[153]:
C1 C2 C3 C4 C5 C6 C7
Out[154]: C1 888.000000
C2 0.943613
C3 555.000000
C4 0.969901
C5 55.000000
C6 777.000000
C7 1020.000000
dtype: float64
39/86
In [155]:▾ # Finding minimum value in Columns
dframe.apply(min)
Out[155]: C1 22.000000
C2 0.167508
C3 0.211469
C4 0.498748
C5 0.134124
C6 0.278564
C7 55.000000
dtype: float64
Out[156]: C1 3618.000000
C2 3.766630
C3 558.698394
C4 5.539198
C5 58.169890
C6 889.319594
C7 2315.000000
dtype: float64
Out[157]: C1 3618.000000
C2 3.766630
C3 558.698394
C4 5.539198
C5 58.169890
C6 889.319594
C7 2315.000000
dtype: float64
40/86
In [159]:▾ # Square root of all values in dataframe
dframe.applymap(np.sqrt)
Out[159]:
C1 C2 C3 C4 C5 C6 C7
Out[160]:
C1 C2 C3 C4 C5 C6 C7
41/86
In [161]: dframe.applymap(float)
Out[161]:
C1 C2 C3 C4 C5 C6 C7
Out[162]: C1 22.000000
C2 0.167508
C3 0.211469
C4 0.498748
C5 0.134124
C6 0.278564
C7 55.000000
dtype: float64
Out[163]:
C1 C2 C3 C4 C5 C6 C7
42/86
In [164]: daf1 = pd.DataFrame({'Id': ['1','2','3','4','5'], 'Name': ['Aryan','Rose','Bran
daf1
Out[164]:
Id Name
0 1 Aryan
1 2 Rose
2 3 Bran
3 4 Shiv
4 5 Joy
Out[165]:
Id Score
0 1 40
1 2 60
2 6 80
3 7 90
4 8 70
Out[166]:
Id Name Score
0 1 Aryan 40
1 2 Rose 60
43/86
In [167]:▾ # Full Outer Join
pd.merge(daf1,daf2, on= 'Id', how = 'outer')
Out[167]:
Id Name Score
0 1 Aryan 40.0
1 2 Rose 60.0
2 3 Bran NaN
3 4 Shiv NaN
4 5 Joy NaN
5 6 NaN 80.0
6 7 NaN 90.0
7 8 NaN 70.0
Out[168]:
Id Name Score
0 1 Aryan 40.0
1 2 Rose 60.0
2 3 Bran NaN
3 4 Shiv NaN
4 5 Joy NaN
Out[169]:
Id Name Score
0 1 Aryan 40
1 2 Rose 60
2 6 NaN 80
3 7 NaN 90
4 8 NaN 70
# csv1.to_csv('/Data_File1.csv')
# csv2.to_csv('/Data_File2.csv')
# csv3.to_csv('/Data_File3.csv')
Out[172]:
Case-
# Active Admin2 Attack Fatality_Ratio Combined_Key Confirmed Country_R
3 rows × 28 columns
Out[173]:
Case-
# Active Admin2 Attack Fatality_Ratio Combined_Key Confirmed Count
3 rows × 28 columns
45/86
In [174]:▾ # Reading columns
covid['Country_Region'].head(3)
Out[174]: 0 US
1 US
2 US
Name: Country_Region, dtype: object
Out[175]:
Country_Region Province_State Confirmed Last_Update
2020-07-08 05:33:48
0 US South Carolina 134.0
2020-07-08 05:33:48
1 US Louisiana 1068.0
2020-07-08 05:33:48
2 US Virginia 1042.0
2020-07-08 05:33:48
3 US Idaho 3252.0
2020-07-08 05:33:48
4 US Iowa 16.0
Out[176]:
Country_Region Province_State Confirmed Last_Update
46/86
In [177]:▾ #Filter data
df1.loc[df1['Country_Region']=='India']
Out[177]:
Country_Region Province_State Confirmed Last_Update
Andaman and Nicobar Islands 2020-07-08 05:33:48
3141 India 147.0
Andhra Pradesh 2020-07-08 05:33:48
3142 India 21197.0
Arunachal Pradesh 2020-07-08 05:33:48
3156 India 276.0
Assam 2020-07-08 05:33:48
3157 India 12522.0
Bihar 2020-07-08 05:33:48
3179 India 12570.0
... ...
... ... ...
Tripura 2020-07-06 04:33:57
11174 India 1568.0
Unknown 2020-07-06 04:33:57
11187 India 4913.0
Uttar Pradesh 2020-07-06 04:33:57
11193 India 27707.0
Uttarakhand 2020-07-06 04:33:57
11194 India 3124.0
West Bengal 2020-07-06 04:33:57
11217 India 22126.0
47/86
In [179]:▾ #Sort Data Frame
display('Sorted Data Frame', df1.sort_values(['Country_Region'], ascending=Fals
In [180]:▾ #Sort Data Frame - Ascending on "Country" & descending on "Last update"
display('Sorted data Frame', df1.sort_values(['Country_Region', 'Last_Update'],
Country_Region Indonesia
Confirmed 66226
Name: 3704, dtype: object
Country_Region Indonesia
Confirmed 64958
Name: 7506, dtype: object
Country_Region Indonesia
Confirmed 63749
Name: 11306, dtype: object
48/86
In [182]:▾ #Unique Values
covid['Country_Region'].drop_duplicates(keep='first').head(10)
Out[182]: 0 US
3124 Italy
3125 Brazil
3126 Russia
3127 Mexico
3128 Japan
3131 Canada
3136 Colombia
3137 Peru
3140 Spain
Name: Country_Region, dtype: object
49/86
In [183]:▾ # Countries impacted with Coronavirus
countries= covid['Country_Region'].unique()
type(countries), countries
Out[183]: (numpy.ndarray,
array(['US', 'Italy', 'Brazil', 'Russia', 'Mexico', 'Japan', 'Canada',
'Colombia', 'Peru', 'Spain', 'India', 'United Kingdom', 'China',
'Chile', 'Netherlands', 'Australia', 'Pakistan', 'Germany',
'Sweden', 'Ukraine', 'Denmark', 'France', 'Afghanistan', 'Albania',
'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda', 'Argentina',
'Armenia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana',
'Brunei', 'Bulgaria', 'Burkina Faso', 'Burma', 'Burundi',
'Cabo Verde', 'Cambodia', 'Cameroon', 'Central African Republic',
'Chad', 'Comoros', 'Congo (Brazzaville)', 'Congo (Kinshasa)',
'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus',
'Czechia', 'Diamond Princess', 'Djibouti', 'Dominica',
'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
'Equatorial Guinea', 'Eritrea', 'Estonia', 'Eswatini', 'Ethiopia',
'Fiji', 'Finland', 'Gabon', 'Gambia', 'Georgia', 'Ghana', 'Greece',
'Grenada', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana',
'Haiti', 'Holy See', 'Honduras', 'Hungary', 'Iceland', 'Indonesia',
'Iran', 'Iraq', 'Ireland', 'Israel', 'Jamaica', 'Jordan',
'Kazakhstan', 'Kenya', 'Korea, South', 'Kosovo', 'Kuwait',
'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia',
'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'MS Zaandam',
'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta',
'Mauritania', 'Mauritius', 'Moldova', 'Monaco', 'Mongolia',
'Montenegro', 'Morocco', 'Mozambique', 'Namibia', 'Nepal',
'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'North Macedonia',
'Norway', 'Oman', 'Panama', 'Papua New Guinea', 'Paraguay',
'Philippines', 'Poland', 'Portugal', 'Qatar', 'Romania', 'Rwanda',
'Saint Kitts and Nevis', 'Saint Lucia',
'Saint Vincent and the Grenadines', 'San Marino',
'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia',
'Somalia', 'South Africa', 'South Sudan', 'Sri Lanka', 'Sudan',
'Suriname', 'Switzerland', 'Syria', 'Taiwan*', 'Tajikistan',
'Tanzania', 'Thailand', 'Timor-Leste', 'Togo',
'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Uganda',
'United Arab Emirates', 'Uruguay', 'Uzbekistan', 'Venezuela',
'Vietnam', 'West Bank and Gaza', 'Western Sahara', 'Yemen',
'Zambia', 'Zimbabwe', nan], dtype=object))
50/86
In [184]:▾ #https://data.world/data-society/pokemon-with-stats
df2 = pd.read_csv('Pokemon.csv')
df2.head(5)
Out[184]:
Type Type Sp. Sp.
# Name 1 2 Total HP Attack Defense Atk De Speed
f
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 4
Out[185]:
Type Type Sp. Sp.
# Name 1 2 Total HP Attack Defense Atk De Speed
f
0 1 Bulbasaur Grass Poison 94 45 49 49 65 65 4
51/86
In [186]:▾ # Sum of Columns
df2['Total'] = df2.iloc[:,5:11].sum(axis=1)
df2.head()
Out[186]:
Type Type Sp. Sp.
# Name 1 2 Total HP Attack Defense Atk De Speed
f
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 4
cols = list(df2.columns)
Out[187]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Total
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 31
52/86
In [188]:▾ #Shifting "Legendary" column - Its Index location -1 or 12
cols = list(df2.columns)
Out[188]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Legen
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
Out[189]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
53/86
In [192]: df2.head(7)
Out[192]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
54/86
In [196]:▾ #Filtering using loc
df2.loc[df2['Type 2']=='Dragon']
Out[196]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gen
f
7 6 CharizardMega Fire Dragon 78 130 111 130 85 100
Charizard X
Zweilous n
694 633 Dark Drago 52 65 50 45 50 38
Hydreigon Drago
695 634 Dark n 72 85 70 65 70 58
Dragalge n
696 635 Dark Drago 92 105 90 125 90 98
Tyrunt
761 691 Poison n 65 75 90 97 123 44
Tyrantrum
766 696 Rock Drago 58 89 77 45 45 48
Noibat
767 697 Rock n 82 121 119 69 59 71
Noivern
790 714 Flying Drago 40 30 35 45 40 55
791 n 85 70 80 97 80 123
715 Flying
Drago
Drago
Drago
Drago
55/86
In [197]:▾ #Filtering using loc
df3 = df2.loc[(df2['Type 2']=='Dragon') & (df2['Type 1']=='Dark')]
df3
Out[197]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Generatio
f
694 633 Deino 634 Dar Drago 52 65 50 45 50 38
k n
Dar Drago
k n
In [198]:▾ #Reset index for Dataframe df3 keeping old index column
df4 = df3.reset_index()
df4
Out[198]:
Type Type Sp. Sp.
index # Name 1 2 HP Attack Defense Atk De Speed G
f
0 694 633 Deino 634 Dar Drago 52 65 50 45 50 38
k n
Dar Drago
Out[199]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Generation
f
0 633 Deino 634 Dar Drago 52 65 50 45 50 38
1 Zweilous 635 k n 72 85 70 65 70 58
k n
Dar Drago
k n
56/86
In [200]: df2.head(5)
Out[200]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
Out[201]: 0 False
1 False
2 False
3 False
4 False
5 False
6 False
Name: Name, dtype: bool
57/86
In [202]:▾ # Display all rows containing Name "rill"
df2.loc[df2.Name.str.contains("rill")]
Out[202]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Genera
f
18 15 Beedrill Bug Poison 65 90 40 45 80 75
Out[203]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
58/86
In [204]:▾ #Display all rows with Type-1 as "Grass" and Type-2 as "Poison"
df2.loc[df2['Type 1'].str.contains("Grass") & df2['Type 2'].str.contains("Poiso
Out[204]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gene
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
59/86
In [205]: df2.loc[df2['Type 1'].str.contains('Grass|Water',regex = True)].head(7)
Out[205]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
Wate
Out[206]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Generation
f
Out[207]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
Wate
r
60/86
In [208]:▾ # To ignore case we can use "Flags = re.I"
df2.loc[df2['Type 1'].str.contains("grass|water", flags = re.I, regex=True)].he
Out[208]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
Wate
r
5 Regex in Pandas dataframe
In [209]:▾ #Get all rows with name starting with "wa"
df2.loc[df2.Name.str.contains('^Wa', flags =re.I, regex = True)].head(7)
Out[209]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Generation
8 Wartortle f
10 Water NaN 59 63 80 65 80 58
Wailmer
350 320 Water NaN 130 70 35 70 35 60
Wailord
351 321 Water NaN 170 90 45 90 45 60
Walrein
400 365 Ice Water 110 80 90 95 90 65
Watchog
564 505 Normal NaN 60 85 69 60 69 77
61/86
In [210]:▾ #Get all rows with name starting with "wa" followed by any letter between a-l
df2.loc[df2.Name.str.contains('^wa[a-l]+',flags = re.I, regex= True)].head(7)
Out[210]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Generation
f
350 320 Wailmer Water NaN 130 70 35 70 35 60
Out[211]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Genera
f
46 41 Zubat Poison Flying 40 45 35 30 40 55
707 644 Zekrom Dragon Electric 100 150 120 120 100 90
62/86
In [212]:▾ # Extracting first 3 characters from "Name" column
df2['Name2'] = df2.Name.str.extract(r'(^\w{3})')
df2.head(5)
Out[212]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
In [213]:▾ # Return all rows with "Name" starting with character 'B or b'
df2.loc[df2.Name.str.match(r'(^[B|b].*)')].head(5)
Out[213]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
63/86
In [214]: df2.head(7)
Out[214]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
Out[215]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Medo Poison 45 49 49 65 65 45
1 2 Ivysaur w Poison 60 62 63 80 80 60
64/86
In [216]: df2['Type 2'] = df2['Type 2'].replace({"Poison": "Venom"})
df2.head()
Out[216]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gener
f
0 1 Bulbasaur Medo Veno 45 49 49 65 65 45
1 2 Ivysaur w m 60 62 63 80 80 60
Medo Veno
w m
Out[217]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gen
f
0 1 Bulbasaur Medow DANGER 45 49 49 65 65 45
65/86
In [218]: df2.loc[df2['Type 2'] == 'DANGER', 'Name2'] =np.NaN
df2.head(7)
Out[218]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gen
f
0 1 Bulbasaur Medow DANGER 45 49 49 65 65 45
Out[219]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gen
f
0 1 Bulbasaur Medow DANGER 45 49 49 65 65 45
66/86
In [220]: df2.loc[df2['Total'] > 100, ['Legendary','Name2']] = ['Aleart-1','Aleart-2']
df2.head(7)
Out[220]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gen
f
0 1 Bulbasaur Medow DANGER 45 49 49 65 65 45
7 Group By
In [221]: df = pd.read_csv('poke_updated.csv')
df.head()
Out[221]:
Unnamed: Type Type Sp. Sp.
0 # Name 1 2 HP Attack Defense Atk De Sp
f
0 0 1 Bulbasaur Grass Poison 45 49 49 65 65
67/86
In [222]: df.groupby(['Type 1']).mean().head(7)
Out[222]:
Unnamed:
0 # HP Attack Defense Sp. Atk Sp. Def Speed
Type
1
Out[223]:
Unnamed:
0 # HP Attack Defense Sp. Atk Sp. Def Speed
Type
1
Dragon 521.843750 474.375000 83.312500 112.125000 86.375000 96.843750 88.843750 83.031250 400.444444
Fighting 363.851852 69.851852 96.777778 65.925926 53.111111 64.703704 66.074074 392.312500 356.281250
Ground 73.781250 95.750000 84.843750 56.468750 62.750000 63.906250 431.840909 392.727273 65.363636
Rock 92.863636 100.795455 63.340909 75.477273 55.909091 486.296296 442.851852 65.222222 92.703704
Steel 126.370370 67.518519 80.629630 55.259259 507.387097 461.354839 66.806452 88.387097 70.225806
Dark 74.645161 69.516129 76.161290 360.942308 327.403846 69.903846 84.769231 67.769231 88.980769
68/86
In [224]: df.groupby(['Type 1']).mean().sort_values('Defense' , ascending = False).head(1
Out[224]:
Unnamed:
0 # HP Attack Defense Sp. Atk Sp. Def Speed
Type
1
Steel 486.296296 442.851852 65.222222 92.703704 126.370370 67.518519 80.629630 55.259259 431.840909
Rock 392.727273 65.363636 92.863636 100.795455 63.340909 75.477273 55.909091 521.843750 474.375000
Dragon 83.312500 112.125000 86.375000 96.843750 88.843750 83.031250 392.312500 356.281250 73.781250
Ground 95.750000 84.843750 56.468750 62.750000 63.906250 536.281250 486.500000 64.437500 73.781250
Ghost 81.187500 79.343750 76.468750 64.343750 333.312500 303.089286 72.062500 74.151786 72.946429
Water 74.812500 70.517857 65.964286 465.666667 423.541667 72.000000 72.750000 71.416667 77.541667
Ice 76.291667 63.458333 380.414286 344.871429 67.271429 73.214286 70.800000 77.500000 70.428571
Grass 61.928571 368.072464 334.492754 56.884058 70.971014 70.724638 53.869565 64.797101 61.681159
Dark
Out[225]:
Unnamed:
0 # HP Attack Defense Sp. Atk Sp. Def Speed
Type
1
69/86
In [226]: df.sum()
Out[227]:
Unnamed: Sp. Sp.
0 # HP Attack Defense Atk De Speed Generation Lege
f
Type
2
In [228]: df.count()
70/86
In [229]: df['count1'] = 0
df.groupby(['Type 2']).count()['count1']
Out[229]: Type 2
Bug 3
Dark 20
Dragon 18
Electric 6
Fairy 23
Fighting 26
Fire 12
Flying 97
Ghost 14
Grass 25
Ground 35
Ice 14
Normal 4
Poison 34
Psychic 33
Rock 14
Steel 22
Water 14
Name: count1, dtype: int64
In [230]: df['count1'] = 0
df.groupby(['Type 1']).count()['count1']
Out[230]: Type 1
Bug 69
Dark 31
Dragon 32
Electric 44
Fairy 17
Fighting 27
Fire 52
Flying 4
Ghost 32
Grass 70
Ground 32
Ice 24
Normal 98
Poison 28
Psychic 57
Rock 44
Steel 27
Water 112
Name: count1, dtype: int64
71/86
In [231]: df['count1'] = 0
df.groupby(['Type 1','Type 2','Legendary']).count()['count1']
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed
\
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
1 2 Ivysaur Grass Poison 60 62 63 80 80 60
2 3 Venusaur Grass Poison 80 82 83 100 100 80
72/86
In [233]: df
Out[233]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Genera
HoopaHoopa f
798 720 Unbound Psychic Dark 80 160 60 170 130 80
Volcanion
799 721 Fire Water 80 110 120 130 90 70
Out[234]:
Type Type Sp. Sp.
# Name 1 2 HP Attack Defense Atk De Speed Gene
f
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45
CharizardMega
8 6 Charizard Y Fire Flying 78 104 78 159 115 100
Squirtle
10 8 Blastoise r NaN 59 63 80 65 80 58
14 11 Bug
Wate NaN 50 20 55 25 25 30
data = ([[80,7,88,6],[90,8,92,7],[89,7,91,8],[87,6,93,8]])
Out[235]:
2010 2015
India 80 7 88 6
USA 90 8 92 7
Russia 89 7 91 8
China 87 6 93 8
Out[236]:
2010 2015
India GDP 7 6
Literacy 80 88
USA GDP 8 7
Literacy 90 92
Russia GDP 7 8
Literacy 89 91
China GDP 6 8
Literacy 87 93
74/86
In [237]:▾ #Unstacks the row to columns
unst_df = st_df.unstack()
unst_df
Out[237]:
201 201
0 Literacy 5 Literacy
USA 8 90 7 92
Russia 7 89 8 91
China 6 87 8 93
Out[239]:
India USA Russia China
2010 GDP 7 8 7 6
Literacy 80 90 89 87
2015 GDP 6 7 8 8
Literacy 88 92 91 93
10 PIVOT Tables
75/86
In [240]: data = { 'Country':['India','USA','Russia' , 'China','India','USA' ,
'Russia','China','I 'Year':['2010','2010','2010' , '2010','2010','2010' ,
'2015','2015','2015' , '2 'Literacy/GDP':['GDP' , 'GDP' , 'GDP' ,
'GDP','Literacy' , 'Literacy', 'Literac 'Value':[7,8,7,6,80,90,89,87,6,7]} df7 =
pd.DataFrame(data,columns=['Country','Year','Literacy/GDP','Value']) df7
Out[240]:
Country Year Literacy/GDP Value
Out[241]:
Value
Year Literacy/GDP
2010 GDP 28
Literacy 170
2015 GDP 13
Literacy 176
76/86
In [242]:▾ #Pivot table with MEAN aggregation
pd.pivot_table(df7 , index= ['Year' , 'Literacy/GDP'] , aggfunc='mean')
Out[242]:
Value
Year Literacy/GDP
Literacy 85.0
Literacy 88.0
11 Hierarchical indexing
In [243]: df7.head()
Out[243]:
Country Year Literacy/GDP Value
77/86
In [244]: df8 = df7.set_index(['Year', 'Literacy/GDP'])
df8
Out[244]:
Country Value
Year Literacy/GDP
GDP USA 8
GDP Russia 7
GDP China 6
Literacy India 80
Literacy USA 90
Literacy China 87
GDP India 6
GDP USA 7
In [245]: df8.index
78/86
In [246]: df8.loc['2010']
Out[246]:
Country Value
Literacy/GDP
GDP India 7
GDP USA 8
GDP Russia 7
GDP China 6
Literacy India 80
Literacy USA 90
In [247]: df8.loc[['2010']]
Out[247]:
Country Value
Year Literacy/GDP
GDP USA 8
GDP Russia 7
GDP China 6
Literacy India 80
Literacy USA 90
In [248]: df8.loc['2015','Literacy']
D:\Software_installed\python 3.8.3\lib\site-packages\ipykernel_launcher.py:1: P
erformanceWarning: indexing past lexsort depth may impact performance.
"""Entry point for launching an IPython kernel.
Out[248]:
Country Value
Year Literacy/GDP
Literacy China 87
79/86
In [249]: df8 = df7.set_index(['Year', 'Literacy/GDP', 'Country'])
df8
Out[249]:
Value
USA 8
Russia 7
China 6
Literacy India 80
USA 90
China 87
India 6
GDP
USA 7
Out[250]:
Country Year Literacy/GDP Value
80/86
In [251]: df8=df7.set_index(['Year', 'Literacy/GDP'])
df8
Out[251]:
Country Value
Year Literacy/GDP
GDP USA 8
GDP Russia 7
GDP China 6
Literacy India 80
Literacy USA 90
Literacy China 87
GDP India 6
GDP USA 7
Out[252]:
Country Value
Literacy/GDP Year
2010 USA 8
2010 Russia 7
2010 China 6
2010 USA 90
2015 Russia 89
2015 China 87
2015 India 6
GDP
2015 USA 7
81/86
In [253]:▾ # Swaping the columns in Hierarchical index
df9 = df9.swaplevel('Year', 'Literacy/GDP')
df9
Out[253]:
Country Value
Year Literacy/GDP
GDP USA 8
GDP Russia 7
GDP China 6
Literacy India 80
Literacy USA 90
Literacy China 87
GDP India 6
GDP USA 7
13 Crosstab in Pandas
In [254]: df7.head()
Out[254]:
Country Year Literacy/GDP Value
82/86
In [255]: pd.crosstab(df7['Literacy/GDP'],df7.Value, margins =True)
Out[255]:
Value 6 7 8 80 87 89 90 All
Literacy/GDP
GDP 2 3 1 0 0 0 0 6
Literacy 0 0 0 1 1 1 1 4
All 2 3 1 1 1 1 1 10
Out[256]:
Literacy/GDP GDP Literacy All
Year
2010 4 2 6
2015 2 2 4
All 6 4 10
Out[257]:
Country China India Russia USA All
Year Literacy/GDP
2010 GDP 1 1 1 1 4
Literacy 0 1 0 1 2
2015 GDP 0 1 0 1 2
Literacy 1 0 1 0 2
All 2 3 2 3 10
83/86
In [258]: df8 = pd.DataFrame({'ID' :[1,2,3,4], 'Name':['Aryan' , 'Basitro' , 'Rose' , 'Jo
df8
Out[258]:
ID Name Score
0 1 Aryan 99
1 2 Basitro 66
2 3 Rose 44
3 4 John 33
Out[259]:
ID Name Score
0 5 Michelle 78
1 6 Ramiro 55
2 7 Vignesh 77
3 8 Damon 87
Out[260]:
ID Name Score
0 1 Aryan 99
1 2 Basitro 66
2 3 Rose 44
3 4 John 33
0 5 Michelle 78
1 6 Ramiro 55
2 7 Vignesh 77
3 8 Damon 87
84/86
In [261]:▾ # Row Bind with append() function
df8.append(df9)
Out[261]:
ID Name Score
0 1 Aryan 99
1 2 Basitro 66
2 3 Rose 44
3 4 John 33
0 5 Michelle 78
1 6 Ramiro 55
2 7 Vignesh 77
3 8 Damon 87
Out[262]:
ID Name
0 1 Aryan
1 2 Basitro
2 3 Rose
3 4 John
Out[263]:
Age Score
0 20 99
1 30 66
2 35 44
3 40 33
85/86
In [264]: pd.concat([df10,df11] , axis = 1)
Out[264]:
ID Name Age Score
0 1 Aryan 20 99
1 2 Basitro 30 66
2 3 Rose 35 44
3 4 John 40 33
86/86