PYTHON PANDAS
Creating DataFrame:
Syntax: DF.DataFrame(data,columns,index)
1. Creating Empty DataFrame
DF=pd.DataFrame( )
2. Creating DataFrame from 2D lists (List of Lists)
L=[ [ 10, 20, 30, 40], [12, 12, 13,14 ] ] Created a list L
DF=pd.DataFrame( L ) Passed it to DataFrame( )
3. Creating DataFrame from 2D dictionary (dictionary of dictionaries)
D={ 1: { 'a':10, 'b':20, 'd':120 },2 : { 'a':30, 'b':60, 'c':90} } Created a nested (2D) dictionary
DF=pd.DataFrame(D) Passed it to DataFrame( )
4. Creating DataFrame from ndarray
a1=np.array([91,2,3,4,5]) Created array a1
a2=np.array([10,20,30,40]) Created array a2
>>> DF=pd.DataFrame([a1,a2]) passed the arrays to DataFrame ( ) (in the form of list)
5. Creating DataFrame from List of Dictionaries
A=[{ 'a':10, 'b':20, 'd':120 },{ 'a':30, 'b':60, 'c':90} ]
>>>>DF=pd.DataFrame(A)
6. Creating DataFrame from Dictionary of Lists
A= {'Maths':[80,85,90],'Science':[78,89,88],'English':[85,74,65]}
>>>>DF=pd.DataFrame(A)
7. Creating DataFrame from Dictionary of Series
DF1=pd.DataFrame({'Maths':pd.Series([80,85,90],index=['a','b','c']),'Science':pd.Series([78,89,88],index=['a','b','c'
]),'English':pd.Series([85,74,65],index=['a','b','c'])})
8. Creating DataFrame from another DataFrame
Attributes of DataFrame:
1. index 6. shape
2. columns 7.values
3. axes 8. ndim
4. dtype 9. empty
5. size 10. T
Selecting/Accessing DataFrame:
1. Accessing a column DF[‘col_name’] or DF.col_name
2. Accessing Multiple columns DF[ [ list of column names] ]
3. Accessing a row DF.loc[ ‘row_label’ ]
4. Accessing Multiple rows DF.loc[ [ list of row_labels] ]
5. Accessing a subset of Dataframe DF.loc [ index , columns ]
index = list of row labels or slice (only labels or all i.e. : ) of row labels
columns = list of column labels or slice (only labels or all i.e. : ) of column labels
OR
DF.iloc[ row_start_index : row_end_index , col_start_index : col_end_index)
Note: When we use iloc[ ], then end index is excluded but while using loc[ ], the end index is included.
6. Selecting/Accessing individual value DF.colmn_name[ ‘row_label’ or index ]
Ex. DF.population[‘Delhi’] or DF.population[2]
Filtering Dataframe using operators:
1. Using index label: DataFrameName[DataFrameName[ col_name] operator value]]
Example DF[DF[‘Marks’]>90]]
DF[ (DF[‘Marks’]>90]) & (DF[‘Gender’] == “Female”]) ]
2. Using dot notation: DataFrameName[DataFrameName. col_name operator value]
Example DF[DF.Marks>90]
DF[(DF.Marks>90) & (DF.Gender == “Female”)]
3. Using .query( ) function: DataFrameName.query( ‘ col_name operator value’ )
Example DF.query( ‘ Maths > 90 & Science > 85’ )
Adding/Modifying Columns of a DataFrame:
DF[‘col_name’]= [List of values] will add/overite a column with given values
DF[‘new_col_name’]= Single_Value will add/overwrite a column with all values as same
DF.assign(colum_name=[ list of values]) will add/overite a column with given values
DF.loc[ : , ‘col_name’] = [ List of Values ] or a Single Value
Note: you may also use dot( . ) notation of modify/add a column Ex. DF.col_name = [list of values]
It the column name does not exist in the DataFrame, it will add a new column but if a column name already exist
with the same name, it will overwrite the values of that particular column.
If you are assigning a list of values, the values in the column required must match the number of values you have
assigned otherwise it will raise an error i.e. ValueError.
assign( ) function can be used to add/modify the values for multiple columns. (the column_name must be
written without quotation marks)
Adding/Modifying Rows of a DataFrame: use .loc( )
DF.loc[‘row_label’]= [ List of values ] will add/overite a row values with given values
DF.loc[‘row_label’]= Single_Value will add/overwrite a row values with all values as same
DF.at[‘row_label’] = [ List of values ]
DF.loc [ ‘row_label’ , start : end ] = [ List of values ] will only modify the given slice of row
Note: To update a subset of DataFrame: DF.loc[ row , columns ]
DF.loc[ list of row_labels/Slice, list of column_names or Slice]
Deleting Columns/Row of a DataFrame:
1. del DF[‘Col_Name’] it can only delete column
It will delete the column and update the values in DF. (the original DataFrame will be updated)
It can be used to drop the complete dataframe Ex. del DF
2. DF.drop[[list of columns/indexes],axis= 0/1) By default the axis=0
Drop function returns us the datframe after deleted values so you may store it in some variable. It will not affect
the original DF.
Renaming Rows/Columns:
DF.rename(columns={old_name:new_name,old_name:new_name,……},inplace=True/False) For columns
DF.rename(index={old_name:new_name,old_name:new_name,……},inplace=True/False) For Rows
Note: To rename both rows and columns DF.rename( {columns={ ……….}, index = { ………….} )
Boolean Indexing: For this first of all you need to create the DF with Boolean indexing.
To display rows with False index DF.loc[ 0 ] or DF.loc[ False ]
To display rows with True index DF.loc[ True ] or DF.loc[ 1 ]
Importing and Exporting Data between CSV files and DataFrame
Importing Data from CSV files: read_csv( ) function is used
read_csv( “ file path”, sep, header, names)
To read a CSV named ‘resultsheet.csv’ and store in a dataframe named ‘DF’
DF=pd.read_csv(“E:/Python programs/resultsheet.csv”, sep= “,”, header=0, names=[‘RNo’, ‘Name’, ‘Eng’, ‘Hin’])
Note: The default value of sep is space (Means the data items will be separated by space character)
Exporting Data from DataFrame to CSV File: to_csv( ) function is used
DF.to_csv( “ file path”, sep, header, index)
To create a CSV file ‘resultsheetnew’ from a dataframe named ‘DF’
DF.to_csv(“E:/Python programs/resultsheetnew.csv”, sep= “,”, header=True/False, index=True/False)
Note: By default the value of hearer and index attributes is True. If you don’t want column labels/ row index to be
stored in CSV file, you can make it False.
Series V/s Numpy Array:
Series Numpy Array
In series we can define our own labeled index to NumPy arrays are accessed by their integer
access elements of an array. These can be position using numbers only.
numbers or letters.
The elements can be indexed in descending The indexing starts with zero for the first
order also. element and the index is fixed.
If two series are not aligned, NaN or missing There is no concept of NaN values and if there
values are generated are no matching values in arrays, alignment
fails.
Series require more memory. NumPy occupies lesser memory.
.
*********************
PANDAS – 2
Descriptive Statistics:
Name UT Maths Science S.St Hindi Eng
0 Raman 1 22 21 18 20 21
1 Raman 2 21 20 17 22 24
2 Raman 3 14 19 15 24 23
3 Zuhaire 1 20 17 22 24 19
4 Zuhaire 2 23 15 21 25 15
5 Zuhaire 3 22 18 19 23 13
6 Ashravy 1 23 19 20 15 22
7 Ashravy 2 24 22 24 17 21
8 Ashravy 3 12 25 19 21 23
9 Mishti 1 15 22 25 22 22
10 Mishti 2 18 21 25 24 23
11 Mishti 3 17 18 20 25 20
Calculating Maximum Value: df.max(numeric_only=True/False, axis=0/1)
To calculate maximum value in each column. By default, all these functions extract a value for each column i.e. axis=0.
However to find the descriptive statistics for each row, we need to specify axis=1 as its argument.
Example. To print maximum marks obtained in each subject in Unit Test 2.
>>> DFUT2=df[df.UT==2] or DFUT2=df[df[‘UT’]==2]
>>> DFUT2.max(numeric_only=True)
Calculating Minimum Value: df.min(numeric_only=True/False, axis=0/1)
Example. To display minimum marks obtained by Raman in all unit tests in Hindi.
dfRaman=df[df.Name=='Raman'] Extract the records of Raman only and store in
dfRamandfRamanHndi= dfRaman.Hindi Extract the record of Hindi from Raman’s
print(dfRamanHndi.max()) Apply function on finally extracted record
Calculating Sum of Values: df.sum(numeric_only=True/False, axis=0/1)
Example: Write python statement to print the total marks obtained by Raman in each subject.
>>> dfRaman=df[df.Name=='Raman']
>>> dfRaman[['Maths','Science','S.St','Hindi','Eng']].sum()
Example: Write python statement to print the total marks obtained by Raman in all subjects in each Unit Test.
>>> dfRaman=df[df.Name=='Raman']
>>> dfRaman[['Maths','Science','S.St','Hindi','Eng']].sum(axis=1)
Calculating Number of Values: df.count(numeric_only=True/False, axis=0/1)
Calculating mean (average): df.mean(numeric_only=True/False, axis=0/1)
mean() will display the average of the values of each column of a DataFrame. It is only applicable for numeric values.
Calculating median (Middle Value): df.median(numeric_only=True/False, axis=0/1)
median() will display the middle values of each column of a DataFrame. It is only applicable for numeric values.
Calculating Mode (Most appeared): df.mode(numeric_only=True/False, axis=0/1)
Mode is defined as the value that appears the most number of times in a data set.
Calculating Quartile: df.quantile( q , numeric_only, axis )
quantile() function is used to get the quartiles. It will output the quartile of each column or row of the DataFrame in four
parts i.e. the first quartile is 25% (parameter q = .25), the second quartile is 50% (Median), the third quartile is 75%
(parameter q = .75). By default, it will display the second quantile (median) of all numeric values.
position = (n-1) × q + 1 to locate the quartile position (Sorting of the dataset is required to calculate it)
Example: >>> Series1=pd.Series([15,18,10,22,23,42,41,36,80,75])
>>> Series1.quantile(.25)
19.0
N=10 (total number of values in data set) q = .25
(10-1) × .25 + 1 = 3.25 position
Arrange data in ascending order 10 15 18 22 23 36 41 42 75 80
3rd 4th positions
rd th
Difference 3 and 4 position values is 4, so fractional part will be multiplied by 4 (i.e. 18 + (4 × .25) = 19)
Example: Write the statement to display the first and third quartiles of all subjects.
>>> df[['Maths','Science','S.St','Hindi','Eng']].quantile([.25,.75])
Maths Science S.St Hindi Eng
0.25 16.50 18.00 18.75 20.75 19.75
0.75 22.25 21.25 22.50 24.00 23.00
Calculating Variance: df.var(numeric_only, axis)
Variance is the average of squared differences from the mean.
Calculating Standard Deviation: df.std((numeric_only, axis)
Standard deviation is calculated as the square root of the variance.
Data Aggregations: Aggregation means to transform the dataset and produce a single numeric value from an array.
Aggregation can be applied to one or more columns together. Aggregate functions are max(),min(), sum(), count(), std(),
var().
>>> df.aggregate('max') will calculate max for each column
>>> df.aggregate(['max','count']) will calculate max and total items for each column
>>> df['Maths'].aggregate(['max','min'],axis=0) will calculate max and min value of Maths column
>>> df[['Maths','Science']].aggregate('sum',axis=1) will calculate sum of Maths and Science in each row.
Sorting a DataFrame: df.sort_values( by = ‘colname’, axis = 0/1, ascending = True/False)
>>> df.sort_values(by=['Name']) will sort the df in ascending order of
Name
>>> dfUT2 = df[df.UT == 2] will sort the
>>> print(dfUT2.sort_values(by=['Science']))