Pandas Cheat Sheet
By Ammar gamal
AI Student at Kafrelsheikh University
Focus: data sets using Python
---------------------------
Pandas Introduction
- What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and Data processing.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was
created by Wes McKinney in 2008.
- Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
- What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns? What is average value? Max value? Min
value? Pandas are also able to delete rows that are not relevant, or contains wrong values, like
empty or NULL values. This is called cleaning the data.
- Where is the Pandas Codebase?
The source code for Pandas is located at this github repository
https://github.com/pandas-dev/pandas
Pandas Getting Started
- Installation of Pandas
If you have Python and PIP already installed on a system, then installation of Pandas is very easy.
Install it using this command:
C:\Users\Your Name>pip install pandas
If this command fails, then use a python distribution that already has Pandas installed like,
Anaconda, Spyder etc.
Import Pandas as pd
Pandas is usually imported under the pd alias.
import pandas as pd # type: ignore
# Checking Pandas Version
# The version string is stored under __version__ attribute.
print ( pd.__version__)
2.2.2
methods pandas
print(dir(pd))
['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype',
'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex',
'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags',
'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index',
'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype',
'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA',
'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex',
'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta',
'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype',
'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__',
'__doc__', '__docformat__', '__file__', '__git_version__',
'__loader__', '__name__', '__package__', '__path__', '__spec__',
'__version__', '_built_with_meson', '_config', '_is_numpy_dev',
'_libs', '_pandas_datetime_CAPI', '_pandas_parser_CAPI', '_testing',
'_typing', '_version_meson', 'annotations', 'api', 'array', 'arrays',
'bdate_range', 'compat', 'concat', 'core', 'crosstab', 'cut',
'date_range', 'describe_option', 'errors', 'eval', 'factorize',
'from_dummies', 'get_dummies', 'get_option', 'infer_freq',
'interval_range', 'io', 'isna', 'isnull', 'json_normalize',
'lreshape', 'melt', 'merge', 'merge_asof', 'merge_ordered', 'notna',
'notnull', 'offsets', 'option_context', 'options', 'pandas',
'period_range', 'pivot', 'pivot_table', 'plotting', 'qcut',
'read_clipboard', 'read_csv', 'read_excel', 'read_feather',
'read_fwf', 'read_gbq', 'read_hdf', 'read_html', 'read_json',
'read_orc', 'read_parquet', 'read_pickle', 'read_sas', 'read_spss',
'read_sql', 'read_sql_query', 'read_sql_table', 'read_stata',
'read_table', 'read_xml', 'reset_option', 'set_eng_float_format',
'set_option', 'show_versions', 'test', 'testing', 'timedelta_range',
'to_datetime', 'to_numeric', 'to_pickle', 'to_timedelta', 'tseries',
'unique', 'util', 'value_counts', 'wide_to_long']
Pandas Series
What is a Series?
A Pandas Series is like a column in a table.
It is a one-dimensionalarray holding data of any type.
#pd.Series( data = list , index = list)
std_mark = [399,380,389,345]
x = pd.Series(std_mark)
x
0 399
1 380
2 389
3 345
dtype: int64
-Labels
If nothing else is specified, the values are labeled with their index number. First value has index
0, second value has index 1 etc.
This label can be used to access a specified value.
# Example
# Return the first value of the Series:
print(x[0])
399
- Create Labels
With the index argument, you can name your own labels.
student = ["Mahmoud","mohamed","mohsen","hamda"]
x= pd.Series(data = std_mark, index= student)
x
Mahmoud 399
mohamed 380
mohsen 389
hamda 345
dtype: int64
# When you have created labels, you can access an item by referring to
the label.
print(x["hamda"])
345
-Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)
day1 420
day2 380
day3 390
dtype: int64
Note: The keys of the dictionary become the labels.
# To select only some of the items in the dictionary,
# use the index argument and specify only the items you want to
include in the Series.
myvar = pd.Series(calories, index = ["day1", "day2"])
print(myvar)
day1 420
day2 380
dtype: int64
Pandas DataFrames
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with
rows and columns.
print(pd.DataFrame())
Empty DataFrame
Columns: []
Index: []
#data column row
students = { 'std_id' :[1204,1205,1206,1207],
'std_name' :['Ammar','Gamal','Mone','Ahmed'],
'std_marks' :[400,405,408,350],
'percentage':[97.56,98.78,99.5,85.36]}
df = pd.DataFrame(students)
df
std_id std_name std_marks percentage
0 1204 Ammar 400 97.56
1 1205 Gamal 405 98.78
2 1206 Mone 408 99.50
3 1207 Ahmed 350 85.36
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc() attribute to return one or more specified row(s)
#refer to the row index:
print(df.loc[0])
std_id 1204
std_name Ammar
std_marks 400
percentage 97.56
Name: 0, dtype: object
# Example
# Return row 0 and 1:
#use a list of indexes:
print(df.loc[[0, 1]])
# Note: When using [], the result is a Pandas DataFrame.
std_id std_name std_marks percentage
0 1204 Ammar 400 97.56
1 1205 Gamal 405 98.78
data = {
#column Row
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
df
calories duration
day1 420 50
day2 380 40
day3 390 45
#refer to the named index:
df.loc["day2"]
calories 380
duration 40
Name: day2, dtype: int64
Pandas Read CSV
Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including
Pandas.
In our examples we will be using a CSV file called 'data.csv'.
df = pd.read_csv('D:/Ai Diploma/insurance.csv')
df.head()
# if the number of rows is not specified, the head() method will
return the top 5 rows.
age sex bmi children smoker region charges
0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast 1725.55230
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
Info About the Data
The DataFrames object has a method called info() , that gives you more information about the
data set.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1337 non-null float64
1 sex 1337 non-null object
2 bmi 1337 non-null float64
3 children 1338 non-null int64
4 smoker 1337 non-null object
5 region 1338 non-null object
6 charges 1337 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 73.3+ KB
Pandas - Cleaning Data
Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates In this tutorial you will learn how to deal with all of them.
Pandas - Cleaning Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
df.isnull().sum()
age 1
sex 1
bmi 2
children 0
smoker 4
region 0
charges 2
dtype: int64
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
df.head(11)
age sex bmi children smoker region charges
0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast NaN
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
5 31.0 female NaN 0 no southeast 3756.62160
6 46.0 female 33.440 1 no southeast 8240.58960
7 37.0 female 27.740 3 no northwest 7281.50560
8 37.0 male 29.830 2 no northeast 6406.41070
9 60.0 NaN 25.840 0 no northwest 28923.13692
10 25.0 male 26.220 0 NaN northeast 2721.32080
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.
# ExampleGet
# Return a new Data Frame with no empty cells:
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
new_df = df.dropna()
new_df.head(25)
# remove 1,5,9,10,11,12 ,23,31
age sex bmi children smoker region charges
0 19.0 female 27.900 0 yes southwest 16884.92400
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
6 46.0 female 33.440 1 no southeast 8240.58960
7 37.0 female 27.740 3 no northwest 7281.50560
8 37.0 male 29.830 2 no northeast 6406.41070
13 56.0 female 39.820 0 no southeast 11090.71780
14 27.0 male 42.130 0 yes southeast 39611.75770
15 19.0 male 24.600 1 no southwest 1837.23700
16 52.0 female 30.780 1 no northeast 10797.33620
17 23.0 male 23.845 0 no northeast 2395.17155
18 56.0 male 40.300 0 no southwest 10602.38500
19 30.0 male 35.300 0 yes southwest 36837.46700
20 60.0 female 36.005 0 no northeast 13228.84695
21 30.0 female 32.400 1 no southwest 4149.73600
22 18.0 male 34.100 0 no southeast 1137.01100
24 37.0 male 28.025 2 no northwest 6203.90175
25 59.0 female 27.720 3 no southeast 14001.13380
26 63.0 female 23.085 0 no northeast 14451.83515
27 55.0 female 32.775 2 no northwest 12268.63225
28 23.0 male 17.385 1 no northwest 2775.19215
29 31.0 male 36.300 2 yes southwest 38711.00000
30 22.0 male 35.600 0 yes southwest 35585.57600
32 19.0 female 28.600 5 no southwest 4687.79700
Note: By default, the dropna() method returns a new DataFrame, and
will not change the original.
If you want to change the original DataFrame, use the inplace = True argument:
df.dropna(inplace=True)
df.head(5)
age sex bmi children smoker region charges
0 19.0 female 27.900 0 yes southwest 16884.92400
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
6 46.0 female 33.440 1 no southeast 8240.58960
Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.
The fillna() method allows us to replace empty cells with a value:
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
df.fillna(130, inplace = True)
df.head(15)
age sex bmi children smoker region charges
0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast 130.00000
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
5 31.0 female 130.000 0 no southeast 3756.62160
6 46.0 female 33.440 1 no southeast 8240.58960
7 37.0 female 27.740 3 no northwest 7281.50560
8 37.0 male 29.830 2 no northeast 6406.41070
9 60.0 130 25.840 0 no northwest 28923.13692
10 25.0 male 26.220 0 130 northeast 2721.32080
11 62.0 female 26.290 0 130 southeast 27808.72510
12 23.0 male 34.400 0 130 southwest 1826.84300
13 56.0 female 39.820 0 no southeast 11090.71780
14 27.0 male 42.130 0 yes southeast 39611.75770
Replace Only For Specified Columns
The example above replaces all empty cells in the whole Data Frame.
To only replace empty values for one column, specify the column name for the DataFrame:
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
df.fillna({"charges": 1300}, inplace = True) #1
df.fillna({"sex" :"male"}, inplace = True) #9
df.fillna({"smoker" :"no"}, inplace = True) #11,12
df.fillna({"bmi" :33.23}, inplace = True) #5
df.head(20)
age sex bmi children smoker region charges
0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast 1300.00000
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
5 31.0 female 33.230 0 no southeast 3756.62160
6 46.0 female 33.440 1 no southeast 8240.58960
7 37.0 female 27.740 3 no northwest 7281.50560
8 37.0 male 29.830 2 no northeast 6406.41070
9 60.0 male 25.840 0 no northwest 28923.13692
10 25.0 male 26.220 0 no northeast 2721.32080
11 62.0 female 26.290 0 no southeast 27808.72510
12 23.0 male 34.400 0 no southwest 1826.84300
13 56.0 female 39.820 0 no southeast 11090.71780
14 27.0 male 42.130 0 yes southeast 39611.75770
15 19.0 male 24.600 1 no southwest 1837.23700
16 52.0 female 30.780 1 no northeast 10797.33620
17 23.0 male 23.845 0 no northeast 2395.17155
18 56.0 male 40.300 0 no southwest 10602.38500
19 30.0 male 35.300 0 yes southwest 36837.46700
Replace Using Mean, Median, or Mode
A common way to replace empty cells, is to calculate the mean, median or mode value of the
column.
Pandas uses the mean() median() and mode() methods to calculate the respective values for a
specified column:
Mean = the average value (the sum of all values divided by number of values).
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
x = df["charges"].mean()
print(x)
df.fillna( {"charges" : x}, inplace = True)
df.head(5) #col 1 charge become mean
13260.776618008234
age sex bmi children smoker region charges
0 19.0 female 27.900 0 yes southwest 16884.924000
1 18.0 male 33.770 1 no southeast 13260.776618
2 28.0 male 33.000 3 no southeast 4449.462000
3 33.0 male 22.705 0 no northwest 21984.470610
4 32.0 male 28.880 0 no northwest 3866.855200
Median = the value in the middle, after you have sorted all values ascending.
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
x = df["charges"].median()
print(x)
df.fillna( {"charges" : x}, inplace = True)
df.head(5) #col 1 charge become median
9382.033
age sex bmi children smoker region charges
0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast 9382.03300
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
Mode = the value that appears most frequently.
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
x = df["charges"].mode()[0]
print(x)
df.fillna( {"charges" : x}, inplace = True)
df.head(5) #col 1 charge become mode
1639.5631
age sex bmi children smoker region charges
0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast 1639.56310
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
Data of Wrong Format
Cells with data of wrong format can make it difficult, or even impossible, to analyze data.
To fix it, you have two options: remove the rows, or convert all cells in the columns into the same
format.
df = pd.read_csv('D:/Ai Diploma/insurance_with_wrong_fotmate.csv')
print(df.isnull().sum())
df.head(10)
age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64
age sex bmi children smoker region charges
0 19 female 27.90 0 yes southwest 16884.92400
1 6845 male 33.77 1 no mk 1725.55230
2 28 male 33.00 56 no southeast 4449.46200
3 254 male 36596.00 0 no 135 21984.47061
4 32 male 28.88 0 5 northwest 5.65510
5 7488 female 25.74 0 no kjguki 3756.62160
6 46 female 33.44 1 no southeast 8240.58960
7 37 female 27.74 3 no northwest 7281.50560
8 37 male 29.83 2 no northeast 6406.41070
9 60 female 25.84 236 no northwest 28923.13692
Replacing Values
One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be "45" instead of "450", and we
could just insert "45" in row 7:
df = pd.read_csv('D:/Ai Diploma/insurance_with_wrong_fotmate.csv')
df.loc[1,"age"] = 20
df.head(2)
age sex bmi children smoker region charges
0 19 female 27.90 0 yes southwest 16884.9240
1 20 male 33.77 1 no mk 1725.5523
For small data sets you might be able to replace the wrong data one by one, but not for big data
sets.
To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries
for legal values, and replace any values that are outside of the boundaries.
df = pd.read_csv('D:/Ai Diploma/insurance_with_wrong_fotmate.csv')
m = int( df["age"].mean())
print(m)
for x in df.index:
if df.loc[x, "age"] > 90:
df.loc[x, "age"] = m
df.head(6)
50
age sex bmi children smoker region charges
0 19 female 27.90 0 yes southwest 16884.92400
1 50 male 33.77 1 no mk 1725.55230
2 28 male 33.00 56 no southeast 4449.46200
3 50 male 36596.00 0 no 135 21984.47061
4 32 male 28.88 0 5 northwest 5.65510
5 50 female 25.74 0 no kjguki 3756.62160
Removing Rows
Another way of handling wrong data is to remove the rows that contains wrong data.
This way you do not have to find out what to replace them with, and there is a good chance you
do not need them to do your analyses.
for x in df.index:
if df.loc[x, "bmi"] > 50:
df.drop(x, inplace = True)
df.head() #3
age sex bmi children smoker region charges
0 19 female 27.90 0 yes southwest 16884.9240
1 50 male 33.77 1 no mk 1725.5523
2 28 male 33.00 56 no southeast 4449.4620
4 32 male 28.88 0 5 northwest 5.6551
5 50 female 25.74 0 no kjguki 3756.6216
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.
To discover duplicates, we can use the duplicated() method.
The duplicated() method returns a Boolean values for each row:
df.duplicated().sum()
df.drop_duplicates(inplace = True)
df.duplicated().sum()
Pandas - Data Correlations
Finding Relationships A great aspect of the Pandas module is the corr() method.
The corr() method calculates the relationship between each column in your data set.
The examples in this page uses a CSV file called: 'data.csv'.
# ExampleGet
# Show the relationship between the columns:
df=pd.read_csv("D:/Ai Diploma/data.csv")
df.corr() #[-1,1]
Duration Pulse Maxpulse Calories
Duration 1.000000 -0.155408 0.009403 0.922717
Pulse -0.155408 1.000000 0.786535 0.025121
Maxpulse 0.009403 0.786535 1.000000 0.203813
Calories 0.922717 0.025121 0.203813 1.000000
Note: The corr() method ignores "not numeric" columns.
Result Explained
The Result of the corr() method is a table with a lot of numbers that represents how well the
relationship is between two columns.
The number varies from -1 to 1.
• 1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data
set, each time a value went up in the first column, the other one went up as well.
• 0.9 is also a good relationship, and if you increase one value, the other will probably
increase as well.
• 0.9 would be just as good relationship as 0.9, but if you increase one value, the other
will probably go down.
• 0.2 means NOT a good relationship, meaning that if one value goes up does not
mean that the other will
Pandas - Plotting
Pandas uses the plot() method to create diagrams.
We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the screen.
Read more about Matplotlib The source code for Matplotlib is located at this
github repository https://github.com/matplotlib/matplotlib
import matplotlib.pyplot as plt
df = pd.read_csv('D:/Ai Diploma/data.csv')
df.plot()
plt.show()
Scatter Plot
Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
A scatter plot needs an x- and a y-axis.
In the example below we will use "Duration" for the x-axis and "Calories" for the y-axis.
Include the x and y arguments like this:
x = 'Duration', y = 'Calories'
df = pd.read_csv('D:/Ai Diploma/data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')
plt.show()
Remember: In the previous example, we learned that the correlation between "Duration" and
"Calories" was 0.922721, and we concluded with the fact that higher duration means more
calories burned.
By looking at the scatterplot, I will agree.
Let's create another scatterplot, where there is a bad relationship between the columns, like
"Duration" and "Maxpulse", with the correlation 0.009403:
df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')
plt.show()
Histogram
Use the kind argument to specify that you want a histogram:
kind = 'hist'
A histogram needs only one column.
A histogram shows us the frequency of each interval, e.g. how many workouts lasted between
50 and 60 minutes?
In the example below we will use the "Duration" column to create the histogram:
df["Duration"].plot(kind = 'hist')
<Axes: ylabel='Frequency'>
conclusion
A Pandas cheat sheet is an invaluable tool for data scientists, analysts, and developers working
with data in Python. By summarizing essential functions and commands, it helps users quickly
recall syntax, methods, and operations, improving productivity and efficiency. Key areas typically
covered in a Pandas cheat sheet include data import/export, DataFrame and Series
manipulation, filtering, sorting, aggregation, and handling missing data. With a cheat sheet on
hand, users can navigate complex data transformation tasks with greater ease, reinforcing their
knowledge of Pandas and streamlining their workflow. Whether you're a beginner or an
experienced user, a Pandas cheat sheet can be a quick reference to master data manipulation
and speed up your data analysis tasks.