Pandas Cheat Sheet
by Justin1209 (Justin1209) via cheatography.com/101982/cs/21202/
Import the Pandas Module Loading and Saving CSVs (cont) Converting Datatypes
import pandas as pd > # Get the first DataFrame chunk: # Convert argument to numeric type
df_urb_pop pandas.to_numeric(arg, errors="ra‐
Create a DataFrame df_urb_pop = next(urb_pop_reader) ise")
# Method 1 errors:
Inspect a DataFrame "raise" -> raise an exception
df1 = pd.DataFrame({
df.head(5) First 5 rows "coerce" -> invalid parsing will be set as NaN
'name':
['John Smith',
'Jane Doe'], df.info() Statistics of columns (row
DataFrame for Select Columns / Rows
'address':
['13 Main St.', count, null values, datatype)
'46 Maple Ave.'], df = pd.DataFrame([
'age':
[34, 28] Reshape (for Scikit) ['January', 100, 100, 23,
}) 100],
nums = np.array(range(1, 11))
# Method 2 ['February', 51, 45, 145,
-> [ 1 2 3 4 5 6 7 8 9 10]
df2 = pd.DataFrame([ 45],
nums = nums.reshape(-1, 1)
['John Smith', '123 Main ['March', 81, 96, 65, 96],
-> [ [1],
St.', 34], ['April', 80, 80, 54, 180],
[2],
['Jane Doe', '456 Maple ['May', 51, 54, 54, 154],
[3],
Ave.', 28], ['June', 112, 109, 79,
[4],
['Joe Schmo', '9 129]],
[5],
Broadway', 51] columns=['month',
[6],
], 'east', 'north', 'south',
[7],
columns=['name',
'address', 'west']
[8],
'age']) )
[9],
[10]]
Loading and Saving CSVs Select Columns
You can think of reshape() as rotating this
# Load a CSV File in to a # Select one Column
array. Rather than one big row of numbers,
clinic_north = df.north
DataFrame nums is now a big column of numbers -
df = pd.read_csv('my-csv- there’s one number in each row. --> Reshape values for Scikit
file.csv') learn: clinic_north.values.re‐
# Saving DataFrame to a CSV File shape(-1, 1)
df.to_csv('new-csv-fi‐ # Select multiple Columns
le.csv') clinic_north_south
= df[['n‐
# Load DataFrame in Chunks (For orth', 'south']]
large Datasets) Make sure that you have a double set of
# Initialize reader object: brackets [[ ]], or this command won’t work!
urb_pop_reader
urb_pop_reader = pd.read_c‐
sv('ind_pop_data.csv',
chunksize=1000)
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by Readable.com
cheatography.com/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 1 of 4. https://readable.com
Pandas Cheat Sheet
by Justin1209 (Justin1209) via cheatography.com/101982/cs/21202/
Select Rows Adding a Column Performing Column Operation (cont)
# Select one Row df = pd.DataFrame([ > -> lower, upper
march = df.iloc[2] [1, '3 inch screw', 0.5, # Perform a lambda Operation on a Column
# Select multiple Rows 0.75], get_last_name = lambda x: x.split(" ")[-1]
jan_feb_march = df.iloc[:3] [2, '2 inch nail', 0.10, df['last_name'] = df.Name.apply(get_last_‐
feb_march_april = df.ilo‐ 0.25], name)
c[1:4] [3, 'hammer', 3.00, 5.50],
Performing a Operation on Multiple
may_june = df.iloc[-2:] [4, 'screwdriver', 2.50,
Columns
# Select Rows with Logic 3.00]
january = df[df.month == ], df = pd.DataFrame([
'January'] columns=['Product ID', ["Apple", 1.00, "No"],
-> <, >, <=, >=, !=, == 'Description', 'Cost to ["Milk", 4.20, "No"],
march_april = df[(df.month == Manufacture', 'Price'] ["Paper Towels", 5.00, "‐
'March') | (df.month == ) Yes"],
'April')] # Add a Column with specified ["Light Bulbs", 3.75, "‐
-> &, | row-values Yes"],
january_february_march = df['Sold in Bulk?'] = ['Yes', ],
df[df.month.isin(['January', 'Yes', 'No', 'No'] columns=["Item", "‐
'February', 'March'])] # Add a Column with same value Price", "Is taxed?"])
-> column_name.isin([" ", " in every row # Lambda Function
"]) df['Is taxed?'] = 'Yes' df['Price with Tax'] = df.app‐
# Add a Column with calculation ly(lambda row:
Selecting a Subset of a Dataframe often
df['Revenue'] = df['Price'] - row['Price'] * 1.075
results in non-consecutive indices.
df['Cost to Manufacture'] if row['Is taxed?'] ==
Using .reset_index() will create a new 'Yes'
DataFrame move the old indices into a new Performing Column Operation else row['Price'],
colum called index. df = pd.DataFrame([
axis=1
)
['JOHN SMITH', 'john.smi‐
Use .reset_index(drop=True) if you dont th@gmail.com'], We apply a lambda to rows, as opposed to
need the index column. columns, when we want to perform functi‐
['Jane Doe', 'jdoe@yah‐
Use .reset_index(inplace=True) to prevent a onality that needs to access more than one
oo.com'],
new DataFrame from brein created. column at a time.
['joe schmo', 'joeschmo‐
@hotmail.com']
],
columns=['Name', 'Email'])
# Changing a column with an
Operation
df['Name'] = df.Name. apply(‐
lower)
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by Readable.com
cheatography.com/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 2 of 4. https://readable.com
Pandas Cheat Sheet
by Justin1209 (Justin1209) via cheatography.com/101982/cs/21202/
Rename Columns Column Statistics Pivot Tables
# Method 1 Mean = Average df.column.mean() orders =
df.columns = ['NewName_1', Median df.column.median() pd.read_csv('orders.csv')
'NewName_2, 'NewName_3', shoe_counts = orders.
Minimal Value df.column.min()
'...'] groupby(['shoe_type',
Maximum Value df.column.max()
# Method 2 'shoe_color']).
Number of Values df.column.count()
df.rename(columns={ id.count().reset_index()
'OldName_1': 'NewNa‐ Unique Values df.column.nunique() shoe_counts_pivot = shoe_c‐
me_1', Standard Deviation df.column.std() ounts.pivot(
'OldName_2': 'NewNa‐ List of Unique Values df.column.unique() index = 'shoe_type',
me_2' columns = 'shoe_color',
Dont't forget reset_index() at the end of a
}, inplace=True
) values = 'id').reset_index()
groupby operation
Using inplace=True lets us edit the original We have to build a temporary table where
DataFrame. Calculating Aggregate Functions we group by the columns we want to
# Group By include in the pivot table
Series vs. Dataframes
grouped = df. groupby(['col1',
Merge (Same Column Name)
# Dataframe and Series 'col2']).col3
print(type(clinic_north)): .measurement()
. reset_index() sales = pd.read_csv('sales.csv')
# <class 'pandas.core.series.Series'> # -> group by column1 and targets = pd.read_csv('ta‐
print(type(df)): column2, calculate values of rgets.csv')
# <class 'pandas.core.frame.DataFrame'> column3 men_women = pd.read_csv('me‐
print(type(clinic_north_south)) n_women_sales.csv')
# Percentile
# <class 'pandas.core.frame.DataFrame'> # Method 1
high_earners = df.groupby('‐
In Pandas category').wage sales_targets = pd.merge(sales,
- a series is a one-dimensional object that apply(lambda
. x: np.per‐ targets, how=" ")
contains any type of data. centile(x, 75)) # how: "inner"(default), "‐
reset_index()
. outer", "left", "right"
- a dataframe is a two-dimensional object #Method 2 (Method Chaining)
# np.percentile can calculate
that can hold multiple columns of different all_data =
any percentile over an array of
types of data.
values sales.merge(targets).merge(men_w
omen)
A single column of a dataframe is a series, Don't forget reset.index()
and a dataframe is a container of two or
more series objects.
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by Readable.com
cheatography.com/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 3 of 4. https://readable.com
Pandas Cheat Sheet
by Justin1209 (Justin1209) via cheatography.com/101982/cs/21202/
Inner Merge (Different Column Name) Melt
orders = pandas.melt(DataFrame, id_vars, value_vars, var_name, value_name='‐
pd.read_csv('orders.csv') value')
products = pd.read_csv('pr‐ id_vars: Column(s) to use as identifier variables.
oducts.csv') value_vars: Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
# Method 1: Rename Columns var_name: Name to use for the ‘variable’ column.
orders_products = value_name: Name to use for the ‘value’ column.
pd.merge(orders, Unpivot a DataFrame from wide to long
products.rename(columns={'i‐ format, optionally leaving identifiers set.
d':'product_id'}), how=" ")
.reset_index() Assert Statements
# how: "inner"(default), "‐ # Test if country is of type
outer", "left", "right" object
# Method 2: assert gapminder.country.d‐
orders_products = types == np.object
pd.merge(orders, products, # Test if year is of type int64
left_on="pr‐
assert gapminder.year.dtypes
oduct_id", == np.int64
right_on="id",
# Test if life_expectancy is
suffixes=["_‐
of type float64
orders","_products"]) assert gapminder.life_exp‐
Method 2: ectancy.dtypes == np.float64
If we use this syntax, we’ll end up with two # Assert that country does not
columns called id. contain any missing values
Pandas won’t let you have two columns assert pd.notnull(gapmind‐
with the same name, so it will change them er.country).all()
to id_x and id_y. # Assert that year does not
We can help make them more useful by contain any missing values
using the keyword suffixes. assert pd.notnull(gapmind‐
er.year).all()
Concatenate
bakery =
pd.read_csv('bakery.csv')
ice_cream = pd.read_csv('ic‐
e_cream.csv')
menu = pd.concat([bakery,
ice_cream])
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by Readable.com
cheatography.com/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 4 of 4. https://readable.com