Pandas Cheat Sheet
by Justin1209 (Justin1209) via cheatography.com/101982/cs/21202/
Import the Pandas Module Loading and Saving CSVs (cont) Converting Datatypes
import pandas as pd # Get the first DataFrame chunk: # Convert argument to numeric type
df_urb_pop pandas.to_numeric(arg, errors‐
Create a DataFrame df_urb_pop = next(urb_pop_re‐ ="raise")
# Method 1 ader) errors:
"raise" -> raise an exception
df1 = pd.DataFrame({
Inspect a DataFrame "coerce" -> invalid parsing will be set as
'name': ['John Smith',
NaN
'Jane Doe'], df.head(5) First 5 rows
'address': ['13 Main St.', df.info() Statistics of columns (row
DataFrame for Select Columns / Rows
'46 Maple Ave.'], count, null values, datatype)
'age': [34, 28] df = pd.DataFrame([
}) Reshape (for Scikit) ['January', 100, 100, 23,
# Method 2 100],
nums = np.array(range(1, 11))
df2 = pd.DataFrame([ ['February', 51, 45, 145, 45],
-> [ 1 2 3 4 5 6 7 8 9 10]
['John Smith', '123 Main ['March', 81, 96, 65, 96],
nums = nums.reshape(-1, 1)
St.', 34], ['April', 80, 80, 54, 180],
-> [ [1],
['Jane Doe', '456 Maple ['May', 51, 54, 54, 154],
[2],
Ave.', 28], ['June', 112, 109, 79, 129]],
[3],
['Joe Schmo', '9 Broadway', columns=['month', 'east',
[4],
51] 'north', 'south', 'west']
[5],
], )
[6],
columns =[ 'name', [7],
Select Columns
'address', 'age']) [8],
# Select one Column
[9],
Loading and Saving CSVs clinic_north = df.north
[10]]
# Load a CSV File in to a --> Reshape values for Scikit
You can think of reshape() as rotating this
DataFrame learn: clinic_north.values.re‐
array. Rather than one big row of numbers,
df = pd.read_csv('my-csv-f‐ shape(-1, 1)
nums is now a big column of numbers -
ile.csv') # Select multiple Columns
there’s one number in each row.
# Saving DataFrame to a CSV File clinic_north_south = df[['n‐
df.to_csv('new-csv-file.csv') orth', 'south']]
# Load DataFrame in Chunks (For Make sure that you have a double set of
large Datasets) brackets [[ ]], or this command won’t work!
# Initialize reader object:
urb_pop_reader
urb_pop_reader = pd.read_c‐
sv('ind_pop_data.csv', chunks‐
ize=1000)
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by Readable.com
cheatography.com/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 1 of 4. https://readable.com
Pandas Cheat Sheet
by Justin1209 (Justin1209) via cheatography.com/101982/cs/21202/
Select Rows Adding a Column Performing Column Operation (cont)
# Select one Row df = pd.DataFrame([ -> lower, upper
march = df.iloc[2] [1, '3 inch screw', 0.5, # Perform a lambda Operation on
# Select multiple Rows 0.75], a Column
jan_feb_march = df.iloc[:3] [2, '2 inch nail', 0.10, get_last_name = lambda x:
feb_march_april = df.iloc[1:4] 0.25], x.split(" ")[-1]
may_june = df.iloc[-2:] [3, 'hammer', 3.00, 5.50], df['last_name'] = df.Name. apply‐
# Select Rows with Logic [4, 'screwdriver', 2.50, 3.00] (get_last_name)
january = df[df.month == ],
'January'] columns=['Product ID', 'Descr‐ Performing a Operation on Multiple
-> <, >, <=, >=, !=, == iption', 'Cost to Manufacture', Columns
march_april = df[(df.month == 'Price'] df = pd.DataFrame([
'March') | (df.month == ) ["Apple", 1.00, "No"],
'April')] # Add a Column with specified ["Milk", 4.20, "No"],
-> &, | row-values ["Paper Towels", 5.00, "‐
january_february_march = df['Sold in Bulk?'] = ['Yes', Yes"],
df[df.month.isin (['January', 'Yes', 'No', 'No'] ["Light Bulbs", 3.75, "Yes"],
'February', 'March'])] # Add a Column with same value ],
-> column_name.isin([" ", " "]) in every row columns=["Item", "Price", "Is
df['Is taxed?'] = 'Yes'
Selecting a Subset of a Dataframe often taxed?"])
results in non-consecutive indices. # Add a Column with calculation # Lambda Function
df['Revenue'] = df['Price'] - df['Price with Tax'] = df.app‐
Using .reset_index() will create a new df['Cost to Manufacture'] ly(lambda row:
DataFrame move the old indices into a new row['Price'] * 1.075
colum called index. Performing Column Operation if row['Is taxed?'] ==
df = pd.DataFrame([ 'Yes'
Use .reset_index(drop=True) if you dont ['JOHN SMITH', 'john.smith@‐ else row['Price'],
need the index column. axis=1
gmail.com'],
Use .reset_index(inplace=True) to prevent
['Jane Doe', 'jdoe@yahoo.c‐ )
a new DataFrame from brein created.
om'], We apply a lambda to rows, as opposed to
['joe schmo', 'joeschmo@hotma‐ columns, when we want to perform functi‐
il.com'] onality that needs to access more than one
], column at a time.
columns=['Name', 'Email'])
# Changing a column with an
Operation
df['Name'] = df.Name. apply(lo‐
wer)
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by Readable.com
cheatography.com/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 2 of 4. https://readable.com
Pandas Cheat Sheet
by Justin1209 (Justin1209) via cheatography.com/101982/cs/21202/
Rename Columns Column Statistics Pivot Tables
# Method 1 Mean = Average df.col umn .mean() orders =
df.columns = ['NewName_1', Median df.column .median() pd.read_csv('orders.csv')
'NewName_2, 'NewName_3', '...'] shoe_counts = orders.
Minimal Value df.column .min()
# Method 2 groupby(['shoe_type', 'shoe_col‐
Maximum Value df.column .max()
df.rename(columns={ or']).
Number of Values df.column .count()
'OldName_1': 'NewName_1', id.count().reset_index()
'OldName_2': 'NewName_2' Unique Values df.col umn .nuni‐ shoe_counts_pivot = shoe_coun‐
}, inplace=True) que() ts.pivot(
Standard Deviation df.col umn .std() index = 'shoe_type',
Using inplace=True lets us edit the original
DataFrame. List of Unique df.column .unique() columns = 'shoe_color',
Values values = 'id').reset_index()
Series vs. Dataframes Dont't forget reset_index() at the end of a We have to build a temporary table where
# Dataframe and Series groupby operation we group by the columns we want to
print(type(clinic_north)): include in the pivot table
# <class 'pandas.core.series.Series'>
Calculating Aggregate Functions
print(type(df)): # Group By Merge (Same Column Name)
# <class 'pandas.core.frame.DataFrame'> grouped = df. groupby(['col1', sales = pd.read_csv('sales.csv')
print(type(clinic_north_south)) 'col2']).col3 targets = pd.read_csv('targe‐
# <class 'pandas.core.frame.DataFrame
'> .measurement(). reset_index() ts.csv')
In Pandas # -> group by column1 and men_women = pd.read_csv('men_w‐
- a series is a one-dimensional object column2, calculate values of omen_sales.csv')
that contains any type of data. column3 # Method 1
# Percentile sales_targets = pd .merge (sales,
- a dataframe is a two-dimensional high_earners = df.groupby('cat‐ targets, how=" ")
object that can hold multiple columns of egory').wage # how: "inner"(default), "out‐
different types of data.
.apply(lambda x: np.percen‐ er", "left", "right"
tile(x, 75)) #Method 2 (Method Chaining)
A single column of a dataframe is a series,
.reset_index() all_data = sales .merge (targe‐
and a dataframe is a container of two or
# np.percentile can calculate ts).merge(men_women)
more series objects.
any percentile over an array of
values
Don't forget reset.index()
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by Readable.com
cheatography.com/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 3 of 4. https://readable.com
Pandas Cheat Sheet
by Justin1209 (Justin1209) via cheatography.com/101982/cs/21202/
Inner Merge (Different Column Name) Melt
orders = pandas.melt(DataFrame, id_vars,
pd.read_csv('orders.csv') value_vars, var_name, value_nam‐
products = pd.read_csv('produ‐ e='value')
cts.csv') id_vars: Column(s) to use as identifier
# Method 1: Rename Columns variables.
orders_products = pd .merge (or‐ value_vars: Column(s) to unpivot. If not
ders, products.rename(columns= specified, uses all columns that are not set
{'id':'product_id'}), how=" ") as id_vars.
.reset_index() var_name: Name to use for the ‘variable’
column.
# how: "inner"(default), "out‐
value_name: Name to use for the ‘value’
er", "left", "right"
column.
# Method 2:
orders_products = Unpivot a DataFrame from wide to long
pd.merge (orders, products, format, optionally leaving identifiers set.
left_on ="pro‐
Assert Statements
duct_id",
right_on ‐ # Test if country is of type
="id", object
suffixes =["_‐ assert gapminder.country.dtypes
orders","_products"]) == np.object
# Test if year is of type int64
Method 2:
assert gapminder.year.dtypes ==
If we use this syntax, we’ll end up with two
columns called id. np.int64
Pandas won’t let you have two columns # Test if life_expectancy is of
with the same name, so it will change them type float64
to id_x and id_y. assert gapminder.life_expectanc‐
We can help make them more useful by y.dtypes == np.float64
using the keyword suffixes. # Assert that country does not
contain any missing values
Concatenate assert pd.notnull(gapminder.cou‐
bakery = ntry).all()
pd.read_csv('bakery.csv') # Assert that year does not
ice_cream = pd.read_csv('ice_c‐ contain any missing values
ream.csv') assert pd.notnull(gapminder.yea‐
menu = pd.concat([bakery, r).all()
ice_cream])
By Justin1209 (Justin1209) Published 23rd November, 2019. Sponsored by Readable.com
cheatography.com/justin1209/ Last updated 31st January, 2020. Measure your website readability!
Page 4 of 4. https://readable.com