Introducing
DataFrames
D ATA M A N I P U L AT I O N W I T H PA N D A S
Richie Co on
Learning Solutions Architect at
DataCamp
What's the point of pandas?
Data Manipulation skill track
Data Visualization skill track
DATA MANIPULATION WITH PANDAS
Course outline
Chapter 1: DataFrames Chapter 3: Slicing and Indexing Data
Sorting and subse ing Subse ing using slicing
Creating new columns Indexes and subse ing using indexes
Chapter 2: Aggregating Data Chapter 4: Creating and Visualizing Data
Summary statistics Plo ing
Counting Handling missing data
Grouped summary statistics Reading data into a DataFrame
DATA MANIPULATION WITH PANDAS
pandas is built on NumPy and Matplotlib
DATA MANIPULATION WITH PANDAS
pandas is popular
1 h ps://pypistats.org/packages/pandas
DATA MANIPULATION WITH PANDAS
Rectangular data
Name Breed Color Height (cm) Weight (kg) Date of Birth
Bella Labrador Brown 56 25 2013-07-01
Charlie Poodle Black 43 23 2016-09-16
Lucy Chow Chow Brown 46 22 2014-08-25
Cooper Schnauzer Gray 49 17 2011-12-11
Max Labrador Black 59 29 2017-01-20
Stella Chihuahua Tan 18 2 2015-04-20
Bernie St. Bernard White 77 74 2018-02-27
DATA MANIPULATION WITH PANDAS
pandas DataFrames
print(dogs)
name breed color height_cm weight_kg date_of_birth
0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
3 Cooper Schnauzer Gray 49 17 2011-12-11
4 Max Labrador Black 59 29 2017-01-20
5 Stella Chihuahua Tan 18 2 2015-04-20
6 Bernie St. Bernard White 77 74 2018-02-27
DATA MANIPULATION WITH PANDAS
Exploring a DataFrame: .head()
dogs.head()
name breed color height_cm weight_kg date_of_birth
0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
3 Cooper Schnauzer Gray 49 17 2011-12-11
4 Max Labrador Black 59 29 2017-01-20
DATA MANIPULATION WITH PANDAS
Exploring a DataFrame: .info()
dogs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
# Column Non-Null Count Dtype
-- ------ -------------- -----
0 name 7 non-null object
1 breed 7 non-null object
2 color 7 non-null object
3 height_cm 7 non-null int64
4 weight_kg 7 non-null int64
5 date_of_birth 7 non-null object
dtypes: int64(2), object(4)
memory usage: 464.0+ bytes
DATA MANIPULATION WITH PANDAS
Exploring a DataFrame: .shape
dogs.shape
(7, 6)
DATA MANIPULATION WITH PANDAS
Exploring a DataFrame: .describe()
dogs.describe()
height_cm weight_kg
count 7.000000 7.000000
mean 49.714286 27.428571
std 17.960274 22.292429
min 18.000000 2.000000
25% 44.500000 19.500000
50% 49.000000 23.000000
75% 57.500000 27.000000
max 77.000000 74.000000
DATA MANIPULATION WITH PANDAS
Components of a DataFrame: .values
dogs.values
array([['Bella', 'Labrador', 'Brown', 56, 24, '2013-07-01'],
['Charlie', 'Poodle', 'Black', 43, 24, '2016-09-16'],
['Lucy', 'Chow Chow', 'Brown', 46, 24, '2014-08-25'],
['Cooper', 'Schnauzer', 'Gray', 49, 17, '2011-12-11'],
['Max', 'Labrador', 'Black', 59, 29, '2017-01-20'],
['Stella', 'Chihuahua', 'Tan', 18, 2, '2015-04-20'],
['Bernie', 'St. Bernard', 'White', 77, 74, '2018-02-27']],
dtype=object)
DATA MANIPULATION WITH PANDAS
Components of a DataFrame: .columns and .index
dogs.columns
Index(['name', 'breed', 'color', 'height_cm', 'weight_kg', 'date_of_birth'],
dtype='object')
dogs.index
RangeIndex(start=0, stop=7, step=1)
DATA MANIPULATION WITH PANDAS
pandas Philosophy
There should be one -- and preferably only one -- obvious way to do it.
- The Zen of Python by Tim Peters, Item 13
1 h ps://www.python.org/dev/peps/pep-0020/
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Sorting and
subsetting
D ATA M A N I P U L AT I O N W I T H PA N D A S
Richie Co on
Learning Solutions Architect at
DataCamp
Sorting
dogs.sort_values("weight_kg")
name breed color height_cm weight_kg date_of_birth
5 Stella Chihuahua Tan 18 2 2015-04-20
3 Cooper Schnauzer Gray 49 17 2011-12-11
0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27
DATA MANIPULATION WITH PANDAS
Sorting in descending order
dogs.sort_values("weight_kg", ascending=False)
name breed color height_cm weight_kg date_of_birth
6 Bernie St. Bernard White 77 74 2018-02-27
4 Max Labrador Black 59 29 2017-01-20
0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
3 Cooper Schnauzer Gray 49 17 2011-12-11
5 Stella Chihuahua Tan 18 2 2015-04-20
DATA MANIPULATION WITH PANDAS
Sorting by multiple variables
dogs.sort_values(["weight_kg", "height_cm"])
name breed color height_cm weight_kg date_of_birth
5 Stella Chihuahua Tan 18 2 2015-04-20
3 Cooper Schnauzer Gray 49 17 2011-12-11
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
0 Bella Labrador Brown 56 24 2013-07-01
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27
DATA MANIPULATION WITH PANDAS
Sorting by multiple variables
dogs.sort_values(["weight_kg", "height_cm"], ascending=[True, False])
name breed color height_cm weight_kg date_of_birth
5 Stella Chihuahua Tan 18 2 2015-04-20
3 Cooper Schnauzer Gray 49 17 2011-12-11
0 Bella Labrador Brown 56 24 2013-07-01
2 Lucy Chow Chow Brown 46 24 2014-08-25
1 Charlie Poodle Black 43 24 2016-09-16
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27
DATA MANIPULATION WITH PANDAS
Subsetting columns
dogs["name"]
0 Bella
1 Charlie
2 Lucy
3 Cooper
4 Max
5 Stella
6 Bernie
Name: name, dtype: object
DATA MANIPULATION WITH PANDAS
Subsetting multiple columns
dogs[["breed", "height_cm"]] cols_to_subset = ["breed", "height_cm"]
dogs[cols_to_subset]
breed height_cm
0 Labrador 56 breed height_cm
1 Poodle 43 0 Labrador 56
2 Chow Chow 46 1 Poodle 43
3 Schnauzer 49 2 Chow Chow 46
4 Labrador 59 3 Schnauzer 49
5 Chihuahua 18 4 Labrador 59
6 St. Bernard 77 5 Chihuahua 18
6 St. Bernard 77
DATA MANIPULATION WITH PANDAS
Subsetting rows
dogs["height_cm"] > 50
0 True
1 False
2 False
3 False
4 True
5 False
6 True
Name: height_cm, dtype: bool
DATA MANIPULATION WITH PANDAS
Subsetting rows
dogs[dogs["height_cm"] > 50]
name breed color height_cm weight_kg date_of_birth
0 Bella Labrador Brown 56 24 2013-07-01
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27
DATA MANIPULATION WITH PANDAS
Subsetting based on text data
dogs[dogs["breed"] == "Labrador"]
name breed color height_cm weight_kg date_of_birth
0 Bella Labrador Brown 56 24 2013-07-01
4 Max Labrador Black 59 29 2017-01-20
DATA MANIPULATION WITH PANDAS
Subsetting based on dates
dogs[dogs["date_of_birth"] > "2015-01-01"]
name breed color height_cm weight_kg date_of_birth
1 Charlie Poodle Black 43 24 2016-09-16
4 Max Labrador Black 59 29 2017-01-20
5 Stella Chihuahua Tan 18 2 2015-04-20
6 Bernie St. Bernard White 77 74 2018-02-27
DATA MANIPULATION WITH PANDAS
Subsetting based on multiple conditions
is_lab = dogs["breed"] == "Labrador"
is_brown = dogs["color"] == "Brown"
dogs[is_lab & is_brown]
name breed color height_cm weight_kg date_of_birth
0 Bella Labrador Brown 56 24 2013-07-01
dogs[ (dogs["breed"] == "Labrador") & (dogs["color"] == "Brown") ]
DATA MANIPULATION WITH PANDAS
Subsetting using .isin()
is_black_or_brown = dogs["color"].isin(["Black", "Brown"])
dogs[is_black_or_brown]
name breed color height_cm weight_kg date_of_birth
0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
4 Max Labrador Black 59 29 2017-01-20
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
New columns
D ATA M A N I P U L AT I O N W I T H PA N D A S
Richie Co on
Learning Solutions Architect at
DataCamp
Adding a new column
dogs["height_m"] = dogs["height_cm"] / 100
print(dogs)
name breed color height_cm weight_kg date_of_birth height_m
0 Bella Labrador Brown 56 24 2013-07-01 0.56
1 Charlie Poodle Black 43 24 2016-09-16 0.43
2 Lucy Chow Chow Brown 46 24 2014-08-25 0.46
3 Cooper Schnauzer Gray 49 17 2011-12-11 0.49
4 Max Labrador Black 59 29 2017-01-20 0.59
5 Stella Chihuahua Tan 18 2 2015-04-20 0.18
6 Bernie St. Bernard White 77 74 2018-02-27 0.77
DATA MANIPULATION WITH PANDAS
Doggy mass index
BMI = weight in kg/(height in m)2
dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"] ** 2
print(dogs.head())
name breed color height_cm weight_kg date_of_birth height_m bmi
0 Bella Labrador Brown 56 24 2013-07-01 0.56 76.530612
1 Charlie Poodle Black 43 24 2016-09-16 0.43 129.799892
2 Lucy Chow Chow Brown 46 24 2014-08-25 0.46 113.421550
3 Cooper Schnauzer Gray 49 17 2011-12-11 0.49 70.803832
4 Max Labrador Black 59 29 2017-01-20 0.59 83.309394
DATA MANIPULATION WITH PANDAS
Multiple manipulations
bmi_lt_100 = dogs[dogs["bmi"] < 100]
bmi_lt_100_height = bmi_lt_100.sort_values("height_cm", ascending=False)
bmi_lt_100_height[["name", "height_cm", "bmi"]]
name height_cm bmi
4 Max 59 83.309394
0 Bella 56 76.530612
3 Cooper 49 70.803832
5 Stella 18 61.728395
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Summary statistics
D ATA M A N I P U L AT I O N W I T H PA N D A S
Maggie Matsui
Senior Content Developer at DataCamp
Summarizing numerical data
.median() , .mode()
dogs["height_cm"].mean()
.min() , .max()
49.714285714285715 .var() , .std()
.sum()
.quantile()
DATA MANIPULATION WITH PANDAS
Summarizing dates
Oldest dog:
dogs["date_of_birth"].min()
'2011-12-11'
Youngest dog:
dogs["date_of_birth"].max()
'2018-02-27'
DATA MANIPULATION WITH PANDAS
The .agg() method
def pct30(column):
return column.quantile(0.3)
dogs["weight_kg"].agg(pct30)
22.599999999999998
DATA MANIPULATION WITH PANDAS
Summaries on multiple columns
dogs[["weight_kg", "height_cm"]].agg(pct30)
weight_kg 22.6
height_cm 45.4
dtype: float64
DATA MANIPULATION WITH PANDAS
Multiple summaries
def pct40(column):
return column.quantile(0.4)
dogs["weight_kg"].agg([pct30, pct40])
pct30 22.6
pct40 24.0
Name: weight_kg, dtype: float64
DATA MANIPULATION WITH PANDAS
Cumulative sum
dogs["weight_kg"] dogs["weight_kg"].cumsum()
0 24 0 24
1 24 1 48
2 24 2 72
3 17 3 89
4 29 4 118
5 2 5 120
6 74 6 194
Name: weight_kg, dtype: int64 Name: weight_kg, dtype: int64
DATA MANIPULATION WITH PANDAS
Cumulative statistics
.cummax()
.cummin()
.cumprod()
DATA MANIPULATION WITH PANDAS
Walmart
sales.head()
store type dept date weekly_sales is_holiday temp_c fuel_price unemp
0 1 A 1 2010-02-05 24924.50 False 5.73 0.679 8.106
1 1 A 2 2010-02-05 50605.27 False 5.73 0.679 8.106
2 1 A 3 2010-02-05 13740.12 False 5.73 0.679 8.106
3 1 A 4 2010-02-05 39954.04 False 5.73 0.679 8.106
4 1 A 5 2010-02-05 32229.38 False 5.73 0.679 8.106
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Counting
D ATA M A N I P U L AT I O N W I T H PA N D A S
Maggie Matsui
Senior Content Developer at DataCamp
Avoiding double counting
DATA MANIPULATION WITH PANDAS
Vet visits
print(vet_visits)
date name breed weight_kg
0 2018-09-02 Bella Labrador 24.87
1 2019-06-07 Max Labrador 28.35
2 2018-01-17 Stella Chihuahua 1.51
3 2019-10-19 Lucy Chow Chow 24.07
.. ... ... ... ...
71 2018-01-20 Stella Chihuahua 2.83
72 2019-06-07 Max Chow Chow 24.01
73 2018-08-20 Lucy Chow Chow 24.40
74 2019-04-22 Max Labrador 28.54
DATA MANIPULATION WITH PANDAS
Dropping duplicate names
vet_visits.drop_duplicates(subset="name")
date name breed weight_kg
0 2018-09-02 Bella Labrador 24.87
1 2019-06-07 Max Chow Chow 24.01
2 2019-03-19 Charlie Poodle 24.95
3 2018-01-17 Stella Chihuahua 1.51
4 2019-10-19 Lucy Chow Chow 24.07
7 2019-03-30 Cooper Schnauzer 16.91
10 2019-01-04 Bernie St. Bernard 74.98
(6 2019-06-07 Max Labrador 28.35)
DATA MANIPULATION WITH PANDAS
Dropping duplicate pairs
unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"])
print(unique_dogs)
date name breed weight_kg
0 2018-09-02 Bella Labrador 24.87
1 2019-03-13 Max Chow Chow 24.13
2 2019-03-19 Charlie Poodle 24.95
3 2018-01-17 Stella Chihuahua 1.51
4 2019-10-19 Lucy Chow Chow 24.07
6 2019-06-07 Max Labrador 28.35
7 2019-03-30 Cooper Schnauzer 16.91
10 2019-01-04 Bernie St. Bernard 74.98
DATA MANIPULATION WITH PANDAS
Easy as 1, 2, 3
unique_dogs["breed"].value_counts() unique_dogs["breed"].value_counts(sort=True)
Labrador 2 Labrador 2
Schnauzer 1 Chow Chow 2
St. Bernard 1 Schnauzer 1
Chow Chow 2 St. Bernard 1
Poodle 1 Poodle 1
Chihuahua 1 Chihuahua 1
Name: breed, dtype: int64 Name: breed, dtype: int64
DATA MANIPULATION WITH PANDAS
Proportions
unique_dogs["breed"].value_counts(normalize=True)
Labrador 0.250
Chow Chow 0.250
Schnauzer 0.125
St. Bernard 0.125
Poodle 0.125
Chihuahua 0.125
Name: breed, dtype: float64
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Grouped summary
statistics
D ATA M A N I P U L AT I O N W I T H PA N D A S
Maggie Matsui
Senior Content Developer at DataCamp
Summaries by group
dogs[dogs["color"] == "Black"]["weight_kg"].mean()
dogs[dogs["color"] == "Brown"]["weight_kg"].mean()
dogs[dogs["color"] == "White"]["weight_kg"].mean()
dogs[dogs["color"] == "Gray"]["weight_kg"].mean()
dogs[dogs["color"] == "Tan"]["weight_kg"].mean()
26.0
24.0
74.0
17.0
2.0
DATA MANIPULATION WITH PANDAS
Grouped summaries
dogs.groupby("color")["weight_kg"].mean()
color
Black 26.5
Brown 24.0
Gray 17.0
Tan 2.0
White 74.0
Name: weight_kg, dtype: float64
DATA MANIPULATION WITH PANDAS
Multiple grouped summaries
dogs.groupby("color")["weight_kg"].agg([min, max, sum])
min max sum
color
Black 24 29 53
Brown 24 24 48
Gray 17 17 17
Tan 2 2 2
White 74 74 74
DATA MANIPULATION WITH PANDAS
Grouping by multiple variables
dogs.groupby(["color", "breed"])["weight_kg"].mean()
color breed
Black Chow Chow 25
Labrador 29
Poodle 24
Brown Chow Chow 24
Labrador 24
Gray Schnauzer 17
Tan Chihuahua 2
White St. Bernard 74
Name: weight_kg, dtype: int64
DATA MANIPULATION WITH PANDAS
Many groups, many summaries
dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].mean()
weight_kg height_cm
color breed
Black Labrador 29 59
Poodle 24 43
Brown Chow Chow 24 46
Labrador 24 56
Gray Schnauzer 17 49
Tan Chihuahua 2 18
White St. Bernard 74 77
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Pivot tables
D ATA M A N I P U L AT I O N W I T H PA N D A S
Maggie Matsui
Senior Content Developer at DataCamp
Group by to pivot table
dogs.groupby("color")["weight_kg"].mean() dogs.pivot_table(values="weight_kg",
index="color")
color
Black 26 weight_kg
Brown 24 color
Gray 17 Black 26.5
Tan 2 Brown 24.0
White 74 Gray 17.0
Name: weight_kg, dtype: int64 Tan 2.0
White 74.0
DATA MANIPULATION WITH PANDAS
Different statistics
import numpy as np
dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)
weight_kg
color
Black 26.5
Brown 24.0
Gray 17.0
Tan 2.0
White 74.0
DATA MANIPULATION WITH PANDAS
Multiple statistics
dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])
mean median
weight_kg weight_kg
color
Black 26.5 26.5
Brown 24.0 24.0
Gray 17.0 17.0
Tan 2.0 2.0
White 74.0 74.0
DATA MANIPULATION WITH PANDAS
Pivot on two variables
dogs.groupby(["color", "breed"])["weight_kg"].mean()
dogs.pivot_table(values="weight_kg", index="color", columns="breed")
breed Chihuahua Chow Chow Labrador Poodle Schnauzer St. Bernard
color
Black NaN NaN 29.0 24.0 NaN NaN
Brown NaN 24.0 24.0 NaN NaN NaN
Gray NaN NaN NaN NaN 17.0 NaN
Tan 2.0 NaN NaN NaN NaN NaN
White NaN NaN NaN NaN NaN 74.0
DATA MANIPULATION WITH PANDAS
Filling missing values in pivot tables
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)
breed Chihuahua Chow Chow Labrador Poodle Schnauzer St. Bernard
color
Black 0 0 29 24 0 0
Brown 0 24 24 0 0 0
Gray 0 0 0 0 17 0
Tan 2 0 0 0 0 0
White 0 0 0 0 0 74
DATA MANIPULATION WITH PANDAS
Summing with pivot tables
dogs.pivot_table(values="weight_kg", index="color", columns="breed",
fill_value=0, margins=True)
breed Chihuahua Chow Chow Labrador Poodle Schnauzer St. Bernard All
color
Black 0 0 29 24 0 0 26.500000
Brown 0 24 24 0 0 0 24.000000
Gray 0 0 0 0 17 0 17.000000
Tan 2 0 0 0 0 0 2.000000
White 0 0 0 0 0 74 74.000000
All 2 24 26 24 17 74 27.714286
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Explicit indexes
D ATA M A N I P U L AT I O N W I T H PA N D A S
Richie Co on
Learning Solutions Architect at
DataCamp
The dog dataset, revisited
print(dogs)
name breed color height_cm weight_kg
0 Bella Labrador Brown 56 25
1 Charlie Poodle Black 43 23
2 Lucy Chow Chow Brown 46 22
3 Cooper Schnauzer Gray 49 17
4 Max Labrador Black 59 29
5 Stella Chihuahua Tan 18 2
6 Bernie St. Bernard White 77 74
DATA MANIPULATION WITH PANDAS
.columns and .index
dogs.columns
Index(['name', 'breed', 'color', 'height_cm', 'weight_kg'], dtype='object')
dogs.index
RangeIndex(start=0, stop=7, step=1)
DATA MANIPULATION WITH PANDAS
Setting a column as the index
dogs_ind = dogs.set_index("name")
print(dogs_ind)
breed color height_cm weight_kg
name
Bella Labrador Brown 56 25
Charlie Poodle Black 43 23
Lucy Chow Chow Brown 46 22
Cooper Schnauzer Grey 49 17
Max Labrador Black 59 29
Stella Chihuahua Tan 18 2
Bernie St. Bernard White 77 74
DATA MANIPULATION WITH PANDAS
Removing an index
dogs_ind.reset_index()
name breed color height_cm weight_kg
0 Bella Labrador Brown 56 25
1 Charlie Poodle Black 43 23
2 Lucy Chow Chow Brown 46 22
3 Cooper Schnauzer Grey 49 17
4 Max Labrador Black 59 29
5 Stella Chihuahua Tan 18 2
6 Bernie St. Bernard White 77 74
DATA MANIPULATION WITH PANDAS
Dropping an index
dogs_ind.reset_index(drop=True)
breed color height_cm weight_kg
0 Labrador Brown 56 25
1 Poodle Black 43 23
2 Chow Chow Brown 46 22
3 Schnauzer Grey 49 17
4 Labrador Black 59 29
5 Chihuahua Tan 18 2
6 St. Bernard White 77 74
DATA MANIPULATION WITH PANDAS
Indexes make subsetting simpler
dogs[dogs["name"].isin(["Bella", "Stella"])]
name breed color height_cm weight_kg
0 Bella Labrador Brown 56 25
5 Stella Chihuahua Tan 18 2
dogs_ind.loc[["Bella", "Stella"]]
breed color height_cm weight_kg
name
Bella Labrador Brown 56 25
Stella Chihuahua Tan 18 2
DATA MANIPULATION WITH PANDAS
Index values don't need to be unique
dogs_ind2 = dogs.set_index("breed")
print(dogs_ind2)
name color height_cm weight_kg
breed
Labrador Bella Brown 56 25
Poodle Charlie Black 43 23
Chow Chow Lucy Brown 46 22
Schnauzer Cooper Grey 49 17
Labrador Max Black 59 29
Chihuahua Stella Tan 18 2
St. Bernard Bernie White 77 74
DATA MANIPULATION WITH PANDAS
Subsetting on duplicated index values
dogs_ind2.loc["Labrador"]
name color height_cm weight_kg
breed
Labrador Bella Brown 56 25
Labrador Max Black 59 29
DATA MANIPULATION WITH PANDAS
Multi-level indexes a.k.a. hierarchical indexes
dogs_ind3 = dogs.set_index(["breed", "color"])
print(dogs_ind3)
name height_cm weight_kg
breed color
Labrador Brown Bella 56 25
Poodle Black Charlie 43 23
Chow Chow Brown Lucy 46 22
Schnauzer Grey Cooper 49 17
Labrador Black Max 59 29
Chihuahua Tan Stella 18 2
St. Bernard White Bernie 77 74
DATA MANIPULATION WITH PANDAS
Subset the outer level with a list
dogs_ind3.loc[["Labrador", "Chihuahua"]]
name height_cm weight_kg
breed color
Labrador Brown Bella 56 25
Black Max 59 29
Chihuahua Tan Stella 18 2
DATA MANIPULATION WITH PANDAS
Subset inner levels with a list of tuples
dogs_ind3.loc[[("Labrador", "Brown"), ("Chihuahua", "Tan")]]
name height_cm weight_kg
breed color
Labrador Brown Bella 56 25
Chihuahua Tan Stella 18 2
DATA MANIPULATION WITH PANDAS
Sorting by index values
dogs_ind3.sort_index()
name height_cm weight_kg
breed color
Chihuahua Tan Stella 18 2
Chow Chow Brown Lucy 46 22
Labrador Black Max 59 29
Brown Bella 56 25
Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 17
St. Bernard White Bernie 77 74
DATA MANIPULATION WITH PANDAS
Controlling sort_index
dogs_ind3.sort_index(level=["color", "breed"], ascending=[True, False])
name height_cm weight_kg
breed color
Poodle Black Charlie 43 23
Labrador Black Max 59 29
Brown Bella 56 25
Chow Chow Brown Lucy 46 22
Schanuzer Grey Cooper 49 17
Chihuahua Tan Stella 18 2
St. Bernard White Bernie 77 74
DATA MANIPULATION WITH PANDAS
Now you have two problems
Index values are just data
Indexes violate "tidy data" principles
You need to learn two syntaxes
DATA MANIPULATION WITH PANDAS
Temperature dataset
date city country avg_temp_c
0 2000-01-01 Abidjan Côte D'Ivoire 27.293
1 2000-02-01 Abidjan Côte D'Ivoire 27.685
2 2000-03-01 Abidjan Côte D'Ivoire 29.061
3 2000-04-01 Abidjan Côte D'Ivoire 28.162
4 2000-05-01 Abidjan Côte D'Ivoire 27.547
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Slicing and
subsetting with .loc
and .iloc
D ATA M A N I P U L AT I O N W I T H PA N D A S
Richie Co on
Learning Solutions Architect at
DataCamp
Slicing lists
breeds = ["Labrador", "Poodle", breeds[2:5]
"Chow Chow", "Schnauzer",
"Labrador", "Chihuahua",
['Chow Chow', 'Schnauzer', 'Labrador']
"St. Bernard"]
breeds[:3]
['Labrador',
'Poodle',
'Chow Chow', ['Labrador', 'Poodle', 'Chow Chow']
'Schnauzer',
'Labrador', breeds[:]
'Chihuahua',
'St. Bernard']
['Labrador','Poodle','Chow Chow','Schnauzer',
'Labrador','Chihuahua','St. Bernard']
DATA MANIPULATION WITH PANDAS
Sort the index before you slice
dogs_srt = dogs.set_index(["breed", "color"]).sort_index()
print(dogs_srt)
name height_cm weight_kg
breed color
Chihuahua Tan Stella 18 2
Chow Chow Brown Lucy 46 22
Labrador Black Max 59 29
Brown Bella 56 25
Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 17
St. Bernard White Bernie 77 74
DATA MANIPULATION WITH PANDAS
Slicing the outer index level
dogs_srt.loc["Chow Chow":"Poodle"] Full dataset
name height_cm weight_kg name height_cm weight_kg
breed color breed color
Chow Chow Brown Lucy 46 22 Chihuahua Tan Stella 18 2
Labrador Black Max 59 29 Chow Chow Brown Lucy 46 22
Brown Bella 56 25 Labrador Black Max 59 29
Poodle Black Charlie 43 23 Brown Bella 56 25
Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 17
The nal value "Poodle" is included
St. Bernard White Bernie 77 74
DATA MANIPULATION WITH PANDAS
Slicing the inner index levels badly
dogs_srt.loc["Tan":"Grey"] Full dataset
Empty DataFrame name height_cm weight_kg
Columns: [name, height_cm, weight_kg] breed color
Index: [] Chihuahua Tan Stella 18 2
Chow Chow Brown Lucy 46 22
Labrador Black Max 59 29
Brown Bella 56 25
Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 17
St. Bernard White Bernie 77 74
DATA MANIPULATION WITH PANDAS
Slicing the inner index levels correctly
dogs_srt.loc[ Full dataset
("Labrador", "Brown"):("Schnauzer", "Grey")]
name height_cm weight_kg
name height_cm weight_kg breed color
breed color Chihuahua Tan Stella 18 2
Labrador Brown Bella 56 25 Chow Chow Brown Lucy 46 22
Poodle Black Charlie 43 23 Labrador Black Max 59 29
Schnauzer Grey Cooper 49 17 Brown Bella 56 25
Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 17
St. Bernard White Bernie 77 74
DATA MANIPULATION WITH PANDAS
Slicing columns
dogs_srt.loc[:, "name":"height_cm"] Full dataset
name height_cm name height_cm weight_kg
breed color breed color
Chihuahua Tan Stella 18 Chihuahua Tan Stella 18 2
Chow Chow Brown Lucy 46 Chow Chow Brown Lucy 46 22
Labrador Black Max 59 Labrador Black Max 59 29
Brown Bella 56 Brown Bella 56 25
Poodle Black Charlie 43 Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 Schnauzer Grey Cooper 49 17
St. Bernard White Bernie 77 St. Bernard White Bernie 77 74
DATA MANIPULATION WITH PANDAS
Slice twice
dogs_srt.loc[ Full dataset
("Labrador", "Brown"):("Schnauzer", "Grey"),
"name":"height_cm"] name height_cm weight_kg
breed color
name height_cm Chihuahua Tan Stella 18 2
breed color Chow Chow Brown Lucy 46 22
Labrador Brown Bella 56 Labrador Black Max 59 29
Poodle Black Charlie 43 Brown Bella 56 25
Schanuzer Grey Cooper 49 Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 17
St. Bernard White Bernie 77 74
DATA MANIPULATION WITH PANDAS
Dog days
dogs = dogs.set_index("date_of_birth").sort_index()
print(dogs)
name breed color height_cm weight_kg
date_of_birth
2011-12-11 Cooper Schanuzer Grey 49 17
2013-07-01 Bella Labrador Brown 56 25
2014-08-25 Lucy Chow Chow Brown 46 22
2015-04-20 Stella Chihuahua Tan 18 2
2016-09-16 Charlie Poodle Black 43 23
2017-01-20 Max Labrador Black 59 29
2018-02-27 Bernie St. Bernard White 77 74
DATA MANIPULATION WITH PANDAS
Slicing by dates
# Get dogs with date_of_birth between 2014-08-25 and 2016-09-16
dogs.loc["2014-08-25":"2016-09-16"]
name breed color height_cm weight_kg
date_of_birth
2014-08-25 Lucy Chow Chow Brown 46 22
2015-04-20 Stella Chihuahua Tan 18 2
2016-09-16 Charlie Poodle Black 43 23
DATA MANIPULATION WITH PANDAS
Slicing by partial dates
# Get dogs with date_of_birth between 2014-01-01 and 2016-12-31
dogs.loc["2014":"2016"]
name breed color height_cm weight_kg
date_of_birth
2014-08-25 Lucy Chow Chow Brown 46 22
2015-04-20 Stella Chihuahua Tan 18 2
2016-09-16 Charlie Poodle Black 43 23
DATA MANIPULATION WITH PANDAS
Subsetting by row/column number
print(dogs.iloc[2:5, 1:4]) Full dataset
breed color height_cm name breed color height_cm weight_kg
2 Chow Chow Brown 46 0 Bella Labrador Brown 56 25
3 Schnauzer Grey 49 1 Charlie Poodle Black 43 23
4 Labrador Black 59 2 Lucy Chow Chow Brown 46 22
3 Cooper Schnauzer Grey 49 17
4 Max Labrador Black 59 29
5 Stella Chihuahua Tan 18 2
6 Bernie St. Bernard White 77 74
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Working with pivot
tables
D ATA M A N I P U L AT I O N W I T H PA N D A S
Richie Co on
Learning Solutions Architect at
DataCamp
A bigger dog dataset
print(dog_pack)
breed color height_cm weight_kg
0 Boxer Brown 62.64 30.4
1 Poodle Black 46.41 20.4
2 Beagle Brown 36.39 12.4
3 Chihuahua Tan 19.70 1.6
4 Labrador Tan 54.44 36.1
.. ... ... ... ...
87 Boxer Gray 58.13 29.9
88 St. Bernard White 70.13 69.4
89 Poodle Gray 51.30 20.4
90 Beagle White 38.81 8.8
91 Beagle Black 33.40 13.5
DATA MANIPULATION WITH PANDAS
Pivoting the dog pack
dogs_height_by_breed_vs_color = dog_pack.pivot_table(
"height_cm", index="breed", columns="color")
print(dogs_height_by_breed_vs_color)
color Black Brown Gray Tan White
breed
Beagle 34.500000 36.4500 36.313333 35.740000 38.810000
Boxer 57.203333 62.6400 58.280000 62.310000 56.360000
Chihuahua 18.555000 NaN 21.660000 20.096667 17.933333
Chow Chow 51.262500 50.4800 NaN 53.497500 54.413333
Dachshund 21.186667 19.7250 NaN 19.375000 20.660000
Labrador 57.125000 NaN NaN 55.190000 55.310000
Poodle 48.036000 57.1300 56.645000 NaN 44.740000
St. Bernard 63.920000 65.8825 67.640000 68.334000 67.495000
DATA MANIPULATION WITH PANDAS
.loc[] + slicing is a power combo
dogs_height_by_breed_vs_color.loc["Chow Chow":"Poodle"]
color Black Brown Gray Tan White
breed
Chow Chow 51.262500 50.480 NaN 53.4975 54.413333
Dachshund 21.186667 19.725 NaN 19.3750 20.660000
Labrador 57.125000 NaN NaN 55.1900 55.310000
Poodle 48.036000 57.130 56.645 NaN 44.740000
DATA MANIPULATION WITH PANDAS
The axis argument
dogs_height_by_breed_vs_color.mean(axis="index")
color
Black 43.973563
Brown 48.717917
Gray 48.107667
Tan 44.934738
White 44.465208
dtype: float64
DATA MANIPULATION WITH PANDAS
Calculating summary stats across columns
dogs_height_by_breed_vs_color.mean(axis="columns")
breed
Beagle 36.362667
Boxer 59.358667
Chihuahua 19.561250
Chow Chow 52.413333
Dachshund 20.236667
Labrador 55.875000
Poodle 51.637750
St. Bernard 66.654300
dtype: float64
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Visualizing your
data
D ATA M A N I P U L AT I O N W I T H PA N D A S
Maggie Matsui
Senior Content Developer at DataCamp
Histograms
import matplotlib.pyplot as plt
dog_pack["height_cm"].hist()
plt.show()
DATA MANIPULATION WITH PANDAS
Histograms
dog_pack["height_cm"].hist(bins=20) dog_pack["height_cm"].hist(bins=5)
plt.show() plt.show()
DATA MANIPULATION WITH PANDAS
Bar plots
avg_weight_by_breed = dog_pack.groupby("breed")["weight_kg"].mean()
print(avg_weight_by_breed)
breed
Beagle 10.636364
Boxer 30.620000
Chihuahua 1.491667
Chow Chow 22.535714
Dachshund 9.975000
Labrador 31.850000
Poodle 20.400000
St. Bernard 71.576923
Name: weight_kg, dtype: float64
DATA MANIPULATION WITH PANDAS
Bar plots
avg_weight_by_breed.plot(kind="bar") avg_weight_by_breed.plot(kind="bar",
plt.show() title="Mean Weight by Dog Breed")
plt.show()
DATA MANIPULATION WITH PANDAS
Line plots
sully.head() sully.plot(x="date",
y="weight_kg",
date weight_kg kind="line")
0 2019-01-31 36.1 plt.show()
1 2019-02-28 35.3
2 2019-03-31 32.0
3 2019-04-30 32.9
4 2019-05-31 32.0
DATA MANIPULATION WITH PANDAS
Rotating axis labels
sully.plot(x="date", y="weight_kg", kind="line", rot=45)
plt.show()
DATA MANIPULATION WITH PANDAS
Scatter plots
dog_pack.plot(x="height_cm", y="weight_kg", kind="scatter")
plt.show()
DATA MANIPULATION WITH PANDAS
Layering plots
dog_pack[dog_pack["sex"]=="F"]["height_cm"].hist()
dog_pack[dog_pack["sex"]=="M"]["height_cm"].hist()
plt.show()
DATA MANIPULATION WITH PANDAS
Add a legend
dog_pack[dog_pack["sex"]=="F"]["height_cm"].hist()
dog_pack[dog_pack["sex"]=="M"]["height_cm"].hist()
plt.legend(["F", "M"])
plt.show()
DATA MANIPULATION WITH PANDAS
Transparency
dog_pack[dog_pack["sex"]=="F"]["height_cm"].hist(alpha=0.7)
dog_pack[dog_pack["sex"]=="M"]["height_cm"].hist(alpha=0.7)
plt.legend(["F", "M"])
plt.show()
DATA MANIPULATION WITH PANDAS
Avocados
print(avocados)
date type year avg_price size nb_sold
0 2015-12-27 conventional 2015 0.95 small 9626901.09
1 2015-12-20 conventional 2015 0.98 small 8710021.76
2 2015-12-13 conventional 2015 0.93 small 9855053.66
... ... ... ... ... ... ...
1011 2018-01-21 organic 2018 1.63 extra_large 1490.02
1012 2018-01-14 organic 2018 1.59 extra_large 1580.01
1013 2018-01-07 organic 2018 1.51 extra_large 1289.07
[1014 rows x 6 columns]
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Missing values
D ATA M A N I P U L AT I O N W I T H PA N D A S
Maggie Matsui
Senior Content Developer at DataCamp
What's a missing value?
Name Breed Color Height (cm) Weight (kg) Date of Birth
Bella Labrador Brown 56 25 2013-07-01
Charlie Poodle Black 43 23 2016-09-16
Lucy Chow Chow Brown 46 22 2014-08-25
Cooper Schnauzer Gray 49 17 2011-12-11
Max Labrador Black 59 29 2017-01-20
Stella Chihuahua Tan 18 2 2015-04-20
Bernie St. Bernard White 77 74 2018-02-27
DATA MANIPULATION WITH PANDAS
What's a missing value?
Name Breed Color Height (cm) Weight (kg) Date of Birth
Bella Labrador Brown 56 ? 2013-07-01
Charlie Poodle Black 43 23 2016-09-16
Lucy Chow Chow Brown 46 22 2014-08-25
Cooper Schnauzer Gray 49 ? 2011-12-11
Max Labrador Black 59 29 2017-01-20
Stella Chihuahua Tan 18 2 2015-04-20
Bernie St. Bernard White 77 74 2018-02-27
DATA MANIPULATION WITH PANDAS
Missing values in pandas DataFrames
print(dogs)
name breed color height_cm weight_kg date_of_birth
0 Bella Labrador Brown 56 NaN 2013-07-01
1 Charlie Poodle Black 43 24.0 2016-09-16
2 Lucy Chow Chow Brown 46 24.0 2014-08-25
3 Cooper Schnauzer Gray 49 NaN 2011-12-11
4 Max Labrador Black 59 29.0 2017-01-20
5 Stella Chihuahua Tan 18 2.0 2015-04-20
6 Bernie St. Bernard White 77 74.0 2018-02-27
DATA MANIPULATION WITH PANDAS
Detecting missing values
dogs.isna()
name breed color height_cm weight_kg date_of_birth
0 False False False False True False
1 False False False False False False
2 False False False False False False
3 False False False False True False
4 False False False False False False
5 False False False False False False
6 False False False False False False
DATA MANIPULATION WITH PANDAS
Detecting any missing values
dogs.isna().any()
name False
breed False
color False
height_cm False
weight_kg True
date_of_birth False
dtype: bool
DATA MANIPULATION WITH PANDAS
Counting missing values
dogs.isna().sum()
name 0
breed 0
color 0
height_cm 0
weight_kg 2
date_of_birth 0
dtype: int64
DATA MANIPULATION WITH PANDAS
Plotting missing values
import matplotlib.pyplot as plt
dogs.isna().sum().plot(kind="bar")
plt.show()
DATA MANIPULATION WITH PANDAS
Removing missing values
dogs.dropna()
name breed color height_cm weight_kg date_of_birth
1 Charlie Poodle Black 43 24.0 2016-09-16
2 Lucy Chow Chow Brown 46 24.0 2014-08-25
4 Max Labrador Black 59 29.0 2017-01-20
5 Stella Chihuahua Tan 18 2.0 2015-04-20
6 Bernie St. Bernard White 77 74.0 2018-02-27
DATA MANIPULATION WITH PANDAS
Replacing missing values
dogs.fillna(0)
name breed color height_cm weight_kg date_of_birth
0 Bella Labrador Brown 56 0.0 2013-07-01
1 Charlie Poodle Black 43 24.0 2016-09-16
2 Lucy Chow Chow Brown 46 24.0 2014-08-25
3 Cooper Schnauzer Gray 49 0.0 2011-12-11
4 Max Labrador Black 59 29.0 2017-01-20
5 Stella Chihuahua Tan 18 2.0 2015-04-20
6 Bernie St. Bernard White 77 74.0 2018-02-27
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Creating
DataFrames
D ATA M A N I P U L AT I O N W I T H PA N D A S
Maggie Matsui
Senior Content Developer at DataCamp
Dictionaries
my_dict = { my_dict = {
"key1": value1, "title": "Charlotte's Web",
"key2": value2, "author": "E.B. White",
"key3": value3 "published": 1952
} }
my_dict["key1"] my_dict["title"]
value1 Charlotte's Web
DATA MANIPULATION WITH PANDAS
Creating DataFrames
From a list of dictionaries From a dictionary of lists
Constructed row by row Constructed column by column
DATA MANIPULATION WITH PANDAS
List of dictionaries - by row
name breed height (cm) weight (kg) date of birth
Ginger Dachshund 22 10 2019-03-14
Scout Dalmatian 59 25 2019-05-09
list_of_dicts = [
{"name": "Ginger", "breed": "Dachshund", "height_cm": 22,
"weight_kg": 10, "date_of_birth": "2019-03-14"},
{"name": "Scout", "breed": "Dalmatian", "height_cm": 59,
"weight_kg": 25, "date_of_birth": "2019-05-09"}
]
DATA MANIPULATION WITH PANDAS
List of dictionaries - by row
name breed height (cm) weight (kg) date of birth
Ginger Dachshund 22 10 2019-03-14
Scout Dalmatian 59 25 2019-05-09
new_dogs = pd.DataFrame(list_of_dicts)
print(new_dogs)
name breed height_cm weight_kg date_of_birth
0 Ginger Dachshund 22 10 2019-03-14
1 Scout Dalmatian 59 25 2019-05-09
DATA MANIPULATION WITH PANDAS
Dictionary of lists - by column
date dict_of_lists = {
name breed height weight of
birth "name": ["Ginger", "Scout"],
"breed": ["Dachshund", "Dalmatian"],
2019-
Ginger Dachshund 22 10
03-14 "height_cm": [22, 59],
2019- "weight_kg": [10, 25],
Scout Dalmatian 59 25 05- "date_of_birth": ["2019-03-14",
09
"2019-05-09"]
}
new_dogs = pd.DataFrame(dict_of_lists)
Key = column name
Value = list of column values
DATA MANIPULATION WITH PANDAS
Dictionary of lists - by column
name breed height (cm) weight (kg) date of birth
Ginger Dachshund 22 10 2019-03-14
Scout Dalmatian 59 25 2019-05-09
print(new_dogs)
name breed height_cm weight_kg date_of_birth
0 Ginger Dachshund 22 10 2019-03-14
1 Scout Dalmatian 59 25 2019-05-09
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Reading and writing
CSVs
D ATA M A N I P U L AT I O N W I T H PA N D A S
Maggie Matsui
Senior Content Developer at DataCamp
What's a CSV file?
CSV = comma-separated values
Designed for DataFrame-like data
Most database and spreadsheet programs can use them or create them
DATA MANIPULATION WITH PANDAS
Example CSV file
new_dogs.csv
name,breed,height_cm,weight_kg,d_o_b
Ginger,Dachshund,22,10,2019-03-14
Scout,Dalmatian,59,25,2019-05-09
DATA MANIPULATION WITH PANDAS
CSV to DataFrame
import pandas as pd
new_dogs = pd.read_csv("new_dogs.csv")
print(new_dogs)
name breed height_cm weight_kg date_of_birth
0 Ginger Dachshund 22 10 2019-03-14
1 Scout Dalmatian 59 25 2019-05-09
DATA MANIPULATION WITH PANDAS
DataFrame manipulation
new_dogs["bmi"] = new_dogs["weight_kg"] / (new_dogs["height_cm"] / 100) ** 2
print(new_dogs)
name breed height_cm weight_kg date_of_birth bmi
0 Ginger Dachshund 22 10 2019-03-14 206.611570
1 Scout Dalmatian 59 25 2019-05-09 71.818443
DATA MANIPULATION WITH PANDAS
DataFrame to CSV
new_dogs.to_csv("new_dogs_with_bmi.csv")
new_dogs_with_bmi.csv
name,breed,height_cm,weight_kg,d_o_b,bmi
Ginger,Dachshund,22,10,2019-03-14,206.611570
Scout,Dalmatian,59,25,2019-05-09,71.818443
DATA MANIPULATION WITH PANDAS
Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Wrap-up
D ATA M A N I P U L AT I O N W I T H PA N D A S
Maggie Matsui
Senior Content Developer at DataCamp
Recap
Chapter 1 Chapter 3
Subse ing and sorting Indexing
Adding new columns Slicing
Chapter 2 Chapter 4
Aggregating and grouping Visualizations
Summary statistics Reading and writing CSVs
DATA MANIPULATION WITH PANDAS
More to learn
Joining Data with pandas
Streamlined Data Ingestion with pandas
Analyzing Police Activity with pandas
Analyzing Marketing Campaigns with
pandas
DATA MANIPULATION WITH PANDAS
Congratulations!
D ATA M A N I P U L AT I O N W I T H PA N D A S