KEMBAR78
Introduction To Pandas | PDF | Computers | Technology & Engineering
0% found this document useful (0 votes)
91 views26 pages

Introduction To Pandas

The document introduces pandas, a Python library for data analysis. It provides two main data structures: Series, which is a one-dimensional labeled array; and DataFrames, which are two-dimensional labeled data structures. The document demonstrates how to create Series and DataFrames from various data types like lists, dictionaries, and how to perform basic operations on them like selecting columns, inserting new columns, and deleting columns.

Uploaded by

Okewunmi Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views26 pages

Introduction To Pandas

The document introduces pandas, a Python library for data analysis. It provides two main data structures: Series, which is a one-dimensional labeled array; and DataFrames, which are two-dimensional labeled data structures. The document demonstrates how to create Series and DataFrames from various data types like lists, dictionaries, and how to perform basic operations on them like selecting columns, inserting new columns, and deleting columns.

Uploaded by

Okewunmi Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

5/9/2019 Introduction to Pandas

Pandas
pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and
transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation
functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to
these two data structures and data ingestion, the key features of pandas this notebook covers are:

Generating descriptive statistics on data


Data cleaning using built in pandas functions
Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
Merging multiple datasets using dataframes

Additional Recommended Resources:

Python for Data Analysis by Wes McKinney


Python Data Science Handbook by Jake VanderPlas

Let's get started with our first pandas notebook!

Import Libraries
In [2]: import pandas as pd

Introduction to pandas Data Structures


pandas has two main data structures it uses, namely, Series and DataFrames.

pandas Series
pandas Series one-dimensional labeled array.

In [3]: ser = pd.Series([100, 'foo', 300, 'bar', 500], ['paul', 'bob', 'wale', 'dan', 'bo

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 1/26


5/9/2019 Introduction to Pandas

In [4]: ser

Out[4]: paul 100


bob foo
wale 300
dan bar
bola 500
dtype: object

In [5]: ser.index

Out[5]: Index(['paul', 'bob', 'wale', 'dan', 'bola'], dtype='object')

In [6]: ser.loc[['wale','bob']]

Out[6]: wale 300


bob foo
dtype: object

In [7]: ser[[4, 3, 1]]

Out[7]: bola 500


dan bar
bob foo
dtype: object

In [8]: ser.iloc[2]
Out[8]: 300

In [9]: 'bob' in ser

Out[9]: True

In [10]: ser

Out[10]: paul 100


bob foo
wale 300
dan bar
bola 500
dtype: object

In [11]: ser * 2
Out[11]: paul 200
bob foofoo
wale 600
dan barbar
bola 1000
dtype: object

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 2/26


5/9/2019 Introduction to Pandas

In [12]: ser[['bola', 'paul']] ** 2


Out[12]: bola 250000
paul 10000
dtype: object

pandas DataFrame
pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series

In [13]: d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),


'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill

In [14]: df = pd.DataFrame(d)
print(df)
one two
apple 100.0 111.0
ball 200.0 222.0
cerill NaN 333.0
clock 300.0 NaN
dancy NaN 4444.0

In [15]: df.index
Out[15]: Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

In [16]: df.columns
Out[16]: Index(['one', 'two'], dtype='object')

In [17]: pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

Out[17]:
one two

dancy NaN 4444.0

ball 200.0 222.0

apple 100.0 111.0

In [18]: pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])


Out[18]:
two five

dancy 4444.0 NaN

ball 222.0 NaN

apple 111.0 NaN

Create DataFrame from list of Python dictionaries


localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 3/26
5/9/2019 Introduction to Pandas

In [19]: data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

In [20]: pd.DataFrame(data)
Out[20]:
alex alice dora ema joe

0 1.0 NaN NaN NaN 2.0

1 NaN 20.0 10.0 5.0 NaN

In [21]: pd.DataFrame(data, index=['orange', 'red'])

Out[21]:
alex alice dora ema joe

orange 1.0 NaN NaN NaN 2.0

red NaN 20.0 10.0 5.0 NaN

In [22]: pd.DataFrame(data, columns=['joe', 'dora','alice'])

Out[22]:
joe dora alice

0 2.0 NaN NaN

1 NaN 10.0 20.0

Basic DataFrame operations

In [23]: df

Out[23]:
one two

apple 100.0 111.0

ball 200.0 222.0

cerill NaN 333.0

clock 300.0 NaN

dancy NaN 4444.0

In [24]: df['one']

Out[24]: apple 100.0


ball 200.0
cerill NaN
clock 300.0
dancy NaN
Name: one, dtype: float64

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 4/26


5/9/2019 Introduction to Pandas

In [25]: df['three'] = df['one'] * df['two']


df
Out[25]:
one two three

apple 100.0 111.0 11100.0

ball 200.0 222.0 44400.0

cerill NaN 333.0 NaN

clock 300.0 NaN NaN

dancy NaN 4444.0 NaN

In [26]: df['flag'] = df['one'] > 250


df
Out[26]:
one two three flag

apple 100.0 111.0 11100.0 False

ball 200.0 222.0 44400.0 False

cerill NaN 333.0 NaN False

clock 300.0 NaN NaN True

dancy NaN 4444.0 NaN False

In [27]: three = df.pop('three')

In [28]: three
Out[28]: apple 11100.0
ball 44400.0
cerill NaN
clock NaN
dancy NaN
Name: three, dtype: float64

In [29]: df

Out[29]:
one two flag

apple 100.0 111.0 False

ball 200.0 222.0 False

cerill NaN 333.0 False

clock 300.0 NaN True

dancy NaN 4444.0 False

In [30]: del df['two']

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 5/26


5/9/2019 Introduction to Pandas

In [31]: df
Out[31]:
one flag

apple 100.0 False

ball 200.0 False

cerill NaN False

clock 300.0 True

dancy NaN False

In [32]: df.insert(2, 'copy_of_one', df['one'])


df
Out[32]:
one flag copy_of_one

apple 100.0 False 100.0

ball 200.0 False 200.0

cerill NaN False NaN

clock 300.0 True 300.0

dancy NaN False NaN

In [33]: df['one_upper_half'] = df['one'][:2]


df
Out[33]:
one flag copy_of_one one_upper_half

apple 100.0 False 100.0 100.0

ball 200.0 False 200.0 200.0

cerill NaN False NaN NaN

clock 300.0 True 300.0 NaN

dancy NaN False NaN NaN

Use Pandas to Read the Dataset


In this notebook, we will be using three CSV files:

ratings.csv : userId,movieId,rating, timestamp


tags.csv : userId,movieId, tag, timestamp
movies.csv : movieId, title, genres

Using the read_csv function in pandas, we will ingest these three files.

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 6/26


5/9/2019 Introduction to Pandas

In [34]: movies = pd.read_csv('movies.csv')


print(type(movies))
movies.head(15)
<class 'pandas.core.frame.DataFrame'>

Out[34]:
movieId title genres

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

1 2 Jumanji (1995) Adventure|Children|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance

4 5 Father of the Bride Part II (1995) Comedy

5 6 Heat (1995) Action|Crime|Thriller

6 7 Sabrina (1995) Comedy|Romance

7 8 Tom and Huck (1995) Adventure|Children

8 9 Sudden Death (1995) Action

9 10 GoldenEye (1995) Action|Adventure|Thriller

10 11 American President, The (1995) Comedy|Drama|Romance

11 12 Dracula: Dead and Loving It (1995) Comedy|Horror

12 13 Balto (1995) Adventure|Animation|Children

13 14 Nixon (1995) Drama

14 15 Cutthroat Island (1995) Action|Adventure|Romance

In [35]: tags = pd.read_csv('tags.csv')


tags = tags.drop(["timestamp"], axis = 1)
tags.head()
Out[35]:
userId movieId tag

0 2 60756 funny

1 2 60756 Highly quotable

2 2 60756 will ferrell

3 2 89774 Boxing story

4 2 89774 MMA

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 7/26


5/9/2019 Introduction to Pandas

In [36]: ratings = pd.read_csv('ratings.csv', parse_dates=['timestamp'])


ratings = ratings.drop(['timestamp'], axis = 1)
ratings.head(20)
Out[36]:
userId movieId rating

0 1 1 4.0

1 1 3 4.0

2 1 6 4.0

3 1 47 5.0

4 1 50 5.0

5 1 70 3.0

6 1 101 5.0

7 1 110 4.0

8 1 151 5.0

9 1 157 5.0

10 1 163 5.0

11 1 216 5.0

12 1 223 3.0

13 1 231 5.0

14 1 235 4.0

15 1 260 5.0

16 1 296 3.0

17 1 316 3.0

18 1 333 5.0

19 1 349 4.0

Data Structures

Series

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 8/26


5/9/2019 Introduction to Pandas

In [37]: tags.head()
Out[37]:
userId movieId tag

0 2 60756 funny

1 2 60756 Highly quotable

2 2 60756 will ferrell

3 2 89774 Boxing story

4 2 89774 MMA

In [38]: #Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)
Out[38]: pandas.core.series.Series

In [39]: print(row_0)

userId 2
movieId 60756
tag funny
Name: 0, dtype: object

In [40]: row_0.index

Out[40]: Index(['userId', 'movieId', 'tag'], dtype='object')

In [41]: row_0['userId']

Out[41]: 2

In [42]: 'rating' in row_0

Out[42]: False

In [43]: row_0.name
Out[43]: 0

In [44]: row_0 = row_0.rename('first_row')


row_0.name
Out[44]: 'first_row'

DataFrames

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 9/26


5/9/2019 Introduction to Pandas

In [45]: tags.head()

Out[45]:
userId movieId tag

0 2 60756 funny

1 2 60756 Highly quotable

2 2 60756 will ferrell

3 2 89774 Boxing story

4 2 89774 MMA

In [46]: tags.index
Out[46]: RangeIndex(start=0, stop=3683, step=1)

In [47]: tags.columns

Out[47]: Index(['userId', 'movieId', 'tag'], dtype='object')

In [48]: # Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]
Out[48]:
userId movieId tag

0 2 60756 funny

11 18 431 gangster

2000 474 5450 women

Descriptive Statistics
Let's look how the ratings are distributed!

In [49]: ratings['rating'].describe()

Out[49]: count 100836.000000


mean 3.501557
std 1.042529
min 0.500000
25% 3.000000
50% 3.500000
75% 4.000000
max 5.000000
Name: rating, dtype: float64

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 10/26


5/9/2019 Introduction to Pandas

In [50]: ratings.describe()
Out[50]:
userId movieId rating

count 100836.000000 100836.000000 100836.000000

mean 326.127564 19435.295718 3.501557

std 182.618491 35530.987199 1.042529

min 1.000000 1.000000 0.500000

25% 177.000000 1199.000000 3.000000

50% 325.000000 2991.000000 3.500000

75% 477.000000 8122.000000 4.000000

max 610.000000 193609.000000 5.000000

In [51]: ratings['rating'].mean()
Out[51]: 3.501556983616962

In [52]: ratings.mean()

Out[52]: userId 326.127564


movieId 19435.295718
rating 3.501557
dtype: float64

In [53]: ratings['rating'].min()
Out[53]: 0.5

In [54]: ratings['rating'].max()
Out[54]: 5.0

In [55]: ratings['rating'].std()

Out[55]: 1.0425292390605359

In [56]: ratings['rating'].mode()

Out[56]: 0 4.0
dtype: float64

In [57]: ratings.corr()
Out[57]:
userId movieId rating

userId 1.000000 0.006773 -0.049348

movieId 0.006773 1.000000 -0.004061

rating -0.049348 -0.004061 1.000000

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 11/26


5/9/2019 Introduction to Pandas

In [58]: filter_1 = ratings['rating'] > 5


#print(filter_1)
filter_1.any()
Out[58]: False

In [59]: filter_2 = ratings['rating'] > 0


filter_2.all()
Out[59]: True

Data Cleaning: Handling Missing Data


In [60]: movies.shape

Out[60]: (9742, 3)

In [61]: #is any row NULL ?

movies.isnull().any()
Out[61]: movieId False
title False
genres False
dtype: bool

Thats nice ! No NULL values !

In [62]: ratings.shape

Out[62]: (100836, 3)

In [63]: #is any row NULL ?

ratings.isnull().any()
Out[63]: userId False
movieId False
rating False
dtype: bool

Thats nice ! No NULL values !

In [64]: tags.shape
Out[64]: (3683, 3)

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 12/26


5/9/2019 Introduction to Pandas

In [65]: #is any row NULL ?

tags.isnull().any()
Out[65]: userId False
movieId False
tag False
dtype: bool

We have some tags which are NULL.

In [66]: tags = tags.dropna()

In [67]: #Check again: is any row NULL ?

tags.isnull().any()
Out[67]: userId False
movieId False
tag False
dtype: bool

In [68]: tags.shape
Out[68]: (3683, 3)

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 13/26


5/9/2019 Introduction to Pandas

In [69]: %matplotlib inline

ratings.hist(column='rating', figsize=(15,10))
Out[69]: array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000096CE153CC0>]],
dtype=object)

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 14/26


5/9/2019 Introduction to Pandas

In [70]: ratings.boxplot(column='rating', figsize=(15,20))


Out[70]: <matplotlib.axes._subplots.AxesSubplot at 0x96ce192a58>

Slicing Out Columns

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 15/26


5/9/2019 Introduction to Pandas

In [71]: tags['tag'].head()

Out[71]: 0 funny
1 Highly quotable
2 will ferrell
3 Boxing story
4 MMA
Name: tag, dtype: object

In [72]: tags.head()

Out[72]:
userId movieId tag

0 2 60756 funny

1 2 60756 Highly quotable

2 2 60756 will ferrell

3 2 89774 Boxing story

4 2 89774 MMA

In [ ]:

In [73]: movies[['title','genres']].head()
Out[73]:
title genres

0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

1 Jumanji (1995) Adventure|Children|Fantasy

2 Grumpier Old Men (1995) Comedy|Romance

3 Waiting to Exhale (1995) Comedy|Drama|Romance

4 Father of the Bride Part II (1995) Comedy

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 16/26


5/9/2019 Introduction to Pandas

In [74]: ratings[-10:]

Out[74]:
userId movieId rating

100826 610 162350 3.5

100827 610 163937 3.5

100828 610 163981 3.5

100829 610 164179 5.0

100830 610 166528 4.0

100831 610 166534 4.0

100832 610 168248 5.0

100833 610 168250 5.0

100834 610 168252 5.0

100835 610 170875 3.0

In [75]: tags["tag"].head(15)

Out[75]: 0 funny
1 Highly quotable
2 will ferrell
3 Boxing story
4 MMA
5 Tom Hardy
6 drugs
7 Leonardo DiCaprio
8 Martin Scorsese
9 way too long
10 Al Pacino
11 gangster
12 mafia
13 Al Pacino
14 Mafia
Name: tag, dtype: object

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 17/26


5/9/2019 Introduction to Pandas

In [76]: tag_counts = tags['tag'].value_counts()


tag_counts[0:15]
Out[76]: In Netflix queue 131
atmospheric 36
thought-provoking 24
superhero 24
surreal 23
Disney 23
funny 23
religion 22
quirky 21
dark comedy 21
sci-fi 21
psychology 21
suspense 20
crime 19
twist ending 19
Name: tag, dtype: int64

In [77]: tag_counts[:10].plot(kind='bar', figsize=(15,10))

Out[77]: <matplotlib.axes._subplots.AxesSubplot at 0x96ce4bd7b8>

Filters for Selecting Rows

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 18/26


5/9/2019 Introduction to Pandas

In [78]: is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]
Out[78]:
userId movieId rating

36 1 608 5.0

38 1 661 5.0

40 1 733 4.0

43 1 804 4.0

44 1 919 5.0

45 1 923 5.0

46 1 940 5.0

47 1 943 4.0

48 1 954 5.0

50 1 1023 5.0

51 1 1024 5.0

52 1 1025 5.0

53 1 1029 5.0

55 1 1031 5.0

56 1 1032 5.0

57 1 1042 4.0

58 1 1049 5.0

59 1 1060 4.0

60 1 1073 5.0

61 1 1080 5.0

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 19/26


5/9/2019 Introduction to Pandas

In [79]: is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]
Out[79]:
movieId title genres

322 364 Lion King, The (1994) Adventure|Animation|Children|Drama|Musical|IMAX

483 551 Nightmare Before Christmas, The (1993) Animation|Children|Fantasy|Musical

488 558 Pagemaster, The (1994) Action|Adventure|Animation|Children|Fantasy

506 588 Aladdin (1992) Adventure|Animation|Children|Comedy|Musical

511 Snow White and the Seven Dwarfs


594 Animation|Children|Drama|Fantasy|Musical
(1937)

512 595 Beauty and the Beast (1991) Animation|Children|Fantasy|Musical|Romance|IMAX

513 596 Pinocchio (1940) Animation|Children|Fantasy|Musical

522 610 Heavy Metal (1981) Action|Adventure|Animation|Horror|Sci-Fi

527 616 Aristocats, The (1970) Animation|Children

534 631 All Dogs Go to Heaven 2 (1996) Adventure|Animation|Children|Fantasy|Musical|R...

In [80]: movies[is_animation].head(15)

Out[80]:
movieId title genres

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

12 13 Balto (1995) Adventure|Animation|Children

44 48 Pocahontas (1995) Animation|Children|Drama|Musical|Romance

205 239 Goofy Movie, A (1995) Animation|Children|Comedy|Romance

272 313 Swan Princess, The (1994) Animation|Children

322 364 Lion King, The (1994) Adventure|Animation|Children|Drama|Musical|IMAX

483 551 Nightmare Before Christmas, The (1993) Animation|Children|Fantasy|Musical

488 558 Pagemaster, The (1994) Action|Adventure|Animation|Children|Fantasy

506 588 Aladdin (1992) Adventure|Animation|Children|Comedy|Musical

511 Snow White and the Seven Dwarfs


594 Animation|Children|Drama|Fantasy|Musical
(1937)

512 595 Beauty and the Beast (1991) Animation|Children|Fantasy|Musical|Romance|IMAX

513 596 Pinocchio (1940) Animation|Children|Fantasy|Musical

522 610 Heavy Metal (1981) Action|Adventure|Animation|Horror|Sci-Fi

527 616 Aristocats, The (1970) Animation|Children

534 631 All Dogs Go to Heaven 2 (1996) Adventure|Animation|Children|Fantasy|Musical|R...

Group By and Aggregate


localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 20/26
5/9/2019 Introduction to Pandas

In [81]: ratings_count = ratings[['movieId','rating']].groupby('rating').count()


ratings_count
Out[81]:
movieId

rating

0.5 1370

1.0 2811

1.5 1791

2.0 7551

2.5 5550

3.0 20047

3.5 13136

4.0 26818

4.5 8551

5.0 13211

In [82]: average_rating = ratings[['movieId','rating']].groupby('movieId').mean()


average_rating.head()
Out[82]:
rating

movieId

1 3.920930

2 3.431818

3 3.259615

4 2.357143

5 3.071429

In [83]: movie_count = ratings[['movieId','rating']].groupby('movieId').count()


movie_count.head()

Out[83]:
rating

movieId

1 215

2 110

3 52

4 7

5 49

Merge Dataframes
localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 21/26
5/9/2019 Introduction to Pandas

In [84]: tags.head()

Out[84]:
userId movieId tag

0 2 60756 funny

1 2 60756 Highly quotable

2 2 60756 will ferrell

3 2 89774 Boxing story

4 2 89774 MMA

In [85]: movies.head()

Out[85]:
movieId title genres

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

1 2 Jumanji (1995) Adventure|Children|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance

4 5 Father of the Bride Part II (1995) Comedy

In [86]: t = movies.merge(tags, on='movieId', how='inner')


t.head()
Out[86]:
movieId title genres userId tag

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 336 pixar

1 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 474 pixar

2 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 567 fun

3 2 Jumanji (1995) Adventure|Children|Fantasy 62 fantasy

4 2 Jumanji (1995) Adventure|Children|Fantasy 62 magic board game

Combine aggreagation, merging, and filters to get useful


analytics

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 22/26


5/9/2019 Introduction to Pandas

In [87]: avg_ratings = ratings.groupby('movieId', as_index=False).mean()


del avg_ratings['userId']
avg_ratings.head()
Out[87]:
movieId rating

0 1 3.920930

1 2 3.431818

2 3 3.259615

3 4 2.357143

4 5 3.071429

In [88]: movies.head()
Out[88]:
movieId title genres

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

1 2 Jumanji (1995) Adventure|Children|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance

4 5 Father of the Bride Part II (1995) Comedy

In [89]: box_office = movies.merge(avg_ratings, on='movieId', how='inner')


box_office.head()

Out[89]:
movieId title genres rating

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 3.920930

1 2 Jumanji (1995) Adventure|Children|Fantasy 3.431818

2 3 Grumpier Old Men (1995) Comedy|Romance 3.259615

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 2.357143

4 5 Father of the Bride Part II (1995) Comedy 3.071429

In [90]: is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]
Out[90]:
movieId title genres rating

9713 191005 Gintama (2017) Action|Adventure|Comedy|Sci-Fi 4.5

9716 193571 Silver Spoon (2014) Comedy|Drama 4.0

9717 193573 Love Live! The School Idol Movie (2015) Animation 4.0

9719 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy 4.0

9723 193609 Andrew Dice Clay: Dice Rules (1991) Comedy 4.0

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 23/26


5/9/2019 Introduction to Pandas

In [91]: is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]
Out[91]:
movieId title genres rating

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 3.920930

2 3 Grumpier Old Men (1995) Comedy|Romance 3.259615

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 2.357143

4 5 Father of the Bride Part II (1995) Comedy 3.071429

6 7 Sabrina (1995) Comedy|Romance 3.185185

In [92]: box_office[is_comedy & is_highly_rated][-5:]

Out[92]:
movieId title genres rating

9708 190209 Jeff Ross Roasts the Border (2017) Comedy 4.0

9713 191005 Gintama (2017) Action|Adventure|Comedy|Sci-Fi 4.5

9716 193571 Silver Spoon (2014) Comedy|Drama 4.0

9719 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy 4.0

9723 193609 Andrew Dice Clay: Dice Rules (1991) Comedy 4.0

Vectorized String Operations


In [93]: movies.head()

Out[93]:
movieId title genres

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

1 2 Jumanji (1995) Adventure|Children|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance

4 5 Father of the Bride Part II (1995) Comedy

Split 'genres' into multiple columns

In [94]: movie_genres = movies['genres'].str.split('|', expand=True)

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 24/26


5/9/2019 Introduction to Pandas

In [95]: movie_genres[:10]

Out[95]:
0 1 2 3 4 5 6 7 8 9

0 Adventure Animation Children Comedy Fantasy None None None None None

1 Adventure Children Fantasy None None None None None None None

2 Comedy Romance None None None None None None None None

3 Comedy Drama Romance None None None None None None None

4 Comedy None None None None None None None None None

5 Action Crime Thriller None None None None None None None

6 Comedy Romance None None None None None None None None

7 Adventure Children None None None None None None None None

8 Action None None None None None None None None None

9 Action Adventure Thriller None None None None None None None

Add a new column for comedy genre flag

In [96]: movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

In [97]: movie_genres[:10]

Out[97]:
0 1 2 3 4 5 6 7 8 9 isComedy

0 Adventure Animation Children Comedy Fantasy None None None None None True

1 Adventure Children Fantasy None None None None None None None False

2 Comedy Romance None None None None None None None None True

3 Comedy Drama Romance None None None None None None None True

4 Comedy None None None None None None None None None True

5 Action Crime Thriller None None None None None None None False

6 Comedy Romance None None None None None None None None True

7 Adventure Children None None None None None None None None False

8 Action None None None None None None None None None False

9 Action Adventure Thriller None None None None None None None False

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 25/26


5/9/2019 Introduction to Pandas

In [98]: movies['genres'].contains() == 'comedy'

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-98-ac8ed1f9139a> in <module>()
----> 1 movies['genres'].contains() == 'comedy'

~\Anaconda3\anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(se
lf, name)
3079 if name in self._info_axis:
3080 return self[name]
-> 3081 return object.__getattribute__(self, name)
3082
3083 def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'contains'

In [99]: movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
movieId 9742 non-null int64
title 9742 non-null object
genres 9742 non-null object
dtypes: int64(1), object(2)
memory usage: 228.4+ KB

In [ ]:

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 26/26

You might also like