Introduction to Data
Analysis in Python
Session 3
Hands-On Data Analysis: Movie Dataset
● From Raw Data to Insights
Load the Data
● File: tmdb-movies.csv
● Contains 10,000+ movies
● Columns: title, budget, revenue, genre, release year, etc.
Check Data Info
• df.info() gives us a summary: how many rows, columns,
and whether there are missing values. This helps us
understand the quality of the data. For example, if
many budget values are missing, we’ll need to handle
that.
Clean the Data
● Drop unused columns: homepage, tagline, keywords
● Remove rows with zero budget/revenue
● Not all columns are useful for our analysis. We remove columns like
homepage and tagline because they don’t help us answer financial or
popularity questions. We also remove movies with zero budget or
revenue — those are likely errors.
Drop Columns
The drop() function removes columns. inplace=True means the change is
saved directly to the DataFrame. We remove columns that don’t add
value to our analysis (e.g., keywords are too detailed for our needs).
Remove Zero Budget/Revenue
Some movies have a budget or revenue of 0, which doesn’t make sense. We
filter the DataFrame to keep only movies with positive budget and revenue.
This makes our analysis more accurate.
Add Profit Column
Profit is not in the dataset, but we can calculate it! This is a key
skill in data analysis: creating new columns from existing ones.
Now we can analyze which movies were the most profitable.
Convert Release Date
The release_date is stored as text. We convert it to a datetime type so
Python can understand it. Then we extract just the year into a new column
called year — this will help us analyze trends over time
Top 5 Highest Revenue Movies
We use sort_values() to arrange movies from highest to lowest revenue.
.head(5) shows only the top 5. This helps us quickly identify blockbusters
like Avatar or Star Wars.
Top 5 Most Profitable Movies
Try it!
Lowest Budget Movie
This code finds the movie with the smallest budget. We use a condition
inside the DataFrame: df['budget'] == df['budget'].min(). This returns
only the row(s) where the budget is the minimum.
Most Popular Movie
Popularity is a score given by TMDb. We find the movie with the highest
popularity using .max(). This might be a recent hit like Jurassic World,
which had massive online engagement
Average Budget & Revenue
.mean() calculates the average. We use f-string formatting to
display the number with commas (e.g., $50,000,000). This
gives us a sense of what a "typical" movie costs and earns.
Distribution of Movie Genres
Genres are stored as "Action|Adventure|Sci-Fi". We combine all
genre strings into one big string with str.cat(sep='|'), then split it
into a list. Now we can count how many times each genre
appears.
Count Genres
Counter counts how many times each genre appears.
.most_common(5) shows the top 5. This tells us which genres
are most popular among filmmakers (e.g., Drama, Comedy).
Bar Chart of Genres
A bar chart makes it easy to compare genre counts.
plt.xticks(rotation=45) tilts the labels so they don’t overlap.
Visuals help us see patterns faster than raw numbers.
Profit Over Time
groupby('year') groups all movies by release year. .sum() adds up
the profit for each year. Plotting it shows trends — for example,
profits may have spiked in 2015 due to big releases.
Best Year for Profit
.idxmax() returns the index (year) with the highest profit.
This answers the question: "In which year did movies
make the most money overall?"
Runtime Analysis
A histogram shows how movie lengths are distributed. Most
movies are around 90–120 minutes. Very short or very long
movies are rare. This helps us understand audience preferences.
Average Runtime
The average runtime tells us the "typical"
length of a movie. Rounding to zero
decimals (:.0f) makes it cleaner. This is
useful for producers planning new films.
Directors with Most Movies
.value_counts() counts how many movies
each director made. The top directors
(like Woody Allen) are very productive.
This shows who has been most active in
the industry.
Filter Successful Movies
We define "successful" as any movie that
made above-average profit. This creates
a new DataFrame called successful with
only the top-performing films.
Genres of Successful Movies
Now we analyze only successful movies.
Are the most common genres the same as
in all movies? We might find that Action
and Adventure dominate here.
Compare Budget vs Revenue
A scatter plot shows the relationship
between budget and revenue. If there’s a
trend upward, it means higher budgets
often lead to higher revenue. The
alpha=0.5 makes points semi-transparent
Correlation Check
Correlation measures how closely two
variables are related. A value near 1
means high budget → high revenue. A
low value means no clear link. This helps
us understand if spending more
guarantees success.
Save Results to CSV
After analysis, we can save our results!
This line exports the top movies to a new
CSV file. index=False means we don’t
save the row numbers. Now the data can
be shared or used in Excel.
Automate with a Function
We wrap our analysis into a function.
Now we can reuse it for any movie
dataset! This is the power of automation
— one function can analyze hundreds of
files.
Congratulations!
● You’ve completed your first data analysis project!
● Next: Try with your own dataset
● Practice, Explore, Automate!
Thank You
WWW.Undercontrolrt.com
7724 0580
@undercontroloman