KEMBAR78
Programming | PDF | Data Analysis | Data
0% found this document useful (0 votes)
28 views30 pages

Programming

Data analysis by python

Uploaded by

akonftto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views30 pages

Programming

Data analysis by python

Uploaded by

akonftto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Introduction to Data

Analysis in Python
Session 3
Hands-On Data Analysis: Movie Dataset

● From Raw Data to Insights


Load the Data
● File: tmdb-movies.csv
● Contains 10,000+ movies
● Columns: title, budget, revenue, genre, release year, etc.
Check Data Info

• df.info() gives us a summary: how many rows, columns,


and whether there are missing values. This helps us
understand the quality of the data. For example, if
many budget values are missing, we’ll need to handle
that.
Clean the Data
● Drop unused columns: homepage, tagline, keywords
● Remove rows with zero budget/revenue
● Not all columns are useful for our analysis. We remove columns like
homepage and tagline because they don’t help us answer financial or
popularity questions. We also remove movies with zero budget or
revenue — those are likely errors.
Drop Columns

The drop() function removes columns. inplace=True means the change is


saved directly to the DataFrame. We remove columns that don’t add
value to our analysis (e.g., keywords are too detailed for our needs).
Remove Zero Budget/Revenue

Some movies have a budget or revenue of 0, which doesn’t make sense. We


filter the DataFrame to keep only movies with positive budget and revenue.
This makes our analysis more accurate.
Add Profit Column

Profit is not in the dataset, but we can calculate it! This is a key
skill in data analysis: creating new columns from existing ones.
Now we can analyze which movies were the most profitable.
Convert Release Date

The release_date is stored as text. We convert it to a datetime type so


Python can understand it. Then we extract just the year into a new column
called year — this will help us analyze trends over time
Top 5 Highest Revenue Movies

We use sort_values() to arrange movies from highest to lowest revenue.


.head(5) shows only the top 5. This helps us quickly identify blockbusters
like Avatar or Star Wars.
Top 5 Most Profitable Movies

Try it!
Lowest Budget Movie

This code finds the movie with the smallest budget. We use a condition
inside the DataFrame: df['budget'] == df['budget'].min(). This returns
only the row(s) where the budget is the minimum.
Most Popular Movie

Popularity is a score given by TMDb. We find the movie with the highest
popularity using .max(). This might be a recent hit like Jurassic World,
which had massive online engagement
Average Budget & Revenue

.mean() calculates the average. We use f-string formatting to


display the number with commas (e.g., $50,000,000). This
gives us a sense of what a "typical" movie costs and earns.
Distribution of Movie Genres

Genres are stored as "Action|Adventure|Sci-Fi". We combine all


genre strings into one big string with str.cat(sep='|'), then split it
into a list. Now we can count how many times each genre
appears.
Count Genres

Counter counts how many times each genre appears.


.most_common(5) shows the top 5. This tells us which genres
are most popular among filmmakers (e.g., Drama, Comedy).
Bar Chart of Genres

A bar chart makes it easy to compare genre counts.


plt.xticks(rotation=45) tilts the labels so they don’t overlap.
Visuals help us see patterns faster than raw numbers.
Profit Over Time

groupby('year') groups all movies by release year. .sum() adds up


the profit for each year. Plotting it shows trends — for example,
profits may have spiked in 2015 due to big releases.
Best Year for Profit

.idxmax() returns the index (year) with the highest profit.


This answers the question: "In which year did movies
make the most money overall?"
Runtime Analysis

A histogram shows how movie lengths are distributed. Most


movies are around 90–120 minutes. Very short or very long
movies are rare. This helps us understand audience preferences.
Average Runtime

The average runtime tells us the "typical"


length of a movie. Rounding to zero
decimals (:.0f) makes it cleaner. This is
useful for producers planning new films.
Directors with Most Movies

.value_counts() counts how many movies


each director made. The top directors
(like Woody Allen) are very productive.
This shows who has been most active in
the industry.
Filter Successful Movies

We define "successful" as any movie that


made above-average profit. This creates
a new DataFrame called successful with
only the top-performing films.
Genres of Successful Movies

Now we analyze only successful movies.


Are the most common genres the same as
in all movies? We might find that Action
and Adventure dominate here.
Compare Budget vs Revenue

A scatter plot shows the relationship


between budget and revenue. If there’s a
trend upward, it means higher budgets
often lead to higher revenue. The
alpha=0.5 makes points semi-transparent
Correlation Check

Correlation measures how closely two


variables are related. A value near 1
means high budget → high revenue. A
low value means no clear link. This helps
us understand if spending more
guarantees success.
Save Results to CSV

After analysis, we can save our results!


This line exports the top movies to a new
CSV file. index=False means we don’t
save the row numbers. Now the data can
be shared or used in Excel.
Automate with a Function

We wrap our analysis into a function.


Now we can reuse it for any movie
dataset! This is the power of automation
— one function can analyze hundreds of
files.
Congratulations!

● You’ve completed your first data analysis project!


● Next: Try with your own dataset
● Practice, Explore, Automate!
Thank You
WWW.Undercontrolrt.com

7724 0580

@undercontroloman

You might also like