KEMBAR78
Programming in Ds With Python | PDF | Information Science | World Wide Web
0% found this document useful (0 votes)
4 views11 pages

Programming in Ds With Python

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Programming in Ds With Python

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

PROGRAMMING IN DATA

SCIENCE WITH PYTHON


CSDX136

----------ASSIGNMENT – 01----------

Work done by,


JENIFER Y
220071601098
CSE – B
1. What is Web Scraping?

Web Scraping is the process of automatically extracting data from websites


using computer programs instead of manually copying information. It is done
using specialized libraries and tools like BeautifulSoup, Scrapy, Selenium, and
Playwright.

It allows us to collect large amounts of data in a structured format (like CSV,


Excel, or databases) for analysis, research, and automation.

2. How Web Scraping Evolved & Why

 Early Days (Manual Copying) → Data was copied manually from


websites.
 Automation Scripts (2000s) → Developers started using simple scripts
with urllib and regex to fetch data.
 Libraries like BeautifulSoup → Made parsing HTML much easier.
 Advanced Frameworks (Scrapy, Selenium, Playwright) → Enabled
large-scale crawling, handling JavaScript, and automating browsers.
👉 The need for evolution came from:

 Explosion of online data (e-commerce, news, research).


 Companies requiring real-time insights (prices, competitors, trends).
 Need for faster, automated collection instead of manual work.

3. Use Cases of Web Scraping

o Collecting product prices from e-commerce websites.


o Gathering movie reviews and ratings from IMDB.
o Extracting job postings from LinkedIn/Indeed.
o Collecting news headlines for sentiment analysis.
o Extracting weather data from forecast sites.
4. Applications of Web Scraping

 Business Intelligence – Competitor analysis, price tracking.


 Data Science / Machine Learning – Creating datasets for model
training.
 Academic Research – Collecting data for studies.
 Digital Marketing – Lead generation, SEO analysis.
 News Aggregation – Bringing together headlines from multiple sites.

Application Area Description Example


E-commerce & Helps track product prices across Amazon price
Price Tracking multiple websites for comparison and comparison tools
monitoring.
Market Research Collects data on customer reviews, Competitor analysis
ratings, and trends for business
insights.
Job Portals Extracts job postings, salary data, LinkedIn, Indeed
and skill requirements. scrapers
Travel & Gathers flight fares, hotel prices, and MakeMyTrip,
Hospitality travel reviews. Booking.com
Social Media Tracks posts, hashtags, and Twitter sentiment
Monitoring engagement data for sentiment scraping
analysis.
News & Media Extracts articles, headlines, and Google News
breaking news updates. scrapers
Real Estate Scrapes property listings, prices, and Zillow, 99acres
locations.
Academic Collects datasets, publications, and Google Scholar
Research citations. scraping
Sports Analytics Gathers live scores, player stats, and ESPN, Cricbuzz
match data.
Financial Data Extracts stock prices, crypto values, Yahoo Finance,
and financial reports. CoinMarketCap

5. Advantages of Web Scraping

✅ Automates data collection


✅ Saves time & effort
✅ Provides real-time data
✅ Handles large-scale datasets
✅ Enables deeper analytics

6. Disadvantages of Web Scraping

❌ Some websites block bots or change structure


❌ Legal/ethical issues (if terms of service are violated)
❌ Dynamic JavaScript pages need advanced tools (Selenium/Playwright)
❌ Large-scale scraping can be resource-heavy.
7. Code Explanation

# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import os

# Ask user for folder path


save_path = input("Enter the folder path to save files:
").strip()
os.makedirs(save_path, exist_ok=True)

# Fetch IMDB Top Movies page


url = "https://www.imdb.com/chart/top/"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)

# Parse HTML using BeautifulSoup


soup = BeautifulSoup(response.text, "html.parser")
rows = soup.select("li.ipc-metadata-list-summary-item")

# Extract Title, Year, Rating


movies_list = []
for row in rows:
title_tag = row.select_one("h3")
title = title_tag.get_text(strip=True) if title_tag else
"N/A"

year_tag = row.select_one("span.ipc-title__subtext")
year = year_tag.get_text(strip=True) if year_tag else
"N/A"

rating_tag = row.select_one("span.ipc-rating-star--
rating")
rating = rating_tag.get_text(strip=True) if rating_tag
else "N/A"

movies_list.append([title, year, rating])

# Save to CSV in chosen folder


csv_file = os.path.join(save_path, "imdb_top_movies.csv")
df = pd.DataFrame(movies_list, columns=["Title", "Year",
"Rating"])
df.to_csv(csv_file, index=False, encoding="utf-8")
# Print Top 10 Movies
for i, row in df.head(10).iterrows():
print(f"{i+1}. {row['Title']} ({row['Year']}) — Rating:
{row['Rating']}")

# Word Count on Titles


all_words = " ".join(df["Title"].tolist()).lower().split()
word_counts = Counter(all_words)
print("\nTop 10 words in movie titles:")
for word, count in word_counts.most_common(10):
print(f"{word}: {count}")

# Generate Word Cloud


wordcloud = WordCloud(width=800, height=400,
background_color="white").generate(" ".join(df["Title"]))
img_file = os.path.join(save_path, "imdb_wordcloud.png")
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.savefig(img_file)
plt.show()

🔎 How the IMDb Web Scraping Program Works ?

1. Importing Libraries

 The program first loads necessary Python libraries:


o Requests to fetch HTML content from IMDb.
o BeautifulSoup to parse and extract specific elements from the
page.
o Pandas to organize the extracted data and save it in CSV
format.
o WordCloud & Matplotlib to generate a word cloud and visualize
text data.

2. Fetching the Web Page

 The program sends a request to the IMDb Top 250 Movies page.
 It downloads the HTML response of that page.
 This response contains all the information (movie titles, years, ratings,
etc.).
3. Parsing HTML with BeautifulSoup

 BeautifulSoup reads the raw HTML and converts it into a navigable


structure (like a tree).
 Specific tags and classes are identified:
o Movie name
o Release year
o IMDb rating

4. Extracting Movie Data

 The program loops through each movie entry.


 For each one, it extracts:
o The title of the movie.
o The year it was released.
o Its IMDb rating.
 This information is stored in a Python list.

5. Saving Data into CSV File

 The extracted list is converted into a pandas DataFrame (like an Excel


table).
 The user specifies a folder where the file should be saved.
 The DataFrame is exported to a CSV file (e.g., imdb_top_movies.csv).
 This makes the data reusable in Excel, Python, or other tools.

6. Word Count and Word Cloud

 The program combines all movie titles into a single text string.
 It counts how frequently each word appears across all titles.
 A word cloud image is generated, where bigger words represent higher
frequency.
 Example: words like “The”, “Of”, or “Man” may appear larger.
7. Output Clarification

 CSV File: Contains structured data → Movie Name, Release Year,


IMDb Rating.
 Top 10 Movies: Displayed separately in the console.
 Word Cloud: A visual representation of the most common words in
movie titles.

8. Before scraping, the webpage is just a normal site you see in the browser

Scrapping website :
9. File saved to the folder as scrappingresults :
WORD CLOUD :

10. CONCLUSION

Web scraping has become a powerful tool in today’s data-driven world. It


allows us to extract structured information from unstructured web pages,
transforming raw HTML into meaningful datasets. Over time, it has evolved
from simple copy-paste scripts to advanced frameworks and automation
tools that can handle complex websites.

With wide applications in research, business, e-commerce, finance, and


machine learning, web scraping helps in decision-making and insights
generation. However, it also comes with challenges such as legal concerns,
website restrictions, and potential ethical issues.

Overall, web scraping acts as a bridge between the vast unorganized data on
the internet and the structured information needed for analysis, making it an
indispensable skill in data science and modern digital solutions.

You might also like