0% found this document useful (0 votes)

4 views11 pages

Programming in Ds With Python

Uploaded by

jenifermaryesu2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views11 pages

Programming in Ds With Python

Uploaded by

jenifermaryesu2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

PROGRAMMING IN DATA

SCIENCE WITH PYTHON

CSDX136

----------ASSIGNMENT – 01----------

Work done by,

JENIFER Y
220071601098
CSE – B
1. What is Web Scraping?

Web Scraping is the process of automatically extracting data from websites

using computer programs instead of manually copying information. It is done
using specialized libraries and tools like BeautifulSoup, Scrapy, Selenium, and
Playwright.

It allows us to collect large amounts of data in a structured format (like CSV,

Excel, or databases) for analysis, research, and automation.

2. How Web Scraping Evolved & Why

 Early Days (Manual Copying) → Data was copied manually from

websites.
 Automation Scripts (2000s) → Developers started using simple scripts
with urllib and regex to fetch data.
 Libraries like BeautifulSoup → Made parsing HTML much easier.
 Advanced Frameworks (Scrapy, Selenium, Playwright) → Enabled
large-scale crawling, handling JavaScript, and automating browsers.
👉 The need for evolution came from:

 Explosion of online data (e-commerce, news, research).

 Companies requiring real-time insights (prices, competitors, trends).
 Need for faster, automated collection instead of manual work.

3. Use Cases of Web Scraping

o Collecting product prices from e-commerce websites.

o Gathering movie reviews and ratings from IMDB.
o Extracting job postings from LinkedIn/Indeed.
o Collecting news headlines for sentiment analysis.
o Extracting weather data from forecast sites.
4. Applications of Web Scraping

 Business Intelligence – Competitor analysis, price tracking.

 Data Science / Machine Learning – Creating datasets for model
training.
 Academic Research – Collecting data for studies.
 Digital Marketing – Lead generation, SEO analysis.
 News Aggregation – Bringing together headlines from multiple sites.

Application Area Description Example

E-commerce & Helps track product prices across Amazon price
Price Tracking multiple websites for comparison and comparison tools
monitoring.
Market Research Collects data on customer reviews, Competitor analysis
ratings, and trends for business
insights.
Job Portals Extracts job postings, salary data, LinkedIn, Indeed
and skill requirements. scrapers
Travel & Gathers flight fares, hotel prices, and MakeMyTrip,
Hospitality travel reviews. Booking.com
Social Media Tracks posts, hashtags, and Twitter sentiment
Monitoring engagement data for sentiment scraping
analysis.
News & Media Extracts articles, headlines, and Google News
breaking news updates. scrapers
Real Estate Scrapes property listings, prices, and Zillow, 99acres
locations.
Academic Collects datasets, publications, and Google Scholar
Research citations. scraping
Sports Analytics Gathers live scores, player stats, and ESPN, Cricbuzz
match data.
Financial Data Extracts stock prices, crypto values, Yahoo Finance,
and financial reports. CoinMarketCap

5. Advantages of Web Scraping

✅ Automates data collection

✅ Saves time & effort
✅ Provides real-time data
✅ Handles large-scale datasets
✅ Enables deeper analytics

6. Disadvantages of Web Scraping

❌ Some websites block bots or change structure

❌ Legal/ethical issues (if terms of service are violated)
❌ Dynamic JavaScript pages need advanced tools (Selenium/Playwright)
❌ Large-scale scraping can be resource-heavy.
7. Code Explanation

# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import os

# Ask user for folder path

save_path = input("Enter the folder path to save files:
").strip()
os.makedirs(save_path, exist_ok=True)

# Fetch IMDB Top Movies page

url = "https://www.imdb.com/chart/top/"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)

# Parse HTML using BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")
rows = soup.select("li.ipc-metadata-list-summary-item")

# Extract Title, Year, Rating

movies_list = []
for row in rows:
title_tag = row.select_one("h3")
title = title_tag.get_text(strip=True) if title_tag else
"N/A"

year_tag = row.select_one("span.ipc-title__subtext")
year = year_tag.get_text(strip=True) if year_tag else
"N/A"

rating_tag = row.select_one("span.ipc-rating-star--
rating")
rating = rating_tag.get_text(strip=True) if rating_tag
else "N/A"

movies_list.append([title, year, rating])

# Save to CSV in chosen folder

csv_file = os.path.join(save_path, "imdb_top_movies.csv")
df = pd.DataFrame(movies_list, columns=["Title", "Year",
"Rating"])
df.to_csv(csv_file, index=False, encoding="utf-8")
# Print Top 10 Movies
for i, row in df.head(10).iterrows():
print(f"{i+1}. {row['Title']} ({row['Year']}) — Rating:
{row['Rating']}")

# Word Count on Titles

all_words = " ".join(df["Title"].tolist()).lower().split()
word_counts = Counter(all_words)
print("\nTop 10 words in movie titles:")
for word, count in word_counts.most_common(10):
print(f"{word}: {count}")

# Generate Word Cloud

wordcloud = WordCloud(width=800, height=400,
background_color="white").generate(" ".join(df["Title"]))
img_file = os.path.join(save_path, "imdb_wordcloud.png")
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.savefig(img_file)
plt.show()

🔎 How the IMDb Web Scraping Program Works ?

1. Importing Libraries

 The program first loads necessary Python libraries:

o Requests to fetch HTML content from IMDb.
o BeautifulSoup to parse and extract specific elements from the
page.
o Pandas to organize the extracted data and save it in CSV
format.
o WordCloud & Matplotlib to generate a word cloud and visualize
text data.

2. Fetching the Web Page

 The program sends a request to the IMDb Top 250 Movies page.
 It downloads the HTML response of that page.
 This response contains all the information (movie titles, years, ratings,
etc.).
3. Parsing HTML with BeautifulSoup

 BeautifulSoup reads the raw HTML and converts it into a navigable

structure (like a tree).
 Specific tags and classes are identified:
o Movie name
o Release year
o IMDb rating

4. Extracting Movie Data

 The program loops through each movie entry.

 For each one, it extracts:
o The title of the movie.
o The year it was released.
o Its IMDb rating.
 This information is stored in a Python list.

5. Saving Data into CSV File

 The extracted list is converted into a pandas DataFrame (like an Excel

table).
 The user specifies a folder where the file should be saved.
 The DataFrame is exported to a CSV file (e.g., imdb_top_movies.csv).
 This makes the data reusable in Excel, Python, or other tools.

6. Word Count and Word Cloud

 The program combines all movie titles into a single text string.
 It counts how frequently each word appears across all titles.
 A word cloud image is generated, where bigger words represent higher
frequency.
 Example: words like “The”, “Of”, or “Man” may appear larger.
7. Output Clarification

 CSV File: Contains structured data → Movie Name, Release Year,

IMDb Rating.
 Top 10 Movies: Displayed separately in the console.
 Word Cloud: A visual representation of the most common words in
movie titles.

8. Before scraping, the webpage is just a normal site you see in the browser

Scrapping website :
9. File saved to the folder as scrappingresults :
WORD CLOUD :

10. CONCLUSION

Web scraping has become a powerful tool in today’s data-driven world. It

allows us to extract structured information from unstructured web pages,
transforming raw HTML into meaningful datasets. Over time, it has evolved
from simple copy-paste scripts to advanced frameworks and automation
tools that can handle complex websites.

With wide applications in research, business, e-commerce, finance, and

machine learning, web scraping helps in decision-making and insights
generation. However, it also comes with challenges such as legal concerns,
website restrictions, and potential ethical issues.

Overall, web scraping acts as a bridge between the vast unorganized data on
the internet and the structured information needed for analysis, making it an
indispensable skill in data science and modern digital solutions.

Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
No ratings yet
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
42 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Python Selenium Web Scraping Guide
No ratings yet
Python Selenium Web Scraping Guide
14 pages
Data Collection
No ratings yet
Data Collection
14 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Dap Mod 4-5
No ratings yet
Dap Mod 4-5
19 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Document 2
No ratings yet
Document 2
6 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Webscraping
No ratings yet
Webscraping
12 pages
Internship Report
No ratings yet
Internship Report
19 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Scraping with Python Guide
No ratings yet
Web Scraping with Python Guide
5 pages
Industrial Training Presentation: Prepared By: Guided by
No ratings yet
Industrial Training Presentation: Prepared By: Guided by
26 pages
19-5E8 Tushara Priya
No ratings yet
19-5E8 Tushara Priya
23 pages
Software Engineering Project
No ratings yet
Software Engineering Project
55 pages
BE IT Project Synopsis Format 2022 23 V1
No ratings yet
BE IT Project Synopsis Format 2022 23 V1
11 pages
Template
No ratings yet
Template
21 pages
Final Report
No ratings yet
Final Report
39 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
4 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Image Scrapper
No ratings yet
Image Scrapper
14 pages
Summary Paper 13 14 15
No ratings yet
Summary Paper 13 14 15
2 pages
Web Scraping Course Notes
No ratings yet
Web Scraping Course Notes
89 pages
Module 4
No ratings yet
Module 4
14 pages
Webscraping 2
No ratings yet
Webscraping 2
2 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
21CSC303JJ SEPM - Ex 1
No ratings yet
21CSC303JJ SEPM - Ex 1
4 pages
Rohan Report
No ratings yet
Rohan Report
25 pages
Web Scraping with Machine Learning
No ratings yet
Web Scraping with Machine Learning
4 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
WEB Scrap Report
No ratings yet
WEB Scrap Report
77 pages
Web Scraping For Data Analytics A BeatifulSoup Implementation
No ratings yet
Web Scraping For Data Analytics A BeatifulSoup Implementation
6 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Automated Web Scraping For Telecom Corpus Application
No ratings yet
Automated Web Scraping For Telecom Corpus Application
5 pages
Implementation of Web Application For Disease Prediction Using AI
No ratings yet
Implementation of Web Application For Disease Prediction Using AI
5 pages
4 Design and Development
No ratings yet
4 Design and Development
3 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
AIML Manual Lab-For Students
No ratings yet
AIML Manual Lab-For Students
45 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Screenshot 2024-12-10 at 8.32.21 PM
No ratings yet
Screenshot 2024-12-10 at 8.32.21 PM
24 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Introduction To Web Scraping
100% (1)
Introduction To Web Scraping
3 pages
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
No ratings yet
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
29 pages
Data Aggregation by Web Scraping Using Python
No ratings yet
Data Aggregation by Web Scraping Using Python
48 pages
4.1 Client Server Communication 4.1wer Server-Web Browser Communication 4.2.1 Request
No ratings yet
4.1 Client Server Communication 4.1wer Server-Web Browser Communication 4.2.1 Request
4 pages
PHD Computer Science Program: 1. Admission Criteria
No ratings yet
PHD Computer Science Program: 1. Admission Criteria
15 pages
Revised Syllabus: Department of Computer Science
No ratings yet
Revised Syllabus: Department of Computer Science
53 pages
Prashanth Resume
No ratings yet
Prashanth Resume
1 page
QCS Sprint 0
No ratings yet
QCS Sprint 0
8 pages
Cambridge Igcse History Coursework Training Handbook
50% (2)
Cambridge Igcse History Coursework Training Handbook
6 pages
Data Resource Management: Introduction To Information Systems
No ratings yet
Data Resource Management: Introduction To Information Systems
15 pages
5.1.1.5 Lab - Internet Fingerprint - ILM
No ratings yet
5.1.1.5 Lab - Internet Fingerprint - ILM
5 pages
How To Create Web Dynpro-Based Iviews: Based On Sap Netweaver™ '04 Stack 09
No ratings yet
How To Create Web Dynpro-Based Iviews: Based On Sap Netweaver™ '04 Stack 09
12 pages
Xi4 Printer Specifications
No ratings yet
Xi4 Printer Specifications
6 pages
Re - Re - Thank You For Writing To My Payroll Helpdesk - (SR# - 1-58400042578)
No ratings yet
Re - Re - Thank You For Writing To My Payroll Helpdesk - (SR# - 1-58400042578)
2 pages
Project On Web Based Application For Insurance Services
64% (11)
Project On Web Based Application For Insurance Services
54 pages
Experienced Content & News Writer Resume
No ratings yet
Experienced Content & News Writer Resume
4 pages
Cuestionario 1
No ratings yet
Cuestionario 1
1 page
Lesson 3:: Introduction To CSS Layout
No ratings yet
Lesson 3:: Introduction To CSS Layout
6 pages
Catalogue CSCS V3 ENG
No ratings yet
Catalogue CSCS V3 ENG
4 pages
Linkedin Resume Help
100% (1)
Linkedin Resume Help
9 pages
Mte 905
No ratings yet
Mte 905
125 pages
E Thesis Iit Kharagpur
100% (2)
E Thesis Iit Kharagpur
8 pages
XAMPP - SSL Encrypt The Transmission of Passwords With Https
No ratings yet
XAMPP - SSL Encrypt The Transmission of Passwords With Https
19 pages
Glasses Direct Competitor Analysis
No ratings yet
Glasses Direct Competitor Analysis
4 pages
Pliny The Younger - Letters & Panegyricus - Latin & English - II
100% (1)
Pliny The Younger - Letters & Panegyricus - Latin & English - II
455 pages
SharePoint Admin & Dev Expertise
No ratings yet
SharePoint Admin & Dev Expertise
4 pages
Introduction To Web Analytics (Unit-1 II Part)
No ratings yet
Introduction To Web Analytics (Unit-1 II Part)
13 pages
CS PQ
No ratings yet
CS PQ
11 pages
Integration Interview Questions
No ratings yet
Integration Interview Questions
2 pages
Marketing 4.0 PDF
No ratings yet
Marketing 4.0 PDF
15 pages
EDU430 - ICT in Education (QUIZ NO 1) MCQS 58
100% (2)
EDU430 - ICT in Education (QUIZ NO 1) MCQS 58
2 pages
Catalogue - Smart Hospitality DisplayAU800 - 2022
No ratings yet
Catalogue - Smart Hospitality DisplayAU800 - 2022
4 pages
Search Engine Description
No ratings yet
Search Engine Description
17 pages