0% found this document useful (0 votes)

4 views9 pages

Data Science

Module 5 covers the analysis of web pages for data scraping, emphasizing the importance of understanding HTML structure, CSS selectors, and JavaScript interactions. It provides practical examples using tools like Chrome DevTools, Selenium, and BeautifulSoup for retrieving and parsing web content. The module also introduces Scrapy as a powerful framework for efficient web crawling and data extraction.

Uploaded by

Saifanamol Vm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views9 pages

Data Science

Uploaded by

Saifanamol Vm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Module 5

Analyzing a Web Page

Before scraping data, it's important to understand the structure of the web page you are working
with. Web pages are typically made up of HTML, CSS, and JavaScript.

Key Aspects of Analyzing a Web Page:

● HTML Structure: Web pages are built using HTML, which defines the structure and
content of the page. To scrape data, you need to identify the relevant HTML elements,
such as <div>, <span>, <a>, <p>, and <table>, where the data resides.
● CSS Selectors: These are patterns used to select elements within the HTML structure.
They are crucial for pinpointing the specific elements you want to scrape. You can use
tools like Chrome Developer Tools to inspect the page and find the CSS selectors.
● JavaScript: Some websites dynamically load data using JavaScript, so you need to
handle such pages differently. Tools like Selenium or Puppeteer can help interact with
JavaScript-heavy pages.

Example - Inspecting a Web Page Using Chrome DevTools:

1. Open the website in Google Chrome.

2. Right-click on the page and select "Inspect" to open DevTools.
3. Use the "Elements" tab to browse through the HTML structure of the page.
4. Look for the data you want to scrape, and note the tag names, classes, or IDs
5. Network Tab: The Network tab in Chrome DevTools shows the HTTP requests made by
the page. By analyzing the requests, you can identify API calls, request parameters, and
response data.
6. XHR (XMLHttpRequest): Websites often use XHR requests to fetch data dynamically
in the background. These requests can be captured and replicated in your web scraping
script to fetch data directly from the API without scraping the HTML.
7. Timeline Tab: The Timeline tab in DevTools can show the sequence of network
activities, script execution, and page rendering. It's useful for debugging slow-loading
elements or identifying resource bottlenecks.

Steps to Analyze Network Requests:

1. Open the Network tab in Chrome DevTools.

2. Refresh the webpage to start capturing network traffic.
3. Look for requests with names like GET, POST, or XHR.
4. Filter requests by type, such as XHR, to see API calls that load data in the background.
5. Examine the request headers, parameters, and response data to determine how to fetch
similar data.

Interacting with JavaScript

Many modern websites use JavaScript to render content dynamically. This means the data you
need may not be directly available in the initial HTML response but instead loaded through
JavaScript after the page is rendered.

Tools for Interacting with JavaScript:

● Selenium: Selenium is a web scraping tool that allows you to control a web browser
programmatically. It can render pages, execute JavaScript, and wait for dynamic content
to load before extracting data.
● Puppeteer: Puppeteer is a Node.js library that provides high-level API for headless
browsers like Chromium. It’s widely used for scraping JavaScript-heavy websites.

Selenium Example:

from selenium import webdriver

from selenium.webdriver.common.by import By

# Set up the driver (ensure you have a WebDriver like ChromeDriver installed)

driver = webdriver.Chrome()

# Open a website

driver.get("https://example.com")

# Wait for dynamic content to load (if necessary)

driver.implicitly_wait(10) # Wait for 10 seconds

# Extract the page source after JavaScript is rendered

page_source = driver.page_source

# Find elements by XPath or CSS Selector

element = driver.find_element(By.XPATH, "//div[@class='example-class']")

# Extract data from the element

data = element.text

print(data)

# Close the browser

driver.quit()

In-Depth Analysis of a Page

In-depth analysis of a page involves understanding both the static and dynamic content, how data
is loaded, and how elements are structured. By analyzing a page thoroughly, you can identify the
best way to scrape it, whether it involves scraping static HTML, fetching data from an API, or
executing JavaScript to reveal dynamic content.

Steps for In-Depth Analysis:

1. Inspect the HTML: Use Chrome DevTools to inspect the HTML structure and identify
the location of the data.
2. Check for Dynamic Content: Look for AJAX or XHR requests in the Network tab to
determine if content is being loaded dynamically.
3. JavaScript Rendering: Identify any JavaScript that needs to be executed to load the
content. Use Selenium or Puppeteer if necessary.
4. Look for API Endpoints: Some sites load data from RESTful APIs, which can be easier
to scrape than rendering JavaScript. Check the network traffic for such endpoints.

Getting Pages

To scrape data from a website, you must first retrieve the HTML of the page. This is typically
done using libraries like requests in Python or axios in JavaScript.

import requests

# Send a GET request to the website

url = 'https://example.com'
response = requests.get(url)

# Check if the request was successful

if response.status_code == 200:

html_content = response.text # Get the HTML content of the page

print(html_content)

else:

print("Failed to retrieve the page")

Reading a Web Page

After retrieving the web page, you need to parse the HTML content to extract the required data.
This is done using HTML parsing libraries that help you navigate and extract elements from the
page.

Tools for Reading and Parsing HTML:

● BeautifulSoup (Python): A popular library for parsing HTML and XML documents. It
makes it easy to extract data using tags, attributes, and CSS selectors.

BeautifulSoup Example:

from bs4 import BeautifulSoup

import requests

# Fetch the page

url = 'https://example.com'

response = requests.get(url)

# Parse the HTML content

soup = BeautifulSoup(response.text, 'html.parser')

# Find elements by tag name or CSS class

element = soup.find('div', class_='example-class')

# Extract the text from the element

data = element.text

print(data)

Simple Web Scraper: Write a Python program using requests and BeautifulSoup to scrape all
the headings (<h1>, <h2>, etc.) from a webpage.

import requests

from bs4 import BeautifulSoup

url = 'https://example.com'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

# Find all headings

headings = soup.find_all(['h1', 'h2', 'h3'])

for heading in headings:

print(heading.text.strip())

from selenium import webdriver

from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get('https://example-ecommerce.com')

driver.implicitly_wait(10) # Wait for page load

# Extract product names and prices

products = driver.find_elements(By.CLASS_NAME, 'product-name')

prices = driver.find_elements(By.CLASS_NAME, 'product-price')

for product, price in zip(products, prices):

print(f'Product: {product.text}, Price: {price.text}')

driver.quit()

Browser-Based Parsing

For websites with content loaded dynamically via JavaScript, the traditional HTTP request won't
work because the content is rendered by the browser. In such cases, browser-based parsing tools
like Selenium or Puppeteer are used.

Screen Reading with Selenium:

Selenium is a powerful web scraping tool that allows you to interact with web browsers
programmatically. It can be used to control a browser (like Chrome or Firefox), simulate user
interactions, and retrieve the content after the page has fully loaded, including
JavaScript-generated content.

from selenium import webdriver

from selenium.webdriver.common.by import By

import time
# Set up the Selenium WebDriver

driver = webdriver.Chrome()

# Open a webpage

driver.get('https://example.com')

# Wait for the page to load (implicitly)

time.sleep(5) # Wait for JavaScript to render

# Find elements on the page (e.g., extracting text from an element with a specific ID)

element = driver.find_element(By.ID, 'example-id')

print(element.text)

# Extract links from all anchor tags

links = driver.find_elements(By.TAG_NAME, 'a')

for link in links:

print(link.get_attribute('href'))

# Close the browser

driver.quit()

Ghost is another tool used for headless browser scraping, similar to Selenium but with a lighter,
more efficient architecture. It works well for scraping JavaScript-rendered content and handling
websites that require minimal browser interaction.
Ghost Example Program:

python

Copy code

import asyncio

from pyppeteer import launch

async def main():

# Launch a headless browser

browser = await launch(headless=True)

page = await browser.newPage()

# Navigate to the URL

await page.goto('https://example.com')

# Wait for the page to load (waiting for a specific element to appear)

await page.waitForSelector('h1')

# Extract content (e.g., text from an element)

content = await page.evaluate('document.querySelector("h1").innerText')

print(content)

# Close the browser

await browser.close()
# Run the async main function

asyncio.get_event_loop().run_until_complete(main())

Key Features of Ghost/Pyppeteer:

● Headless Browsing: Ghost and Pyppeteer allow you to run a web browser without the
graphical interface (headless mode).
● JavaScript Rendering: Just like Selenium, Ghost allows you to interact with and scrape
content generated by JavaScript.
● Asynchronous Operations: Pyppeteer operates asynchronously using asyncio, making it
highly efficient for tasks like scraping multiple pages in parallel.

Introduction to Scrapy

Scrapy is an open-source and collaborative web crawling framework for Python. It allows you to
extract data from websites, process it, and store it in your desired format. Scrapy handles a lot of
the complexities of web scraping, such as managing requests, handling responses, and dealing
with issues like retries, redirection, and handling concurrent requests.

Key Features of Scrapy:

● Fast and Efficient: Scrapy uses asynchronous networking and is designed for
performance.
● Powerful Spidering: It provides tools to navigate through websites, scrape data, and even
follow links to other pages.
● Data Pipelines: Scrapy includes a built-in mechanism for processing and storing scraped
data.
● Built-in Support for Handling Requests: Scrapy handles things like HTTP requests,
retries, redirects, and following links.

Cawling Whole Websites with Scrapy

To crawl an entire website, you can configure the spider to follow links from the initial pages to
subsequent pages. This allows Scrapy to recursively scrape multiple pages and links across a site.

Crawling Multiple Pages:

When scraping a website, it’s often necessary to follow links within the site to collect data from
multiple pages. You can do this by using the response.follow method to follow links from one
page to the next.

DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Webscraping
No ratings yet
Webscraping
12 pages
Python Selenium Web Scraping Guide
No ratings yet
Python Selenium Web Scraping Guide
14 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Session 3 Data Aquisition - Updated
100% (1)
Session 3 Data Aquisition - Updated
40 pages
Web Scraping CheatSheet Guide
No ratings yet
Web Scraping CheatSheet Guide
10 pages
Notes For Web Scraping - BeautifulSoup-3903
No ratings yet
Notes For Web Scraping - BeautifulSoup-3903
6 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Web Scraping in Node - Js - Top 7 Best Tools - Medium
No ratings yet
Web Scraping in Node - Js - Top 7 Best Tools - Medium
13 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Integrasi Level Antarmuka Pengguna
No ratings yet
Integrasi Level Antarmuka Pengguna
20 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Web Scraping & API Guide
No ratings yet
Web Scraping & API Guide
24 pages
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
No ratings yet
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
42 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
4 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Scraping
100% (1)
Scraping
25 pages
Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
20 - BeautifulSoup Library For Web Scraping
No ratings yet
20 - BeautifulSoup Library For Web Scraping
12 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
Scraping
No ratings yet
Scraping
6 pages
Web Scraping Course Notes
No ratings yet
Web Scraping Course Notes
89 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Quick Guide Web Scraping With Python
No ratings yet
Quick Guide Web Scraping With Python
3 pages
6 Results and Discussions
No ratings yet
6 Results and Discussions
5 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
Download
No ratings yet
Download
4 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
Document 2
No ratings yet
Document 2
6 pages
Unit I
No ratings yet
Unit I
12 pages
Slide10 Part1
No ratings yet
Slide10 Part1
35 pages
DAP Module4 1
No ratings yet
DAP Module4 1
110 pages
Web Scraping - Notes - 321
No ratings yet
Web Scraping - Notes - 321
3 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Web Scraping 101
No ratings yet
Web Scraping 101
5 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
Pharmasug China 2022 AD127
No ratings yet
Pharmasug China 2022 AD127
4 pages
Webscraping 2
No ratings yet
Webscraping 2
2 pages
The A-Z of Web Scraping in 2020 (A How-To Guide)
No ratings yet
The A-Z of Web Scraping in 2020 (A How-To Guide)
18 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Lecture 12 - Web Scrapping
No ratings yet
Lecture 12 - Web Scrapping
11 pages
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Solved - Trigger Job in SAP BW From SAC - SAP Community
No ratings yet
Solved - Trigger Job in SAP BW From SAC - SAP Community
9 pages
Type Classification
No ratings yet
Type Classification
4 pages
Hyundai Hd2600 Hd3100 Turningcenter
No ratings yet
Hyundai Hd2600 Hd3100 Turningcenter
36 pages
Strategic Decisions For Multisided Platforms
No ratings yet
Strategic Decisions For Multisided Platforms
20 pages
Information Technology in A Global Society
No ratings yet
Information Technology in A Global Society
21 pages
Full Stack Development
No ratings yet
Full Stack Development
27 pages
Chess Master Club
No ratings yet
Chess Master Club
2 pages
Iphone Components Detailed
No ratings yet
Iphone Components Detailed
3 pages
CANoe InstallationQuickStartGuide
No ratings yet
CANoe InstallationQuickStartGuide
100 pages
Peer Evaluation Form Template
No ratings yet
Peer Evaluation Form Template
1 page
Rk-Mcu HN-PTT - Master Control Unit Rk-Mcu
No ratings yet
Rk-Mcu HN-PTT - Master Control Unit Rk-Mcu
1 page
Lesson 4 - Continuous Probability Distributions (With Exercises)
No ratings yet
Lesson 4 - Continuous Probability Distributions (With Exercises)
16 pages
Syllabus For PH.D Entrance Test, RGPV. Common For CSE/IT/CA
No ratings yet
Syllabus For PH.D Entrance Test, RGPV. Common For CSE/IT/CA
6 pages
Jenish Shiroya: React JS Intern & MERN Developer
No ratings yet
Jenish Shiroya: React JS Intern & MERN Developer
1 page
Problem - B - Codeforces
No ratings yet
Problem - B - Codeforces
3 pages
Python Module-1 QB Solution (21EC643)
No ratings yet
Python Module-1 QB Solution (21EC643)
25 pages
127+ Data Science Projects With Python Code.
No ratings yet
127+ Data Science Projects With Python Code.
9 pages
Tourism Recommendation System: A Survey and Future Research Directions
No ratings yet
Tourism Recommendation System: A Survey and Future Research Directions
45 pages
Online Banking in Bangladesh
100% (3)
Online Banking in Bangladesh
8 pages
Oracle Security Hardening Guide
No ratings yet
Oracle Security Hardening Guide
9 pages
Inggris Pas Gasal 22 23
No ratings yet
Inggris Pas Gasal 22 23
8 pages
Vedant Report
No ratings yet
Vedant Report
31 pages
PassThru Protocol Log Analysis
No ratings yet
PassThru Protocol Log Analysis
2 pages
QSP-12 - Procedure For Process Change Control
No ratings yet
QSP-12 - Procedure For Process Change Control
2 pages
SI Analytics Lab 845 - Catalogue Pages 1 6
No ratings yet
SI Analytics Lab 845 - Catalogue Pages 1 6
6 pages
Liz: I Am Afraid That I Am Putting On Weight. - Tony
No ratings yet
Liz: I Am Afraid That I Am Putting On Weight. - Tony
6 pages
BEV Datasheet En-Gb
No ratings yet
BEV Datasheet En-Gb
12 pages
Faculty of Higher Education: HS1011 Data Communication and Networks Trimester 2 2018
No ratings yet
Faculty of Higher Education: HS1011 Data Communication and Networks Trimester 2 2018
5 pages
Zoho Analytics Plan Comparison
No ratings yet
Zoho Analytics Plan Comparison
3 pages
Direction Sense Questions New Pattern Part 1 Boost Up Pdfs
No ratings yet
Direction Sense Questions New Pattern Part 1 Boost Up Pdfs
25 pages