KEMBAR78
Data Science | PDF | Selenium (Software) | World Wide Web
0% found this document useful (0 votes)
4 views9 pages

Data Science

Module 5 covers the analysis of web pages for data scraping, emphasizing the importance of understanding HTML structure, CSS selectors, and JavaScript interactions. It provides practical examples using tools like Chrome DevTools, Selenium, and BeautifulSoup for retrieving and parsing web content. The module also introduces Scrapy as a powerful framework for efficient web crawling and data extraction.

Uploaded by

Saifanamol Vm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

Data Science

Module 5 covers the analysis of web pages for data scraping, emphasizing the importance of understanding HTML structure, CSS selectors, and JavaScript interactions. It provides practical examples using tools like Chrome DevTools, Selenium, and BeautifulSoup for retrieving and parsing web content. The module also introduces Scrapy as a powerful framework for efficient web crawling and data extraction.

Uploaded by

Saifanamol Vm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Module 5

Analyzing a Web Page

Before scraping data, it's important to understand the structure of the web page you are working
with. Web pages are typically made up of HTML, CSS, and JavaScript.

Key Aspects of Analyzing a Web Page:

●​ HTML Structure: Web pages are built using HTML, which defines the structure and
content of the page. To scrape data, you need to identify the relevant HTML elements,
such as <div>, <span>, <a>, <p>, and <table>, where the data resides.
●​ CSS Selectors: These are patterns used to select elements within the HTML structure.
They are crucial for pinpointing the specific elements you want to scrape. You can use
tools like Chrome Developer Tools to inspect the page and find the CSS selectors.
●​ JavaScript: Some websites dynamically load data using JavaScript, so you need to
handle such pages differently. Tools like Selenium or Puppeteer can help interact with
JavaScript-heavy pages.

Example - Inspecting a Web Page Using Chrome DevTools:

1.​ Open the website in Google Chrome.


2.​ Right-click on the page and select "Inspect" to open DevTools.
3.​ Use the "Elements" tab to browse through the HTML structure of the page.
4.​ Look for the data you want to scrape, and note the tag names, classes, or IDs
5.​ Network Tab: The Network tab in Chrome DevTools shows the HTTP requests made by
the page. By analyzing the requests, you can identify API calls, request parameters, and
response data.
6.​ XHR (XMLHttpRequest): Websites often use XHR requests to fetch data dynamically
in the background. These requests can be captured and replicated in your web scraping
script to fetch data directly from the API without scraping the HTML.
7.​ Timeline Tab: The Timeline tab in DevTools can show the sequence of network
activities, script execution, and page rendering. It's useful for debugging slow-loading
elements or identifying resource bottlenecks.

Steps to Analyze Network Requests:

1.​ Open the Network tab in Chrome DevTools.


2.​ Refresh the webpage to start capturing network traffic.
3.​ Look for requests with names like GET, POST, or XHR.
4.​ Filter requests by type, such as XHR, to see API calls that load data in the background.
5.​ Examine the request headers, parameters, and response data to determine how to fetch
similar data.

Interacting with JavaScript


Many modern websites use JavaScript to render content dynamically. This means the data you
need may not be directly available in the initial HTML response but instead loaded through
JavaScript after the page is rendered.

Tools for Interacting with JavaScript:

●​ Selenium: Selenium is a web scraping tool that allows you to control a web browser
programmatically. It can render pages, execute JavaScript, and wait for dynamic content
to load before extracting data.
●​ Puppeteer: Puppeteer is a Node.js library that provides high-level API for headless
browsers like Chromium. It’s widely used for scraping JavaScript-heavy websites.

Selenium Example:

from selenium import webdriver

from selenium.webdriver.common.by import By

# Set up the driver (ensure you have a WebDriver like ChromeDriver installed)

driver = webdriver.Chrome()

# Open a website

driver.get("https://example.com")

# Wait for dynamic content to load (if necessary)

driver.implicitly_wait(10) # Wait for 10 seconds

# Extract the page source after JavaScript is rendered

page_source = driver.page_source

# Find elements by XPath or CSS Selector

element = driver.find_element(By.XPATH, "//div[@class='example-class']")


# Extract data from the element

data = element.text

print(data)

# Close the browser

driver.quit()

In-Depth Analysis of a Page

In-depth analysis of a page involves understanding both the static and dynamic content, how data
is loaded, and how elements are structured. By analyzing a page thoroughly, you can identify the
best way to scrape it, whether it involves scraping static HTML, fetching data from an API, or
executing JavaScript to reveal dynamic content.

Steps for In-Depth Analysis:

1.​ Inspect the HTML: Use Chrome DevTools to inspect the HTML structure and identify
the location of the data.
2.​ Check for Dynamic Content: Look for AJAX or XHR requests in the Network tab to
determine if content is being loaded dynamically.
3.​ JavaScript Rendering: Identify any JavaScript that needs to be executed to load the
content. Use Selenium or Puppeteer if necessary.
4.​ Look for API Endpoints: Some sites load data from RESTful APIs, which can be easier
to scrape than rendering JavaScript. Check the network traffic for such endpoints.

Getting Pages

To scrape data from a website, you must first retrieve the HTML of the page. This is typically
done using libraries like requests in Python or axios in JavaScript.

import requests

# Send a GET request to the website

url = 'https://example.com'
response = requests.get(url)

# Check if the request was successful

if response.status_code == 200:

html_content = response.text # Get the HTML content of the page

print(html_content)

else:

print("Failed to retrieve the page")

Reading a Web Page

After retrieving the web page, you need to parse the HTML content to extract the required data.
This is done using HTML parsing libraries that help you navigate and extract elements from the
page.

Tools for Reading and Parsing HTML:

●​ BeautifulSoup (Python): A popular library for parsing HTML and XML documents. It
makes it easy to extract data using tags, attributes, and CSS selectors.

BeautifulSoup Example:

from bs4 import BeautifulSoup

import requests

# Fetch the page

url = 'https://example.com'

response = requests.get(url)

# Parse the HTML content

soup = BeautifulSoup(response.text, 'html.parser')


# Find elements by tag name or CSS class

element = soup.find('div', class_='example-class')

# Extract the text from the element

data = element.text

print(data)

Simple Web Scraper: Write a Python program using requests and BeautifulSoup to scrape all
the headings (<h1>, <h2>, etc.) from a webpage.

import requests

from bs4 import BeautifulSoup

url = 'https://example.com'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

# Find all headings

headings = soup.find_all(['h1', 'h2', 'h3'])

for heading in headings:

print(heading.text.strip())

from selenium import webdriver

from selenium.webdriver.common.by import By


driver = webdriver.Chrome()

driver.get('https://example-ecommerce.com')

driver.implicitly_wait(10) # Wait for page load

# Extract product names and prices

products = driver.find_elements(By.CLASS_NAME, 'product-name')

prices = driver.find_elements(By.CLASS_NAME, 'product-price')

for product, price in zip(products, prices):

print(f'Product: {product.text}, Price: {price.text}')

driver.quit()

Browser-Based Parsing

For websites with content loaded dynamically via JavaScript, the traditional HTTP request won't
work because the content is rendered by the browser. In such cases, browser-based parsing tools
like Selenium or Puppeteer are used.

Screen Reading with Selenium:

Selenium is a powerful web scraping tool that allows you to interact with web browsers
programmatically. It can be used to control a browser (like Chrome or Firefox), simulate user
interactions, and retrieve the content after the page has fully loaded, including
JavaScript-generated content.

from selenium import webdriver

from selenium.webdriver.common.by import By

import time
# Set up the Selenium WebDriver

driver = webdriver.Chrome()

# Open a webpage

driver.get('https://example.com')

# Wait for the page to load (implicitly)

time.sleep(5) # Wait for JavaScript to render

# Find elements on the page (e.g., extracting text from an element with a specific ID)

element = driver.find_element(By.ID, 'example-id')

print(element.text)

# Extract links from all anchor tags

links = driver.find_elements(By.TAG_NAME, 'a')

for link in links:

print(link.get_attribute('href'))

# Close the browser

driver.quit()

Ghost is another tool used for headless browser scraping, similar to Selenium but with a lighter,
more efficient architecture. It works well for scraping JavaScript-rendered content and handling
websites that require minimal browser interaction.
Ghost Example Program:

python

Copy code

import asyncio

from pyppeteer import launch

async def main():

# Launch a headless browser

browser = await launch(headless=True)

page = await browser.newPage()

# Navigate to the URL

await page.goto('https://example.com')

# Wait for the page to load (waiting for a specific element to appear)

await page.waitForSelector('h1')

# Extract content (e.g., text from an element)

content = await page.evaluate('document.querySelector("h1").innerText')

print(content)

# Close the browser

await browser.close()
# Run the async main function

asyncio.get_event_loop().run_until_complete(main())

Key Features of Ghost/Pyppeteer:

●​ Headless Browsing: Ghost and Pyppeteer allow you to run a web browser without the
graphical interface (headless mode).
●​ JavaScript Rendering: Just like Selenium, Ghost allows you to interact with and scrape
content generated by JavaScript.
●​ Asynchronous Operations: Pyppeteer operates asynchronously using asyncio, making it
highly efficient for tasks like scraping multiple pages in parallel.

Introduction to Scrapy

Scrapy is an open-source and collaborative web crawling framework for Python. It allows you to
extract data from websites, process it, and store it in your desired format. Scrapy handles a lot of
the complexities of web scraping, such as managing requests, handling responses, and dealing
with issues like retries, redirection, and handling concurrent requests.

Key Features of Scrapy:

●​ Fast and Efficient: Scrapy uses asynchronous networking and is designed for
performance.
●​ Powerful Spidering: It provides tools to navigate through websites, scrape data, and even
follow links to other pages.
●​ Data Pipelines: Scrapy includes a built-in mechanism for processing and storing scraped
data.
●​ Built-in Support for Handling Requests: Scrapy handles things like HTTP requests,
retries, redirects, and following links.

Cawling Whole Websites with Scrapy

To crawl an entire website, you can configure the spider to follow links from the initial pages to
subsequent pages. This allows Scrapy to recursively scrape multiple pages and links across a site.

Crawling Multiple Pages:

When scraping a website, it’s often necessary to follow links within the site to collect data from
multiple pages. You can do this by using the response.follow method to follow links from one
page to the next.

You might also like