KEMBAR78
Web Scraping using Python | Web Screen Scraping | PPTX
Python Has Become The Most Popular Language For Web Scraping for Many
Reasons. These Include It’s Flexibility, Ease of Coding, Dynamic Typing, A
Large Collection of Libraries to Manipulate Data, and Support For The Most
Common Scraping Tools, Such As Scrapy, Beautiful Soup, and Selenium.
What is Web Scraping?
Web Scraping is a software method of scraping data from different
websites. It keeps attention on the transformation of unstructured data on
the web (Typically HTML), into structured data that can be stored and
analyzed.
1
Why We Scrape?
 Web Pages that Contain Wealth of Data Designed Mostly for Human Consumption.
 Static Website
 Interfacing with 3rd Party with no API access
 Website are More Important than APIs
 The Data is Already Feasible
 No Rate Limiting
 Anonymous Access
2
Fetch The Data
 Involves Finding the endpoint – URL or URLs
 Sending HTTP Request to the server
 Using Request Library:
Import Requests
Data = requests.get (‘http://google.com/’)
Html = data.content
3
Processing
 Avoid using reg-ex
 Reason why not to use it:
1. It’s Fragile
2. Really Hard to Maintain
3. Importer HTML & Encoding Handling
4
Use Beautiful Soup For Parsing
 Provides Simple Methods to Search, Navigate, and Select
 Deals with Broken Web-Pages Really Well
 Auto-detects encoding
5
Export The Data
 Database (Relational or Non-Relational)
 File (XML, YAML, CSV, JSON, etc)
 APIs
6
Challenges
 External Site Can Be Changes Without Warning
7
 Figuring out the Frequency is Difficult
 Changes can Break Scrapers Easily
 Bad HTTP Status Codes
 Example: Using 200 OK to signal an error
 Cannot always trust your HTTP libraries default behavior
 Messy HTML Markup
Scrapy – A Framework For Web Scraping
8
 Uses XPath to Select Elements
 Interactive Shell Scripting
 Using Scrapy:
1. Define a Model to Store Items
2. Create Your Spider to Extract Items
3. Write a Pipeline to Store Them
Web Scraping using Python | Web Screen Scraping

Web Scraping using Python | Web Screen Scraping

  • 1.
    Python Has BecomeThe Most Popular Language For Web Scraping for Many Reasons. These Include It’s Flexibility, Ease of Coding, Dynamic Typing, A Large Collection of Libraries to Manipulate Data, and Support For The Most Common Scraping Tools, Such As Scrapy, Beautiful Soup, and Selenium.
  • 2.
    What is WebScraping? Web Scraping is a software method of scraping data from different websites. It keeps attention on the transformation of unstructured data on the web (Typically HTML), into structured data that can be stored and analyzed. 1
  • 3.
    Why We Scrape? Web Pages that Contain Wealth of Data Designed Mostly for Human Consumption.  Static Website  Interfacing with 3rd Party with no API access  Website are More Important than APIs  The Data is Already Feasible  No Rate Limiting  Anonymous Access 2
  • 4.
    Fetch The Data Involves Finding the endpoint – URL or URLs  Sending HTTP Request to the server  Using Request Library: Import Requests Data = requests.get (‘http://google.com/’) Html = data.content 3
  • 5.
    Processing  Avoid usingreg-ex  Reason why not to use it: 1. It’s Fragile 2. Really Hard to Maintain 3. Importer HTML & Encoding Handling 4
  • 6.
    Use Beautiful SoupFor Parsing  Provides Simple Methods to Search, Navigate, and Select  Deals with Broken Web-Pages Really Well  Auto-detects encoding 5
  • 7.
    Export The Data Database (Relational or Non-Relational)  File (XML, YAML, CSV, JSON, etc)  APIs 6
  • 8.
    Challenges  External SiteCan Be Changes Without Warning 7  Figuring out the Frequency is Difficult  Changes can Break Scrapers Easily  Bad HTTP Status Codes  Example: Using 200 OK to signal an error  Cannot always trust your HTTP libraries default behavior  Messy HTML Markup
  • 9.
    Scrapy – AFramework For Web Scraping 8  Uses XPath to Select Elements  Interactive Shell Scripting  Using Scrapy: 1. Define a Model to Store Items 2. Create Your Spider to Extract Items 3. Write a Pipeline to Store Them