The document explains web scraping as a method for extracting large volumes of data from websites into local files, emphasizing its utility for various applications. It details the three main steps of web scraping: getting content, parsing the response, and preserving the data, while outlining tools and libraries available like BeautifulSoup and Scrapy. Additionally, it addresses challenges, ethical considerations, and offers examples of practical applications, stressing the importance of conforming to a site's terms of use.
Whatiswebscraping ?
Web scrapingis a technique to extract large amounts of
data from websites whereby the data is extracted and
saved to a local file in your computer.
The data can be used for several purposes like displaying on
your own website and application, performing data analysis
or for any other reason.
4.
whyshouldyouscrape
- API maynot provide what you need
- No rate limit
- Take what you really want!
- Reduces manual effort
- Swag!
Howit’sdone?
Broadly a ThreeStep Process
1. Getting the content (in most cases HTML)
2. Parsing the response.
3. Optimizing/Improving the performance and preserving the data
7.
GETTINGTHECONTENT
● Using moduleslike urllib, urllib2, requests, mechanize and selenium.
● Involves GET/POST request to the server.
● The response contains the information to be extracted.
● Sometimes not as easy as it may seem.
8.
ExtractingTheData
1. Using RegularExpression and Basic python
Tricky, complex and kind of fragile.
2. Using Parsing Libraries
❏ Two different approaches possible -- Simple Parsing and Search Tree
parsing.
❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib.
❏ Each modules has its own techniques and thus its own pros and trade-
offs
Examples
Example 1 :Scraping Tweets from Twitter using BeautifulSoup
and python’s Requests module
Code
Example 2 : Scraping top Stackoverflow posts using Scrapy
Code
Example 3 : Using Selenium to Log in and fetch library
details from a university library site which uses Dynamic
HTML.
14.
WHATTOUSEWHERE
1. Handling dynamicallygenerated html
Solutions: Selenium or Spidermonkey
2. Cookie based Authentication
Solution : Requests module.
3. Simple scraping
Solutions: BeautifulSoup+Requests, Scrapy, Selenium
16.
Scrapinghacks
1. Overcoming captchas
Lookuptables, One time manual entry , Death By Captchas (paid service)
2. Per IP address query limit
Using tsocks, ssh_D and socks monkey.
3. Improving performance
Multiprocessing , gevent and requests.async() method.
17.
Example3
Automating My CollegeLibrary
Problems :
1. Authentication
2. Dynamically Generated <iframe> tag
Solution
Selenium with headless Browser like PhantomJS
Alternative: Mechanize
Code
19.
EthicsOfScraping
Exceeding authorized useof the site
Means doing anything that is prohibited in the Terms of Use
(See CFAA, breach of contract, unjust enrichment, trespass
to chattels, and various state laws similar to CFAA)
Copyright Issues
If the material you are scraping is not factual, but
something that required some amount of creativity to create,
you have copyright to worry about.
QuickTip -- Conform to the the robots.txt file.
21.
● The brute-forceway to get the information required.
● Absolutely Legal
● Not always that easy.