Getting started with Web Scraping in Python

Scrapingtotherescue
(Webscrapingusingpython)
By : Satwik Kansal and Pradhvan Bisht

Whatiswebscraping ?
Web scraping is a technique to extract large amounts of
data from websites whereby the data is extracted and
saved to a local file in your computer.
The data can be used for several purposes like displaying on
your own website and application, performing data analysis
or for any other reason.

whyshouldyouscrape
- API may not provide what you need
- No rate limit
- Take what you really want!
- Reduces manual effort
- Swag!

Thingsthatmightcomehandy
-HTML
-CSS
-XPATH
-Regular Expressions

Howit’sdone?
Broadly a Three Step Process
1. Getting the content (in most cases HTML)
2. Parsing the response.
3. Optimizing/Improving the performance and preserving the data

GETTINGTHECONTENT
● Using modules like urllib, urllib2, requests, mechanize and selenium.
● Involves GET/POST request to the server.
● The response contains the information to be extracted.
● Sometimes not as easy as it may seem.

ExtractingTheData
1. Using Regular Expression and Basic python
Tricky, complex and kind of fragile.
2. Using Parsing Libraries
❏ Two different approaches possible -- Simple Parsing and Search Tree
parsing.
❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib.
❏ Each modules has its own techniques and thus its own pros and trade-
offs

ComparingParsers
BEAUTIFUL SOUP
LXML
SCRAPY
HTML5LIB

PreservingTheData
1. Writing to a file.
2. Exporting as csv or excel file.
3. Storing in a database.

Examples
Example 1 : Scraping Tweets from Twitter using BeautifulSoup
and python’s Requests module
Code
Example 2 : Scraping top Stackoverflow posts using Scrapy
Code
Example 3 : Using Selenium to Log in and fetch library
details from a university library site which uses Dynamic
HTML.

WHATTOUSEWHERE
1. Handling dynamically generated html
Solutions: Selenium or Spidermonkey
2. Cookie based Authentication
Solution : Requests module.
3. Simple scraping
Solutions: BeautifulSoup+Requests, Scrapy, Selenium

Scrapinghacks
1. Overcoming captchas
Lookup tables, One time manual entry , Death By Captchas (paid service)
2. Per IP address query limit
Using tsocks, ssh_D and socks monkey.
3. Improving performance
Multiprocessing , gevent and requests.async() method.

Example3
Automating My College Library
Problems :
1. Authentication
2. Dynamically Generated <iframe> tag
Solution
Selenium with headless Browser like PhantomJS
Alternative: Mechanize
Code

EthicsOfScraping
Exceeding authorized use of the site
Means doing anything that is prohibited in the Terms of Use
(See CFAA, breach of contract, unjust enrichment, trespass
to chattels, and various state laws similar to CFAA)
Copyright Issues
If the material you are scraping is not factual, but
something that required some amount of creativity to create,
you have copyright to worry about.
QuickTip -- Conform to the the robots.txt file.

● The brute-force way to get the information required.
● Absolutely Legal
● Not always that easy.

Getting started with Web Scraping in Python

In this document

More Related Content

What's hot

Similar to Getting started with Web Scraping in Python

Recently uploaded

Getting started with Web Scraping in Python