KEMBAR78
Pydata-Python tools for webscraping | PDF
Python tools for
webscraping
José Manuel Ortega
@jmortegac
SpeakerDeck space
https://speakerdeck.com/jmortega
Github repository
https://github.com/jmortega/pydata_webscraping
Agenda
Scraping techniques
Introduction to webscraping
Python tools for webscraping
Scrapy project
Scraping techniques
 Screen scraping
 Report mining
 Web scraping
 Spiders /Crawlers
Screen scraping
 Selenium
 Mechanize
 Robobrowser
Selenium
 Open Source framework for automating
browsers
 Python-Module
http://pypi.python.org/pypi/selenium
 pip install selenium
 Firefox-Driver
Selenium
 find_element_
by_link_text(‘text’): find the link by text
by_css_selector: just like with lxml css
by_tag_name: ‘a’ for the first link or all links
by_xpath: practice xpath regex
by_class_name: CSS related, but this finds
all different types that have the same class
Selenium youtube
Selenium youtube search
Report mining
Miner
Webscraping
Python tools
 Requests
 Beautiful Soup 4
 Pyquery
 Webscraping
 Scrapy
Spiders /crawlers
 A Web crawler is an Internet bot that
systematically browses the World Wide Web,
typically for the purpose of Web indexing. A
Web crawler may also be called a Web
spider.
https://en.wikipedia.org/wiki/Web_crawler
Spiders /crawlers
Spiders /crawlers
scrapinghub.com
Requests http://docs.python-requests.org/en/latest
Requests
Web scraping with Python
1. Download webpage with requests
2. Parse the page with BeautifulSoup/lxml
3. Select elements with Regular
expressions,XPath or css selectors
Xpath selectors
Expression Meaning
name matches all nodes on the current level with
the specified name
name[n] matches the nth element on the current level
with the specified name
/ Do selection from the root
// Do selection from current node
* matches all nodes on the current level
. Or .. Select current / parent node
@name the attribute with the specified name
[@key='value'] all elements with an attribute that matches
the specified key/value pair
name[@key='value'] all elements with the specified name and an
attribute that matches the specified key/value
pair
[text()='value'] all elements with the specified text
name[text()='value'] all elements with the specified name and text
BeautifulSoup
 Parsers support lxml,html5lib
 Installation
 pip install lxml
 pip install html5lib
 pip install beautifulsoup4
 http://www.crummy.com/software/BeautifulSoup
BeautifulSoup
 soup = BeautifulSoup(html_doc,’lxml’)
 Print all: print(soup.prettify())
 Print text: print(soup.get_text())
from bs4 import BeautifulSoup
BeautifulSoup functions
 find_all(‘a’)Returns all links
 find(‘title’)Returns the first element <title>
 get(‘href’)Returns the attribute href value
 (element).text  Returns the text inside an
element
for link in soup.find_all('a'):
print(link.get('href'))
External/internal links
External/internal links
http://pydata.org/madrid2016
Webscraping
pip install webscraping
#Download instance
D = download.Download()
#get page
html =
D.get('http://pydata.org/madrid2016/schedule/')
#get element where is located information
xpath.search(html, '//td[@class="slot slot-talk"]')
Pydata agenda code structure
Extract data from pydata agenda
PyQuery
Scrapy installation
pip install scrapy
Scrapy
Uses a mechanism based on XPath
expressions called Xpath
Selectors.
Uses Parser LXML to find elements
Twisted for asyncronous operations
Scrapy advantages
 Faster than mechanize because it
uses asynchronous operations (Twisted).
 Scrapy has better support for html
parsing.
 Scrapy has better support for unicode
characters, redirections, gzipped
responses, encodings.
 You can export the extracted data directly
to JSON,XML and CSV.
Architecture
Scrapy Shell
scrapy shell <url>
from scrapy.select import Selector
hxs = Selector(response)
Info = hxs.select(‘//div[@class=“slot-inner”]’)
Scrapy Shell
scrapy shell http://scrapy.org
Scrapy project
$ scrapy startproject <project_name>
scrapy.cfg: the project configuration file.
tutorial/:the project’s python module.
items.py: the project’s items file.
pipelines.py : the project’s pipelines file.
setting.py : the project’s setting file.
spiders/ : spiders directory.
Pydata conferences
Spider generating
$ scrapy genspider -t basic
<SPIDER_NAME> <DOMAIN>
$ scrapy list
Spiders list
Pydata spyder
Pydata sypder
Pipelines
 ITEM_PIPELINES =
{'pydataSchedule.pipelines.PyDataSQLitePipeline': 100,
'pydataSchedule.pipelines.PyDataJSONPipeline':200,}
 pipelines.py
Pydata SQLitePipeline
Execution
$ scrapy crawl <spider_name>
$ scrapy crawl <spider_name> -o items.json -t json
$ scrapy crawl <spider_name> -o items.csv -t csv
$ scrapy crawl <spider_name> -o items.xml -t xml
Pydata conferences
Pydata conferences
Pydata conferences
Launch spiders without scrapy
command
Scrapy Cloud
http://doc.scrapinghub.com/scrapy-cloud.html
https://dash.scrapinghub.com
>>pip install shub
>>shub login
>>Insert your ScrapingHub API Key:
Scrapy Cloud /scrapy.cfg
# Project: demo
[deploy]
url =https://dash.scrapinghub.com/api/scrapyd/
#API_KEY
username = ec6334d7375845fdb876c1d10b2b1622
password =
#project identifier
project = 25767
Scrapy Cloud
$ shub deploy
Scrapy Cloud
Scrapy Cloud
Scrapy Cloud
Scrapy Cloud Scheduling
curl -u APIKEY:
https://dash.scrapinghub.com/api/schedule.json -d
project=PROJECT -d spider=SPIDER
References
 http://www.crummy.com/software/BeautifulSoup
 http://scrapy.org
 https://pypi.python.org/pypi/mechanize
 http://docs.webscraping.com
 http://docs.python-requests.org/en/latest
 http://selenium-python.readthedocs.org/index.html
 https://github.com/REMitchell/python-scraping
Books
Thank you!

Pydata-Python tools for webscraping