Scraping
without
programming
© Samantha Sunne
What's scraping?
How do journalists get data?
...from humans? ...from computers?
ay :) ay :(
Easy w Hard w
Ask
FOIA Download Scrape
nicely
ay :) ay :(
Easy w Hard w
Scraping is...
catching, collecting or coaxing data off the web.
Web scrapers are often called:
EXTRACTORS CRAWLERS
if they pull data from or if they pull data from
a single webpage multiple webpages
You'll need a basic
understanding of HTML
HTML is meant for computers, but some of it is understandable to humans.
Sometimes the code itself is interesting.
Jeb Bush’s
campaign site
Summary of the
movie “Die Hard”
Browser HTML
You can see it by right-clicking and clicking View Source.
This is called looking “under the hood."
Technique 1:
import HTML with Google Sheets
HTML content is wrapped in tags that look like this:
<>content content</>
There are a lot of tags, and a lot of them have different abbreviations, so you'll need to refer to an
HTML dictionary.
Some of them are intuitive, like tables:
<table>
Here is my table, everything between these table tags.
</table>
Webpages are actually made up of tons of "elements" - images, text chunks, headers, you name it.
You can grab any of them!
Video from Firefox
Elements can be inside each other. But any word you see immediately after a "<" is the name of the
element.
<table>
<tr>
A "table Inside a "table"
<td>Hi! I'm text inside a table.</td> Inside a "table
data" element
row" element
</tr> element
</table>
What we have here is an element inside an element inside an element.
But no worries, you can grab any of them! The larger elements will just include the smaller ones.
Version for humans Version for computers
<table>
<tr>
Hi! I'm text inside a table. <td>Hi! I'm text inside a table.
</td>
</tr>
</table>
How to scrape an HTML table with Google Sheets
You can use this Google Sheets formula to scrape a table or a list:
=ImportHTML(“url”, “element”, number of element)
The "url" is the link you are scraping.
The "element" is the HTML tag.
The number is the order the element is in.
tutorial
ed? Try this
Still confus
Example 1
=importHTML("sample.com",”table”,0)
This would import the first table on sample.com
(If it’s the first element, it’s number 0. I know, weird, it’s a computer thing.)
You can use View Source to figure out what element and number you want.
tutorial
ed? Try this
Still confus
This scrapes
<table>'s
For example, the IRE
Job Center
I used this formula
to scrape the IRE
jobs board:
https://www.ire.org
/jobs/
It also scrapes lists
(<ol>'s and <ul>'s)
Again, refer to the HTML
dictionary if you need to know the
HTML abbreviation for lists
This time, I scraped a list.
Remember, this isn’t just
any old list, but the first
“list” HTML element on the
page.
If you aren't sure if there is
a List element, check the
source code. (Right click >
View Source > Ctrl+F for
"<li>")
The importHTML formula can
only scrape lists and tables.
Technique 2:
import XML with Google Sheets
Browser XML
HTML
XML is also source code, but it's
in more of a "tree" format.
You can see
it by right
clicking and
clicking
"Inspect."
With XML, you can find the "XPath"
An XPath is like an address, or a map, to a very specific spot in the webpage’s code.
It looks like this:
This webpage, in the "wikitable" table, in the first row, in the fourth column
//table[@class='wikitable']/tr[1]/td[4]
On this the with the class in the first the fourth
page column
table “wikitable” row
It's easiest to highlight the element you want and Right Click > Inspect on it.
How to scrape XML data with Google Sheets
You can scrape everything at a particular XPath using this Google Sheets formula:
=importxml(“url”,”XPath”)
For example,
=importxml(“sample.com”,”//table[2]//tr”)
returns all the table rows ("tr") from the second table ("table[2]") in sample.com
ed? Try this tutorial
Still confus
XPaths open up a whole new world
all bold text: //b
all large text (known as headers): //h1
all headers with the word 'Parish': //h1[contains(.,'parish')]
all headers with the class 'Parish_name': //h1[@class='parish_name']
me tutorial!
r an aweso
See here fo
Here, I scraped:
the first column (<td>)
from all the table rows (<tr>)
in the table (<table>)
on ire.org/jobs
But what if your
data isn't in a
<table>?
Like the Talking Biz News
job board
Here, the page source
told me that all the job
titles are headers.
So I only scraped <h2>'s
For instance, the
dates on the
Talking Biz News
job board are
paragraph (<p>)
elements with
the class
'post-date'
Technique 3:
import RSS feeds with Google Sheets
This job board is
available as an RSS
feed
The Arab Reporters for
Investigative Journalism
How to scrape an RSS feed with Google Sheets
Google Sheets has a third import function you can use for scraping:
=importFeed(“source”,”items”)
It scrapes data from an RSS feed,
which is useful if you want to scrape something at regular intervals.
But we’re not going to do that in this session.
If you’re interested, check out this tutorial.
Here, I filled the first
column of my Google
Sheet with the item
titles - which in this
case, are the job
positions
Technique 4:
Writing code!
What a simple scrape would look like in Ruby
require 'Nokogiri'
require 'open-uri'
require 'csv'
url = "https://ire.org/jobs/"
html = Nokogiri::HTML(open(url))
csv = CSV.open("ire_jobs.csv", "w",{:col_sep => ",", :quote_char => '\'', :force_quotes => true})
html.xpath('//table//tr').each do |row|
tarray = []
row.xpath('td').each do |cell|
tarray << cell.text
end
csv << tarray
t here...
end
is scrip
csv.close
Copy th
Technique 5:
Point-and-click apps
OutWit Hub OutWit Hub is a desktop app that can
identify each HTML element on a
webpage and scrape it.
OutWit Hub is a desktop
app that can identify each
HTML element on a
webpage and scrape it.
(The free version limits how
many elements you can scrape.)
Some other good tools
● DownThemAll (Firefox add-on)
○ Highlights a whole list of links (or images/media) and downloads them
○ Not as robust as OutWit Hub
● Zapier and IFTTT (websites)
○ These automation tools can help keep track of new data, such as new Instagram posts
or tweets
○ Try: save RSS feed content to a Google Sheet or get emailed tweets from a certain
area
● Web Scraper (Chrome extension)
○ Has you build a sitemap for a site to be crawled
I wouldn’t recommend...
● Import.io (desktop app)
○ $300 a month! (But it does have a seven-day free trial.)
● Scraper (Chrome extensions)
○ Extremely basic; can only handle the simplest of tables. You’d be better off with
Google Sheets
● Helium Scraper (desktop app)
○ Works like other scraping tools, but costs a lot of money. (Although it does have a free
trial.)
● InfoExtractor (website)
○ Doesn't work.
If you want to delve further...
How to Feel Like You’re Hacking
My tutorial on getting data and metadata off the web
Using the web inspector for complex scrapes
Eric Sagara’s tutorial on using Python or Ruby for more advanced scraping
For more tools
I reviewed OutWit Hub and
DownThemAll for my newsletter, Tools
for Reporters. You can find new, useful
tools there - scraping and otherwise -
every week.
For questions, contact me at
samanthasunne@gmail.com or
@samanthasunne. If I don’t know an
answer, I can point you to one.
xkcd