KEMBAR78
Distil Networks Ebook Web Scraping | PDF | World Wide Web | Internet & Web
0% found this document useful (1 vote)
353 views19 pages

Distil Networks Ebook Web Scraping

Web scraping involves using automated software (bots) to extract data from websites. This can have both positive and negative consequences. Positively, it allows content aggregation sites to operate and businesses to monitor competitors. However, it can also undermine businesses through competitive intelligence gathering, search engine optimization penalties if content is copied, and in rare cases even business destruction if a company's inventory or data is stolen. Motives for web scraping range from desirable aggregation to questionable competitive monitoring to outright damaging practices like digital directory abuse or business attacks.

Uploaded by

vinodhewards
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
353 views19 pages

Distil Networks Ebook Web Scraping

Web scraping involves using automated software (bots) to extract data from websites. This can have both positive and negative consequences. Positively, it allows content aggregation sites to operate and businesses to monitor competitors. However, it can also undermine businesses through competitive intelligence gathering, search engine optimization penalties if content is copied, and in rare cases even business destruction if a company's inventory or data is stolen. Motives for web scraping range from desirable aggregation to questionable competitive monitoring to outright damaging practices like digital directory abuse or business attacks.

Uploaded by

vinodhewards
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Table of Contents

03 // Introduction - the wide open web 17 // Conclusion

05 // What exactly is web scraping? 17 // About Quocirca

07 // Potential damage of web scraping 18 // References

09 // What are the motives for web scraping? 19 // Need Help?

11 // Is web scraping legal?

12 // How is web scraping perpetrated?

13 // Mitigating web scrapers

16 // Turning the tables on web scrapers

2
Introduction the wide open web
The world wide web is estimated to hold almost examples of software or web robots (usually
50 billion indexed pages1. These are pages that referred to simply as bots).
are accessible via search engines and therefore
open to all. Needless to say, the indexing needs If all the information contained on those
to be automated, a continuous process that web-sites can be found by web crawlers it can be
ensures the most recent content can be found. accessed by anyone else. This means businesses
The indexing is carried out by web crawlers can keep an eye on competitors as never before;
that continuously check for page changes and in fact, whole new businesses,
additions. Web crawlers are one of the earliest

46%
ARE BOTS

3
INTRODUCTION THE WIDE OPEN WEB

such as price comparison sites now exist SEARCH ENGINE CRAWLERS

that rely on nothing more than aggregating


information from multiple other sites that deal
with a particular commodity. The process of
doing this also relies on bots, these are called
web scrapers.

It is estimated that 46% of all web activity


is now down to bots2. Some, such as web
crawlers are generally considered good bots.
Others, for example those used to mount
denial of service attacks are definitely bad
bots. Web scrapers are grey; in some cases
a site owner will want its content lifted and WEB SCRAPING
aggregated elsewhere; in other cases, doing so
is tantamount to theft. This e-book looks at the
issues around web scrapers and what can be
done to control their activity.

4
What exactly is web scraping?

The concepts behind web scraping pre-date It was also realised the concept could be
the web itself. With the advent of client- reversed. Web pages themselves have
server computing in the early 1990s the use similarities to VDU screens, regardless of
of graphical user interfaces (GUI) became how either are generated the user sees an
common-place (chiefly through Microsoft area of screen real estate populated by data
Windows). However, many organisations had fields with uninteresting space in-between.
legacy applications often written in Cobol and Web scraping extracts data in a way it can be
running on mainframe computers. The output understood and re-used.
of these applications was designed for visual
display units (VDUs), sometimes referred
"WEB SCRAPING EXTRACTS DATA IN A WAY IT
to as dumb terminals. There was a need to
CAN BE UNDERSTOOD AND RE-USED."
re-present the VDU output for the new GUIs.
This led to the birth of screen-scraping, the idea
was later adapted to serve up the content of
legacy applications for web browsers.

5
WHAT EXACTLY IS WEB SCRAPING?

In many cases the target web pages will need


input to stimulate the filling of fields, for
example the desired destination on a travel site.
So, as is the case with many bots, web scrapers
need to mimic human activity which can make
them hard to differentiate from real users. "WEB SCRAPERS ARE ONE OF TWENTY
TYPES OF AUTOMATED THREATS (BAD
Web scrapers are one of twenty types of BOTS) THAT ARE DESCRIBED IN A NEW
automated threats (bad bots) that are TAXONOMY FROM OWASP"
described in a new taxonomy from OWASP3 (the
Open Web Application Security Project). Each is
given an OAT (OWASP Automated Threat) code.
Web scraping is OAT-011 Collect application
content and/or other data for use elsewhere.

6
Potential damage of web scraping
For many businesses, their website has become The trick is to be able to recognise web
their primary shop window, where their offerings scrapers amongst all the other bots and
are displayed along with pricing information and block, limit or allow them according to policy.
current offers. This is information that is open to
all; prospects and customers of course, but also There are other possible unwanted impacts.
competitors and free-loaders who deploy web Search engine optimisation (SEO) can
scrapers to regularly harvest data from sites be negatively impacted. For example, as
they are interested in. content is copied around the internet by web
scrapers, Googles search engine assumes
Web scraping ranges from being an unwanted that sites hosting the original content are
activity that can undermine or even destroy your trying to game its search algorithms, and so
businesses to the wanted that may supplement lowers the ranking of pages with the original
it. However, even with the latter, there can be and copied content. The entity that copied
too much of a good thing; server and network the information can sometimes even end up
infrastructure can become over-stressed and with a higher ranking than the originating site;
costs incurred from random spikes in traffic due that may even be the goal.
to aggressive web scraping.

7
POTENTIAL DAMAGE OF WEBSCRAPING

OAT-011 WEB SCRAPING

STEP 1 STEP 2 STEP 3 STEP 4


Original Site Duplicate Content SEO Penalized Revenue Drops

SEO Crawler

8
What are the motives for web scraping?
There are a wide range of motives for web 2. Competitive intelligence
scraping. For the target sites these range from Web scrapers make it possible for unscrupulous
the being desirable, through questionable to competitors to collect online pricing and product
damaging. Some of the primary ones are: information in near real time enabling price
matching as soon as new offers and promotions
1. Content aggregation are announced. Data aggregated in this way
Price comparison sites, such as Gocompare can be fed onto aggregation sites ceding any
and Moneysupermarket, and service advantage the targeted retailer hoped to achieve
aggregators, such as the travel sites TripAdvisor in the first place.
and Expedia rely on web scraping. The sites
they target may value the exposure provided 3. Multiple listing services (MLS)
by some aggregators but are unable to control MLSs are widely used in the property market,
which are allowed and what data their web but also for the buying and selling of other
scrapers can access. Furthermore, unchecked assets. They are a treasure trove of leads for
aggregators can lead to unexpected costs, for all involved in a given sales process; bankers,
example, the Canadian travel site Red Label brokers, removal companies etc. who often
Vacations had bots executing searches on its deploy web scrapers to find new leads. In the
site, which triggered third party API calls with US, MLS rules now require that site operators
associated fees4. take steps to ensure their data is not harvested
by web scrapers.
9
WHAT ARE THE MOTIVES FOR WEB SCRAPING?

4. Digital directory abuse the originating site could even be penalized for
Organisations like CrunchBase and Manta ostensibly leaking confidential information if
provide online business contact information. it is provided under contract in the first place.
There would be no point in doing so if the For example, B&H Photo Video invested in
aim was not for other businesses to find and building an inventory with 300,000 items and
make use of their data. However, it can also be subsequently became a victim of data mining5.
harvested in bulk by unscrupulous organisations
using web scrapers to drive spamming 6. Business destruction
operations. Not only is the data being misused Retailers have even been driven out of business.
but the target sites performance can be affected Diapers.com was once a successful online
as web scrapers return time and again to search retailer of baby products, a market Amazon
directories and build new lists. wanted to enter. Amazon ran a web scraper
against Diapers.com to track its products and
5. Data mining/inventory theft pricing, eventually mimicking the product line and
Imagine a business that has invested in applying real-time price updates. Within a year
compiling an extensive specialist inventory Amazon had launched Amazon Mom and shortly
of goods in a certain category and put in after acquired Diapers.com. This was far cheaper
place the processes to keep it up to date. An and easier than building a baby products
unscrupulous would-be competitor then sets up business from scratch6.
its own website and scrapes the first businesses
inventory each day, and, with lower costs,
undercuts the pricing. In some circumstances,
10
Is web scraping legal?
Web scraping content from a competitors
website might be considered fair game, since
the data is in the public domain. However, there
have been legal challenges and web scraping
is currently a legal grey area. The history is well
described by Distil Networks7.

In Europe, legal actions have successfully


been used to prevent web scrapers from
what amounts to invasions of privacy, but in
the United States web scraping still appears
to be considered an acceptable risk in the
hypercompetitive world of online business.
Ultimately, it is better to control web scraping
in the first place than to mount expensive
rearguard legal actions.

11
How is web scraping perpetrated?
The barrier to entry for web scraping is low Once programmed with the necessary
and the tools are easily available online, for parameters the web scraper is launched
example from Import.io, webhose.io or Outwit. at the target web-site to periodically collect
The tools typically have easy to use drag- data and re-present it in a desired format, for
and-drop interfaces which enable them to be example a spreadsheet.
programmed to fetch target data.
For more technical detail see Distil Networks
"THE BARRIER TO ENTRY FOR WEB Web Scraping: Everything You Wanted to
SCRAPING IS LOW AND THE TOOLS Know (but were afraid to ask)7.

ARE EASY AVAILABLE ONLINE..."

12
Mitigating web scrapers
There are a number of measures that can be firewall rules (see next point). All this effort may
taken to control web scrapers, some more work for a short while before the web scrapers
effective than others: are back, with new IP addresses or hiding behind
new proxies. Blocking IP addresses can also
1. Robot exclusion standard affect legitimate users coming via the same
This is one thing that will not work as it relies service provider as the web scrapers.
on etiquette. The robot exclusion standard/
protocol, or simply robots.txt, is used by 3. Web application firewalls (WAF)
websites to communicate with web crawlers WAFs are designed to protect web applications
and good bots providing information about from being exploited due to the presence of
which areas of the website should not be common software vulnerabilities. Web scrapers
processed or scanned. However, web scrapers are not targeting vulnerabilities but aiming to
and other bad bots need not cooperate. mimic real users. Therefore, other than being
programmed to block manually identified IP
2. Manual addresses (see last point), they are of little use
Dealing with web scrapers manually involves for controlling web scraping.
many man-hours spent on a game of whack-a-
mole, trawling through logs, identifying tell-tale 4. Login enforcement
behaviour, blocking IP addresses and rewriting Some web-sites require login to access the most
valued data, however, this is no protection from
13
MITIGATING WEB SCRAPERS

web scrapers, as it is easy for the perpetrators


6. Geo-fencing
to create their own accounts and program their
Geo-fencing means web-sites are only exposed
web scrapers accordingly. Strong authentication
within the geographic locations in which they
or CAPTCHAs (see next point) can be deployed,
conduct business. This will not stop web scraping
but these introduce more inconvenience for
per se, but will mean the perpetrators have to go
legitimate users, whose initial casual interest
to the extra effort of seeming to run their web
may be dispelled by the commitment of
scrapers within a specific geographic location.
account creation.
This may simply involve using a VPN link to a local
5. Are you a human? point of presence (PoP).
One obvious way to vet web scraping is to
7. Flow enforcement
ask users to prove they are human. This is
Enforcing the route legitimate users take
the aim of CAPTCHAs (Completely Automated
through a web-site can ensure they are validated
Public Turing test to tell Computers and
each step of the way. Web scrapers are often
Humans Apart). They annoy some users who
hardwired to go after high-value targets and
find them hard to interpret and, needless to
encounter problems if forced to follow a typical
say, workarounds have been developed. One
users predetermined flow.
of the bad-bot activities described by OWASP
is CAPTCHA Bypass (OAT-0093). There are also 8. Direct bot detection and mitigation
CAPTCHA farms, where the test posed by the The aim here is the direct detection of scrapers
CAPTCHA is outsourced to teams of low-cost through a range of techniques including
humans via sites on the dark web.
14
MITIGATING WEB SCRAPERS

behavioural analysis and digital fingerprinting a Dutch property sales site, which integrated
using specific bot detection and control Distil with AWS and F5 to prevent web scraping
technology designed for the task. Across multiple of leads9; Easyjet which put in place a 24-hour
customers, suppliers of such technologys are automated system for detecting and acting on
able to improve their understanding of web web scraping activity; and Move, which spent
scrapers and other bots through machine over a year building its own web scraping
learning to the benefit of all. Once web scrapers prevention tools before realising Distil could
are identified, it can be decided based on adapt more rapidly to the changing bot-scape.
provenance and other factors if their activity
should be allowed, controlled or blocked and
appropriate action taken.

Distil Networks is a provider of bot detection


and mitigation tools; more information can be
found in its product overview8. It also has some
customer stories that underline the problems
around web scraping and how purpose-built
tools help to solve them. These include Funda,

15
Turning the tables on web scrapers
With direct bot mitigation in place it is even
possible to turn the tables on unwanted
web scrapers and get them to work to your
advantage.
"WITH DIFFERENT BOT MITIGATION IN PLACE
For example, if you are about to change prices, IT IS EVEN POSSIBLE TO TURN THE TABLES ON
temporarily turn off bot protection and allow UNWANTED WEB SCRAPERS AND GET THEM TO
web scrapers access to price match old prices
WORK TO YOUR ADVANTAGE."
and turn protection back on before making
changes. Web scrapers can also be fed false
information by directing bots to bogus web
pages not seen by customers.

16
Conclusion
Unlike some bots that are always bad and others detection tools invaluable for putting your
that are primarily good, web scrapers are grey. organisation back in control of its own online
In some cases, their activity is wanted; in other resources and co-operating with, blocking
cases it can be damaging. This makes the policy- or misleading those who value your data as
based limits that are possible with automated bot much as you do.

About Quocirca
Quocirca is a research and analysis company with a primary focus on the European market.
Quocirca produces free to market content aimed at IT decision makers and those that influence
them in business of all sizes and public sector organizations.

17
References
1 http://www.worldwidewebsize.com/ 6 The Time Jeff Bezos Went Thermonuclear on Diapers.
com, Slate.com, Oct 2013
2 Quantifying the Risk and Economic Impact of http://www.slate.com/blogsfuturetense/2013/10/10/
Bad Bots, Aberdeen Group, April 2016 amazon_book_how_jeff_bezos_went_thermonuclear_
http://resources.distilnetworks.com/h/i/261110682- on_diapers_com.html_
aberdeen-risk-and-economic-impact-of-bad-bots
7 Web Scraping: Everything You Wanted to Know
3 OWASP Automated Threat List:
(but were afraid to ask)
https://www.owasp.org/images/3/33/Automated-
http://resources.distilnetworks.com/h/i/111901208-
threat-handbook.pdf
scraping-everything-you-wanted-to-know-but-were-
afraid-to-ask/181642
4 Distil Networks - Red Label Vacations case study
http://resources.distilnetworks.com/h/i/53821906-
8 Distil Networks product overview
how-online-travel-site-red-label-vacations-stopped-
http://resources.distilnetworks.com/h/i/71010768-
web-scraping-bots-in-their-tracks/181642
distil-product-overview-data-sheet
5 Optimizing Web Application Security for the New Bad
Bot Threat Landscape
https://resources.distilnetworks.com/white-papers-
and-data-sheets/optimizing-web-application-security-
to-fight-bad-bots

18
Need Help?
For help mitigating account takeovers and other automated threats, please visit
www.distilnetworks.com, call 1-415-423-0831 or email sales@distilnetworks.com
to speak with one of our security experts.

Thanks to Distil Networks for underwriting the Quocirca Ultimate Guide to


Preventing Cyber Attacks Series.

LIVE DEMO FREE TRIAL CONTACT US

19

You might also like