KEMBAR78
Data Aggregation by Web Scraping Using Python | PDF | World Wide Web | Internet & Web
0% found this document useful (0 votes)
292 views48 pages

Data Aggregation by Web Scraping Using Python

Web scraping is a technique to extract large amounts of data from websites in an automated fashion. It involves inspecting web pages to find relevant data and structures, parsing the content, and saving it in a local database or spreadsheet. Some common uses of web scraping include data science tasks like machine learning, competitor monitoring, social media analysis, and more.

Uploaded by

saniyasalwa965
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
292 views48 pages

Data Aggregation by Web Scraping Using Python

Web scraping is a technique to extract large amounts of data from websites in an automated fashion. It involves inspecting web pages to find relevant data and structures, parsing the content, and saving it in a local database or spreadsheet. Some common uses of web scraping include data science tasks like machine learning, competitor monitoring, social media analysis, and more.

Uploaded by

saniyasalwa965
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

A

Mini Project Report


on
DATA AGGREGATION BY WEB SCRAPING
Submitted to
Jawaharlal Nehru Technological University, Hyderabad
in partial fulfillment of the requirements for the award of Degree of
Bachelor of Technology
in
Computer Science & Engineering
by

S. SRAVYA (206Y1A0587)
SANIYA SALWA (206Y1A0589)
T. SRIVANI (206Y1A0597)
A. PRATHIMA (206Y1A05A6)

Under the guidance


of
Mr. M. MRUTHYUNJAYA
Asst. Professor

Department of Computer Science & Engineering

2023-2024
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that the project entitled “DATA AGGREGATION BY WEB


SCRAPING” is submitted by S. Sravya (206Y1A0587), Saniya Salwa (206Y1A0589) and
T. Srivani (206Y1A0597), A. Prathima (206Y1A05A6) in the partial fulfillment of
requirement for the award of degree of Bachelor of Technology in Computer Science and
Engineering during academic year 2023-24.

Mr. M. MRUTHYUNJAYA Dr. E. SUDARSHAN


Project Guide Head of the Department

External Examiner
ACKNOWLEDGEMENT
We wish to take this opportunity to express our deep gratitude to all the people who have
extended their cooperation in various ways during my project work. It is our pleasure to
acknowledge the help of all those individuals.
We would like to thank our project guide Mr. M. Mruthyunjaya, Asst. Prof., Computer
Science and Engineering Department for his guidance and help throughout the development
of this project work by providing us with required information. With his guidance,
cooperation and encouragement we had learnt many new things Suring our project tenure.
We would like to thank our project coordinator Mr. V. SRINIVAS, Asst. Prof. Computer
Science and Engineering Department for his continuous coordination throughout the project
tenure.
We specially thank Dr. E. SUDARSHAN, Professor and Head of The Department,
Computer Science and Engineering Department for his continuous encouragement and
valuable guidance in bringing shape to this dissertation.
We specially thank Dr. I. RAJASRI REDDY, Principal, Sumathi Reddy Institute of
Technology for Women for his encouragement and support.
In completing this project successfully all our faculty members have given an excellent
cooperation by guiding us in every aspect. We also thank our lab faculty and librarians.

S. SRAVYA (206Y1A0587)
SANIYA SALWA (206Y1A0589)
T. SRIVANI (206Y1A0597)
A. PRATHIMA (206Y1A05A6)
ABSTRACT
Web scraping automates the process of extracting and saving large amounts of data from
different websites with ease and in a small amount of time. Web scraping is a technique to
fetch data from websites. Web scraping collects and categorizes all the required data in one
accessible location. Most of this data is unstructured data in an HTML format which is then
converted into structured data in a spreadsheet or a database so that it can be used in various
applications. Web scraping finds many uses both at a professional and personal level, it can
be used for Brand Monitoring and Competition Analysis, Machine Learning, Financial
Data Analysis, Social Media Analysis, SEO monitoring etc.
CONTENTS
S.NO Topics Page
1. INTRODUCTION……………………………………...01
1.1 What is Web scraping?
1.2 Who is using web scraping?
1.3 Why Web Scraping for data science?
1.4 Why Python for Web scraping?
1.5 Different types of Web scraping
2. LITERATURE SURVEY……………………………..08
3. SYSTEM ANALYSIS………………………………....09
3.1 Existing system
3.2 Problem statement
3.3 Proposed system
3.4 Mathematical model
4. METHODOLOGY…………………………………….11
4.1 Inspect your data source
4.2 Scrape HTML content from page
4.3 Parse html code with beautiful soup
5. DESIGN………………………………………………...26
5.1 System Requirements Specification (SRS)
5.2 UML diagrams
5.3 System Study
6. TECHNOLOGIES LEARNT………………………….31
6.1 Python web scraping tools
6.2 Installation of python packages
6.3 HTTP headers.
7. TESTING………………………………………………37
8. RESULTS……………………………………………...38
9. CONCLUSION AND FUTURE SCOPE………….....39
10. BIBLIOGRAPHY…………………………………......42
LIST OF FIGURES

Sl. No. Figure Name Page No


1.1.1 What is Web Scraping?......................................................01
1.3.1 How does Web Scraping work?.........................................05
4.1.1 Explore the website………………………………………11
4.1.2 The HTML on the right represents
the structure of the page you can see on the left…………14
5.2.1 Class diagram…………………………………………….27
5.2.2 Use case diagram…………………………………………27
5.2.3 Activity diagram………………………………………….28
9 Dependency of Web scraping on different field………….40
CHAPTER 1
INTRODUCTION
Assume you need some data from a site? Suppose a passage on Donald Trump! What do
you do? Indeed, you can reorder the data from Wikipedia to your own record. Be that as it
may, imagine a scenario where you need to get a lot of data from a site as fast as could be
expected. For example, a lot of information from a site to prepare a Machine Learning
calculation? In such a circumstance, reorder won't work! Also, that is the point at which
you'll have to utilize Web Scraping.
1.1 WHAT IS WEB SCRAPING

Fig. 1.1.1: What is Web Scraping?


Web scraping, web harvesting, or web data extraction is data scraping utilized for
extracting data from sites. The web scraping programming may straightforwardly get to
the World Wide Web utilizing the Hypertext Transfer Protocol or an internet browser.
While web scraping should be possible physically by a product client, the term regularly
alludes to computerized measures carried out utilizing a bot or web crawler It is a type of
duplicating in which explicit data is assembled and replicated from the web, normally into
a focal nearby database or bookkeeping page, for later recover or examination.
Web scraping a site page includes bringing it and extracting from it. Bringing is the
downloading of a page (which a program does when a client sees a page). Consequently,
web crawling is a fundamental segment of web scraping, to get pages for later handling.
When brought, at that point extraction can happen. The substance of a page might be
parsed, looked, reformatted, its data replicated into an accounting page or stacked into a
database. Web scrubbers normally remove something from a page, to utilize it for another
reason elsewhere. A model is find and duplicate names and phone numbers, or
organizations and their URLs, or email delivers to a rundown (contact scraping).

1
1.2 WHO IS USING WEB SCRAPING
There are numerous reasonable uses of approaching and assembling data on the web,
considerable lot of which fall in the domain of data science. The following list outlines
some interesting real-life use cases:
Many of Google's items have profited by Google's Centre business of crawling the web.
Google Translate, for example, uses text put away on the web to prepare and develop
itself.
• Scraping is being applied a ton in HR and worker examination. The San Francisco-
based hiQ startup works in selling worker examinations by gathering and inspecting
public profile data, for example, from LinkedIn (who was troubled about this but rather
was so far unfit to forestall this work on after a legal dispute; see https://www.
bloomberg.com/news/highlights/2017-11-15/the- merciless battle tomine-your-data-
and-offer it-to-your-chief).
• Digital marketeers and computerized specialists frequently use data from the web for a
wide range of intriguing and innovative tasks. "We Feel Fine" by Jonathan Harris and
Sep Kamvar, for example, scraped different blog locales for phrases beginning with "I
feel," the aftereffects of which could at that point imagine how the world was feeling
for the duration of the day.
• In another examination, messages scraped from Twitter, web journals, and other web-
based media were scraped to develop a data set that was utilized to assemble a
prescient model toward distinguishing examples of melancholy furthermore, self-
destructive musings. This may be a priceless instrument for help suppliers, however
obviously it warrants an exhaustive thought of security related issues also (see
https://www.sas.com/en_ca/bitsof knowledge/articles/examination/utilizing huge data-
to-predict suicide-hazard canada.html).
• Emmanuel Sales additionally scraped Twitter, however here with the objective to sort
out his own group of friends and course of events of posts (see
https://emsal.me/blog/4). An intriguing perception here is that the creator previously
thought about utilizing Twitter's API, however found that "Twitter vigorously rate
limits doing this: on the off chance that you need to get a client's follow list, at that
point you can just do so multiple times like clockwork, which is quite awkward to
work with."
• In a paper named "The Billion Prices Project: Using Online Prices for Estimation and
Research" (see http://www.nber.org/papers/ w22111), web scraping was utilized to
gather a data set of online cost data that was utilized to build a hearty day by day value
record for different nations.

2
• Sociopolitical researchers are scraping social sites to follow populace opinion and
political direction. A celebrated article called "Analyzing Trump's Most Rabid Online
Following" (see https://fivethirtyeight.com/highlights/taking apart trumps most-
frenzied web based after/) examines client conversations on Reddit utilizing semantic
investigation to portray the online devotees what's more, aficionados of Donald Trump.
• One analyst had the option to prepare a profound learning model dependent on scraped
pictures from Tinder and Instagram along with their "likes" to anticipate whether a
picture would be considered "appealing" (see
http://karpathy.github.io/2015/10/25/selfie/). Cell phone creators are now consolidating
such models in their photograph applications to help you review your photos.
• In "The Girl with the Brick Earring," Lucas Woltmann sets out to scratch Lego block
data from https://www.bricklink.com to decide the best determination of Lego pieces
(see http:// lucaswoltmann.de/art'n'images/2017/04/08/the-young lady with-thebrick-
earring.html) to address a picture (one of the co-creators of this book is an ardent Lego
fan, so we needed to incorporate this model).
• In "Examining 1000+ Greek Wines With Python," Florents Tselai scratches data
around 1,000 wine assortments from a Greek wine shop (see https://tselai.com/greek-
wines-analysis.html) to investigate their starting point, rating, type, and strength (one
of the coauthors of this book is an enthusiastic wine fan, so we needed to incorporate
this model).
• Lyst, a London-based online style commercial center, scraped the web for semi-
organized data about style items and afterward applied AI to introduce this data neatly
and exquisitely for buyers from one focal site. Other data researchers have done
comparable tasks to group comparative style items (see http://talks.lystit.com/dsl-
scraping presentation/).
• We've administered an examination where web scraping was utilized to remove data
from places of work, to get a thought with respect to the prevalence of distinctive data
science-and investigation related apparatuses in the work environment (spoiler: Python
and R were both rising consistently).
• Another investigation from our examination bunch included utilizing web scraping to
screen media sources and web gatherings to follow public estimation concerning.
Regardless of your field of revenue, there's quite often a utilization case to improve or
enhance your training dependent on data. "Data is the new oil," so the regular saying
goes, and the web has a ton of it.

3
1.3 WHY WEB SCRAPING FOR DATA SCIENCE
When riding the web utilizing a typical internet browser, you've most likely experienced
numerous locales where you thought about social occasion, putting away, and breaking
down the data introduced on the site's pages. Particularly for data scientists, whose "crude
material" is data, the web uncovered a great deal of fascinating freedoms:
• There may be a fascinating table on a Wikipedia page (or pages) you need to
recover to play out some factual examination.
• Perhaps you need to get a rundown of audits from a film site to perform text
mining, make a suggestion motor, or fabricate a prescient model to spot counterfeit
surveys.
• You may wish to get a posting of properties on a land site to construct an engaging
geo-perception.
• You'd prefer to accumulate extra highlights to improve your data set based on data
found on the web, say, climate data to estimate, for instance, soda deals.
• You may be pondering about doing informal community examination utilizing
profile data found on a web gathering.
• It may be fascinating to screen a news site for moving new stories on a specific
subject of interest.
Internet browsers are truly adept at showing pictures, showing movements, and spreading
out sites as it were that is outwardly interesting to people, however they don't uncover a
basic method to send out their data, at any rate not much of the time. Rather than review
the website page by page through your internet browser's window, wouldn't it be ideal to
have the option to consequently accumulate a rich data set? This is actually where web
scraping enters the image. In the event that you feel comfortable around the web a piece,
you'll most likely be pondering: "Isn't this precisely what Application Programming
Interface (APIs) are for?" Indeed, numerous sites these days give such an API that gives a
way to the rest of the world to access their data vault in an organized manner — intended
to be burned-through and gotten to by PC programs, not people (albeit the projects are
composed by people, of course). Twitter, Facebook, LinkedIn, and Google, for example,
all give such APIs in request to look and post tweets, get a rundown of your companions
and their preferences, see who you're associated with, etc. So why, at that point, would we
actually require web scraping? The fact is that APIs are incredible intends to get to data
sources, given the current site gives one in the first place and that the API uncovered the
usefulness you need. The overall guideline of thumb is to search for an API first and utilize
that in the event that you can, prior to embarking to assemble a web scrubber to assemble
the data.
4
For example, you can without much of a stretch utilize Twitter's API to get a rundown of
ongoing tweets, rather than wasting time yourself. In any case, there are still different
reasons why web scraping may be ideal over the utilization of an API:
• The site you need to separate data from doesn't give an API.
• The API gave isn't free (while the site is).
• The API gave is rate restricted: which means you can just access it a number of
specific times each second, out of every day, …
• The API doesn't uncover all the data you wish to acquire (though the site does).
In these cases, the use of web scraping may prove to be useful. The reality stays that in the
event that you can see some data in your internet browser, you will actually want to get to
also, recover it through a program. On the off chance that you can get to it through a
program, the data can be put away, cleaned, and utilized in any capacity.

HOW WEB SCRAPERS WORK

Fig. 1.3.1: How does Web Scraping work

Web Scrapers can extract all the data on specific locales or the particular data that a client
needs. Preferably, it's ideal in the event that you indicate the data you need so the web
scraper just extracts that data rapidly. For instance, you should scratch an Amazon page for
the sorts of juicers accessible, yet you may just need the data about the models of various
juicers and not the client surveys.

5
So when a web scraper needs to scratch a webpage, first it is given the URL's of the
necessary destinations. At that point it stacks all the HTML code for those locales and a
further developed scraper may even extract all the CSS and JavaScript components too. At
that point the scraper gets the necessary data from this HTML code and yields this data in
the configuration indicated by the client. Generally, this is as an Excel spreadsheet or a
CSV record yet the data can likewise be saved in different configurations, for example, a
JSON document.

1.4 WHY PYTHON FOR WEB SCRAPING


The popular programming language Python is a great tool for creating web scraping
software. Since websites are constantly being modified, web content changes over time.
For example, the website’s design may be modified or new page components may be
added. A web scraper is programmed according to the specific structure of a page. If the
structure of the page is changed, the scraper must be updated. This is particularly easy to
do with Python.
Python is also effective for word processing and web resource retrieval, both of which
form the technical foundations for web scraping. Furthermore, Python is an established
standard for data analysis and processing. In addition to its general suitability, Python has a
thriving programming ecosystem. This ecosystem includes libraries, open-source projects,
documentation, and language references as well as forum posts, bug reports, and blog
articles.
There are multiple sophisticated tools for performing web scraping with Python. Here we
will introduce you to three popular tools: Scrapy, Selenium, and BeautifulSoup. For some
hands-on experience, you can use our tutorial on web scraping with Python based on
BeautifulSoup. This will allow you to directly familiarize yourself with the scraping
process.
There are many reasons why developers choose Python for web scraping over any other
language:
Simple Syntax— Python is one of the simplest programming languages to understand.
Even beginners can understand and write scraping scripts due to the clear and easy-to-read
syntax.
Extreme Performance — Python provides many powerful libraries for web scraping,
such as Requests, Beautiful Soup, Scrapy, Selenium, etc. These libraries can be used for
making high-performance and robust scrapers.

6
Adaptability — Python provides a couple of great libraries that can be utilized for various
conditions. You can use Requests for making simple HTTP requests and, on the other end,
Selenium for scraping dynamically rendered content.

1.5 DIFFERENT TYPES OF WEBSCRAPERS


Web Scrapers can be separated based on a wide range of measures including Self-built or
Pre-built Web Scrapers, Browser extension or Software Web Scrapers, and Cloud or Local
Web Scrapers.
You can have Self-built Web Scrapers however that requires progressed information on
programming. Also, on the off chance that you need more highlights in your Web Scraper,
you need considerably more information. Then again, Pre-built Web Scrapers are
previously made scrapers that you can download and run without any problem. These
likewise have further developed choices that you can tweak.
Browser extension Web Scrapers are extensions that can be added to your browser.
These are not difficult to run as they are coordinated with your browser and yet, they are
additionally restricted along these lines. Any high-level highlights that are outside the
extent of your browser are difficult to run on Browser extension Web Scrapers.
In any case, Software Web Scrapers don't have these constraints as they can be
downloaded and introduced on your PC. These are more intricate than Browser extension
Web Scrapers however they likewise have progressed highlights that are not restricted by
the extent of your browser.
Cloud Web Scrapers run on the cloud which is an off-webpage worker generally given by
the organization that you purchase the scraper from. These permit your PC to zero in on
different errands as the PC assets are not needed to scratch data from websites. Local Web
Scrapers, then again, run on your PC utilizing local assets. So in the event that the Web
Scrapers require more CPU or RAM, your PC can turn out to be moderate and not perform
different undertakings.

7
CHAPTER 2
LITERATURE SURVEY
To know how the data extraction measure has developed has such a lot of one should
comprehend the strategies engaged with this strategy for web scraping is significant
scraping has been near almost as long as the web. The sway behind business web scraping
has reliably been to get a straightforward business advantage and fuse things like
sabotaging a competitor's extraordinary esteeming, taking leads, securing advancing
endeavors, redirecting APIs, and the all-around theft of and information. The essential
aggregators and assessment engines appeared to be hot on the effect points of the online
business impact and worked by and large unchallenged until the authentic challenges of
the mid-2000s. Early scraping contraptions were truly central - truly reordering anything
obvious from the site. Today, in any case, it's a by and large extraordinary story: web
scraping is tremendous business with amazing gadgets also, organizations to arrange.
Extraction and Analysis of information are for the most part used by the Digital
distributers and lists, Travel, Real home, and E-exchange. At that point once more,
assessment and figuring return way with the advances in collection parts also, the
development of Real Databases: The data had been seen and managed as data to be set up
for data assessment. The crucial turning point was the proximity of RDB (Relational
Database) in the midst of the 1980s which enabled clients to make Sequel (SQL) to
recover data from the database. For clients, the benefit of RDB and SQL is to have the
capacity to isolate their data on interest. It made the strategy to get data essential and
spread database use. Information Warehouse: The qualification from standard social
databases is that information stockrooms are for the most part smoothed out for response
time to requests. The improvement of data mining as made conceivable appreciation to
database and data stockroom movements, which connect with relationship store additional
information and still separate it during a sensible method. A general business design
created, where organizations started to "predict" customer's possible requirements reliant
upon assessment of the chronicled getting plans.

8
CHAPTER 3
SYSTEM ANALYSIS
3.1 Existing System
In Existing system is the manual web data extraction process has two major problems.
Firstly, it can’t measure costs efficiently and can escalate it very quickly. The data
collection costs increase as more data is collected from each website. In order to conduct a
manual extraction, businesses need to hire large number of staffs, this increases the cost of
labour significantly. Secondly, each manual extraction is known to be error prone. Further,
if any business process is very complex then cleaning up the data can get expensive and
time consuming. The below figure explains the errors and data cleanup processes problems
with the Manual method.

3.2 Problem Statement


The world of retail is changing rapidly. Many brick and mortar locations are closing and
being replaced by online stores, direct to consumer brands, and subscription services.
However, while the breadth of assortment is something that drives customers to website, a
lot of E-Commerce platforms fail to sell through a high percentage of merchandise.

3.3 Proposed System


Web Scraping (web harvesting or web data extraction) is a computer software technique to
extract information from websites. Usually, such programming programs recreate human
investigation of the World Wide Web by either executing low-level Hyper content
Transfer Protocol (HTTP), or installing a completely fledged internet browser, like Internet
Explorer or Mozilla Firefox. Web Scraping is firmly identified with web ordering, that lists
data on the web utilizing about web crawler and is a widespread method received by most
web indexes. Conversely, Web Scraping centers more around the change of unstructured
information on the web, ordinarily in HTML design, into organized information that can
be put away and investigated in a focal neighborhood data set or accounting page. The
pressure identification module examines the parallel picture from the limit left top to
record the co-ordinates of the eyebrow. The stress detection module scans the binary image
from the extreme left top to record the co-ordinates of the eyebrow. The offline
displacement calculation sub-module calculates the shifting of eyebrow using the obtained
eyebrow co-ordinates which is subsequently followed by variance calculation of the
displacement. The classifier sub-module is trained offline are employed to determine the
9
presence of emotion. The integrated decision of individual frames eventually determines
the level of stress involved. Web Scraping is a technique to extract structured data from
websites. WSAPI is the platform that enables an organization to extend their existing web
based system, as well designed set of services for creating new channels, developer
integration or partner integration.

3.4 Mathematical Model


1. Logistic Regression It is a type of classification algorithm, it is used when there would
be only Binary output, i.e., the result belongs to one class or another e.g., 0 or 1.Logistic
Regression should only be used when the target variables are discrete. Logistic Regression
is a kind of powerful machine learning algorithm it uses a sigmoid function, it is best
suitable for binary classification problems, but it can be used in multiclass classification
problems can be used with "one vs all" method. sigmoid function
S(z)= 1/(1+e)

2. Linear Regression Straight relapse is one of the simplest and most well known Machine
Learning calculations. It is a measurable strategy that is utilized for prescient examination.
Straight relapse makes expectations for persistent/genuine or numeric factors like deals,
compensation, age, item cost, and so on Direct relapse calculation shows a straight
connection between a reliant (y) and at least one autonomous (y) factors, consequently
called as direct relapse. Since straight relapse shows the direct relationship, which implies
it discovers how the worth of the reliant variable is changing as indicated by the worth of
the free factor. The direct relapse model gives a slanted straight line addressing the
connection between the factors. Mathematically, it can represent a linear regression as:
y= a0+a1x+ ε

10
CHAPTER 4
METHODOLOGY
There are mainly 4 steps followed in web scraping and we use python programming
language to implement the scraping process. The following are the 4 main steps that we
follow to implement web scraping:
1) Inspect Your Data Source
2) Scrape HTML Content from a Page
3) Parse HTML Code with Beautiful Soup
4) Generating a CSV from the data
Let’s see the methodology that is followed in doing web scraping by taking an example of
web scraper that fetches Software Developer job listings from the Monster job aggregator
site which is a static website. Your web scraper will parse HTML to choose the important
snippets of data and channel that content for explicit words. You can scratch any webpage
on the Internet that you can take a gander at, however the trouble of doing so relies upon
the website.
4.1 INSPECT YOUR DATA SOURCE
The initial step is to go to the site you need to scrape utilizing your number one browser.
You'll have to comprehend the site construction to extract the data you're keen on.
Explore the website

Fig. 4.1.1: Explore the website

11
Navigate the site and associate with it actually like any ordinary client would. For instance,
you could look for Software Developer occupations in Australia utilizing the site's local
inquiry interface:
You can see that there's a rundown of jobs returned on the left side, and there are more
definite depictions about the chose work on the right side. At the point when you click on
any of the jobs on the left, the substance on the right changes. You can likewise see that
when you cooperate with the website, the URL in your browser's location bar additionally
changes.

Decipher the Information in URLs


A ton of data can be encoded in a URL. Your web scraping excursion will be a lot simpler
on the off chance that you initially come out as comfortable with how URLs work and
what they're made of. Attempt to dismantle the URL of the site you're as of now on:
https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia
You can deconstruct the above URL into two principal parts:
1. The base URL represents the path to the search functionality of the website. In the
example above, the base URL is https://www.monster.com/jobs/search/.
2. The query parameters represent additional values that can be declared on the page.
In the example above, the query parameters are?
• q=Software-Developer&where=Australia.
Any job you'll look for on this website will utilize a similar base URL. In any case, the
query boundaries will change contingent upon what you're searching for. You can consider
them query strings that get shipped off the database to recover explicit records.
Query parameters generally consist of three things:
1. Start: The beginning of the query parameters is denoted by a question mark (?).
2. Information: The pieces of information constituting one query parameter are
encoded in key-value pairs, where related keys and values are joined together by an
equals sign (key=value).
3. Separator: Every URL can have multiple query parameters, which are separated
from each other by an ampersand (&).
Outfitted with this data, you can dissect the URL's query boundaries into two key-value
sets:
1. q=Software-Developer selects the type of job you’re looking for.
2. where=Australia selects the location you’re looking for.

12
Attempt to change the search parameters and see how that influences your URL. Feel free

to enter new values in the search bar up top:

Change these values to notice the progressions in the URL. Then, attempt to change the
values straightforwardly in your URL. See what happens when you glue the accompanying
URL into your browser's location bar:
https://www.monster.com/jobs/search/?q=Programmer&where=New-York
You'll see that adjustments of the search box of the site are straightforwardly reflected in
the URL's query parameters and the other way around. In the event that you change both of
them, you'll see various outcomes on the website. At the point when you investigate URLs,
you can get data on the best way to recover data from the website's worker.

Inspect the Site Using Developer Tools


Then, you'll need to study how the data is organized for show. You'll have to comprehend
the page construction to pick what you need from the HTML reaction that you'll gather in
one of the forthcoming advances.
Developer apparatuses can assist you with understanding the design of a website. All
cutting edge browsers accompany developer apparatuses introduced. In Chrome, you can
open up the developer instruments through the menu View → Developer → Developer
Tools. You can likewise get to them by right-tapping on the page and choosing the Inspect
choice, or by utilizing a keyboard alternate route.
Developer tools allow you to interactively explore the site’s DOM(Document Object
Model) to better understand the source that you’re working with. To dive into your page's
DOM, select the Elements tab in developer tools. You'll see a design with interactive
HTML components.

13
Fig.4.1.2: The HTML on the right represents the structure of the page you can see on the left

You can think about the content showed in your browser as the HTML design of that page.
At the point when you right-click components on the page, you can choose Inspect to
zoom to their location in the DOM. You can likewise drift over the HTML text on your
right and see the relating components light up on the page. Play around and investigate!
The more you become acquainted with the page you're working with, the simpler it will be
to scrape it. Notwithstanding, don't get excessively overpowered with all that HTML text.
You'll utilize the force of programming to venture through this labyrinth and single out just
the intriguing parts with Beautiful Soup.

4.2 SCRAPE HTML CONTENT FROM A PAGE


Since you have a thought of what you're working with, it's an ideal opportunity to begin
utilizing Python. In the first place, you'll need to get the site's HTML code into your
Python script so you can collaborate with it. For this undertaking, you'll utilize Python's
requests library.
Type the following in your terminal to install it:
$ pip3 install requests
Then open up a new file in your favorite text editor, in my case I use Google colab. All you
need to recover the HTML are a couple of lines of code:
import requests
URL = 'https://www.monster.com/jobs/search/?q=SoftwareDeveloper&where=Austr- alia'
page = requests.get (URL)
This code plays out a HTTP solicitation to the given URL. It recovers the HTML data that
the worker sends back and stores that data in a Python object.
14
Assuming you investigate the downloaded content, you'll notice that it looks basically the
same as the HTML you were reviewing prior with developer tools. To improve the
construction of how the HTML is shown in your support yield, you can print the item's
content property with print().

Static Websites
The website you’re scraping in this example serves static HTML content. In this
situation, the server that has the site sends back HTML archives that as of now contain all
the data you'll will see as a client.
At the point when you inspected the page with developer tools prior on, you found that a
job posting comprises of the accompanying long and muddled looking HTML:
<section class="card-content" data-jobid="4755ec59-d0db-4ce9-8385-b4df7c1e9f7c"
onclick="MKImpressionTrackingMouseDownHijack(this, event)">
<div class="flex-row">

<div class="mux-company-logo thumbnail"></div>

<div class="summary">

<header class="card-header">

<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-


m_impr_j_cid="4" data-m_impr_j_coc="" data-m_impr_j_jawsid="371676273" data-
m_impr_j_jobid="0"data-m_impr_j_jpm="2”data-m_impr_j_jpt="3"data-
m_impr_j_lat="30.1882" data-m_impr_j_lid="619" data-m_impr_j_long="-95.6732" data-
m_impr_j_occid="11838"data-r_j_p="3"datam_impr_j_postingid="4755ec59- d0db4ce9-
8385-b4df7c1e9f7c"datam_impr_j_pvc="4496dab8-a60c-4f02-a2d1- 6213320e7213" data-
m_impr_s_t="t"data-m_impr_uuid="0b620778-73c7-4550-9db5-
df4efad23538"href="https://jobopenings.monster.com/python-developer-wo- odlands-wa-
us-lancesoft-inc/4755ec59-d0db-4ce9-8385-b4df7c1e9f7c"
onclick="clickJobTitle('plid=619&amp;pcid=4&amp;poccid=11838','Software
Developer','');
clickJobTitleSiteCat('{&quot;events.event48&quot;:&quot;true&quot;,&quot;eVar25
&quot;:&quot;Python
Developer&quot;,&quot;eVar66&quot;:&quot;Monster&quot;,&quot;eVar67&quot;:&quo
t;JSR2CW&quot;,&quot;eVar26&quot;:&quot;_LanceSoftInc&quot;,&quot;eVar31&quot
;:&quot;Woodlands_WA_&quot;,&quot;prop24&quot;:&quot;2019-07-
02T12:00&quot;,&quot;eVar53&quot;:&quot;1500127001001&quot;,&quot;eVar50&quo
15
t;:&quot;Aggrega ted&quot;,&quot;eVar74&quot;:&quot;regular&quot;}')">Python
Developer
</a></h2>
</header>
<div class="company">
<span class="name">LanceSoft Inc</span>
<ul class="list-inline">
</ul>

</div>
<div class="location">
<span class="name"> Woodlands, WA
</span>
</div>
</div>
<div class="meta flex-col">
<time datetime="2017-05-26T12:00">2 days ago</time>
<span class="mux-tooltip applied-only" data-mux="tooltip" title="Applied">
<i aria-hidden="true" class="icon icon-applied"></i>
<span class="sr-only">Applied</span>

It tends to be hard to fold your head over a long square of HTML code. To make it simpler
to peruse, you can utilize a HTML formatter to consequently tidy it up somewhat more.
Great comprehensibility assists you with bettering comprehend the design of any code
block. While it could conceivably assist with improving the organizing of the HTML, it's
consistently worth an attempt. Remember that each website will appear to be unique. That
is the reason it's important to inspect and comprehend the construction of the site you're as
of now working with prior to pushing ahead.
The HTML above unquestionably has a couple of befuddling parts in it. For example, you
can scroll to the right to see the large number of attributes that the <a> element has.
Luckily, the class names on the elements that you’re interested in are relatively
straightforward:
• class="title": the title of the job posting
• class="company": the company that offers the position
• class="location": the location where you’d be working.

16
In the event that you at any point become mixed up in a huge heap of HTML, recollect that
you can generally return to your browser and use developer tools to additionally investigate
the HTML structure intuitively.
At this point, you've effectively outfit the force and easy to use plan of Python's requests
library. With a couple of lines of code, you figured out how to scrape the static HTML
content from the web and make it accessible for additional preparing. Be that as it may,
there are a couple of additional difficult circumstances you may experience when you're
scraping websites. Before you start utilizing Beautiful Soup to pick the pertinent data from
the HTML that you just scraped, investigate two of these circumstances.

Hidden Websites
Some pages contain information that’s hidden behind a login. That means you’ll need an
account to be able to see (and scrape) anything from the page. The process to make an
HTTP request from your Python script is different than how you access a page from your
browser. That means that just because you can log in to the page through your browser,
that doesn’t mean you’ll be able to scrape it with your Python script.
In any case, there are some exceptional strategies that you can use with the requests to get
to the content behind logins. These methods will permit you to sign in to websites while
making the HTTP demand from inside your content.

Dynamic Websites
Static sites are simpler to work with on the grounds that the server sends you a HTML
page that as of now contains all the data as a reaction. You can parse a HTML reaction
with Beautiful Soup and start to select the applicable data.
Then again, with a unique website the server probably won't send back any HTML
whatsoever. Instead, you’ll receive JavaScript code as a response. To offload work from
the server to the clients’ machines, many modern websites avoid crunching numbers on
their servers whenever possible. Instead, they’ll send JavaScript code that your browser
will execute locally to produce the desired HTML.
As referenced previously, what occurs in the browser isn't identified with what occurs in
your content. Your browser will perseveringly execute the JavaScript code it gets back
from a server and make the DOM and HTML for you locally. In any case, doing a
solicitation to a powerful website in your Python content won't give you the HTML page
content.
At the point when you use requests, you'll just get what the server sends back. On account
17
of a unique website, you'll end up with some JavaScript code, which you will not have the
option to parse utilizing Beautiful Soup. The best way to go from the JavaScript code to
the content you're keen on is to execute the code, actually like your browser does. The
requests library can't do that for you, however there are different arrangements that can.
For instance, requests-html is a task made by the creator of the requests library that permits
you to effectively deliver JavaScript utilizing linguistic structure that is like the grammar
in requests. It additionally incorporates capacities for parsing the data by utilizing
Beautiful Soup in the engine.
Another mainstream decision for scraping dynamic content is Selenium. You can consider
Selenium a thinned down browser that executes the JavaScript code for you prior to giving
the delivered HTML reaction to your content. But, in our case we will now only scrape
static websites using beautiful soup.

4.3 PARSE HTML CODE WITH BEAUTIFUL SOUP


You've effectively scraped some HTML from the Internet, however when you take a
gander at it now, it simply appears to be a tremendous wreck. There are huge loads of
HTML components to a great extent, a huge number of qualities spread around and wasn't
there some JavaScript blended in also? It's an ideal opportunity to parse this extensive code
reaction with Beautiful Soup to make it more open and select the data that you're keen on.
Beautiful Soup is a Python library for parsing organized data. It permits you to associate
with HTML along these lines to how you would interface with a web page
utilizing developer tools. Beautiful Soup two or three instinctive capacities you can use to
investigate the HTML you got. To begin, utilize your terminal to introduce the Beautiful
Soup library:

$ pip3 install beautifulsoup4


Then, import the library and create a Beautiful Soup object:

import requests

from bs4 import BeautifulSoup

URL = 'https://www.monster.com/jobs/search/?q=SoftwareDeveloper&where= Australia'


page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')


At the point when you add the two featured lines of code, you're making a Beautiful Soup
object that takes the HTML content you scraped before as its info. At the point when you
start up the article, you additionally educate Beautiful Soup to utilize the fitting parser.

18
FIND ELEMENTS BY ID
In a HTML web page, each component can have an id characteristic doled out. As the
name as of now proposes, that id property makes the component remarkably recognizable
on the page. You can start to parse your page by choosing a particular component by its
ID.
Switch back to developer tools and distinguish the HTML object that contains the entirety
of the job postings. Investigate by floating over pieces of the page and utilizing right-snap
to Inspect.
At the hour of this composition, the component you're searching for is a <div> with an id
trait that has the value "ResultsContainer". It two or three different characteristics also,
however beneath is the substance of what you're searching for:
<div id="ResultsContainer">
<!-- all the job listings -->
</div>

Beautiful Soup allows you to find that specific element easily by its ID:
results = soup.find(id='ResultsContainer')

For simpler survey, you can .prettify() any Beautiful Soup object when you print it out. On
the off chance that you call this strategy on the outcomes variable that you just doled out
above, at that point you should see all the HTML contained inside the <div>:
print(results.prettify())
At the point when you utilize the component's ID, you're ready to select one component
from among the remainder of the HTML. This permits you to work with just this particular
piece of the page's HTML. It would seem that the soup just got somewhat more slender!
Notwithstanding, it's still very thick.

19
FIND ELEMENTS BY HTML CLASS NAME
You've seen that each job posting is enclosed by a <section> component with the class
card-content. Presently you can work with your new Beautiful Soup object called results
and select just the job postings. These are, all things considered, the pieces of the HTML
that you're keen on! You can do this in one line of code:
job_elems = results.find_all('section', class_='card-content')
Here, you call .find_all() on a Beautiful Soup object, which returns an iterable containing
all the HTML for all the job postings showed on that page.
Investigate every one of them:
for job_elem in job_elems: print(job_elem, end='\n'*2)
That is now quite flawless, yet there's still a ton of HTML! You've seen before that your
page has engaging class names on certain components. We should select just those:

for job_elem in job_elems:

# Each job_elem is a new BeautifulSoup object.

# You can use the same methods on it as you did before. title_elem = job_elem.find('h2',
class_='title') company_elem = job_elem.find('div', class_='company') location_elem =
job_elem.find('div', class_='location') print(title_elem)
print(company_elem) print(location_elem)
print()

Fantastic! You're drawing nearer and nearer to the data you're really intrigued by. In any
case, there's a great deal going on with every one of those HTML labels and properties
gliding around:

<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-


m_impr_j_cid="4" data-m_impr_j_coc="" data-m_impr_j_jawsid="371676273" data-
m_impr_j_jobid="0" data-m_impr_j_jpm="2" data-m_impr_j_jpt="3" data-
m_impr_j_lat="30.1882" data-m_impr_j_lid="619" data-m_impr_j_long="-95.6732" data-
m_impr_j_occid="11838" data-m_impr_j_p="3" data-
m_impr_j_postingid="4755ec59-d0db-4ce9-8385-b4df7c1e9f7c" data-
m_impr_j_pvc="4496dab8-a60c-4f02-a2d1-6213320e7213" data-m_impr_s_t="t" data-
m_impr_uuid="0b620778-73c7-4550-9db5-df4efad23538" href="https://job-
openings.monster.com/python-developer-woodlands-wa-us-lancesoft-inc/4755ec59- d0db-
4ce9-8385-b4df7c1e9f7c"
onclick="clickJobTitle('plid=619&amp;pcid=4&amp;poccid=11838','SoftwareDeveloper','
20
”);
clickJobTitleSiteCat('{&quot;events.event48&quot;:&quot;true&quot;,&quot;eVar25
&quot;:&quot;Python
Developer&quot;,&quot;eVar66&quot;:&quot;Monster&quot;,&quot;eVar67&quot;:
&quot;JSR2CW&quot;,&quot;eVar26&quot;:&quot;_LanceSoft
Inc&quot;,&quot;eVar31&quot;:&quot;Woodlands_WA_&quot;,&quot;prop24&quo
t;:&quot;2019-07-
02T12:00&quot;,&quot;eVar53&quot;:&quot;1500127001001&quot;,&quot;eVar50
&quot;:&quot;Aggregated&quot;,&quot;eVar74&quot;:&quot;regular&quot;}')">Pyt hon
Developer
</a></h2>

<div class="company">

<span class="name">LanceSoft Inc</span>

<ul class="list-inline">

</ul>

</div>

<div class="location">

<span class="name"> Woodlands, WA</span>

</div>

EXTRACT TEXT FROM HTML ELEMENTS


For the present, you just need to see the title, organization, and location of each job
posting. Also, view! Beautiful Soup has got you covered. You can add .text to a Beautiful
Soup object to return just the content of the HTML components that the object contains:
for job_elem in job_elems:
title_elem = job_elem.find('h2', class_='title')
company_elem = job_elem.find('div', class_='company')
location_elem = job_elem.find('div', class_='location')
print(title_elem.text)
print(company_elem.text) print(location_elem.text)
print()

21
Run the above code piece and you'll see the content showed. Be that as it may, you'll
likewise get a great deal of whitespace. Since you're currently working with Python
strings, you can .strip() the unnecessary whitespace. You can likewise apply some other
recognizable Python string techniques to additional tidy up your content. The web is
chaotic and you can't depend on a page design to be reliable all through. Along these lines,
you'll usually run into errors while parsing HTML.
At the point when you run the above code, you may experience an Attribute Error:
AttributeError: 'NoneType' object has no attribute 'text'
Assuming that is the situation, make a stride back and inspect your past outcomes. Were
there any things with a value of None? You may have seen that the construction of the
page isn't completely uniform. There could be a commercial in there that shows in an
unexpected route in comparison to the typical job postings, which may return various
outcomes. For this model currently taken, you can safely ignore the hazardous component
and skirt it while parsing the HTML:
for job_elem in job_elems:
title_elem = job_elem.find('h2', class_='title')
company_elem = job_elem.find('div', class_='company')
location_elem = job_elem.find('div', class_='location')
if None in (title_elem, company_elem, location_elem):
continue print(title_elem.text.strip())
print(company_elem.text.strip())
print(location_elem.text.strip())
print()
Go ahead and investigate why one of the components is returned as None. You can utilize
the restrictive assertion you composed above to print() out and inspect the important
component in more detail. What do you belive is going on there?
After you complete the above advances take a stab at running your content once more. The
outcomes at last look much better:

Python Developer
LanceSoft Inc Woodlands,
WA
Senior Engagement Manager Zuora
Sydney, NSW

22
FIND ELEMENTS BY CLASS NAME AND TEXT CONTENT
At this point, you've tidied up the rundown of jobs that you saw on the website. While that
is quite flawless as of now, you can make your content more valuable. Nonetheless, not the
entirety of the job postings appears to be developer jobs that you'd be keen on as a Python
developer. So as opposed to printing out the entirety of the jobs from the page, you'll
initially channel them for certain keywords.
You realize that job titles in the page are kept inside <h2> components. To channel just for
explicit ones, you can utilize the string contention:
python_jobs = results.find_all('h2', string='Python Developer')

This code discovers all <h2> components where the contained string matches 'Python
Developer' precisely. Note that you're straightforwardly calling the technique on your first
outcomes variable. On the off chance that you feel free to print() the yield of the above
code scrap to your reassure, at that point you may be disillusioned in light of the fact that it
will most likely be unfilled:
There was unquestionably a job with that title in the search results, so for what reason is it

not appearance up? At the point when you use string= as you did over, your program
searches for precisely that string. Any distinctions in capitalization or whitespace will keep
the component from coordinating. In the following segment, you'll figure out how to make
the string broader.
Pass a Function to a Beautiful Soup Method Notwithstanding strings, you can regularly
pass capacities as contentions to Beautiful Soup techniques. You can change the past line
of code to utilize a capacity all things considered:

python_jobs = results.find_all('h2', string=lambda text: 'python' in text.lower())

Presently you're passing an unknown capacity to the string= contention. The lambda works
takes a gander at the content of each <h2> component, changes it over to lowercase, and
checks whether the substring 'python' is found anyplace in there. Presently you have a
match:
>>> print(len(python_jobs)) 1
Your program has discovered a match!.

23
On the off chance that you actually don't get a match, take a stab at adjusting your search string.
The job offers on this page are continually changing and there probably won't be a job recorded
that incorporates the substring 'python' in its title at the time that you're working.
The way toward discovering explicit components relying upon their content is an incredible
method to channel your HTML reaction for the data that you're searching for. Beautiful Soup
permits you to utilize either precise strings or capacities as contentions for separating text in
Beautiful Soup objects.

EXTRACT ATTRIBUTES FROM HTML ELEMENTS


Now, your Python script as of now scrapes the site and channels its HTML for applicable
job postings. Very much done! Notwithstanding, one thing that is as yet missing is the
connection to go after a position.
While you were inspecting the page, you tracked down that the connection is important for
the component that has the title HTML class. The current code strips away the whole
connection while getting to the .text attribute of its parent component. As you've seen
previously, .text just contains the noticeable content of a HTML component. Labels and
attributes are not piece of that. To get the genuine URL, you need to extract one of those
attributes as opposed to disposing of it.
Take a gander at the rundown of sifted results python_jobs that you made previously. The
URL is contained in the href attribute of the settled <a> tag. Start by getting the <a>
component. At that point, extract the value of its href attribute utilizing square-section
documentation:
python_jobs = results.find_all('h2', string=lambda text: "python" in text.lower())
for p_job in python_jobs:
link = p_job.find('a')['href']
print(p_job.text.strip())
print(f"Apply here: {link}\n")

The sifted results will just show connects to job openings that remember python for their
title. You can utilize a similar square-section documentation to extract other HTML
attributes also. A typical use case is to bring the URL of a connection, as you did
previously.

24
GENERATING A CSV FROM THE DATA
Finally, we would like to save all our data in some CSV file. with open('results.csv', 'w',
newline='', encoding='utf-8') as f:
print("Saving the extracted data into a file")
writer = csv.writer(f)
writer.writerow(['Title', 'Company', 'Location','JobUrl'])
writer.writerows(records)
Here we create a CSV file called results.csv and save all the job postings in it for any
further use.

25
CHAPTER 5
DESIGN
5.1 SYSTEM REQUIREMENTS SPECIFICATION
5.1.1 HARDWARE REQUIREMENTS
The hardware requirements for the project are:
CPU : 2 x 64-bit, 2.8 GHz, 8.00 GT/s CPUs or better.
RAM : at least 2 GB
HARDDISK : at least 20 GB

5.1.2 SOFTWARE REQUIREMENTS


The software needed for the demonstration of the project are:
• Jupyter Notebook Environment or Google Colab.
• Operating System: Any Windows Operating System or any macOS.
• Language: Python3.
• Libraries: Python (requests, beautiful soup)

5.1.3 DEPENDENCIES
Ensure that necessary Python packages like requests, pandas, and other supporting libraries
are installed.

5.2 UML DIAGRAMS

An UML diagram is a diagram dependent on the UML (Unified Modeling Language) with
the motivation behind outwardly addressing a framework alongside its principle actors,
roles, actions, antiques or classes, to more readily comprehend, change, keep up, or record
information about the framework.

26
5.2.1 CLASS DIAGRAM

Fig. 5.2.1 Class Diagram

5.2.2 USE CASE DIAGRAM

Fig. 5.2.2 Use Case Diagram

27
5.2.3 ACTIVITY DIAGRAM

Fig. 5.2.3 Activity Diagram

5.3 SYSTEM STUDY


1. Objective
The project aims to develop a robust data aggregation system utilizing Python for web
scraping, facilitating the compilation of diverse information from specified websites.
2. Scope
The project scope encompasses:
❖ Aggregating data from target websites in real-time.Extracting specific data points
like prices, reviews, or any other relevant information.
❖ Handling dynamic website structures and adapting to changes.
❖ Storing aggregated data in a structured database.
❖ Providing a foundation for future scalability and addition of new data sources.

28
3. Data Sources
❖ Identifying and categorising the target websites, considering their:
❖ Content structure.
❖ HTML markup and CSS classes.
❖ Frequency of updates.
❖ Any anti-scraping measures in place.
4. Data Aggregation Process
❖ Define the step-by-step data aggregation process:
❖ Initiate HTTP requests to target websites.
❖ Parse HTML content using Beautiful Soup and/or utilize Selenium for dynamic
content.
❖ Extract relevant data points based on defined criteria.
❖ Transform and clean data for consistency.
❖ Load data into the chosen database.
5. Data Storage
❖ Choose a suitable database (e.g., SQLite, MySQL, MongoDB) and design an
appropriate schema.
❖ Considerations such as normalization, indexing, and handling large datasets are
done.
6. Error Handling
❖ Implementing mechanisms to handle potential issues:
❖ Monitor for changes in website structure.
❖ Address connectivity issues.
❖ Log errors and provide notifications for intervention.
7. Security and Ethical Considerations
❖ Ensure compliance with legal and ethical standards:
❖ Respect website terms of service.
❖ Implement rate limiting to avoid IP blocking.
❖ Encrypt sensitive data if applicable.
8. Monitoring and Logging
❖ Develop a logging system to track:
❖ Successful data extraction.
❖ Errors and exceptions.
❖ Performance metrics.

29
9. User Interface
❖ Consider adding a user interface for:
❖ Configuring scraping parameters.
❖ Monitoring the scraping process.
❖ Viewing aggregated data.
10. Testing
❖ Establish a comprehensive testing plan covering:
❖ Unit testing for individual functions.
❖ Integration testing for the entire scraping pipeline.
❖ Handling edge cases and exceptions.

30
CHAPTER 6
TECHNOLOGIES LEARNT
6.1 Python web scraping tools
Python web scraping tools
In the Python ecosystem, there are several well-established tools for executing a web
scraping project:
• Scrapy
• Selenium
• BeautifulSoup
In the following, we will go over the advantages and disadvantages of each of these three
tools.
Web scraping with Scrapy
The Python web scraping tool Scrapy uses an HTML parser to extract information from
the HTML source code of a page. This results in the following schema illustrating web
scraping with Scrapy:
URL → HTTP request → HTML → Scrapy
The core concept for scraper development with Scrapy are scrapers called web spiders.
These are small programs based on Scrapy. Each spider is programmed to scrape a specific
website and crawls across the web from page to page as a spider is wont to do. Object-
oriented programming is used for this purpose. Each spider is its own Python class.
In addition to the core Python package, the Scrapy installation comes with a command-line
tool. The spiders are controlled using this Scrapy shell. In addition, existing spiders can be
uploaded to the Scrapy cloud. There the spiders can be run on a schedule. As a result, even
large websites can be scraped without having to use your own computer and home internet
connection. Alternatively, you can set up your own web scraping server using the open-
source software Scrapyd.
Scrapy is a sophisticated platform for performing web scraping with Python. The
architecture of the tool is designed to meet the needs of professional projects. For example,
Scrapy contains an integrated pipeline for processing scraped data. Page retrieval in Scrapy
is asynchronous which means that multiple pages can be downloaded at the same time.
This makes Scrapy well suited for scraping projects in which a high volume of pages needs
to be processed.

31
Web scraping with Selenium
The free-to-use software Selenium is a framework for automated software testing for web
applications. While it was originally developed to test websites and web apps, the
Selenium WebDriver with Python can also be used to scrape websites. Despite the fact that
Selenium itself is not written in Python, the software’s functions can be accessed using
Python.
Unlike Scrapy or BeautifulSoup, Selenium does not use the page’s HTML source code.
Instead, the page is loaded in a browser without a user interface. The browser interprets the
page’s source code and generates a Document Object Model (DOM). This standardized
interface makes it possible to test user interactions. For example, clicks can be simulated
and forms can be filled out automatically. The resulting changes to the page are reflected
in the DOM. This results in the following schema illustrating web scraping with Selenium:
URL → HTTP request → HTML → Selenium → DOM
Since the DOM is generated dynamically, Selenium also makes it possible to scrape pages
with content created in JavaScript. Being able to access dynamic content is a key
advantage of Selenium. Selenium can also be used in combination with Scrapy or
BeautifulSoup. Selenium delivers the source code, while the second tool parses and
analyzes it. This results in the following schema:
URL → HTTP request → HTML → Selenium → DOM → HTML →
Scrapy/BeautifulSoup.

Web scraping with BeautifulSoup


BeautifulSoup is the oldest of the Python web scraping tools presented here. Like Scrapy,
it is also an HTML parser. This results in the following schema illustrating web scraping
with BeautifulSoup:
URL → HTTP request → HTML → BeautifulSoup
Unlike Scrapy, developing scrapers with BeautifulSoup does not require object-oriented
programming. Instead, scrapers are coded as a simple script. Using BeautifulSoup is thus
probably the easiest way to fish specific information out of the “tag soup”.

32
Scrapy Selenium BeautifulSoup
Easy to ++ + +++
learn
Accesses ++ +++ +
dynamic
content
Creates +++ + ++
complex
applications
Able to ++ + +++
cope with
HTML
errors
Optimized +++ + +
for scraping
performance
Strong +++ + ++
ecosystem

Comparative study and knowing advantages of python libraries

6.2 Installation of Python Packages


Using Python packages for web scraping
Every web scraping project is different. Sometimes, you just want to check the website for
any changes. Other times, you are looking to perform complex analyses. With Python, you
have a wide selection of packages at your disposal.
Use the following code in the command-line terminal to install packages with pip3.
pip3 install <package></package>
Integrate modules in the Python script with import.
from <package> import <module></module></package>

The following packages are often used in web scraping projects:

33
Package Use
venv Manage a virtual environment for the
project
request Request websites
lxml Use alternative parsers for HTML and
XML
csv Read and write spreadsheet data in
CSV format
pandas Process and analyze data
scrapy Use Scrapy
selenium Use Selenium WebDriver

6.3 HTTP Headers


In this section, we will learn about the “Headers” and their importance in web scraping. I
will also try to explain the type of headers to you in detail. Let’s get started with it!
Headers are used to provide essential meta-information such as content type, user agent,
content length, and much more about the request and response. They are usually
represented in text string format and are separated by a colon.
Headers have a significant impact on web scraping. Passing correctly optimized headers
not only guarantees accurate data but also reduces the response timings. Generally, website
owners implement anti-bot technology to protect their websites from being scraped by
scraping bots. However, you can bypass this anti-bot mechanism and prevent your IPs
from getting blocked by passing the appropriate headers with the HTTP request.
Headers can be classified into four types:

34
• Request Headers
• Response Headers
• Representation Headers
• Payload Headers
Let us learn each of them in detail.

1. Request Headers
The headers sent by the client when requesting data from the server are known as Request
Headers. It also helps to recognize the request sender or client using the information in the
headers.
Here are some examples of the request headers.
• authority: en.wikipedia.org
• method: GET
• accept-language: en-US, en;q=0.9
• accept-encoding: gzip, deflate, br
• upgrade-insecure-requests: 1
• user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_4) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/100.0.4869.91 Safari/537.36
The user agent indicates the type of software or application used to send the request to the
server.
The Accept-Language header tells the server about the desired language for the response.
The Accept-Encoding header is a request header sent by the client that indicates the
content encoding it can understand.
Note: Not all headers in the request can be specified as request headers. For example —
The Content-Type header is not a request header but a representation header.

2. Response Headers
The headers sent by the server to the client in response to the request headers from the user
are known as Response Headers. It is not related to the content of the message. It is sent by
the server to convey instructions and information to the client.
Here are some examples of the response headers.
• content-length: 35408
• content-type: text/html
• date: Thu, 13 Apr 2023 14:09:10 GMT
• server: ATS/9.1.4
• cache-control: private, s-maxage=0, max-age=0, must-revalidate

35
The Date header indicates the date on which the response is sent to the client.
The Server header informs the client from which server the response is returned, and
the Content-Length header indicates the length of the content returned by the server.
Note: The Content-Type header is the representation header.

3. Representation Headers
The headers that communicate the information about the representation of resources in the
HTTP response body sent to the client by the server are known as Representation Header.
The data can be transferred in several formats, such as JSON, XML, HTML, etc.
Here are some examples of the representation headers.
• content-encoding: gzip
• content-length: 35408
• content-type: text/html
The Content-Encoding header informs the client about the encoding of the HTTP response
body.

4. Payload Headers
The headers that contain information about the original resource representation are known
as Payload Headers.
Here are some examples of payload headers.
• content-length: 35408
• content-range: bytes 200–1000/67589
• trailer: Expires
The Content-Range header tells the position of the partial message in the full-body
message.
Here, we are completed with the Headers section. There can be more headers to be
discussed. But, it will make the blog long and deviate from the main topic.

36
CHAPTER 7
TESTING
The scraper that we designed now with the above code only gives the complete data
whatever appears on
website when we search it through manually going to concern web site. So, every time the
CSV file is created but only has data which we see when we go to site and search
whatever we want. Therefore, every time we need to check the generated CSV file for
correct working of the code and if we give any type of input which is relevant or non-
relevant the output data extracted wholly depends upon the web site and the available data
it’s possess.
Case 1: if we give empty string as input

There will be no data generated into the CSV file.


Case 2: if we give irrelevant string as input, no data will be generated into the CSV file.

37
CHAPTER 8
RESULTS
As we know that, it only gives the same exact data that we see when we go to website
manually and see. So, the results will be as follows,
Case 1

Case 2

38
CHAPTER 9
CONCLUSION AND FUTURE SCOPE
It is safe to say that web scraping has become a fundamental expertise to get in the
present advanced world, not just for tech organizations and not just for specialized
positions. On one side, ordering enormous datasets are essential to Big Data
analytics, Machine Learning, and Artificial Intelligence; on the opposite side, with
the blast of computerized data, Big Data is getting a lot simpler to access than any
time in recent memory.
FUTURE SCOPE
With the expansion of increasingly more data in the realm of the web, the
significance of web scraping is expanding. Numerous organizations are presently
offering altered web scraping apparatuses to their customers where they assemble
data from everywhere the universe of the web and mastermind them into helpful
and effectively justifiable data. It diminishes the valuable labor to physically visit
every website and gather the data. Web Scrapers are planned and code for each
and singular website and crawlers do expansive scraping. In the event that the
website has a convoluted design, more coding is needed to scrap its data when
contrasted with a straightforward one. The Future of web scraping is in reality
brilliant and it will turn out to be increasingly more fundamental for each business
with the progression of time.
Web scraping administrations are considered as quite possibly the most rehearsed
exercises done by a large portion of the IT organizations and Ecommerce Stores
that work across the globe. A typical inquiry that is regularly posed is the reason
an organization, business or eCommerce store needs to separate the data from the
web. The straightforward answer is that the Internet is the biggest wellspring of
data on the planet and contains data in each field of life. Regardless of whether it
is the data of a specific item, value list, work, or offer costs. The entirety of this
data can be assembled with the assistance of web scraping. A few organizations
are as yet utilizing old and manual techniques for social affair the data from the
web that abandoned them in the development field.
One of the broadest and effective suppliers of web scraping administrations is
Information Transformation Services (ITS) working across the Globe. ITS has
customers everywhere on the world and arrangements in each field and specialty.
For more data click Web Scraping Services.
39
We have particular programming and methods that assist organizations with
getting fast outcomes from the web. ITS offers 99% effective and precise data.
Our Services speed up and diminish the time and assets of the organization for
social event data from the Internet. We further wipe out the mistake inclined
manual cycles and ensure that scraping is liberated from any manual or
programmed blunder.

Fig. 9: Dependency of web scraping on different fields

The data separated from the web scraping is for the most part put away in the E-
business framework. The System allows the customers effectively to look for any
item and get data about it. The data could be anything from item names, part
numbers, costs, and the stock level. This is the most famous thought in the
eCommerce stores and other web- based frameworks whose tasks may be in cozy
relationship with the data that may or might be associated with or may contain E-
trade destinations and the catalogs of the individuals with the scraping
administrations offered by ITS, our customers can gather the outcomes from any
professional resources, part postings, or whatever other website which may
contain data that is applicable.

40
ADVANTAGES OF WEB SCRAPING
➢ Promoting
Sooner rather than later, Web scraping will be one of the significant apparatuses leading
the pack age measure. The web scraping device can make statistical surveying of the
specific item/administrations and tremendous advantages to offer in the showcasing field.

➢ Client Behavior and Buying Trends


Web Scraping can help in getting knowledge of how clients/customers can consider the
item or administrations and assists with planning an advertising methodology for the
item/administrations. The Web Scraping of client surveys and client input will be vital
soon.

➢ Online business and Price Intelligence


Web Scraping is a significant apparatus in deciding the sticker price of the item in an e-
commerce store. The greater part of the organizations are determining the methodologies
for the data that they have scraped for the advanced observing of the contender's website.
This pattern will be complex in the coming years. So the requirement for web scraping
administrations will increment in the field of E-trade stores, Hotel and Travel Industry.

➢ Value Research Market and Data Scraping


Market patterns and gauge assumes a significant part in the Equity-based Markets.
Financial backers additionally need to think about the most recent and forthcoming gauges
of the market and web scraping can assume a significant part in following the promoting
patterns. This data about the market can assist financial backers with contributing all the
more reasonably.

41
BIBLIOGRAPHY

1. D. M. Thomas and S. Mathur, "Data Analysis by Web Scraping using


Python," 2019 3rd International conference on Electronics, Communication and
Aerospace Technology (ICECA), 2019, pp. 450-454, doi:
10.1109/ICECA.2019.8822022.
2. Website Scraping with Python - Using BeautifulSoup and Scrapy | Gábor Hajba | Springer
3. Beautiful Soup: Build a Web Scraper With Python – Real Python
4. Implementing Web Scraping in Python with BeautifulSoup - GeeksforGeeks
5. The Future of Web Scraping Services - ITS (it-s.com)
6. What is Web Scraping and Why You Should Learn It? - KDnuggets
7. Python | Tools in the world of Web Scraping - GeeksforGeeks
8. ”Renita Crystal Pereira, Vanitha T. “Web
9. Scraping of Social Networks.” International
10. Journal of Innovative Research in Computer and Communication
11. Engineering, vol. 3, pp.237-239, Oct. 7, 2018”
12. ”Ghazvinian, Holbert, Viswanathan.
13. “SimpleWebScraping.”Internet:https://seanholbert.wordpress.co
14. m/2011/07/15/scrappy-simple-webscraping/, Jun. 2015”
15. ”Bellarosey.“Crowdsourcing-Definition.”
16. Internet:http://crowdsourcing.typepad.com/cs/2006/06/crowdsour
17. cing_a.html, Jun. 02, 2006”
18. “BrightPLanet.com Deep web White
19. Paper.http://www.completeplanet.com/Tutorials/DeepWeb/index.
20. asp.”
21. ”Kolari, Pand Joshi A. ,“Web mining :
22. research and practice , Computing in Science &Engineering”,
23. IEEE Transactions on Knowledgeand Data Engineering, vol. 6,
24. no. 2,Vol. 6 , No. 4, 2004”
25. ”Kengtel,W:Wagner,M.Proteins1999,37,334-345.”
26. “Datahen."3 Advantages of web scraping

42

You might also like