Automated Web Scraping For Telecom Corpus Application
Automated Web Scraping For Telecom Corpus Application
Application
Abstract— The necessity for effective techniques of data enabling the rapid gathering of important data from the internet.
collection and analysis has grown as the amount of data available Moreover, energy harvesting application sector of research and
on the internet has increased. One technique that enables us to development [5]-[9].
take information from websites and save it for later examination This paper explores how web scraping is used to create an
is web scraping. By creating a tool that can gather pertinent
application that mines data from websites publishing news
information from telecom websites and store it in a structured
fashion for analysis, we hope to automate the process of web related to the telecom industry and creates a database using the
scraping for telecom corpus as part of this project. Python is the data mined.
programming language that we employ, along with several The rest of the paper is organized as follows - Section II
packages for web scraping, including Beautiful Soup and provides an insight into related research in web scraping.
Selenium. This paper seeks to provide insights into the Section III discusses the product lifecycle. Section IV focuses
methodology of creating a web application that a certain user on the methodology that was used to create the web application.
could use to scrape telecom websites. Section V presents the results of the project. Finally, section VI
is the section that is the conclusion to this paper.
Keywords—Web scraping, Flask, MongoDB, Web Application,
Python, PyMongo, Selenium, BeautifulSoup II. STATE OF ART
I. INTRODUCTION A query with a number of restrictions coupled by logical
operators might be thought of as an anomaly. Upon execution,
Web scraping, sometimes referred to as screen scraping, web
the queries might return some results that the user can categorise
data extraction, web harvesting, etc., is a technique for
as anomalies or non-anomalies. Web scraping is used to detect
automatically extracting web data rather than duplicating it by
anomalies in reported in [10], which for the purpose of detecting
hand [1]. It is a method for extracting useful information from
the presence of mining and crawling abnormalities,
website's HTML and putting it in a centralised, local database
eShopmonitor is developed. The product consists of three main
or spreadsheet. For this, it makes use of the website's URL. Web
parts: a crawler that retrieves pages of interest to the user, a
scraping is carried out using web scrapers which are
miner that enables the user to specify the fields of interest in
programmed to scrape data from specific websites or a certain
various types of pages and then extracts these fields from the
type of websites. Web scraping is used for data acquisition, data
crawled pages, and a reporter that produces reports on the
acquisition is the first phase in any data science study and
gathered data. The web scraper was developed taking inspiration
development; it is the step in which the data obtained from
from this paper using restrictions to scrape data from specific
private sources such as firm sales records and financial reports,
fields only.
or from public sources such as journals, websites, and open
People are greatly affected by weather predictions,
data, or by purchasing data [2]. Data acquisition is followed by
particularly in the social and economic spheres. We can study
data analysis, data analysis is a technique for finding solutions
yearly weather data trends that influence temperature patterns
to issues by questioning and interpreting data [3]. There are
and precipitation by gathering weather data. Data collection
hundreds of web scraping software available today, most of
must take place throughout time to increase the analysis of
them designed by using Java, Python and Ruby [4].
weather forecasts. Therefore, in work reported in [11], an
The telecommunications sector is essential to the
attempt is made to mine data from various weather websites and
development of our connected world in the modern digital
consolidate it and provide upto date weather information online.
environment. It is crucial for researchers, policymakers, and
Web scraping technologies was used in this study to gather
industry professionals to access, analyses, and comprehend the
information from real-time data-providing websites. By
quantity of information available given the wide variety of data
combining data from several websites into a single database or
sources, including websites, forums, social media, and more.
spreadsheet using the web scraping approach, the data was more
Web scraping becomes a powerful tool in this situation,
Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on July 24,2025 at 06:02:21 UTC from IEEE Xplore. Restrictions apply.
easily analysed and visualised. The web scraper took inspiration
from this paper to scrape data from 4 websites and consolidate
it. III. BACKGROUND
This section details the steps involved in product
To gather data and create a platform for visualising and development lifecycle. Organisations use the Product
displaying COVID-19 data from every country in the world Development Lifecycle (PDLC), a systematic process, to create,
reported in [12] presents a technology that performs data mining introduce, and improve new goods or current ones. The
using web scraping and web crawling. This tool was developed numerous crucial steps that make up this well-structured
in response to services like John Hopkins, which compiles data framework direct the development of a concept into a valuable
from eight different non-governmental sources. However, the product. Understanding the PDLC is essential for
procedure for gathering, organising, and structuring the data for contextualising the study's findings since it forms the basis for
their "COVID-19 Dashboard by the Centre for Systems Science creative solutions and novel perspectives. The PLDC generally
involves idea generation, idea screening, concept development,
and Engineering (CSSE) at Johns Hopkins University (JHU)" is
business analysis, development and product launch
opaque, which creates a further issue in that some users are
unable to use it as a tool to acquire datasets from preferred data A. Idea Generation
sources with customised data structures and set up user-defined The development of fresh ideas is a component of the PDLC's
acquisition frequency. Users are unable to specify the level of initial phase. These concepts could be impacted by several
granularity, filter the information, or choose the types of data for factors, including client input, market trends, internal
tailored scholar study. In other words, consumers that have brainstorming sessions, or upcoming technologies. It's crucial at
specialised needs to get tailored datasets cannot be sufficiently this point to comprehend market dynamics and client
accommodated by present solutions and this tool was created to preferences. Owners of businesses often do market and
meet those needs. The web scraper used the ideas introduced in competitor research to find opportunities and gaps. Cross-
this paper to create the option of scraping from articles that lie functional teams, where members from several departments
within a range of dates only. share their knowledge and viewpoints, typically assist the
production of ideas.
Information gathering and processing are now simple thanks B. Idea Screening
to developments in big data. Since people must move about to
Potential ideas are compared to established standards during
complete some daily chores, such as travelling to work, school,
this procedure. These considerations often include viability,
and shopping, transportation data is extremely important. As a
conformity to the organization's mission and goals, market
result, the number of trips made by public transportation
potential, and profitability. Idea screening helps save resources
increases every day. These visits are frequently made by people
that may otherwise be spent for development by lowering the
with special needs, but they are frequently unaware of how
risks of pursuing unworkable or poorly suited concepts.
accessible the routes are. Even while there are many websites
and apps that offer information on public transport services, C. Concept Development and Testing
most of them do not go into great depth on how accessible the Detailed conceptions are created from viable ideas through
routes are. In order to increase accessibility to urban public further development. To do this, specify the features of the
transit, work in [13] suggests a technology framework for the product, the user demographics that will utilise it, and the
processing, management, and use of open data. The insights distinctive selling point. To see how the product will look and
provided by this paper also inspired some features of the web work, prototypes are frequently made. Simple mockups to
scraper such as the main interface that the user interacts with. completely developed models might be included in these
Reported in[14] proposes a system to bypass some dynamic prototypes. To get input, concepts are given to a chosen set of
web pages' sophisticated anti-crawler encryption mechanism possible users or customers. Before moving on to the
and get the required page information by simulating human web development stage, the concept must be enhanced using this
browsing using the Python operation automation testing feedback.
framework. The system retrieves the item information from D. Business Analysis
Requests library keyword searches. Both Beautiful Soup and
regular expressions were used to tidy up the data initially. The To determine potential demand, market trends, and competition
for the proposed product, extensive market research is
data is then kept by the system in the MongoDB database. The
undertaken. For determining whether a product is financially
web scraper improves on the ideas given by this paper by using
viable, an evaluation of the development, production, and
novel methods to bypass restrictions posed by each website it marketing expenses is essential. Revenue predictions are
scrapes data from. developed to determine the profitability of the product based on
market analysis and cost estimates.
In the reported work[15] shows how data is stored and how
access can be managed in such a way that data for different user E. Development
can be segmented in such a way that each user gets their own Specifications for the product are created in great detail, taking
segments. This paper inspired the idea of user data segmentation into account the user experience, design components, and
based on the account they used. technological needs. The development of the product then starts
Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on July 24,2025 at 06:02:21 UTC from IEEE Xplore. Restrictions apply.
with engineering teams work on it. This is then followed by that are linked to the collection that are devoted to store the
testing wherein the product is tested rigorously. user’s history of data scraped. They are linked through a
F. Launch relational ID. Every time a user logs in SQLAlchemy is used
to authenticate them and that is followed by using the library
The product is then launched. The launch of the product is then PyMongo. Once the authentication is done the user activity.
followed by post-launch evaluations and further development
and changes for every new version that is to be launched. It was V. RESULTS AND DISCUSSIONS
then concluded that the best possible data sources would be
The web crawler was created using Python and its libraries
news websites and to make it easier to filter and focus on
namely, Selenium and BeautifulSoup was integrated into a web
Telecom industry related news.
application that can be used to scrape data from four different
IV. METHODOLOGY news websites. Fig. 1. shows the home page of the webpage.
SQLAlchemy was used to create the feature of user
A. Identifying data sources authentication, the user could authenticate themselves using
The possible data sources were first identified and since the their username and their password. The features of Flask such as
objective of the project was to obtain data from the data sources Flash, User_Login etc. were used in tandem with SQLAlchemy
identified and use it to analyze and determine the industry trends to make it all work together and for it to be integrated as a feature
in real time. It was then concluded that the best possible data into the web application as a feature. Fig. 2. shows the login
sources would be news websites and to make it easier to filter page of the web application.
and focus on Telecom industry related news it would be best to The web crawler was set up in a way that it could scrape data in
scrape data from websites that are devoted to publishing news two ways. One way would be to scrape all the articles mentioned
and articles related developments in the Telecom industry. on the first page of the home section of the website, and the other
B. Design the web scraping program way was to allow the user to mention a date range during which
the articles that were published within as shown in Fig. 3. If the
The data sources after being found were then inspected using the
user is not logged in the web application does not keep track of
inspect tool which is available on browsers. The HTML
structure of the webpages are then analyzed and the best way to user data and as a result the user will get results of the scraping
determine the data (the paragraphs of news articles) was task, they have run but will not be able to retrieve this data later.
determined. When the user logs in the user will be able to retrieve the data
later as the user’s activity is tracked and the results of all their
It was determined that the best way to scrape paragraphs from scraping run can be retrieved because of it. Fig. 4. shows the
the news articles would be using the HTML <p> tags The picture of the history page.
BeautifulSoup python library is used to write a program that can
scrape data from one page at a time. We use the Flask web The user can choose two options in terms of the method by
framework to make the web applications. We use SQLAlchemy which the want to scrape so when the user clicks on the submit
to introduce feature of the user being able to log in. The web button a routing program is used to route and run the correct
application is having two variations. One where the user does program as each website has different structure and requires
not log in. In this case the user is going to only receive the different programs to be devoted to them. Before the results of
output. In the second variation wherein the user log in the user the scraping task run get displayed in case the user is logged in
has access to history of the previous data that they have scraped the results are stored in a local MongoDB database. Fig 5 shows
the database where the results of the web scraping is stored. The
C. Storing Data user can later retrieve the data from the MongoDB local
In order to store data and keep track of user history we use database directly or they can just access it via the web
MongoDB. We create a database and create a collection to applications user history.
store data and track user activity. There are then collections
Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on July 24,2025 at 06:02:21 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Login page of the web application.
Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on July 24,2025 at 06:02:21 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Picture of the MongoDB Database where the results of the web scraping tasks are stored.
Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on July 24,2025 at 06:02:21 UTC from IEEE Xplore. Restrictions apply.