KEMBAR78
Automated Web Scraping For Telecom Corpus Application | PDF | World Wide Web | Internet & Web
0% found this document useful (0 votes)
95 views5 pages

Automated Web Scraping For Telecom Corpus Application

Automated_Web_Scraping_for_Telecom_Corpus_Application

Uploaded by

roxstatus199
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views5 pages

Automated Web Scraping For Telecom Corpus Application

Automated_Web_Scraping_for_Telecom_Corpus_Application

Uploaded by

roxstatus199
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Automated Web Scraping for Telecom Corpus

Application

Adithyan Sasikumar T V1, Sandeep Singh Chauhan*1, and Sarda Sharma1


1
Department of Electronics and Communication Engineering, Amrita School of Engineering,
Bengaluru, Amrita Vishwa Vidyapeetham, India.
2024 IEEE 3rd International Conference on Data, Decision and Systems (ICDDS) | 979-8-3503-6389-0/24/$31.00 ©2024 IEEE | DOI: 10.1109/ICDDS62937.2024.10910530

Corresponding Author: Sandeep Singh Chauhan, Email id: ss_chauhan@blr.amrita.edu

Abstract— The necessity for effective techniques of data enabling the rapid gathering of important data from the internet.
collection and analysis has grown as the amount of data available Moreover, energy harvesting application sector of research and
on the internet has increased. One technique that enables us to development [5]-[9].
take information from websites and save it for later examination This paper explores how web scraping is used to create an
is web scraping. By creating a tool that can gather pertinent
application that mines data from websites publishing news
information from telecom websites and store it in a structured
fashion for analysis, we hope to automate the process of web related to the telecom industry and creates a database using the
scraping for telecom corpus as part of this project. Python is the data mined.
programming language that we employ, along with several The rest of the paper is organized as follows - Section II
packages for web scraping, including Beautiful Soup and provides an insight into related research in web scraping.
Selenium. This paper seeks to provide insights into the Section III discusses the product lifecycle. Section IV focuses
methodology of creating a web application that a certain user on the methodology that was used to create the web application.
could use to scrape telecom websites. Section V presents the results of the project. Finally, section VI
is the section that is the conclusion to this paper.
Keywords—Web scraping, Flask, MongoDB, Web Application,
Python, PyMongo, Selenium, BeautifulSoup II. STATE OF ART
I. INTRODUCTION A query with a number of restrictions coupled by logical
operators might be thought of as an anomaly. Upon execution,
Web scraping, sometimes referred to as screen scraping, web
the queries might return some results that the user can categorise
data extraction, web harvesting, etc., is a technique for
as anomalies or non-anomalies. Web scraping is used to detect
automatically extracting web data rather than duplicating it by
anomalies in reported in [10], which for the purpose of detecting
hand [1]. It is a method for extracting useful information from
the presence of mining and crawling abnormalities,
website's HTML and putting it in a centralised, local database
eShopmonitor is developed. The product consists of three main
or spreadsheet. For this, it makes use of the website's URL. Web
parts: a crawler that retrieves pages of interest to the user, a
scraping is carried out using web scrapers which are
miner that enables the user to specify the fields of interest in
programmed to scrape data from specific websites or a certain
various types of pages and then extracts these fields from the
type of websites. Web scraping is used for data acquisition, data
crawled pages, and a reporter that produces reports on the
acquisition is the first phase in any data science study and
gathered data. The web scraper was developed taking inspiration
development; it is the step in which the data obtained from
from this paper using restrictions to scrape data from specific
private sources such as firm sales records and financial reports,
fields only.
or from public sources such as journals, websites, and open
People are greatly affected by weather predictions,
data, or by purchasing data [2]. Data acquisition is followed by
particularly in the social and economic spheres. We can study
data analysis, data analysis is a technique for finding solutions
yearly weather data trends that influence temperature patterns
to issues by questioning and interpreting data [3]. There are
and precipitation by gathering weather data. Data collection
hundreds of web scraping software available today, most of
must take place throughout time to increase the analysis of
them designed by using Java, Python and Ruby [4].
weather forecasts. Therefore, in work reported in [11], an
The telecommunications sector is essential to the
attempt is made to mine data from various weather websites and
development of our connected world in the modern digital
consolidate it and provide upto date weather information online.
environment. It is crucial for researchers, policymakers, and
Web scraping technologies was used in this study to gather
industry professionals to access, analyses, and comprehend the
information from real-time data-providing websites. By
quantity of information available given the wide variety of data
combining data from several websites into a single database or
sources, including websites, forums, social media, and more.
spreadsheet using the web scraping approach, the data was more
Web scraping becomes a powerful tool in this situation,

Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on July 24,2025 at 06:02:21 UTC from IEEE Xplore. Restrictions apply.
easily analysed and visualised. The web scraper took inspiration
from this paper to scrape data from 4 websites and consolidate
it. III. BACKGROUND
This section details the steps involved in product
To gather data and create a platform for visualising and development lifecycle. Organisations use the Product
displaying COVID-19 data from every country in the world Development Lifecycle (PDLC), a systematic process, to create,
reported in [12] presents a technology that performs data mining introduce, and improve new goods or current ones. The
using web scraping and web crawling. This tool was developed numerous crucial steps that make up this well-structured
in response to services like John Hopkins, which compiles data framework direct the development of a concept into a valuable
from eight different non-governmental sources. However, the product. Understanding the PDLC is essential for
procedure for gathering, organising, and structuring the data for contextualising the study's findings since it forms the basis for
their "COVID-19 Dashboard by the Centre for Systems Science creative solutions and novel perspectives. The PLDC generally
involves idea generation, idea screening, concept development,
and Engineering (CSSE) at Johns Hopkins University (JHU)" is
business analysis, development and product launch
opaque, which creates a further issue in that some users are
unable to use it as a tool to acquire datasets from preferred data A. Idea Generation
sources with customised data structures and set up user-defined The development of fresh ideas is a component of the PDLC's
acquisition frequency. Users are unable to specify the level of initial phase. These concepts could be impacted by several
granularity, filter the information, or choose the types of data for factors, including client input, market trends, internal
tailored scholar study. In other words, consumers that have brainstorming sessions, or upcoming technologies. It's crucial at
specialised needs to get tailored datasets cannot be sufficiently this point to comprehend market dynamics and client
accommodated by present solutions and this tool was created to preferences. Owners of businesses often do market and
meet those needs. The web scraper used the ideas introduced in competitor research to find opportunities and gaps. Cross-
this paper to create the option of scraping from articles that lie functional teams, where members from several departments
within a range of dates only. share their knowledge and viewpoints, typically assist the
production of ideas.
Information gathering and processing are now simple thanks B. Idea Screening
to developments in big data. Since people must move about to
Potential ideas are compared to established standards during
complete some daily chores, such as travelling to work, school,
this procedure. These considerations often include viability,
and shopping, transportation data is extremely important. As a
conformity to the organization's mission and goals, market
result, the number of trips made by public transportation
potential, and profitability. Idea screening helps save resources
increases every day. These visits are frequently made by people
that may otherwise be spent for development by lowering the
with special needs, but they are frequently unaware of how
risks of pursuing unworkable or poorly suited concepts.
accessible the routes are. Even while there are many websites
and apps that offer information on public transport services, C. Concept Development and Testing
most of them do not go into great depth on how accessible the Detailed conceptions are created from viable ideas through
routes are. In order to increase accessibility to urban public further development. To do this, specify the features of the
transit, work in [13] suggests a technology framework for the product, the user demographics that will utilise it, and the
processing, management, and use of open data. The insights distinctive selling point. To see how the product will look and
provided by this paper also inspired some features of the web work, prototypes are frequently made. Simple mockups to
scraper such as the main interface that the user interacts with. completely developed models might be included in these
Reported in[14] proposes a system to bypass some dynamic prototypes. To get input, concepts are given to a chosen set of
web pages' sophisticated anti-crawler encryption mechanism possible users or customers. Before moving on to the
and get the required page information by simulating human web development stage, the concept must be enhanced using this
browsing using the Python operation automation testing feedback.
framework. The system retrieves the item information from D. Business Analysis
Requests library keyword searches. Both Beautiful Soup and
regular expressions were used to tidy up the data initially. The To determine potential demand, market trends, and competition
for the proposed product, extensive market research is
data is then kept by the system in the MongoDB database. The
undertaken. For determining whether a product is financially
web scraper improves on the ideas given by this paper by using
viable, an evaluation of the development, production, and
novel methods to bypass restrictions posed by each website it marketing expenses is essential. Revenue predictions are
scrapes data from. developed to determine the profitability of the product based on
market analysis and cost estimates.
In the reported work[15] shows how data is stored and how
access can be managed in such a way that data for different user E. Development
can be segmented in such a way that each user gets their own Specifications for the product are created in great detail, taking
segments. This paper inspired the idea of user data segmentation into account the user experience, design components, and
based on the account they used. technological needs. The development of the product then starts

Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on July 24,2025 at 06:02:21 UTC from IEEE Xplore. Restrictions apply.
with engineering teams work on it. This is then followed by that are linked to the collection that are devoted to store the
testing wherein the product is tested rigorously. user’s history of data scraped. They are linked through a
F. Launch relational ID. Every time a user logs in SQLAlchemy is used
to authenticate them and that is followed by using the library
The product is then launched. The launch of the product is then PyMongo. Once the authentication is done the user activity.
followed by post-launch evaluations and further development
and changes for every new version that is to be launched. It was V. RESULTS AND DISCUSSIONS
then concluded that the best possible data sources would be
The web crawler was created using Python and its libraries
news websites and to make it easier to filter and focus on
namely, Selenium and BeautifulSoup was integrated into a web
Telecom industry related news.
application that can be used to scrape data from four different
IV. METHODOLOGY news websites. Fig. 1. shows the home page of the webpage.
SQLAlchemy was used to create the feature of user
A. Identifying data sources authentication, the user could authenticate themselves using
The possible data sources were first identified and since the their username and their password. The features of Flask such as
objective of the project was to obtain data from the data sources Flash, User_Login etc. were used in tandem with SQLAlchemy
identified and use it to analyze and determine the industry trends to make it all work together and for it to be integrated as a feature
in real time. It was then concluded that the best possible data into the web application as a feature. Fig. 2. shows the login
sources would be news websites and to make it easier to filter page of the web application.
and focus on Telecom industry related news it would be best to The web crawler was set up in a way that it could scrape data in
scrape data from websites that are devoted to publishing news two ways. One way would be to scrape all the articles mentioned
and articles related developments in the Telecom industry. on the first page of the home section of the website, and the other
B. Design the web scraping program way was to allow the user to mention a date range during which
the articles that were published within as shown in Fig. 3. If the
The data sources after being found were then inspected using the
user is not logged in the web application does not keep track of
inspect tool which is available on browsers. The HTML
structure of the webpages are then analyzed and the best way to user data and as a result the user will get results of the scraping
determine the data (the paragraphs of news articles) was task, they have run but will not be able to retrieve this data later.
determined. When the user logs in the user will be able to retrieve the data
later as the user’s activity is tracked and the results of all their
It was determined that the best way to scrape paragraphs from scraping run can be retrieved because of it. Fig. 4. shows the
the news articles would be using the HTML <p> tags The picture of the history page.
BeautifulSoup python library is used to write a program that can
scrape data from one page at a time. We use the Flask web The user can choose two options in terms of the method by
framework to make the web applications. We use SQLAlchemy which the want to scrape so when the user clicks on the submit
to introduce feature of the user being able to log in. The web button a routing program is used to route and run the correct
application is having two variations. One where the user does program as each website has different structure and requires
not log in. In this case the user is going to only receive the different programs to be devoted to them. Before the results of
output. In the second variation wherein the user log in the user the scraping task run get displayed in case the user is logged in
has access to history of the previous data that they have scraped the results are stored in a local MongoDB database. Fig 5 shows
the database where the results of the web scraping is stored. The
C. Storing Data user can later retrieve the data from the MongoDB local
In order to store data and keep track of user history we use database directly or they can just access it via the web
MongoDB. We create a database and create a collection to applications user history.
store data and track user activity. There are then collections

Fig. 1. Home page of the web application.

Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on July 24,2025 at 06:02:21 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Login page of the web application.

Fig. 3. Results page of the web application.

Fig. 4. History page of the web application.

Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on July 24,2025 at 06:02:21 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Picture of the MongoDB Database where the results of the web scraping tasks are stored.

[7] Sandeep Singh Chauhan, Nadeem Tariq Beigh, Dibyajyoti


VI. CONCLUSION Mukherjee,Dhiman Mallick, "Flexible Vibrational Energy Harvester with
The application was deployed shortly thereafter using a Groove Design Using BTO/PVDF-TrFE Film for Higher Output," IEEE
Applied Sensing Conference (APSCON), Bengaluru, India, pp. 1-3, June,
hosting service. This paper explains the process of creating and 2023.
deploying a web scraping application that users can use to scrape [8] Sarda Sharma, Sandeep Singh Chauhan, Karumbaiah N. Chappanda,
paragraphs from new articles of telecom news websites and store "Enhanced Electrochemical Properties of Fe Doped TiO2 Nanotubes for
them in MongoDB database. This application will help people Supercapacitor Application", 2024 IEEE 4th International Conference on
keep track of the various developments that happen in the VLSI Systems, Architecture, Technology and Applications (VLSI SATA),
Telecom sector. It saves people a lot of time in keeping track of pp.1-4, 2024.
these developments manually. This application can be used [9] Sandeep Singh Chauhan, M. Kaur, N. Batra, M. Singh and B. Mitra,
"Fabrication of Flexible Metglas Based Magnetic Energy Harvester using
along with various visualization tools like Tableau to visualize PVDF-TrFE/KNN Composite Film," 2022 IEEE International
the trends of the industry using the keywords in the paragraphs. Conference on Emerging Electronics (ICEE), Bangalore, India, pp. 1-4,
This application can also be used in a scheduled manner with the May, 2022.
help of an application like Windows task scheduler. The [10] N. Agrawal, R. Ananthanarayanan, R. Gupta, S. Joshi, R. Krishnapuram
application’s storage of data in MongoDB makes it easier to and S. Negi, "The eShopmonitor: A comprehensive data extraction tool
perform several operations with this data due to the versatility of for monitoring Web sites," in IBM Journal of Research and Development,
vol. 48, no. 5.6, pp. 679-692, Sep. 2004.
MongoDB.
[11] Fatmasari, Y. N. Kunang and S. D. Purnamasari, "Web Scraping
VII. REFERENCES Techniques to Collect Weather Data in South Sumatera," 2018
International Conference on Electrical Engineering and Computer
[1] Singrodia, Vidhi, Anirban Mitra, and Subrata Paul "A review on web Science (ICECOS), Pangkal, Indonesia, pp. 385-390, 2018.
scraping and its applications," 2019 international conference on computer [12] H. Lan et al., "COVID-Scraper: An Open-Source Toolset for
communication and informatics (ICCCI). Jan 23-25, 2019. Automatically Scraping and Processing Global Multi-Scale
[2] Khder, Moaiad Ahmad "Web Scraping or Web Crawling: State of Art, Spatiotemporal COVID-19 Records," in IEEE Access, vol. 9, pp. 84783-
Techniques, Approaches and Application," International Journal of 84798, 2021.
Advances in Soft Computing & Its Applications, vol. 13, pp. 145-168, [13] B. Vela, J. M. Cavero, P. Cáceres and C. E. Cuesta, "A Semi-Automatic
2021. Data–Scraping Method for the Public Transport Domain," in IEEE
[3] D. M. Thomas and S. Mathur, "Data Analysis by Web Scraping using Access, vol. 7, pp. 105627-105637, 2019.
Python," 2019 3rd International conference on Electronics, [14] Y. Wang, "Research on Python Crawler Search System Based on
Communication and Aerospace Technology (ICECA), Coimbatore, India, Computer Big Data," 2023 IEEE 3rd International Conference on Power,
pp. 450-454, 2019. Electronics and Computer Applications (ICPECA), Shenyang, China, pp.
[4] Sirisuriya, De S "A comparative study on web scraping," Proceedings of 1179-1183, 2023.
8th International Research Conference, KDU, Published November, [15] E. Gupta, S. Sural, J. Vaidya and V. Atluri, "Enabling Attribute-Based
2015. Access Control in NoSQL Databases," in IEEE Transactions on
[5] Sandeep Singh Chauhan, N. T. Beigh, D. Mukherjee and D. Mallick, Emerging Topics in Computing, vol. 11, no. 1, pp. 208-223, 1 Jan.-March
"Development and Optimization of Highly Piezoelectric BTO/PVDF- 2023.
TrFE Nanocomposite Film for Energy Harvesting Application," 2022
IEEE International Conference on Emerging Electronics (ICEE),
Bangalore, India, pp. 1-5, 15 May, 2022.
[6] Arun Kumar, Sandeep Singh Chauhan, Krishnamoorthy K, Devadas
Divya Bharathi, Abhilash Ravikumar, M.R. Rahman,” Flexible and Cost
Effective CNT Coated Cotton Fabric for Co Gas Sensing Application,”
Sensors and Actuators A, Vol. 23, p. 114640, 9 September, 2023.

Authorized licensed use limited to: Shah and Anchor Kutchhi Engineering College. Downloaded on July 24,2025 at 06:02:21 UTC from IEEE Xplore. Restrictions apply.

You might also like