Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Text Summarization using NLP
Kesanapalli Lakshmi Priyanka1, Dr. Vinay V Hedge2
Computer Science and Engineering Department,
Rashtreeya Vidyalaya College of Engineering, Mysore Road, Bangalore
Abstract:- Text summarization is an area within natural Keywords:- NLP, gTTS library, Flask, TextRank algorithm,
language processing (NLP) that revolves around producing URLs
brief and condensed summaries from extended passages of
text. The exponential growth of digital content has given I. INTRODUCTION
rise to a vast quantity of textual information, creating a
challenge for individuals to stay abreast of this information Text summarization is a subset of natural language
overload. While previous advancements in text processing (NLP) that concentrates on producing brief
summarization have marked significant achievements, summaries from lengthy texts. In simpler words, it's about
there remains an existing void in adequately addressing the condensing big pieces of text into shorter, meaningful
specific requirements for summarizing general textual summaries. Summarization involves creating a shorter
content. The project's goal is to create a summarization rendition of a document/URL, maintaining its crucial details.
system that generates concise summaries by using creative Some approaches involve extracting content directly from the
methods in natural language processing and sophisticated original, while others craft entirely fresh text to capture the
machine learning algorithms. This system will help fill the essence.This stands out as a demanding task in the realm of
informational divide between lengthy texts and condensed natural language processing (NLP), demanding a diverse set of
summaries.The primary objective is to create an efficient skills. These include comprehending lengthy text segments and
producing logical and connected text that effectively
and effective summarization model that enables text
encapsulates the key subjects within a web link. There are
summarization and speech synthesis integrating the gTTS
library, enabling the transformation of summaries into different techniques to extract information from raw text data
speech. We strived to empower users by developing and use it for a summarization model, overall they can be
customization options that grant them the ability to define categorized as Extractive and Abstractive. Extractive methods
summary attributes such as length and style, culminating in select the most important sentences within a text ,therefore the
personalized and precisely tailored summarization outputs. result summary is just a subset of the full text. Extractive
Summarization: In this method, the system identifies and
This project seamlessly integrates web scraping, extracts the most relevant sentences or phrases from the original
frequency-based text summarization, and a user-friendly text to form the summary. The extracted sentences are usually
Flask interface, enhancing content consumption and presented as they appear in the original document. Abstractive
accessibility. Users input URLs, initiating efficient Summarization: Abstractive summarization involves
processes of extracting essential text, generating concise generating new sentences that may not exist in the source text
summaries, and estimating reading time. Web scraping to convey the key points in a more concise manner. This
extracts data for text summarization, using frequency- approach often requires natural language generation techniques
based scoring for succinct summaries. The Flask interface and can be more challenging but potentially more informative.
empowers users to input URLs, triggering content
extraction and summarization. The project finds II. SCOPE
applications in content understanding, gTTS-enabled The scope of this project revolves around exploring the
accessibility, and efficient information management. application of Natural Language Processing (NLP) techniques
Beneficial for education, it aids in quick comprehension of for text summarization. In today's era of information overload,
complex subjects, supported by estimated reading time. the ability to efficiently extract key insights from vast volumes
Merging technology with user-centric design, it enriches of textual data is of paramount importance. Text summarization
learning, research, and content assimilation across using NLP offers a promising solution to this challenge by
domains. An empowering tool for academia, professionals, automatically generating concise and coherent summaries from
and personal exploration, it navigates the digital realm lengthy documents, articles, and reports. The project aims to
effectively. use the method of word frequency and sentence score to decide
The project's integrated approach of web scraping, which words/sentences should be included in the summarized
frequency-based text summarization, and Flask interface text. It uses the text rank algorithm. The significance of this
yields efficient content extraction, concise summaries, and project lies in its potential to revolutionize content processing,
estimated reading time. Quantitative analysis involves enabling users to quickly grasp the main points of a document
comparing the generated summaries' quality, coherence, and make informed decisions in various domains, based on the
and accuracy with existing literature. given URL link content. As NLP research continues to evolve
and new technologies emerge, the future scope of text
IJISRT23SEP011 www.ijisrt.com 346
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
summarization is likely to encompass a wide range of significantly improve information retrieval systems, leading to
innovative solutions that will revolutionize information enhanced query success rates.
processing and user experience across various domains.
N. S. Shirwandkar [4] presents a system that operates on
III. RELATED WORK input in .txt format, essentially text documents. The text
undergoes preprocessing, involving steps like segmenting
In this section, we delve into a comprehensive evaluation sentences, breaking it down into individual words
of existing literature that pertains to systems for summarizing (tokenization), and removing stop words and punctuation. To
text and the associated methods. While the literature pool is evaluate sentence importance, their attributes are computed.
extensive, our focus centers on the latest, pertinent research and The input is then processed by two techniques: Restricted
review papers. We've classified the approaches taken by Boltzmann Machine and Fuzzy logic. This results in two
researchers based on the foundational concepts they employ in distinct summaries, each subjected to a sequence of operations.
their methods. Our attention is directed towards the specific However, it's worth noting that Fuzzy logic's reliance on human
techniques adopted, the platforms utilized for testing these knowledge poses a limitation. Ultimately, the combined
methods, and the resulting system performances. Moreover, we utilization of these methods yields a more effective summary
underline the assertions made by the researchers. To cap it off, than using the RBM method alone.
we distill the insights obtained from the research papers we've
studied and analyzed. This section culminates by shedding light P. Janjanam [5] explores machine learning within the
on the driving force behind addressing the identified issue. context of recent advancements in text summarization. The
study focuses on modern techniques that leverage evolutionary
S. R. Rahimi [1] explores the connection between text processes and graph-based methods for representing features.
mining and text summarization. They delve into the key factors These techniques extend from selecting relevant sentences to
and stages necessary for effective summarization. Additionally, generating summaries. The intent behind this research is to
they assess various summarization techniques to identify the contribute to the development of powerful applications in
most effective one. This study provides insights into crucial Natural Language Processing (NLP). The paper encompasses a
aspects and methods applicable for generating summaries. range of subjects, including text representation, feature
identification, graph-centric summarization, and optimization-
Rahul [2] undertakes an evaluation of diverse based summarization. Moreover, it examines the effectiveness
summarization techniques that adopt structural and semantic of different summarized text methods through a comparison of
approaches for condensing text content. Their analysis their Rouge scores—a metric used for evaluating summary
encompasses both individual documents and collections of quality.
documents from various datasets. They delve into widely
employed methods for text summarization, such as machine Yanxia [6] presents an innovative system that builds upon
learning, reinforcement learning, neural networks, fuzzy logic, the traditional TF-IDF approach. This is accomplished by
and sequence-to-sequence modeling. The study probes into the introducing the concept of a coefficient of weight for part-of-
accuracy scores attained by these methods and discusses speech tags and accounting for word position weight. This
optimization algorithms. A noteworthy observation is their enhancement is achieved through the utilization of the TF-IDF-
exploration of the effectiveness of employing multiple methods NL algorithm, which boasts the capability to extract
in contrast to relying solely on a single method. In essence, this characteristic words, thus enhancing retrieval performance.
study sheds light on the prevalent techniques used for text What sets this algorithm apart is its capacity to improve
summarization, showcasing how their accuracy varies across clustering effectiveness, providing a more accurate reflection of
identical datasets. Additionally, it delves into optimization the distinctive attributes of the text. The approach operates
strategies and emphasizes the enhanced outcomes achieved under the assumption that the counts of different words offer
through the integration of multiple techniques. independent indications of similarity. Consequently, this
system significantly enhances clustering of characteristic
A Mishra [3] introduces a system designed to handle words, leading to a more precise representation of textual
information storage, retrieval, and management. The system attributes. This advancement holds the potential to be highly
evaluates the significance of words within a document advantageous.
collection or corpus. It employs the TF-IDF (Term Frequency-
Inverse Document Frequency) approach for information Fadi [7] introduces an innovative system aimed at
retrieval, where TF and IDF weight values are calculated to enhancing the effectiveness of the TFIDF technique, a common
determine word importance. The TF-IDF weight is then used to method used for information retrieval. The system introduces
retrieve and rank queries based on their relevance in the three distinct techniques for weighting within the TFIDF
retrieval and ranking process. This enhances the precision of framework: Dispersed Words Weight Augmentation, Title
results displayed to users. However, in the word-count aspect, Weight Augmentation, and First Ranked Words Weight
direct similarity computation might slow down the process for Augmentation. These techniques collectively contribute to
extensive vocabularies. Leveraging the TF-IDF algorithm can more accurately fetching relevant documents within the system.
This leads to a notable improvement in the information retrieval
IJISRT23SEP011 www.ijisrt.com 347
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
process. Notably, this system doesn't rely on semantic sentence importance. It constructs a sentence graph, with
similarities between words. By employing these novel connections between sentences weighted by their similarity.
weighting approaches, which exhibit superior performance and The sentences are ranked by their PageRank scores, reflecting
elevated recall values, the system becomes more adept at their importance within the graph. This proposed system is a
retrieving pertinent documents. As the document's word potent means to automatically summarize text from URL links.
weights increase through these new techniques, the efficiency It generates concise and meaningful summaries efficiently. The
of retrieval is significantly heightened. combination of TF-IDF, a well-established word importance
measure, and the text rank algorithm, a robust method for
G. V. Madhuri Chandu [8] has introduced a system that ranking sentences, contributes to its effectiveness.
centers around a model capable of providing concise and non-
repetitive answers to a variety of queries related to educational VI. METHODOLOGY
institutions. The model incorporates multiple techniques from
the realm of natural language processing (NLP) to condense A. TF-IDF APPROACH
text and furnish pertinent outcomes. Moreover, it integrates Pre-processing Step: This step prepares the document or
hybrid similarity measures and clustering algorithms. The group of interconnected documents for the summarization
process encompasses data collection, pre-processing, system. The input is transformed into a collection of
tokenization, and retrieving sentences that align with the user's individual words or phrases extracted from the document.
query from the original text. The system comprises two main This pre-processing phase encompasses stages rooted in
phases: (i) Retrieving sentences pertinent to the query, and (ii) Natural Language Processing (NLP), including breaking the
Removing redundant sentences. This approach effectively text into sentences, breaking sentences into individual
summarizes content based on user queries. The model tokens (tokenization), removing insignificant words (stop
demonstrates efficient performance across many scenarios. words), and reducing words to their base form (stemming).
However, there are instances where it struggles to retrieve Once pre-processing is completed, each token's word
critical sentences. A limitation lies in its ability to occasionally frequency and inverse document frequency are computed.
extract less important or irrelevant sentences from websites. Sentence segmentation: Sentence segmentation involves
breaking down a sequence of written language into
IV. EXISTING SYSTEM individual sentences. This process is vital for understanding
and analyzing the text's structure. In languages like English,
Extractive summarization entails picking and merging punctuation marks, especially periods and full stops, serve
crucial sentences or phrases directly from the source text to as reliable indicators to identify the boundaries between
create a summary. This technique relies on arranging sentences sentences. These punctuation symbols play a significant role
by their significance and relevance to the overall content. in determining where one sentence ends and the next one
Noteworthy approaches for extractive summarization begins.
encompass BERT-based models and methods centered around Tokenization: Tokenization involves breaking down
graphs. Abstractive summarization, on the other hand, involves sentences into individual discrete units, known as tokens.
crafting summaries by rephrasing and rewording the original These tokens can encompass various elements, such as
content. This requires a more profound comprehension of the distinct words, key terms, phrases, and identifiers.
input text and often involves techniques for generating natural Tokenization is a pivotal step that facilitates further
language. Notable methods for abstractive summarization processing and understanding of the text's content. It
encompass pre-trained Transformers and Pointer-Generator involves the separation of tokens using spaces, punctuation
Networks. marks, or line breaks. Depending on specific requirements,
the separation might be straightforward or more complex
V. PROPOSED SYSTEM
due to the interplay of whitespace and punctuation marks.
The system being suggested automates the process of This process effectively dissects text into manageable
summarizing text from a given URL link, employing an components for analysis and manipulation.
extractive summarization strategy. The initial step involves Stop Word Removal: Stop words are common words that
computing a score called term frequency-inverse document appear frequently in a language but often carry limited
frequency (TF-IDF) for each word within sentences. semantic meaning. The process of deleting stop words
Subsequently, the text rank algorithm is employed to arrange involves eliminating words like "the," "to," "are," "is," and
sentences based on their TF-IDF scores. The ultimate summary so on. These words are considered less informative when it
is then formed by selecting sentences with the highest ranks. comes to understanding the context or essence of the text.
By removing stop words, the goal is to enhance the
The TF-IDF score gauges a word's significance within a effectiveness of specific tasks, such as supporting phrase-
document, derived from the product of its term frequency (how based searches. This practice streamlines the text and
often it appears in the document) and its inverse document focuses on the more substantive and significant terms.
frequency (how many documents in the corpus feature the
word). On the other hand, the text rank algorithm evaluates
IJISRT23SEP011 www.ijisrt.com 348
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Stemming: Stemming involves the simplification of words In simpler terms, TF-IDF reveals how crucial a word is
by reducing them to their core or root form. This process within a specific document compared to its presence in other
aims to capture the fundamental essence of words, even if documents. Words with high TF-IDF scores are those that
they appear in different forms due to variations in tense, appear frequently in the document but are rare across other
pluralization, or other linguistic changes. By condensing documents in the corpus. This metric helps identify words that
words to their base form, stemming enhances the scope of carry substantial significance and uniqueness within the context
Natural Language Processing (NLP) tools. This allows these of a particular document.
tools to effectively recognize words regardless of their
grammatical variations, thus improving their performance in
language analysis tasks.
Lemmatization: Lemmatization involves categorizing
various forms of a word together to treat them as a unified
entity during analysis. This process focuses on reducing
different grammatical variations of a word to its base or root B. TEXT - RANK ALGORITHM
form, allowing for more effective language understanding The text rank algorithm is a method based on graphs that's
and processing. employed for summarizing text from web documents accessed
TF-IDF Score : TF-IDF, which stands for Term Frequency- via URL links.. The algorithm works by first extracting the text
Inverse Document Frequency, is a statistical metric that from the URL link. Once the text has been extracted, it is passed
gauges the significance of a word within a document. This to the TextRank algorithm, which creates a graph of the
score is calculated by multiplying the word's term frequency sentences and ranks them by their importance. The top-ranked
(how often it appears in the document) by its inverse sentences are then used to create a summary.
document frequency (how commonly it appears across the
entire corpus of documents). The term frequency counts C. TEXT-SPEECH CONVERSION :
how many times a word occurs in a specific document, while The condensed text can be transformed into spoken words
the inverse document frequency quantifies the rarity of the using the gTTS Python library, which utilizes the Google Text-
word across all documents in the collection. to-Speech API to convert text into speech. gTTS plays a crucial
role in enhancing the user experience in text summarization by
providing an additional capability of converting the summary
text into speech. This integration of gTTS with text
summarization allows users to not only read the summarized
content but also listen to it.
Fig. 1: Architecture Diagram
IJISRT23SEP011 www.ijisrt.com 349
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Fig. 2: Pre-processing flowchart
VII. RESULTS AND ANALYSIS sentences within the text documents, resulting in concise and
focused summaries. The analysis of the results showed that the
The study's findings indicated that the text rank algorithm algorithm was able to achieve good results for a variety of text
successfully condensed text content from URL-linked documents, including news articles, scientific papers, and legal
documents into summaries. The summaries generated by the documents. The algorithm was also able to generate summaries
algorithm were both accurate and informative. The algorithm that were of different lengths, depending on the needs of the
demonstrated its capability to pinpoint the most crucial user.
Fig. 3: Summarization Challenges Visualization
Fig. 4: Summarized Text with Audio Playback
IJISRT23SEP011 www.ijisrt.com 350
Volume 8, Issue 9, September 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
VIII. CONCLUSION [5.] P. Janjanam and C. P. Reddy, "Text Summarization: An
Essential Study", 2019 International Conference on
In the digital age, the relentless surge in information Computational Intelligence in Data Science (ICCIDS),
accessibility via the World Wide Web has heightened the pp. 1-6, 2019.
demand for advanced text summarization techniques. This [6.] Yanxia Yang , “Research and Realization of Internet
project addresses this need by efficiently distilling lengthy texts Public Opinion Analysis Based on Improved TF - IDF
into concise, coherent summaries. It goes further by seamlessly Algorithm”, 2017 16th International Symposium on
integrating text-to-speech functionality, enhancing accessibility Distributed Computing and Applications to Business,
and user experience. This synthesis of text summarization and Engineering and Science (DCABES)
speech technology represents a valuable contribution to [7.] Fadi F. Yamout and R. Lakkis, "Improved TFIDF
managing the ever-expanding pool of digital information, weighting techniques in document Retrieval", 2018
catering to the needs of modern users who seek quick access to Thirteenth International Conference on Digital
relevant content. It marks a step forward in the evolution of text Information Management (ICDIM), pp. 69-73, 2018.
summarization, aiding individuals in extracting valuable [8.] G. V. Madhuri Chandu, A. Premkumar, S. S. K and N.
insights from overwhelming textual data. Sampath, "Extractive Approach For Query Based Text
Summarization", 2019 International Conference on
IX. FUTURE ENHANCEMENTS Issues and Challenges in Intelligent Computing
As we stand at the intersection of advancements gleaned Techniques (ICICT), pp. 1-5, 2019.
from ten seminal papers on automatic text summarization, a
roadmap for future research emerges. Building upon the
foundations laid by these studies, there are several compelling
avenues to explore. Enhancing user-centric models by
integrating sentiment analysis and context-awareness could
lead to summaries finely tuned to individual preferences.
Exploring semantic enrichment techniques and ontological
integration may unlock summaries with deeper contextual
understanding. Adaptable reinforcement learning strategies can
mitigate exposure bias and elevate the consistency of
abstractive summarization. Venturing into multi-modal
summarization, where text and visual content converge, could
redefine summarization paradigms.
REFERENCES
[1.] S. R. Rahimi, A. T. Mozhdehi and M. Abdolahi, "An
overview on extractive text summarization", 2017 IEEE
4th International Conference on Knowledge-Based
Engineering and Innovation (KBEI), pp. 0054-0062,
2017.
[2.] S. Adhikari Rahul and Monika, "NLP based Machine
Learning Approaches for Text Summarization", 2020
Fourth International Conference on Computing
Methodologies and Communication (ICCMC), pp. 535-
538, 2020.
[3.] Mishra and S. Vishwakarma, "Analysis of TF-IDF
Model and its Variant for Document Retrieval", 2015
International Conference on Computational Intelligence
and Communication Networks (CICN), pp. 772-776,
2015.
[4.] N. S. Shirwandkar and S. Kulkarni, "Extractive Text
Summarization Using Deep Learning", 2018 Fourth
International Conference on Computing Communication
Control and Automation (ICCUBEA), pp. 1-5, 2018.
IJISRT23SEP011 www.ijisrt.com 351