NLP Mini Project
NLP Mini Project
On
Submitted By
Satishkumar M 121A1095
Senthilmurugan S 121A1096
Sarvesh Shinde 121A1102
Manish Thevar 121A1115
Subject Incharge
Prof. Suvarna Chaure
Nerul – 400706
CERTIFICATE
This is to certify that the requirements for the project report entitled ‘Marathi News Summarizer’
have been successfully completed by the following students:
Name Roll No.
Satishkumar M 121A1095
Senthilmurugan S 121A1096
Sarvesh Shinde 121A1102
Manish Thevar 121A1115
in partial fulfillment of the course Natural Language Processing in Computer Engineering (VIII
SEM) of Mumbai University in the Department of Computer Engineering, SIES Graduate School
of Technology, Nerul – 400706, during the Academic Year 2024 – 25.
Subject Incharge
Nerul – 400706
PROJECT APPROVAL
Examiners:
1.
2.
Subject Incharge:
Suvarna Chaure
Date:
Place: Nerul
Nerul – 400706
DECLARATION
We declare that this written submission for Natural Language Processing mini project entitled
“Marathi News Summarizer” represents our ideas in our own words and where others' ideas or
words have been included, we have adequately cited and referenced the original sources. We also
declare that we have adhered to all principles of academic honesty and integrity and have not
misrepresented or fabricated or falsified any ideas / data / fact / source in our submission. We
understand that any violation of the above will cause disciplinary action by the institute and also
evoke penal action from the sources which have not been properly cited or from whom prior
permission has not been taken when needed.
Project Group Members:
Satishkumar M(121A1095)
Senthilmurugan S(121A1096)
Sarvesh Shinde(121A1102)
Manish Thevar(121A1115)
Date:
Place:
Table of Contents
Abstract................................................................................................................................ i
List of Figures...................................................................................................................... ii
1. Introduction................................................................................................................. 1
1.1 Fundamentals................................................................................................... 1
1.2 Objectives........................................................................................................ 1
2. Literature Survey......................................................................................................... 3
3. Project Implementation………………………………………………………............ 5
3.1 Overview…………………….......................................................................... 5
5.1 Summary………............................................................................................... 15
References............................................................................................................................. 16
Acknowledgement……………………………………………………………………......... 17
Abstract
Online news Articles have made it easier to read news on the go. However, summaries allow people to
easily understand the contents of news article quickly. Providing brief summary or an excerpt allows
readers to know what to expect from the news articles. With the headline, it would be convenient for the
reader to read a brief about the article too. A summary is a type of text document that is produced by a
much larger text or sometimes multiple texts, that conveys important information in the original text in a
much shorter form. The goal of automatic news summarization is to provide an excerpt from the news
article that covers the news article in fewer sentences and words. Languages such as Marathi don’t have
resources such as automated summarization. Marathi news articles don’t come up with a summary. To
generate a summary for news articles, we use a series of steps namely, Preprocessing and Processing
Phase. Preprocessing phase represents Marathi text in a structured way. In Processing phase, different
features determine the important sentences. Weights of features are determined to create a sentence score
which allow us to sort sentences.
i
List of Figures
ii
List of Tables
iii
Chapter 1
Introduction
1.1 Fundamentals
Marathi Language is a Devnagari language spoken predominantly by around 83 million people in
Maharashtra and some parts of Goa, India. People living in rural parts of these states communicatte
only via Marathi. People here get their news from Marathi news sources and articles. However,
summaries for these articles need to be created under human supervision. Brief summaries allow
users to read news in a brief format. News Summarization deals with reducing the source text into
a short version preserving its contents and overall meaning. Extraction of useful sentences from a
news article is a very time consuming task. This is where Natural Language Processing helps us to
create news summaries very quickly. Summarization occurs in two phases: Preprocessing phase,
Processing phase.
The Figure 1.1 shows us the flow of our project. The Marathi news is preprocessed and then processed
by out text rank algorithm to give us a summary as an output.
1.2 Objectives
The project intends to generate a summary from news document and present it to the user.
1
• Creation of Stop Word List
• Creating a text rank algorithm that gives us highest ranked sentences for our summary
The introduction is given in Chapter 1. It describes the fundamental terms used in this project. It
motivates to study and understand the different techniques used in this work. This chapter also
presents the outline of the objective of the report.
The Chapter 2 describes the review of the relevant various techniques in the literature systems. It
describes the pros and cons of each technique.
The Chapter 3 presents the Theory and proposed work. It describes the major approaches used in
this work.
2
Chapter 2
Literature Survey
2.1 Introduction
The Different Techniques and Algorithms are needed to be used for our Project. Conducting a
research allows us to examine various techniques that would be necessary for our project. We have
studied various papers to find out preprocessing techniques that are efficient. We also studied
algorithms that are neccesary for extraction of sentences.
To handle the linguisic part of our project, a research suggests to separate words out into Morphemes
and handle processing of language. The difficulty depends greatly on the complexity of the
morphology of the language considered. Applying inflectional rules allows us to categorise the
sentences to their root words.
Tokenization and Stemming of the document can be handled by using lexical analyser. These separate
tokens are called as lexicons. Stemming removes the suffixes of the word and gives us a root word.
Marathi is a morphologically rich language. Variants of words in Marathi are usually formed by
adding suffixes to root word.
After preprocessing, the research moves on to summarization of the news article. Graph based
Ranking algorithms are essentially a way of deciding the importance of a vertex within a graph based
on global information recursively drawn on the entire graph. The higher the importance of the node,
the higher the score.
3
2.3 Literature Summary
Author and
A Survey of Automatic
Text Summarization for Vivat Giri, Dr. M.M. Important Keyphrases, Numeric Values to be considered
1.
Different Regional Math , Dr. U.P. for sentence score
Language in India Kulkarni, 2016
Discovering Suffixes: A
Mudassar Majgaonker,
3. case study for Marathi Tanveer Siddiqui, Finding and removal of suffixes using various rules.
Language 2010
4
Chapter 3
Implementation Details
3.1 Overview
The project works mainly on two phases. Preprocessing phase and Processing phase. Preprocessing
phase consists of Tokenization, Stemming, Removal of Stop words. Processing phase consists of
implementation of a Textrank algorithm for marathi language. The inspiration behind this project is
to generate sentence scoring for Marathi language based on position of word in the sentence.
This method fails in practical use as words like stopwords, non stemmed words and special
characters etc. will almost always be the result of this method, as they occur more frequently. But
of course, they are not the right way to summarize our document.
The news article is first given to our tokenization algorithm. Tokenization removes any unnecessary
special characters such as brackets, commas, fullstops etc. Some news articles may have English or
5
any other Language words in it. We don’t process those words as those words are very few in
quantity and also have significant meaning to them. After Tokenization of sentences, we remove
any stopwords from the tokens. Stopwords are type of words which don’t add meaning to the
sentence. They can be ignored safely without sacrificing the meaning of the sentence.
After removing these stopwords, we send the sentence to our stemmer. Stemming is the process of
reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known
as a lemma. Stemming is important in natural language understanding (NLU) and natural language
processing (NLP). Stemming is a part of linguistic studies in morphology and artificial intelligence
(AI) information retrieval and extraction.
After Stemming we get a list of root words that are important to form a sentence. Now we move
onto our TextRank Algorithm. For finding the top keyphrases, we plan to use Graph-based
Algorithms, where each node is considered as a word from our stemmed dictionary. Graph-based
ranking algorithm is a way of deciding on importance of a vertex within a graph, by taking into
account global information recursively computed from an entire graph, rather than relying on local
vertex-specific information. To give each node a weight we find the maximum word score of the
node. This is done by
score= 1∕(π*√(pos*(1-pos))
Where,
pos=word_position/len(document)
The maximum score for that word is used as the weight of the node. After node creation, we add
edges to our graph. The addition of a window helps us to find the nearest words to the given words.
Consider a window size of ‘n’. A word will form an edge with the next ‘n’ edges and the weight of
an edge would be average of the weight of the nodes. After forming a graph, we apply pagerank. It
implements a random surfer model where a node with probability ‘d’ is selected and jumps to
completely new probability ‘1-d’. We get the top keyphrases after applying page rank. These are
then added to each sentence score where they are present. The sentences with the top score are
returned as they create a summary.
6
3.2 Implementation Details
For our implementation of our system, we needed a corpus. We used webscrapping for our corpus.
We scrapped Marathi News Websites and cleaned them for their tags. Usually on these websites
there is an excerpt given which is the first few lines of the articles. We plan to replace that we an
extractive summary that would be generated from our corpus.
3.2.1 Methodology
The article is then given for Tokenization. Each sentence is tokenized into words and we keep a track
of these sentence. Now for each sentence we remove stopwords.
Stopwords are removed from a stop word list. We create a stop word list from our corpus. We
tokenize our corpus into words. Now we find the most frequent words from the tokenize wordlist.
These words are appended to our stopword list. Stopwords are type of words which don’t add
meaning to the sentence. They can be ignored safely without sacrificing the meaning of the sentence.
After removing stopwords, we pass these words to a stemmer. Stemming is done by removing
suffixes from the words and then checking whether the words are in a wordlist. This wordlist contains
words and relates them to their stems and is provided by IIT Bombay. For removal of suffixes, we
check them with the suffixes normally used in Marathi language.
For Example, Words such as आनंदाददतीशयाने is first removed with suffix ने and then is recursively
stripped of suffixes, शया, ती, दद and finally is removed of a Maatra ाााा to get a root word as
आनंद.
7
Table 3.1. Stemming Suffixes
8
Table 3.1 shows different suffixes which we remove for finding the stemmed word.
After stemming the word, the remains of every sentence are stored in a dictionary and are used for
further processing. Meanwhile, we store sentences as a separate entity. We also find the total
number of stemmed words in our dictionary. To find the weight of each node, we find the word
score using the formula. For every word, we find the maximum of these node weight and assign it
to the node hash.
Graph-based ranking algorithms are essentially a way of deciding the importance of a vertex within
a graph, based on global information recursively drawn from the entire graph. The basic idea im
plemented by a graph-based ranking model is that of "voting" or "recommendation". When one
vertex links to another one, it is basically casting a vote for that other vertex. The higher the number
of votes that are cast for a vertex, the higher the importance of the vertex. Moreover, the importance
of the vertex casting the vote determines how important the vote itself is, and this information is also
taken into ac count by the ranking model. Hence, the score associated with a vertex is determined
based on the votes that are cast for it, and the score of the vertices cast ing these votes.
Formally, let G (V, E) be a directed graph with the set of vertices V and set of edges E, where E is a
subset of V x V. For a given vertex Vi, let In(V) be the set of vertices that point to it (predecessors),
and let Out (V) be the set of vertices that vertex V; points to (successors).
The task of a keyword extraction application is to au tomatically identify in a text a set of terms that
bes describe the document. Such keywords may consti tute useful entries for building an automatic
index fo a document collection, can be used to classify a text or may serve as a concise summary for
a given doc ument. Moreover, a system for automatic identifica tion of important terms in a text can
be used for the problem of terminology extraction, and construction of domain-specific dictionaries.
The simplest possible approach is perhaps to use a frequency criterion to select the "important" key
words in a document. However, this method was generally found to lead to poor results, and conse
quently other methods were explored. The state-of the-art in this area is currently represented by
super vised learning methods, where a system is trained to recognize keywords in a text, based on
lexical and syntactic features.
For adding edges to the graph, we use windows. Each window is of certain size and are the total
number of edges added at a time. For example, Suppose the stemmed dictionary contains words:
9
ददद्यार्थ्ाााा ं , ऑनलााईन, दशक्षणाता न, अखाे र
Now for word ददद्यार्थ्ाााा ं , we add an edge from ददद्यार्थ्ाााां to ऑनलााईन, दशक्षणाता न, अखाे र. These
edges would have weight as an average of weights of both nodes.
After applying Textrank, we get keyphrases for the document. For every article with ‘n’ stemmed
words, we take the minimum of 0.1*n or 7*log(n). This allows normalization of total stemmed words.
If these keyphrases are present in the stemmed dictionary, we add the weight to sentence score. The
sentence with the highest scores are returned as an output.
1. collections: Collections in Python are containers that are used to store collections of data, for
example, list, dict, set, tuple etc. These are built-in collections.
2. io: Python io module allows us to manage the file-related input and output operations. The
advantage of using the IO module is that the classes and functions available allows us to extend
the functionality to enable writing to the Unicode data.
3. re: This module provides regular expression matching operations similar to those found in
Perl.
4. nltk: NLTK is a leading platform for building Python programs to work with human language
data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet,
along with a suite of text processing libraries for classification, tokenization, stemming, tagging,
parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries. We would be
using sent_tokenize.
5. pandas: Pandas is a software library written for the Python programming language for data
manipulation and analysis.
10
6. numpy: NumPy is a library for the Python programming language, adding support for large,
multi-dimensional arrays and matrices
7. Matplotlib.pyplot: Matplotlib is a plotting library for creating static, animated, and interactive
visualizations in Python.
8. operator: The operator module exports a set of efficient functions corresponding to the
intrinsic operators of Python. For example, operator.add(x, y) is equivalent to the expression
x+y. Many function names are those used for special methods, without the double underscores.
9. math: Python has a built-in module that you can use for mathematical tasks.
10. NetworkX is a Python package for the creation, manipulation, and study of the structure,
dynamics, and functions of complex networks. By definition, a Graph is a collection of nodes
(vertices) along with identified pairs of nodes (called edges, links, etc). In NetworkX, nodes can
be any hashable object e.g., a text string, an image, an XML object, another Graph, a customized
node object, etc.
11
Chapter 4
Project Inputs and Outputs
शाळााां ना 1 माे पासा न सा ट्ट्याा जाहीर, 14 जानपासा न नदााे शाक्षदणक दाराथष सा रू होणार.
माां बई : दशक्षकााां ना, ददद्यार्थ्ाााा ं ना ऑनलााईन दशक्षणातान अखाे र ददशथााां ती दाे ण्याात आलाी आहाे .
राजथयाातील प्रार्दमक, माध्यदमक, उच्च माध्यदमक शाळांना 1 मे पासन सट्ट्या जाहीर करण्यात आल्या आहे. 1 मे ते
सा रू होणार आहाे . राजथयाातील दशकथ षक संघटनाकड न दााारं दााार के ल्या जात असलेल्या
मागणीनंतर शाळांना सट्टी जाहीर करण्यात आल्या आहेत.दददरथ ााष तील तापमान ददचाारात घाे ता 28
काोरोनााच्या पाश्ाथषराथा मीदार मागाील दाराथषर्ाापासा न शाळाा जरी अनाे क ददकाणाी बां दच
रादहल्याा तरी ऑनलााइन दागषसारू होताे . ऑनलााइन दशक्षण दाे ऊन अभ्यासक्रम यादाराथाी
दशक्षकााां नी पाणष काे लाा. मात्र, काोरोनाची रााज्यााताील सदथ यसथर्र्ती नाकताच इयत्ताा पदहलाी ताे
दआदााी ददद्यार्थ्ाााा ं ना परीक्षाा रद्द करून दा तथ ााां ना पास करून पा ढील दागााष त प्रमाोट काे लाे
तर इयत्ताा 9 ताे 11 दागााषच्या परीक्षाा साद्धाा रद्द काे ल्याा आहाे . तरीसाद्धाा, अनाे क शाळााां मधथयाे
तथ ाामा ळाे शा क्षदणक दाराथष समाापथ ताी करून सा ट्टाी जाहीर करादााी अशाा आशयााचाे पत्र
दशक्षण मां तथराी दाराथााष गाायकदाााड, मा ख्यमां तथराी उद्धदा दाााकराे यााां ना पादादाण्यात
12
याे त होताे . यााची दखल घाे त, आज शाळााां ना साट्टाी दाे ण्याबााबत दनणषय जाहीर करण्यात आला.
नदााे शाक्षदणक दाराथष जरी 14 जानपासान सारू होणार असलाे तरी प्रत्कथष शाळाा सारू
होण्याबााबतचा दनणषय काालााां तरानाे काोरोना पररसथर्र्तीचा आढादााा घाे ऊन तथ ााबााबत कळददलाे
13
सध्याच्याा काोरोना सथर्र्तीचथ याा पाश्ाथषराथा मीदार दागष ऑनलााइनच होणार आहाे त. र्र्ाादनक प्रशाासनानाे
After Tokenization, Stopword removal and stemming the output is stored in stemmed dictionary
14
We create a graph for the nodes using networkx package.
These are the top sentence scores. Hence, the summary for given article is:
शाळााां नाा 1 माे पाासा न सा ट्ट्याा जााहाीर, 14 जा नपासान नदााे शा क्षदणक दाराथष सा रू हाोणार.
राज्यातील प्रार्दमक, माध्यदमक, उच्च माध्यदमक शाळां ना 1 मे पासन सट्ट्या जाहीर करण्यात आल्या आहे .1 मे ते
13 जन दरम्यान शाळां ना सट्टी राहणार आहे .
राजथयााताील दशक्षक सां घटनाकडा न दााारां दााार काे ल्या जात असलाे ल्या मागणाीनां तर शााळााां नाा साट्टाी
जाहीर करण्यात आल्या आहाे त. याचाी दखल घाे त, आज शााळााां नाा सा ट्टाी दाे ण्याबााबत दनणषय जाहाीर करण्यात
आला.
नदााे शा क्षदणक दाराथष जराी 14 जानपाासा न सा रू हाोणार असलाे तरी प्रतथकथष शाळा सा रू हाोण्याबााबतचाा
दनणषय काालााां तरानाे काोराोनाा पररस्थर्तीचा आढादााा घेऊन तथााबाबत कळददले जाणार आहे .
16
Chapter 5
Summary and Future Scope
5.1 Summary
Our main aim for designing the project was to summarize news articles so that readers can easily read
summaries of articles without need of a human to write those summaries. For this task, we opted for
TextRank Algorithm. The product of the textrank algorithm will give us the keyphrases from the
news articles. This gives us the most important stemmed words. The textrank algorithm is initialised
on position of word after stemming. We perform sentence scoring to find the highest scoring
sentences. For Preprocessing, we create our stopwords list, tokenized the article and cleaned the text
to give us stemmed dictionary. This stemmed dictionary is then passed onto Textrank algorithm that
gives us a short summary of the article.
2. To make our news summarization abstractive, we could add NER for finding Noun phrase and
Verb Phrases.
3. Further, we can use Sentence Feature Extraction to generate sentences that are not in the news
articles.
17
References
[1] V. Giri, Dr. M. Math, Dr. U. Kulkarni, “ A Survey of Automatic Text Summarization System
for Different Regional Language in India”, Bonfring International Journal of Software
Engineering and Soft Computing(BIJSESC), October 2016.
[2] S. Shimpikar, S. Govilkar, “ Abstractive Text Sumarization using Rich Semantic Graph for
Marathi Sentences”, Journal of Applied Science and Computations(JASC), December 2018.
[3] M. Majgaonkar, T. Siddiqui, “Discovering Suffixes, A case study for Marathi Language”,
International Journal on Computer Science and Engineering(IJCSE), 2010.
[4] P. Gawade, D. Madhavi, J. Gaikwad, S. Jadhav, “Natural Language Processing Tasks for
Marathi Language”, International Journal of Engineering Research and Development(IJERD),
April 2013
[5] R. Mihalcea, P. Tarau, “TextRank: Bringing Order to Texts”, 2004 Conference on Empirical
Methods in Natural Language Processing, July 2004.
18
Acknowledgement
We wish to express our deep sense of gratitude and thanks to our Professor and Guide for this Mini Project, Prof.
Suvarna Chaure for her guidance, help and useful suggestions, which helped in completing our project work in
time.
We also thank our H.O.D., Dr. Aparna Bannore for her support in completing the project. We are also extremely
grateful for her guidance provided whenever required.
We also thank to our Principal Dr. Atul Kemkar, for extending his support to carry out this project. Also, we would
like to thank the entire faculty of the Computer Department for their valuable ideas and timely assistance in this
project, last but not the least, we would like to thank our teaching and non-teaching staff members of our college
for their support, in facilitating timely completion of this project.
Project Team
Satishkumar M(121A1095)
Senthilmurugan S(121A1096)
Sarvesh Shinde(121A1102)
Manish Thevar(121A1115)
19