KEMBAR78
NLP Lab Manual | PDF | Part Of Speech | Morphology (Linguistics)
83% found this document useful (6 votes)
10K views56 pages

NLP Lab Manual

The document provides information about the Department of Computer Engineering including its vision, mission, quality policy, program educational objectives, program outcomes, and program specific outcomes. It also includes an index listing experiments and their corresponding page numbers for the Natural Language Processing lab manual. The objectives are to impart quality education in computer science, develop students' skills to solve problems, and prepare them for careers or further education in a competitive environment. The outcomes cover applying engineering knowledge, designing solutions, investigating problems, using tools, considering ethics and society, and engaging in lifelong learning.

Uploaded by

It's Just Yash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
83% found this document useful (6 votes)
10K views56 pages

NLP Lab Manual

The document provides information about the Department of Computer Engineering including its vision, mission, quality policy, program educational objectives, program outcomes, and program specific outcomes. It also includes an index listing experiments and their corresponding page numbers for the Natural Language Processing lab manual. The objectives are to impart quality education in computer science, develop students' skills to solve problems, and prepare them for careers or further education in a competitive environment. The outcomes cover applying engineering knowledge, designing solutions, investigating problems, using tools, considering ethics and society, and engaging in lifelong learning.

Uploaded by

It's Just Yash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Department of Computer Engineering

Department of Computer Engineering

Lab Manual
Final Year Semester-VIII
Subject: Natural Language Processing

Even Semester

1
Department of Computer Engineering

Institutional Vision, Mission and Quality Policy

Our Vision
To foster and permeate higher and quality education with value added engineering, technology programs,
providing all facilities in terms of technology and platforms for all round development with societal
awareness and nurture the youth with international competencies and exemplary level of employability
even under highly competitive environment so that they are innovative adaptable and capable of handling
problems faced by our country and world at large.

Our Mission
The Institution is committed to mobilize the resources and equip itself with men and materials of
excellence thereby ensuring that the Institution becomes pivotal center of service to Industry, academia,
and society with the latest technology. RAIT engages different platforms such as technology enhancing
Student Technical Societies, Cultural platforms, Sports excellence centers, Entrepreneurial Development
Center and Societal Interaction Cell. To develop the college to become an autonomous Institution &
deemed university at the earliest with facilities for advanced research and development programs on par
with international standards. To invite international and reputed national Institutions and Universities to
collaborate with our institution on the issues of common interest of teaching and learning sophistication.

Our Quality Policy

Our Quality Policy

It is our earnest endeavour to produce high quality engineering professionals who are
innovative and inspiring, thought and action leaders, competent to solve problems faced
by society, nation and world at large by striving towards very high standards in learning,
teaching and training methodology.

Our Motto: If it is not of quality, it is NOTRAIT!

Dr. Vijay
D.PatilPresident, 2
RAES
Department of Computer Engineering

Departmental Vision, Mission

Vision
To impart higher and quality education in computer science with value added engineering and technology
programs to prepare technically sound, ethically strong engineers with social awareness. To extend the
facilities, to meet the fast changing requirements and nurture the youths with international competencies
and exemplary level of employability and research under highly competitive environments.

Mission
• To mobilize the resources and equip the institution with men and materials of excellence to
provide knowledge and develop technologies in the thrust areas of computer science and
Engineering.

• To provide the diverse platforms of sports, technical, co-curricular and extracurricular activities
for the overall development of student with ethical attitude.

• To prepare the students to sustain the impact of computer education for social needs
encompassing industry, educational institutions and public service.

• To collaborate with IITs, reputed universities and industries for the technical and overall
upliftment of students for continuing learning and entrepreneurship.

3
Department of Computer Engineering

Departmental Program Educational Objectives


(PEOs)

1. Learn and Integrate


To provide Computer Engineering students with a strong foundation in the mathematical,
scientific and engineering fundamentals necessary to formulate, solve and analyze engineering
problems and to prepare them for graduate studies.

2. Think and Create


To develop an ability to analyze the requirements of the software and hardware, understand the
technical specifications, create a model, design, implement and verify a computing system to
meet specified requirements while considering real-world constraints to solve real world
problems.

3. Broad Base
To provide broad education necessary to understand the science of computer engineering and the
impact of it in a global and social context.

4. Techno-leader
To provide exposure to emerging cutting edge technologies, adequate training & opportunities to
work as teams on multidisciplinary projects with effective communication skills and leadership
qualities.

5. Practice citizenship
To provide knowledge of professional and ethical responsibility and to contribute to society
through active engagement with professional societies, schools, civic organizations or other
community activities.

6. Clarify Purpose and Perspective


To provide strong in-depth education through electives and to promote student awareness on the
life-long learning to adapt to innovation and change, and to be successful in their professional
work or graduate studies.

4
Department of Computer Engineering

Departmental Program Outcomes (POs)


PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.

PO2: Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences..

PO3: Design/development of solutions: Design solutions for complex engineering problems


and design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.

PO4: Conduct investigations of complex problems: Use research-based knowledge and


research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.

PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities with an understanding of the limitations.

PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.

PO7: Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.

PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.

PO9: Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.

PO10: Communication: Communicate effectively on complex engineering activities with the


engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.

PO11: Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.

PO12: Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.

5
Department of Computer Engineering

Program Specific Outcomes (PSOs)

PSO1: To build competencies towards problem solving with an ability to understand, identify,
analyze and design the problem, implement and validate the solution including both hardware
and software.

PSO2: To build appreciation and knowledge acquiring of current computer techniques with an
ability to use skills and tools necessary for computing practice.

PSO3: To be able to match the industry requirements in the area of computer science and
engineering. To equip skills to adopt and imbibe new technologies.

6
Department of Computer Engineering

Index
Sr. No. Contents Page No.
1. List of Experiments 8-9
2. Experiment Plan and Course Outcomes 10
Mapping of Course Outcomes – Program
3. 11-12
Outcomes and Program Specific outcome
4. Study and Evaluation Scheme 13
5. Experiment No. 1 14
6. Experiment No. 2 21
7. Experiment No. 3 24
8. Experiment No. 4 28
9. Experiment No. 5 32
10. Experiment No. 6 37
11. Experiment No. 7 43
12. Experiment No. 8 47
13. Experiment No. 9 51
14. Mini Project 55

7
Department of Computer Engineering

List of Experiments
Sr.
Experiments Name
No.

1. Study of R and basic commands to access text data.

2. Perform Preprocessing (Tokenization, Scrip Validation, Stop word removal and


stemming) of Text.

3. Perform Morphological Analysis.

4. Implement N-Gram model (bigram extraction).

5. Implement Part-of-Speech (POS) Tagging.

6. Implement chunking to extract Noun Phrases.

7. Identify semantic relationships between the words from given text (Use WordNet
Dictionary) .

8. Study on reference resolution algorithm.

9. Perform Name Entity Recognition (NER) on given text.

10. Mini Project: One real life Natural Language application to be implemented (Use
standard Datasets available on the web).

8
Department of Computer Engineering

Course Objective, Course Outcome &


Experiment Plan
Course Objective:

1 To understand natural language processing and to learn how to apply basic


algorithms in this field.
2 To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.
3 To design and implement applications based on natural language processing
4 To implement various language Models
5 To design systems that uses NLP techniques

Course Outcomes:

Understand fundamental concept of natural language text processing and implement


CO1 basic commands of text processing using R tool.

CO2 Apply morphological analysis on natural language text.

Analyze syntactic structure of a language using syntax analysis techniques.


CO3

CO4 Identify semantic relationships between words using semantic analysis.

Apply the discourse analysis techniques to resolve the references.


CO5
Identify Named Entities which are important in information extraction applications.
CO6

9
Department of Computer Engineering

Experiment Plan:
Module Week Course Waightage
Experiments Name
No. No. Outcome
Study of R and basic commands to access text
1. W1 data. CO1 10
Perform Preprocessing (Tokenization, Scrip
2. W2 Validation, Stop word removal and stemming) of CO2 03
Text.
Perform Morphological Analysis.
3. W3 CO2 03
Implement N-Gram model (bigram extraction).
4. W4 CO2 04
Implement Part-of-Speech (POS) Tagging.
5. W5 CO3 10
Implement chunking to extract Noun Phrases.
6 W6 CO4 05
Identify semantic relationships between the words
7. W7 from given text (Use WordNet Dictionary) . CO4 05

Study on reference resolution algorithm.


8. W8 CO5 10
Perform Name Entity Recognition (NER) on given
9. W9 CO6 10
text.
Mini Project: One real life Natural Language
application to be implemented (Use standard
10. W10 Datasets available on the web).

10
Department of Computer Engineering

CO-PO & PSO Mapping


Mapping of Course outcomes with Program Outcomes:

Subject Course Outcomes Contribution to Program outcomes


Weight
1 2 3 4 5 6 7 8 9 10 11 12
Understand fundamental
concept of natural language text
CO1 processing and implement basic 1 1 1 3 1 2 1
commands of text processing
using R tool.

CO2 Apply morphological analysis 1 2 2 2 1 1 1


on natural language text.
Analyze syntactic structure of a
CO3 language using syntax analysis 1 2 2 2 1 1 1
PRATICAL
80% techniques.

Identify semantic relationships


CO4 between words using semantic 1 2 2 2 1 1 1
analysis.
Apply the discourse analysis
CO5 techniques to resolve the 1 2 2 2 1 1 1
references.
Identify Named Entities which are
CO6 important in information extraction 1 2 2 2 1 1 1
applications.

11
Department of Computer Engineering

Mapping of Course outcomes with Program Specific Outcomes:

Contribution to Program
Course Outcomes
Specific outcomes
PSO1 PSO2 PSO3
Understand fundamental concept of natural
CO1 language text processing and implement basic 3 3 2
commands of text processing using R tool.

Apply morphological analysis on natural language 3 3 3


CO2
text.
Analyze syntactic structure of a language using
CO3 3 3 3
syntax analysis techniques.

CO4 Identify semantic relationships between words 3 3 3


using semantic analysis.
Apply the discourse analysis techniques to resolve
CO5 3 3 3
the references.
Identify Named Entities which are important in
CO6 information extraction applications.
2 2 3

12
Department of Computer Engineering

Study and Evaluation Scheme


Course
Course Name Teaching Scheme Credits Assigned
Code
Theor Practic Theor Practica Tutoria
Tutorial Total
CSL804 Computational y al y l l
Lab-II (NLP)
02 02 02 -- 02

Course Code Course Name Examination Scheme


Term Work Practical & Oral Total
CSL804 Computational
Lab-II (NLP) 50 25 75

• Term Work: 50 Marks

The distribution of marks for term work shall be as follows:

⚫ Lab Experimental Work & Mini project : 50 Marks

◦ Lab experiments : 10 Marks

◦ Assignments : 10 Marks

◦ Attendance (Theory & Practical) : 05 Marks

◦ Mini project: Report preparation and Implementation along with research papers
survey related to selected topic for mini project: 25 marks

● Practical & Oral: 25 Marks

Practical Examination is to be conducted based on departmental level optional courses

Note: Although it is not mandatory, the experiments can be conducted with reference to
any Indian regional language.

13
Department of Computer Engineering

Computational Lab-II
Experiment No. : 1

Study of R tool to access text data.

14
Department of Computer Engineering

Experiment No.1
1. Aim: Study of R tool and basic commands to access text data.
2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.

Outcomes: Understand fundamental concept of natural language text processing and


implement basic commands of text processing using R tool.

3. Hardware / Software Required : R Studio

4. Theory:
R is one of the most popular and open source software projects for data science. It.is
used for analyzing data and constructing graphics. It is one of the popular tools used for
processing natural language text.
R has a wide variety of useful packages. The most commonly used packages for text
analysis and natural language processing are:
• OpenNLP
Apache OpenNLP is widely used for most common tasks in NLP, such as
tokenization, POS tagging, named entity recognition (NER), chunking, parsing, and
so on. It provides functions for sentence annotation, word annotation, POS tag
annotation, and annotation parsing using the Apache OpenNLP chunking parser.
• tm Package
It is a text-mining framework which uses a corpus, the main structure tm package,
for storing and manipulating text documents.
• koRpus Package
The korRpus package is a set of tools to analyze texts. Includes functions for
automatic language detection, hyphenation, several indices of lexical diversity. Basic
import functions for language corpora are also provided, to enable frequency
analysis and measures like tf-idf.
• SnowballC
An R interface to the C libstemmer library that implements Porter’s word stemming
algorithm for collapsing words to a common root to aid comparison of vocabulary.

15
Department of Computer Engineering

• Word cloud Package


This package provides functionality to create pretty word clouds, visualize
differences and similarity between documents, and avoid over-plotting in scatter
plots with text.

Installing RStudio in Windows

Download link
https://rstudio.com/products/rstudio/download/

Basic commands in R programming

1. Creating a row vector


> x=c(1,2,3,4,5,6)
>x
[1] 1 2 3 4 5 6

2. Summation
> sum(x)
[1] 21

3. Mean
> mean(x)
[1] 3.5

4. Median
> median(x)
[1] 3.5

5. Square root
> sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490

6.Squaring
> x^2
[1] 1 4 9 16 25 36

7.Creating sequence
> seq(1,10)
[1] 1 2 3 4 5 6 7 8 9 10

16
Department of Computer Engineering

8. Creating histogram of sequence


> x= c(2,4,4,6,6,5,5,7,3,7,3,8,9,7,9,6,4,3,4,4,6,2,2,1,2,4,6,6,8)
> hist(x)

9. Creating scatter plot


> x=c(1,3,5,7,9)
> y=c(2,4,6,8,10)
> plot(x,y)

10. Making a time plot


> plot(x,type="b")

11. Regular
expressions
> grep("[a-zA-Z]",c(123,"abc"),value=TRUE)
[1] "abc"
> grep("(ab){2}",c("aabaa","abaaabab","abab"),value=TRUE)
[1] "abaaabab" "abab"
> grep("^(ab)",c("aabaa","abaaabab","bab"),value=TRUE)
[1] "abaaabab"
> grep("(ab)$",c("aabaa","abaaabab","bab"),value=TRUE)
[1] "abaaabab" "bab"

17
Department of Computer Engineering
12. Getting information about functions
> ?par

13. Reading files


text <- readLines(file.choose())

14. History of all commands


> history()

18
Department of Computer Engineering

15. Creating Word Cloud


text <- readLines(file.choose())
docs <- Corpus(VectorSource(text))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 2,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))

5. Conclusion:
R is a language used for statistical computations, data analysis and graphical
representation of data. After performing this experiment we are able to work with
packages of R and represent output in visual forms.

19
Department of Computer Engineering

6. Viva Questions:
• What is Natural Language Processing?
• What is Text Analysis?
• What are features of R?

References:
1. Brian Neil Levine, An Introduction to R Programming
2. Niel J le Roux, Sugnet Lubbe, A step by step tutorial : An introduction into R application
and programming

20
Department of Computer Engineering

Computational Lab-II
Experiment No. : 2

Perform Preprocessing of Text

Experiment No.2

21
Department of Computer Engineering

1. Aim: Perform Pre-processing (Tokenization, Scrip Validation, Stop word removal and
stemming) of Text.

2. Objectives:
• To understand natural language processing and to learn how to apply basic algorithms in
this field.
• To implement various language Models.

Outcomes: Students will be able to apply morphological analysis on natural language text.

3. Hardware / Software Required : R Studio /Python

4. Theory:
Text pre-processing is traditionally an important step for natural
language processing (NLP) tasks. It transforms text into a more digestible form so that
algorithms can perform better. It simply means to bring your text into a form that
is predictable and analyzable for your task.

● Commonly the steps taken are:


a. Filtration and Script Validation
As presence of special characters in documents degrades the performance, it needs to be
removed.
The special characters such as “ ” ‘ ’ , . / ? [ ] { } : ; \ | ~ ! @ # $ % ^ & * ( ) _ - = + < >
are frequently used in many language scripts.
—These characters will not contribute towards final result.
—We compared UTF-8 list with each character of each token, if match found the
character is valid and allowed otherwise removed from the document.
E.g.
Input: शिवाजीची आई कोण होती? , who
Remove ‘,’ and ‘who ‘ which does not belong to Source language.
Output: शिवाजीची आई कोण होती?

b. Tokenization
Tokenization is a process of converting sentence into a chain of words so that processing
word by word can be easily performed. Here we use white space character for
tokenization.

Tokenization
Token 0: शिवाजीची
Token 1: आई

22
Department of Computer Engineering

Token 2: कोण
Token 3: होती

English Input: The car is red.


Output:
Token 0: The
Token 1: car
Token 2: is
Token 3: red

c. Stop Word Removal


Stop words are the most frequently occurring words which slow down the processing of
documents as these words are irrelevant. Such words include articles, prepositions and
other function words. Hence we remove the stop words to enhance the speed of
searching.
—A corpus of stop words is used to filter out the stop words from the documents.
Stop Words in
English : “the”, “a”, “an”, “in” …
Marathi: असं, शकंवा, याने, ये, मध्ये, व, आशण , हे , तर…

d. Stemming
Suffix stripping is done in this step. The widely used method for this processing is
Stemmer which uses a suffix list to remove suffixes from words. The stem is not
necessarily the linguistic root of the word.

Stemming in English :
car, cars, car's, cars' => car

5. Conclusion: We learned text pre-processing like Tokenization, Scrip Validation, Stop word
removal and stemming using inbuilt library from python/R.

6. Viva Questions:
• What is the importance of preprocessing in NLP tasks?
• How Porter Stemmer works?
• What is difference between stemming and lemmatization?

References:
1. Daniel Jurafsky, James H. Martin “Speech and Language Processing” Second
Edition, Prentice Hall, 2008.

23
Department of Computer Engineering

Computational Lab-II
Experiment No. : 3

Perform Morphological Analysis

24
Department of Computer Engineering

Experiment No.3
1. Aim: Perform Morphological Analysis.

2. Objectives:
• To understand natural language processing and to learn how to apply basic algorithms
in this field.
• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.

Outcomes: Students will be able to apply morphological analysis on natural language


text.

3. Hardware / Software Required : R Studio / Python

4. Theory:
A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller parts. On the other hand, the word 'cats'
is complex, because the word is made up of two parts: root 'cat' and plural suffix '-s'

Analysis of a word into root and affix(es) is called as Morphological analysis of a word.
It is mandatory to identify the root of a word for any natural language processing task. A
root word can have various forms. For example, the word 'play' in English has the
following forms: 'play', 'plays', 'played' and 'playing'. Hindi shows more number of forms
for the word 'खेल' (khela) which is equivalent to 'play'.

The forms of 'खेल' (khela) are the following:

खेल(khela), खेला(khelaa), खेली(khelii), खेलंगा(kheluungaa), खेलंगी(kheluungii),

खेलेगा(khelegaa), खेलेगी(khelegii), खेलते(khelate), खेलती(khelatii), खेलने(khelane),


खेलकर(khelakar)

25
Department of Computer Engineering

Thus we understand that the morphological richness of one language might vary from
one language to another. Indian languages are generally morphologically rich languages
and therefore morphological analysis of words becomes a very significant task for Indian
languages.

Morphology is of two types:

1. Inflectional morphology

Deals with word forms of a root, where there is no change in lexical category. For
example, 'played' is an inflection of the root word 'play'. Here, both 'played' and 'play' are
verbs.

2. Derivational morphology

Deals with word forms of a root, where there is a change in the lexical category. For
example, the word form 'happiness' is a derivation of the word 'happy'. Here, 'happiness'
is a derived noun form of the adjective 'happy'.

Morphological Features:

All words will have their lexical category attested during morphological analysis. A noun
and pronoun can take suffixes of the following features: gender, number, person, case

For example, morphological analysis of a few words is given below:

Hindi

लडके (ladake)

rt=लड़का(ladakaa), cat=n, gen=m, num=pl, case=dir

English
boy
rt=boy, cat=n, gen=m, num=sg
toys
rt=toy, cat=n, num=pl, per=3

'rt' stands for root.

'cat' stands for lexical category. The value of lexical category can be noun, verb, adjective,
pronoun, adverb, preposition.

'gen' stands for gender. The value of gender can be masculine or feminine.

'num' stands for number. The value of number can be singular (sg) or plural (pl).

'per' stands for person. The value of person can be 1, 2 or 3

26
Department of Computer Engineering

The value of tense can be present, past or future. This feature is applicable for verbs.

• Word Generation:

It is a inverse process where we generate different forms of the word from a given root word.

Example in English :

‘play’→ 'plays', 'played' and 'playing'.

7. Conclusion:
We have learnt about different morphological features of a word and the understood that
obvious use of morphology in NLP systems is to reduce the number of forms of words to
be stored.

8. Viva Questions:
• What is difference in word analysis and word generation?
• What are different morphological features?

References:
1. Daniel Jurafsky, James H. Martin “Speech and Language Processing” Second
Edition, Prentice Hall, 2008.
2. Siddiqui and Tiwary U.S., Natural Language Processing and Information Retrieval,
Oxford University Press (2008).

27
Department of Computer Engineering

Computational Lab-II
Experiment No. : 4

To implement N-Gram model

28
Department of Computer Engineering

Experiment No.4
1. Aim: To implement N-Gram model (bi-gram extraction).
2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To implement various language Models.

Outcomes: Students will understand probabilistic language model bigram used in natural
language processing tasks.

3. Hardware / Software Required : Python / R Studio / NLTK

4. Theory:

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of


n items from a given sample of text or speech. The items can be phonemes, syllables, letters,
words or base pairs according to the application. The n-grams typically are collected from a text
or speech corpus.

Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a


"bigram" (or, less commonly, a "digram"); size 3 is a "trigram". English cardinal numbers are
sometimes used, e.g., "four-gram", "five-gram", and so on.

An n-gram model is a type of probabilistic language model for predicting the next item in such a
sequence in the form of a (n − 1)–order Markov model. N-gram models are now widely used in
probability, communication theory, computational linguistics (for instance, statistical natural
language processing), computational biology (for instance, biological sequence analysis), and
data compression. Two benefits of n-gram models (and algorithms that use them) are simplicity
and scalability – with larger n, a model can store more context with a well-understood space–
time tradeoff, enabling small experiments to scale up efficiently.

N-grams of texts are extensively used in text mining and natural language processing tasks. They
are basically a set of co-occurring words within a given window and when computing the n-
grams you typically move one word forward (although you can move X words forward in more
advanced scenarios). For example, for the sentence "The cow jumps over the moon". If N=2
(known as bigrams), then the ngrams would be:

the cow

cow jumps

jumps over

29
Department of Computer Engineering

over the

the moon

So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to
jumps->over, etc, essentially moving one word forward to generate the next bigram.

If N=3, the n-grams would be:

the cow jumps

cow jumps over

jumps over the

over the moon

So you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is
essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3
this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so
on.

How many N-grams in a sentence?

If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:

Bigrams
We can avoid this very long calculation by approximating that the probability of a given word
depends only on the probability of its previous words. This assumption is called Markov
assumption and such a model is called Markov model- bigrams. Bigrams can be generalized to
the n-gram which looks at (n-1) words in the past. A bigram is a first-order Markov model.
Therefore ,
P(w(1), w(2)..., w(n-1), w(n))= P(w(2)|w(1)) P(w(3)|w(2)) …. P(w(n)|w(n-1))

What are N-grams used for?

N-grams are used for a variety of different task. For example, when developing a
language model, n-grams are used to develop not just unigram models but also bigram
and trigram models. Google and Microsoft have developed web scale n-gram models that
can be used in a variety of tasks such as spelling correction, word breaking and text
summarization.

7. Conclusion: We learned language modelling and the n-gram, one of the most widely
used tools in language processing. Language models offer a way to assign a probability
to a sentence or other sequence of words, and to predict a word from preceding words.
.

30
Department of Computer Engineering

8. Viva Questions:
• What are the advantages behind using N-gram model in text
classification?
• Explain how Bi-gram for text classification work with suitable
example.

References:
1. Daniel Jurafsky, James H. Martin “Speech and Language Processing” Second
Edition, Prentice Hall, 2008.
2. Jalaj Thanaki, “Python Natural Language Processing”First edition Kindle edition

31
Department of Computer Engineering

Computational Lab-II
Experiment No. : 5
To implement Part-of-Speech (POS) Tagging

32
Department of Computer Engineering

Experiment No.5
1. Aim: Implement Part-of-Speech (POS) Tagging.

2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.

• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.

Outcomes: Students will able to analyze syntactic structure of a language using syntax analysis
techniques.

3. Hardware / Software Required : Python/ R Studio / NLTK

4. Theory:

POS tagging or part-of-speech tagging is the procedure of assigning a grammatical category like
noun, verb, adjective etc. to a word. In this process both the lexical information and context play
an important role as the same lexical form can behave differently in a different context.

part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging
or word-category disambiguation, is the process of marking up a word in a text (corpus) as
corresponding to a particular part of speech, based on both its definition and its context—i.e., its
relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified
form of this is commonly taught to school-age children, in the identification of words as nouns,
verbs, adjectives, adverbs, etc.

Part-of-speech tagging is harder than just having a list of words and their parts of speech,
because some words can represent more than one part of speech at different times, and because
some parts of speech are complex or unspoken. This is not rare—in natural languages (as
opposed to many artificial languages), a large percentage of word-forms are ambiguous. For
example, even "dogs", which is usually thought of as just a plural noun, can also be a verb:

The sailor dogs the hatch.

Correct grammatical tagging will reflect that "dogs" is here used as a verb, not as the more
common plural noun. Grammatical context is one way to determine this; semantic analysis can
also be used to infer that "sailor" and "hatch" implicate "dogs" as 1) in the nautical context and
2) an action applied to the object "hatch" (in this context, "dogs" is a nautical term meaning
"fastens (a watertight door) securely").

33
Department of Computer Engineering

Types of POS taggers

POS-tagging algorithms fall into two distinctive groups:

1. Rule-Based POS Taggers

2. Stochastic POS Taggers

E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based
algorithms. Let us first look at a very brief overview of what rule-based tagging is all about.

Rule-Based Tagging

Automatic part of speech tagging is an area of natural language processing where statistical
techniques have been more successful than rule-based methods.

Typical rule-based approaches use contextual information to assign tags to unknown or


ambiguous words. Disambiguation is done by analyzing the linguistic features of the word, its
preceding word, its following word, and other aspects.

For example, if the preceding word is an article, then the word in question must be a noun. This
information is coded in the form of rules.

Example of a rule:

If an ambiguous/unknown word X is preceded by a determiner and followed by a noun, tag it as


an adjective.

Defining a set of rules manually is an extremely cumbersome process and is not scalable at all.
So we need some automatic way of doing this.

The Brill’s tagger is a rule-based tagger that goes through the training data and finds out the set
of tagging rules that best define the data and minimize POS tagging errors. The most important
point to note here about Brill’s tagger is that the rules are not hand-crafted, but are instead found
out using the corpus provided. The only feature engineering required is a set of rule templates
that the model can use to come up with new features.

Stochastic Part-of-Speech Tagging

The term ‘stochastic tagger’ can refer to any number of different approaches to the problem of
POS tagging. Any model which somehow incorporates frequency or probability may be properly
labelled stochastic.

The simplest stochastic taggers disambiguate words based solely on the probability that a word
occurs with a particular tag. In other words, the tag encountered most frequently in the training
set with the word is the one assigned to an ambiguous instance of that word. The problem with
this approach is that while it may yield a valid tag for a given word, it can also yield
inadmissible sequences of tags.

34
Department of Computer Engineering

An alternative to the word frequency approach is to calculate the probability of a given sequence
of tags occurring. This is sometimes referred to as the n-gram approach, referring to the fact that
the best tag for a given word is determined by the probability that it occurs with the n previous
tags. This approach makes much more sense than the one defined before, because it considers
the tags for individual words based on context.

The next level of complexity that can be introduced into a stochastic tagger combines the
previous two approaches, using both tag sequence probabilities and word frequency
measurements. This is known as the Hidden Markov Model (HMM).

For example the word "Park" can have two different lexical categories based on the context.

1. The boy is playing in the park. ('Park' is Noun)

2. Park the car. ('Park' is Verb)

Assigning part of speech to words by hand is a common exercise one can find in an elementary
grammar class. But here we wish to build an automated tool which can assign the appropriate
part-of-speech tag to the words of a given sentence. One can think of creating handcrafted rules
by observing patterns in the language, but this would limit the system's performance to the
quality and number of patterns identified by the rule crafter. Thus, this approach is not
practically adopted for building POS Tagger. Instead, a large corpus annotated with correct POS
tags for each word is given to the computer and algorithms then learn the patterns automatically
from the data and store them in the form of a trained model. Later this model can be used to POS
tag new sentences.

35
Department of Computer Engineering

POS tag list

5. Conclusion: We learned Part of Speech ( POS) Tags which are useful for building
parse trees and used in building NERs (most named entities are Nouns) and extracting
relations between words.

6. Viva Questions:
• What are the different types of POS tagger?
• Explain with example how POS Tags are useful in building
Lemmatizer?

7. References:
1. Daniel Jurafsky, James H. Martin “Speech and Language Processing” Second
Edition, Prentice Hall, 2008.
2. Jalaj Thanaki, “Python Natural Language Processing”First edition Kindle edition

36
Department of Computer Engineering

Computational Lab-II
Experiment No. : 6
To implement chunking to extract Noun
Phrases

37
Department of Computer Engineering

Experiment No.6
1. Aim: To implement chunking to extract Noun Phrases.

2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.

Outcomes: Student will analyze of a sentence which identifies the constituents (noun groups,
verbs, verb groups, etc.) which are correlated.

3. Hardware / Software Required : Python/R Studio /NLTK

Theory: Chunking is an analysis of a sentence which identifies the constituents (noun groups,
verbs, verb groups, etc.) which are correlated. These are non-overlapping regions of text.
Usually, each chunk contains a head, with the possible addition of some function words and
modifiers either before or after depending on languages. These are non-recursive in nature i.e. a
chunk cannot contain another chunk of the same category.

Some of the groups possible are:

1. Noun Group

2. Verb Group

For example, the sentence 'He reckons the current account deficit will narrow to only 1.8 billion
in September.' can be divided as follows:

[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only 1.8
billion ] [PP in ] [NP September ]

Each chunk has an open boundary and close boundary that delimit the word groups as a minimal
non-recursive unit.

Chunking of text involves dividing a text into syntactically correlated words. For example, the
sentence 'He ate an apple.' can be divided as follows:

Each chunk has an open boundary and close boundary that delimit the word groups as a minimal
non-recursive unit. This can be formally expressed by using IOB prefixes.

38
Department of Computer Engineering

Chunking of text involves dividing a text into syntactically correlated words.

Eg: He ate an apple to satiate his hunger.

[NP He ] [VP ate ] [NP an apple] [VP to satiate] [NP his hunger]

Eg: दरवाज़ा खुल गया

[NP दरवाज़ा] [VP खुल गया]

Chunk Types

The chunk types are based on the syntactic category part. Besides the head a chunk also contains
modifiers (like determiners, adjectives, postpositions in NPs).

The basic types of chunks in English are:

Chunk Type Tag Name

1. Noun NP

2. Verb VP

3. Adverb ADVP

4. Adjectival ADJP

5. Prepositional PP

The basic Chunk Tag Set for Indian Languages

Sl. No Chunk Type Tag Name

1 Noun Chunk NP

2.1 Finite Verb Chunk VGF

2.2 Non-finite Verb Chunk VGNF

2.3 Verb Chunk (Gerund) VGNN

3 Adjectival Chunk JJP

4 Adverb Chunk RBP

39
Department of Computer Engineering

A. NP Noun Chunks

Noun Chunks will be given the tag NP and include non-recursive noun phrases and postposition
for Indian languages and preposition for English. Determiners, adjectives and other modifiers
will be part of the noun chunk.

Eg:

(इस/DEM शकताब/NN में/PSP)NP

'this' 'book' 'in'

((in/IN the/DT big/ADJ room/NN))NP

B. Verb Chunks

The verb chunks are marked as VP for English, however they would be of several types for
Indian languages. A verb group will include the main verb and its auxiliaries, if any.

For English:

I (will/MD be/VB loved/VBD)VP

The types of verb chunks and their tags are described below.

1. VGF Finite Verb Chunk

The auxiliaries in the verb group mark the finiteness of the verb at the chunk level. Thus, any
verb group which is finite will be tagged as VGF. For example,

Eg: मैंने घर पर (खाया/VM)VGF

'I erg''home' 'at''meal' 'ate'

2. VGNF Non-finite Verb Chunk

A non-finite verb chunk will be tagged as VGNF.

Eg: सेब (खाता/VM हुआ/VAUX)VGNF लड़का जा रहा है

'apple' 'eating' 'PROG' 'boy' go' 'PROG' 'is'

3. VGNN Gerunds

A verb chunk having a gerund will be annotated as VGNN.

Eg: िराब (पीना/VM)VGNN सेहत के शलए हाशनकारक है sharAba

'liquor' 'drinking' 'heath' 'for' 'harmful' 'is'

C. JJP/ADJP Adjectival Chunk

40
Department of Computer Engineering

An adjectival chunk will be tagged as ADJP for English and JJP for Indian languages. This
chunk will consist of all adjectival chunks including the predicative adjectives.

Eg:

वह लड़की है (सुन्दर/JJ)JJP

The fruit is (ripe/JJ)ADJP

Note: Adjectives appearing before a noun will be grouped together within the noun chunk.

D. RBP/ADVP Adverb Chunk

This chunk will include all pure adverbial phrases.

Eg:

वह (धीरे -धीरे /RB)RBP चल रहा था

'he' 'slowly' 'walk' 'PROG' 'was'

He walks (slowly/ADV)/ADVP

PP Prepositional Chunk

This chunk type is present for only English and not for Indian languages. It consists of only the
preposition and not the NP argument.

Eg:

(with/IN)PP a pen

IOB prefixes

Each chunk has an open boundary and close boundary that delimit the word groups as a minimal
non-recursive unit. This can be formally expressed by using IOB prefixes: B-CHUNK for the
first word of the chunk and I-CHUNK for each other word in the chunk. Here is an example of
the file format:

Tokens POS Chunk-Tags

He PRP B-NP

ate VBD B-VP

an DT B-NP

apple NN I-NP

to TO B-VP

41
Department of Computer Engineering

satiate VB I-VP

his PRP$ B-NP

hunger NN I-NP

4. Conclusion: We learned Chunking which is useful in POS and short phrase like Noun
phrase in NLP.

5. Viva Questions:
• Define Chunking with example.
• What is difference between POS tag and Chunking in NLP?

References:
1. Daniel Jurafsky, James H. Martin “Speech and Language Processing” Second
Edition, Prentice Hall, 2008.
2. Jalaj Thanaki, “Python Natural Language Processing”First edition Kindle edition

42
Department of Computer Engineering

Computational Lab-II
Experiment No. : 7
Identify semantic relationships between the
words from given text (Use WordNet Dictionary)

43
Department of Computer Engineering

Experiment No.7
1. Aim: Using Wordnet dictionary identify synonyms from given text.

2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.
• To design and implement applications based on natural language processing
• To implement various language Models.
• To design systems that uses NLP techniques

Outcomes: Analyse and Implement linguistic foundations for semantic analysis.

3. Hardware / Software Required : Python/R

4. Theory:

Word Senses

Consider the two uses of the lemma bank mentioned above, meaning something like “financial
institution” and “sloping mound”, respectively:

-- Instead, a bank can hold the investments in a custodial account in the client’s name.
--But as agriculture burgeons on the east bank, the river will shrink even more.

A sense (or word sense) is a discrete representation of one aspect of the meaning of a word.
Loosely following lexicographic tradition, we represent each sense by placing a superscript on
the lemma as in bank and bank.

Relationships between Senses

• synonym -words having the same meaning. when two senses of two different words (lemmas)
are identical, or nearly identical, we say the two senses are synonym. Eg. couch/sofa
vomit/throw up filbert/hazelnut car/automobile.

• hyponyms - One sense is a hyponym of another sense if the first sense is more specific, a
subclass. For example, car is a hyponym of vehicle; dog is a hyponym of animal, and mango
is a hyponym of fruit.

44
Department of Computer Engineering

• hypernyms -The generic term used to designate a class of specifics (i.e., meal is a breakfast),
vehicle is a hypernym of car, and animal is a hypernym of dog. It is unfortunate that the two
words hypernym and hyponym are very similar and hence easily confused; for this reason, the
word superordinate is often used instead of hypernym.

• Meronyms & holonyms - Another common relation is meronymy, the part-whole relation.
A leg is part of a chair; a wheel is part of a car. We say that wheel is a meronym of car, and
car is a holonym of wheel.

• Homophones- Two words can be homonyms in a different way if they are spelled differently
but pronounced the same, like write and right, or piece and peace.

What is Wordnet?

WordNet provides information on co-ordinate terms, derivates, senses and more. It is used to
find the similarities between any two words. It also holds information on the results of the
related word. In short or nutshell one can treat it as Dictionary or Thesaurus. Going deeper in
wordnet, it is divided into four total subnets such as

1. Noun
2. Verb
3. Adjective
4. Adverb

Wordnet is an NLTK corpus reader, a lexical database for English. It can be used to find the
meaning of words, synonym or antonym. One can define it as a semantically oriented dictionary
of English. It is imported with the following command:

from nltk.corpus import wordnet as rait

Let us understand some of the features available with the wordnet:

Synset: It is also called as synonym set or collection of synonym words. Let us check a example

from nltk.corpus import wordnet


syns = wordnet.synsets("dog")
print(syns)

Output:

[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'),


Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]

Lexical Relations: These are semantic relations which are reciprocated. If there is a relationship
between {x1,x2,...xn} and {y1,y2,...yn} then there is also relation between {y1,y2,...yn} and

45
Department of Computer Engineering

{x1,x2,...xn}. For example Synonym is the opposite of antonym or hypernyms and hyponym are
type of lexical concept.

Let us write a program using python to find synonym and antonym of word "active" using
Wordnet.

from nltk.corpus import wordnet


synonyms = []
antonyms = []

for syn in wordnet.synsets("active"):


for l in syn.lemmas():
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

The output of the code:

{'dynamic', 'fighting', 'combat-ready', 'active_voice', 'active_agent', 'participating', 'alive',


'active'} -- Synonym

{'stative', 'passive', 'quiet', 'passive_voice', 'extinct', 'dormant', 'inactive'} -- Antonym

5. Conclusion:

WordNet is a lexical database that has been used by a major search engine. From the WordNet,
information about a given word or phrase can be calculated. It can be used in the area of
artificial intelligence for text analysis. With the help of Wordnet, you can create your corpus for
spelling checking, language translation, Spam detection and many more.

6. Viva Questions:
• What is Word Sense?
• What are different relations between word sense?
• What is a command to import wordnet dictionary ?

References:

TB1: Daniel Jurafsky, James H. Martin ―Speech and Language Processing, Second Edition,
Prentice Hall,2008.

RB1: Siddiqui and Tiwary U.S., Natural Language Processing and Information Retrieval, Oxford
University Press (2008).

46
Department of Computer Engineering

Experiment No. : 8
Study on Reference Resolution Algorithm

47
Department of Computer Engineering

Experiment No.8
1. Aim: Study on Reference Resolution Algorithm

2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.
• To design and implement applications based on natural language processing
• To implement various language Models.
• To design systems that uses NLP techniques

Outcomes: Students able to identify and resolve references between sentences from the
discourse.
3. Hardware / Software Required : Study experiment.

4. Theory:
A discourse is a collocated group of sentences which convey a clear understanding only
when read together. The etymology of anaphora is ana (Greek for back) and pheri (Greek
for to bear), which in simple terms means repetition. The most prevalent type of
anaphora in natural language is the pronominal anaphora. Coreference, as the term
suggests refers to words or phrases referring to a single unique entity in the world.
Anaphoric and co-referent entities themselves form a subset of the broader term
\discourse parsing", which is crucial for full text understanding.

Reference Resolution Algorithms

1. Rule-based entity resolution.

Reference resolution task in NLP has been widely considered as a task which inevitably
depends on some hand-crafted rules. These rules are based on syntactic and semantic
features of the text under consideration. Which features aid entity resolution and which
do not has been a constant topic of debate. There have also been studies conducted
specifically targeting this. Thus, most of the earlier anaphora resolution (AR) and
coreference resolution (CR) algorithms were dependent on a set of hand-crafted rules.

48
Department of Computer Engineering

2. Statistical and machine learning based entity resolution.

The field of entity resolution underwent a shift during the late nineties from heuristic-
and rule-based approaches to learning-based approaches. Some of the early learning-
based and probabilistic approaches for AR used decision trees, genetic algorithms and
Bayesian rule. These approaches set the foundation for the learning-based approaches
for entity resolution which improved successively over time and, finally, outperformed
the rule-based algorithms.

3. Deep learning models for CR

Since its inception, the aim of entity resolution research has been to reduce the
dependency on hand-crafted features. With the introduction of deep learning in
NLP, words could be represented as vectors conveying semantic dependencies .
This gave an impetus to approaches which deployed deep learning for entity
resolution.

The first non-linear mention ranking model for CR aimed at learning different
feature representations for anaphoricity detection and antecedent ranking by pre-
training on these two individual subtasks. This approach addressed two major issues
in entity resolution: the first being the identification of non-anaphoric references
which are abound in text and the second was the complicated feature conjunction in
linear models which was necessary because of the inability of simpler features to
make a clear distinction between truly co-referent and non-coreferent mentions.
This model handled the above issues by introducing a new neural network model
which took only raw un-conjoined features as inputs and attempted to learn
intermediate representations.

5. Conclusion:

Entity resolution aims at resolving repeated references to an entity in a document and forms a
core component of natural language processing (NLP) research. This field possesses immense
potential to improve the performance of other NLP fields like machine translation, sentiment
analysis, paraphrase detection, summarization, etc. The area of entity resolution in NLP has seen
proliferation of research in two separate sub-areas namely: anaphora resolution and coreference
resolution.

6. Viva Questions:

• What are anaphora resolution (AR) and coreference resolution (CR) problems in
NLP?
• What are different types of References in Natural Language?
Eg. Zero Anaphora, One Anaphora, Demonstratives etc.

49
Department of Computer Engineering

References:

TB1: Daniel Jurafsky, James H. Martin ―Speech and Language Processing, Second
Edition, Prentice Hall,2008.

RB1: Siddiqui and Tiwary U.S., Natural Language Processing and Information Retrieval, Oxford
University Press (2008).

50
Department of Computer Engineering

Computational Lab-II
Experiment No. : 9
Perform Name Entity Recognition (NER) on
given text.

51
Department of Computer Engineering

Experiment No.9
1. Aim: Perform Name Entity Recognition (NER) on given text.

2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.
• To design and implement applications based on natural language processing
• To implement various language Models.
• To design systems that uses NLP techniques

Outcomes: understand how to use Named entity recognition (NER) refers to the identification
of words in a sentence as an entity

3. Hardware / Software Required : Python using spaCy

4. Theory:

Named entity recognition (NER) is probably the first step towards information extraction that
seeks to locate and classify named entities in text into pre-defined categories such as the
names of persons, organizations, locations, expressions of times, quantities, monetary values,
percentages, etc.

NER with spaCy


spaCy is regarded as the fastest NLP framework in Python, with single optimized functions for
each of the NLP tasks it implements.

spaCy supports the following entity types:

PERSON, NORP (nationalities, religious and political groups), FAC (buildings, airports etc.),
ORG (organizations), GPE (countries, cities etc.), LOC (mountain ranges, water bodies etc.),
PRODUCT (products), EVENT (event names), WORK_OF_ART (books, song titles), LAW
(legal document titles), LANGUAGE (named languages), DATE, TIME, PERCENT, MONEY,
QUANTITY, ORDINAL and CARDINAL.

Being easy to learn and use, one can easily perform simple tasks using a few lines of code.

52
Department of Computer Engineering
Installation :

pip install spacy

python -m spacy download en_core_web_sm

Code for NER using spaCy.

import spacy

nlp = spacy.load('en_core_web_sm')

sentence = "Apple is looking at buying U.K. startup for $1 billion"

doc = nlp(sentence)

for ent in doc.ents:


print(ent.text, ent.start_char, ent.end_char, ent.label_)

Output

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY

Example -2 Further, it is interesting to note that spaCy’s NER model uses capitalization as one
of the cues to identify named entities. The same example, when tested with a slight modification,
produces a different result.
import spacy

nlp = spacy.load('en_core_web_sm')

sentence = "apple is looking at buying U.K. startup for $1 billion"

doc = nlp(sentence)

for ent in doc.ents:


print(ent.text, ent.start_char, ent.end_char, ent.label_)

Output

U.K. 27 31 GPE
$1 billion 44 54 MONEY

53
Department of Computer Engineering

5. Conclusion:

NER is used in many fields in Natural Language Processing (NLP), and it can help answering
many real-world questions, such as:

• Which companies were mentioned in the news article?


• Were specified products mentioned in complaints or reviews?
• Does the tweet contain the name of a person? Does the tweet contain this person’s
location?

6. Viva Questions:
• What is Name Entity Recognition?
• What are the libraries used in NER?
• What is the difference between chunking and NER?

References:

RB1: Siddiqui and Tiwary U.S., Natural Language Processing and Information Retrieval,
Oxford University Press (2008).
https://spacy.io/

54
Department of Computer Engineering

Computational Lab-II

Mini Project

55
Department of Computer Engineering

1. Aim: Case study/Mini Project based on Application in Module-6

2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To design and implement applications based on natural language processing

Outcomes: Be able to apply NLP techniques to design real world NLP applications such as
machine translation, text categorization, text summarization, information extraction...etc.

3. Hardware / Software Required: Language: R tool/Python etc.

4. Theory:
1. Abstract
2. Introduction
3. Literature Survey – required minimum 4 research paper on selected application.
4. Implementation – implement specified problem using any standard algorithm.
5. Result and Analysis-
6. Conclusion-
References
Note: -Mini project is a group activity, maximum 3 students in a group.
-Need to submit report as a hard copy in same computation Lab-II course file.
-Include minimum above mentioned points.
- Also attach printouts of presentation.

56

You might also like