Natural Language Processing
Dr. Ankur Priyadarshi
Assistant Professor
Computer Science and Information Technology
Syllabus
Prerequisites:
1. Basic knowledge about English grammar and
Theory of Computation.
2. Basic knowledge in Machine Learning tools.
Course objectives
1. To understand the algorithms available for the processing of
linguistic information and computational properties of natural languages.
2. To conceive basic knowledge on various morphological,
syntactic and semantic NLP tasks.
3. To familiarize various NLP software libraries and datasets
publicly available.
4. To develop systems for various NLP problems with moderate
complexity.
5. To learn various strategies for NLP system evaluation and error
analysis.
Unit I:
INTRODUCTION TO NLP
Natural Language
Processing
⊹ Natural language processing (NLP) refers to the branch of computer
science—and more specifically, the branch of artificial intelligence or
AI—concerned with giving computers the ability to understand text and
spoken words in much the same way human beings can.
⊹ NLP combines computational linguistics—rule-based modeling of human
language—with statistical, machine learning, and deep learning models.
⊹ Together, these technologies enable computers to process human
language in the form of text or voice data and to ‘understand’ its full
meaning, complete with the speaker or writer’s intent and sentiment.
NLP APPLICATIONS
1. Information Extraction
2. Question Answering
3. Sentiment Analysis
4. Machine Translation and many..
Speech recognition, Intent classification, Urgency detection, Auto-correct, Market Intelligence, Email
filtering, Voice assistants and chatbots, Advertisement to target audience, Recruitment
Information Extraction (IE)
1. Working with an enormous amount of text data is always hectic
and time-consuming.
2. Hence, many companies and organisations rely on Information
Extraction techniques to automate manual work with intelligent
algorithms.
3. Information extraction can reduce human effort, reduce expenses,
and make the process less error-prone and more efficient.
Example: IE
We can extract the following information from the text:
● Country – India, Captain – Virat Kohli
● Batsman – Virat Kohli, Runs – 2
● Bowler – Kyle Jamieson
● Match venue – Wellington
● Match series – New Zealand
● Series highlight – single fifty, 8 innings, 3 formats
Question Answering
⊹ Question answering is a critical NLP problem and a
long-standing artificial intelligence milestone.
⊹ QA systems allow a user to express a question in natural
language and get an immediate and brief response.
⊹ QA systems are now found in search engines and phone
conversational interfaces, and they’re fairly good at
answering simple snippets of information.
⊹ On more hard questions, however, these normally only go as
far as returning a list of snippets that we, the users, must
then browse through to find the answer to our question.
Sentiment Analysis
⊹ Sentiment analysis (or opinion mining) is a natural
language processing (NLP) technique used to determine
whether data is positive, negative or neutral.
⊹ Sentiment Analysis, as the name suggests, it means to
identify the view or emotion behind a situation. It basically
means to analyze and find the emotion or intent behind a
piece of text or speech or any mode of communication.
Suppose, there is a fast-food chain company and they sell a variety of
different food items like burgers, pizza, sandwiches, milkshakes, etc. They
have created a website to sell their food and now the customers can order
any food item from their website and they can provide reviews as well, like
whether they liked the food or hated it.
● User Review 1: I love this cheese sandwich, it’s so delicious.
● User Review 2: This chicken burger has a very bad taste.
● User Review 3: I ordered this pizza today.
So, as we can see that out of these above 3 reviews,
The first review is definitely a positive one and it signifies that the customer was
really happy with the sandwich. The second review is negative, and hence the
company needs to look into their burger department. And, the third one doesn’t
signify whether that customer is happy or not, and hence we can consider this as a
neutral statement.
Machine Translation
Machine Translation (MT) is the task of automatically
converting one natural language into another, preserving
the meaning of the input text, and producing fluent text in the
output language.
Machine Translation (MT) is the task of automatically converting one natural
language into another, preserving the meaning of the input text, and producing
fluent text in the output language.
While machine translation is one of the oldest subfields of artificial intelligence
research, the recent shift towards large-scale empirical techniques has led to
very significant improvements in translation quality.
The Stanford Machine Translation group's research interests lie in techniques
that utilize both statistical methods and deep linguistic analyses.
Machine translation: approaches
● Rule-based Machine Translation (RBMT): 1970s-1990s
● Statistical Machine Translation (SMT): 1990s-2010s
● Neural Machine Translation (NMT): 2014-...
Rule based MT (RBMT)
A rule-based system requires experts’ knowledge about the source and
the target language to develop syntactic, semantic and morphological
rules to achieve the translation.
The Wikipedia article of RBMT includes a basic example of rule-based
translation from English to German. The translation needs an
English-German dictionary, a rule set for English grammar and a rule
set for German grammar
An RBMT system contains a pipeline of Natural Language Processing
(NLP) tasks including Tokenization, Part-of-Speech tagging and so on.
Most of these jobs have to be done in both source and target language.
SYSTRAN is one of the oldest Machine Translation company.
It translates from and to around 20 languages.
SYSTRAN was used for the Apollo-Soyuz project (1973) and by the
European Commission (1975)
Advantages
● No bilingual text required
● Domain-independent
● Total control (a possible new rule for every situation)
● Reusability (existing rules of languages can be transferred
when paired with new languages)
Disadvantages
● Requires good dictionaries
● Manually set rules (requires expertise)
Statistical MT
This approach uses statistical models based on the analysis of bilingual
text corpora.
It was first introduced in 1955, but it gained interest only after 1988
when the IBM Watson Research Center started using it.
SMT Examples
● Google Translate (between 2006 and 2016, when they
announced to change to NMT)
● Microsoft Translator (in 2016 changed to NMT)
● Moses: Open source toolkit for statistical machine translation
Advantages
● Less manual work from linguistic experts
● One SMT suitable for more language pairs
● Less out-of-dictionary translation: with the right language
model, the translation is more fluent
Disadvantages
● Requires bilingual corpus
● Specific errors are hard to fix
● Less suitable for language pairs with big differences in word
Neural MT
❖ The neural approach uses neural networks to achieve machine
translation.
❖ Compared to the previous models, NMTs can be built with one
network instead of a pipeline of separate tasks.
NMT examples
● Google Translate (from 2016) link to language team at Google
AI
● Microsoft Translate (from 2016) link to MT research at
Microsoft
● Translation on Facebook: link to NLP at Facebook AI
● OpenNMT: An open-source neural machine translation
system.
Advantages
● End-to-end models (no pipeline of specific tasks)
Disadvantages
● Requires bilingual corpus
● Rare word problem
NLP PHASES
Lexical Analysis
● It involves identifying and analyzing the structure of words. Lexicon of a
language means the collection of words and phrases in that particular
language.
● The lexical analysis divides the text into paragraphs, sentences, and words.
So we need to perform Lexicon Normalization.
The most common lexicon normalization techniques are Stemming:
● Stemming: Stemming is the process of reducing derived words to their word
stem, base, or root form generally a written word form like-“ing”, “ly”, “es”, “s”,
etc
● Lemmatization: Lemmatization is the process of reducing a group of words
into their lemma or dictionary form. It takes into account things like POS(Parts
of Speech), the meaning of the word in the sentence, the meaning of the
word in the nearby sentences, etc. before reducing the word to its lemma.
Syntactic Analysis
Syntactic Analysis is used to check grammar, arrangements of words,
and the interrelationship between the words.
Example: Mumbai goes to the Sara
Here “Mumbai goes to Sara”, which does not make any sense, so this
sentence is rejected by the Syntactic analyzer.
Syntactical parsing involves the analysis of words in the sentence for
grammar.
Dependency Grammar and Part of Speech (POS) tags are the important
attributes of text syntactic.
Semantic analysis
The way we understand what someone has said is an unconscious
process relying on our intuition and knowledge about language itself.
In other words, the way we understand language is heavily based on
meaning and context. Computers need a different approach, however.
The word “semantic” is a linguistic term and means "related to
meaning or logic."
Semantic analysis is the process of understanding the meaning and
interpretation of words, signs and sentence structure.
Discourse Integration
Discourse integration is closely related to pragmatics (context of the sentence).
Discourse integration is considered as the larger context for any smaller part of NL
structure. NL is so complex and, most of the time, sequences of text are dependent
on prior discourse.
This concept occurs often in pragmatic ambiguity. This analysis deals with how the
immediately preceding sentence can affect the meaning and interpretation of the
next sentence. Here, context can be analyzed in a bigger context, such as paragraph
level, document level, and so on.
Pragmatic Analysis
Pragmatic Analysis is part of the process of extracting information from text.
Specifically, it’s the portion that focuses on taking a structures set of text and
figuring out what the actual meaning was.
It actually comes from the field of linguistics (as a lot of NLP does), where the
context is considered from the text.
Why is this important? Because a lot of text’s meaning does have to do with
the context in which it was said/written.
Ambiguity, and limiting ambiguity, are at the core of natural language
processing, so needless to say, pragmatic analysis is actually quite crucial
with respect to extracting meaning or information.
Difficulty In NLP
● Contextual words and phrases and homonyms
● Synonyms
● Irony and sarcasm
● Ambiguity
● Errors in text or speech
● Colloquialisms and slang
● Domain-specific language
● Low-resource languages
● Lack of research and development