Sample Paper Questions - Natural Language Processing (Part 2)
Q1 What is NLP (Natural Language Processing)?
It is the subset of Artificial Intelligence that deals with how computers through a program will perform tasks like
speech recognition, translation, large amounts of Natural language data analysis and extraction so that successful
interaction can occur between the machines and the humans to give the desired output.
Q2 What are the types of data used for Natural Language Processing applications?
Natural Language Processing takes in the data of Natural Languages in the form of written words and spoken
words which humans use in their daily lives and operates on this.
Q3 While working with NLP what is the meaning of Syntax and Semantics?
Syntax: Syntax refers to the grammatical structure of a sentence.
Semantics: It refers to the meaning of the sentence.
Q4 Mention the Difficulties faced by machine to understand human language.
Arrangement of words and meaning: There are rules in human language which provide structure to a
language. There are nouns, verbs, adverbs, adjectives. A word can be a noun at one time and an adjective
some other time.
Analogy with programming language- Different syntax, same semantics: 2+3 = 3+2
Here the way these statements are written is different, but their meanings are the same that is 5.
Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3), Here the statements written have the
same syntax but their meanings are different. In Python 2.7, this statement would result in 1 while in
Python 3, it would give an output of 1.5.
Multiple meanings of a word – In natural language, a word can have multiple meanings and the meanings
fit into the statement according to the context of it.
Example - His face turns red after consuming the medicine.
Meaning - Is he having an allergic reaction? Or is he not able to bear the taste of that medicine?
Perfect syntax, no meaning – Sometimes, a statement can have perfectly correct syntax but it does not
mean anything.
Example - Chickens feed extravagantly while the moon drinks tea.
This statement is correct grammatically but it does not make any sense. In Human language, a perfect
balance of syntax and semantics is important for better understanding.
Q5 What is a Chatbot?
A chatbot is a computer program that's designed to simulate human conversation through voice commands
or text chats or both. Eg: Mitsuku Bot, Jabberwacky etc.
A chatbot is a computer program that can learn over time how to best interact with humans. It can answer
questions and troubleshoot customer problems, evaluate and qualify prospects, generate sales leads and
increase sales on an ecommerce site.
A chatbot is also known as an artificial conversational entity (ACE), chat robot, talk bot, chatterbot or
chatterbox.
Q6 Mention Examples of Chatbots.
Mitsuku Bot, CleverBot, Jabberwacky, Haptik, Rose, Ochatbot
Q7 List the limitations of Chatbots.
Chatbots cannot make out grammatical errors in the text
They have no emotions
They can’t ask qualifying questions if clarification is required
Complex Chatbots are expensive
Q8 List two applications of Chatbots in Education.
1. Efficient teacher Assistants: Chatbots are being used as virtual teaching assistants, to answer students’
queries about the course module, lesson plans, assignments and deadlines.
2. Answer queries at the time of admission: Since most of the questions are repetitive, Chatbots are used
to convert this time consuming task of replying to each query personally into an automatic one.
Q9 Difference between Script-bot and Smart-bot.
Q10 Mention some applications of Natural Language Processing.
Natural Language Processing Applications-
Sentiment Analysis
Chatbots & Virtual Assistants
Text Classification
Text Extraction
Machine Translation
Text Summarization
Market Intelligence
Auto-Correct
Q11 What is Text Normalisation?
The first step in Data processing is Text Normalisation. Text Normalisation helps in cleaning up the textual data
in such a way that it comes down to a level where its complexity is lower than the actual data. In this we undergo
several steps to normalize the text to a lower level. We work on text from multiple documents and the term used
for the whole textual data from all the documents altogether is known as corpus.
Q12 Does the vocabulary of a corpus remain the same before and after text normalization? Why?
No, the vocabulary of a corpus does not remain the same before and after text normalization.
Reasons are –
In normalization the text is normalized through various steps and is lowered to minimum vocabulary
since the machine does not require grammatically correct statements but the essence of it.
In normalization Stop words, Special Characters and Numbers are removed.
In stemming the affixes of words are removed and the words are converted to their base form.
So, after normalization, we get the reduced vocabulary
Q13 What is the need of text normalization in NLP?
As we all know that the language of computers is Numerical, the very first step that comes to our mind is to
convert our language to numbers.
This conversion takes a few steps to happen. The first step to it is Text Normalization. Since human languages
are complex, we need to first of all simplify them in order to make sure that the understanding becomes possible.
Text Normalization helps in cleaning up the textual data in such a way that it comes down to a level where its
complexity is lower than the actual data.
Q14 What are the steps of text Normalization? Explain them in brief.
In Text Normalization, we undergo several steps to normalize the text to a lower level.
1. Sentence Segmentation - Under sentence segmentation, the whole corpus is divided into sentences. Each
sentence is taken as a different data so now the whole corpus gets reduced to sentences.
2. Tokenisation- After segmenting the sentences, each sentence is then further divided into tokens. Tokens
is a term used for any word or number or special character occurring in a sentence. Under tokenisation,
every word, number and special character is considered separately and each of them is now a separate
token.
3. Removing Stop words, Special Characters and Numbers –In this step, the tokens which are not
necessary are removed from the token list.
4. Converting text to a common case -After the stop words removal, we convert the whole text into a
similar case, preferably lower case. This ensures that the case-sensitivity of the machine does not consider
same words as different just because of different cases.
5. Stemming: In this step, the remaining words are reduced to their root words. In other words, stemming is
the process in which the affixes of words are removed and the words are converted to their base form. In
stemming, the stemmed words (words which are we get after removing the affixes) may or may not be
meaningful.
6. Lemmatization: In lemmatization, the word we get after affix removal (also known as lemma) is a
meaningful one. Lemmatization makes sure that lemma is a word with meaning and hence it takes a longer
time to execute than stemming.
Q15 What is the importance of converting the text into a common case?
In Text Normalization, we undergo several steps to normalize the text to a lower level. After the removal
of stop words, we convert the whole text into a similar case, preferably lower case. This ensures that the
case-sensitivity of the machine does not consider same words as different just because of different cases.
Q16 What is Segmentation?
Under sentence segmentation, the whole corpus is divided into sentences. Each sentence is taken as a different
data so now the whole corpus gets reduced to sentences.
Example: Raj and Vijay are best friends. They play together with other friends.
Sentence Segmentation:
1. Raj and Vijay are best friends.
2. They play together with other friends.
Q17 What is Tokenisation?
After segmenting the sentences, each sentence is then further divided into tokens. Tokens is a term used for any
word or number or special character occurring in a sentence. Under tokenisation, every word, number and
special character is considered separately and each of them is now a separate token.
Q18 What is meant by Removing Stop words, Special Characters and Numbers?
Stop words are the words which occur very frequently in the corpus but do not add any value to it. Humans use
grammar to make their sentences meaningful for the other person to understand. But grammatical words do not
add any essence to the information which is to be transmitted through the statement hence they come under stop
words.
In this step, all the stop words or special characters like #$%@! Or numbers which are not necessary are
removed from the list of tokens to make it easier for the NLP system to focus on the words that are important for
data processing.
It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for
symbolic and statistical natural language processing.
Q19 What is Converting text to a common case in Text Normalisation?
After the stop words removal, we convert the whole text into a similar case, preferably lower case. This ensures
that the case-sensitivity of the machine does not consider same words as different just because of different cases.
Q20 Difference between Stemming and Lemmatization
Stemming:
In this step, the remaining words are reduced to their root words. In other words, stemming is the process
in which the affixes of words are removed and the words are converted to their base form.
In stemming, the stemmed words (words which are we get after removing the affixes) may or may not be
meaningful.
Stemming just removes the affixes hence it is faster.
Lemmatization
In lemmatization, the word we get after affix removal (also known as lemma) is a meaningful one.
Lemmatization makes sure that lemma is a word with meaning and hence it takes a longer time to
execute than stemming.
Q21 Examples of Stemming and Lemmatization
Word Affix Stem Lemmatization
Healed -ed heal heal
Healing -ing heal heal
Healer -er heal heal
Changed -ed chang change
Changing -ing Chang change
Studying -ing study study
Studies -es studi study
Likes -s like like
Prefers -s prefer prefer
Wants -s want want
Tries -es tri try
Q22 Create the stemming and lemmatization of word “running”, “runner” and “runs”
After stemming, they would all be reduced to the common stem “run.”
Stemming Words: run, run, run
Lemmatized Words: running, runner, run
In this example, “running” remains unchanged, as it is already in its base form (a verb in the present participle
form). “Runner” remains “runner” since it is a noun, and the plural form “runs” becomes “run,” the base form of
the verb.
Q23 Following is the corpus:
Document 1: This is my first experience of Text Mining.
Document 2: I have learnt new techniques in this.
Perform sentence segmentation, tokenization and stop word removal
Sentence Segmentation
This is my first experience of Text Mining.
I have learnt new techniques in this.
Tokenization
This is my first experience This is my first experience of Text Mining .
of Text Mining.
I have learnt new I have learnt new techniques in this .
techniques in this.
Removing Stop Words
my first experience Text Mining
learnt new techniques
Q24 Identify any two stop words which should not be removed from the given sentence and why? Get help
and support whether you’re shopping now or need help with a past purchase.
“Contact us at abc@pwershel.com or on our website www.pwershel.com”
Stop words in the given sentence which should not be removed are: @, . (fullstop), _(underscore),
123(numbers)
These tokens are generally considered as stop words, but in the above sentence, these tokens are part of an email
ID. Removing these tokens may lead to invalid website address and email ID. So these words should not be
removed from the above sentence.
Q25 What are the applications of TFIDF?
TFIDF is commonly used in the Natural Language Processing domain. Some of its applications are:
Document Classification - Helps in classifying the type and genre of a document.
Topic Modelling - It helps in predicting the topic for a corpus.
Information Retrieval System - To extract the important information out of a corpus.
Stop word filtering - Helps in removing the unnecessary words out of a text body.
Q26 What is the full form of TFIDF?
Term Frequency and Inverse Document Frequency
Q27 What is TFIDF? Write its formula.
Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a
word is to a document in a collection or corpus.
5
The number of times a word appears in a document divided by the total number of words in the document. Every
document has its own term frequency.
Q28 What is term frequency?
Term frequency is the frequency of a word in one document. Term frequency can easily be found from the
document vector table as in that table we mention the frequency of each word of the vocabulary in each document.
Q29 Which package is used for Natural Language Processing in Python programming?
Natural Language Toolkit (NLTK). NLTK is one of the leading platforms for building Python programs that can
work with human language data.
Q30 What is inverse document frequency?
To understand inverse document frequency, first we need to understand document frequency.
Document Frequency is the number of documents in which the word occurs irrespective of how many times it
has occurred in those documents.
In case of inverse document frequency, we need to put the document frequency in the denominator while the total
number of documents is the numerator.
For example, if the document frequency of a word “AMAN” is 2 in a particular document then its inverse document
frequency will be 3/2. (Here no. of documents is 3)
Q31 Through a step-by-step process, calculate TFIDF for the given corpus and mention the word(s) having
highest value.
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.
Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can easily be found from the
document vector table as in that table we mention the frequency of each word of the vocabulary in each document.
We are going to Mumbai is a famous place I am in
1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 1 0 0 0
1 1 1 1 0 0 1 1 1 0 0 0
0 0 0 0 1 0 0 1 0 1 1 1
Inverse Document Frequency
Document Frequency is the number of documents in which the word occurs irrespective of how many times it
has occurred in those documents. The document frequency for the exemplar vocabulary would be:
We are going to Mumbai is a famous place I am in
2 2 2 2 3 1 2 3 2 1 1 1
For inverse document frequency, we need to put the document frequency in the denominator while the total
number of documents is the numerator. Here, the total number of documents are 3, hence inverse document
frequency becomes:
We are going to Mumbai is a famous place I am in
4/2 4/2 4/2 4/2 4/3 4/1 4/2 4/3 4/2 4/1 4/1 4/1
The formula of TFIDF for any word W becomes:
TFIDF (W) = TF (W) * log (IDF (W))
The words having highest value are – Mumbai, Famous as they have high frequency.