Unit 6: Natural Language Processing
Introduction
A natural language is a human language, such as French, Spanish, English, Japanese, etc.
Features of Natural Languages
● They are governed by set rules that include syntax, lexicon, and semantics.
● All natural languages are redundant, i.e., the information can be conveyed in multiple ways.
● All natural languages change over time.
Computer Language
Computer languages are languages used to interact with a computer, such as Python, C++, Java, HTML,
etc.
Can computers understand our language?
Computers require a specific set of instructions to understand
human input called programs.
To talk to a computer, we convert natural language into a
language that a computer understands. We need Natural
Language Processing to help computers understand natural
language.
Why is NLP important?
Computers can only process electronic signals in the form of
binary language. Natural Language Processing facilitates this
conversion to digital form from the natural form.
Thus, the whole purpose of NLP is to make communication
between computer systems and humans possible. This includes
creating different tools and techniques that facilitate better
communication of intent and context.
Demystify Natural Language Processing (NLP)
Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling computers to
analyse, understand and process human languages to derive meaningful information from human
language.
Applications of Natural Language Processing
Since Artificial Intelligence nowadays is becoming an integral part of our lives, its applications are very
commonly used by the majority of people in their daily lives.
Here are some of the applications of Natural Language Processing which are used in the real-life
scenario:
Voice assistants: Voice assistants take our natural speech, process
it, and give us an output. These assistants leverage NLP to
understand natural language and execute tasks efficiently.
For example:
Hey Google, set an alarm at 3.30 pm
Hey Alexa, play some music
Hey Siri, what's the weather today
Autogenerated captions: Captions are generated by turning natural
speech into text in real-time. It is a valuable feature for enhancing
the accessibility of video content.
For example:
Auto-generated captions on YouTube and Google Meet.
Language Translation: It incorporates the generation of translation
from another language. This involves the conversion of text or
speech from one language to another, facilitating cross-linguistic
communication and fostering global connectivity.
For example:
Google Translate
Sentiment Analysis: Sentiment Analysis is a tool to express an
opinion, whether the underlying sentiment is positive, negative, or
neutral. Customer sentiment analysis helps in the automatic
detection of emotions when customers interact with the products,
services, or brand
Text Classification: Text classification is a tool which classifies a
sentence or document category-wise.
In the example, we can observe news articles containing
information on various sectors, including Food, Sports, and
Politics, being categorized through the text classification process.
This process classifies the raw texts into predefined groups or
categories.
Keyword Extraction: Keyword extraction is a tool that automatically
extracts the most used, important words and expressions from a text.
It can give valuable insights into people’s opinions about any business
on social media.
Customer Service can be improved by using a Keyword extraction
tool.
Stages of Natural Language Processing (NLP)
The different stages of Natural Language Processing (NLP) serve various purposes in the overall task of
understanding and processing human language. The stages of Natural Language Processing (NLP)
typically involve the following :
Lexical Analysis:
NLP starts with identifying the structure of input words. It is the process of dividing a large chunk of
words into structural paragraphs, sentences, and words.
Lexicon stands for a collection of the various words and phrases used in a language.
Lengthy text is broken down into chunks.
Syntactic Analysis / Parsing
It is the process of checking the grammar of sentences and phrases. It forms a relationship among words
and eliminates logically incorrect sentences.
The grammar is correct!
Semantic Analysis
In this stage, the input text is now checked for meaning, and every word and phrase is checked for
meaningfulness.
For example:
It will reject a sentence that contains ‘hot ice cream’ in it. The fox jumped into the dog.
Sentences make actual sense!
Discourse Integration
It is the process of forming the story of the sentence. Every sentence should have a relationship with its
preceding and succeeding sentences.
The flow of words makes sense!
Pragmatic Analysis
In this stage, sentences are checked for their relevance in the real world. Pragmatic means practical or
logical, i.e., this step requires knowledge of the intent in a sentence. It also means to discard the actual
word meaning taken after semantic analysis and take the intended meaning.
The intended meaning has been achieved!
In summary,
Chatbots
One of the most common applications of Natural Language Processing is a chatbot.
A chatbot is a computer program that's designed to simulate human conversation through voice
commands or text chats or both.
It can learn over time how to best interact with humans. It can answer questions and troubleshoot
customer problems, evaluate and qualify prospects, generate sales leads and increase sales on an
ecommerce site. There are a lot of chatbots available.
Let us try some of the chatbots and see how they work.
Elizabot - https://www.masswerk.at/elizabot/
Mitsuki - https://www.kuki.ai/
Cleverbot - https://www.cleverbot.com/
Singtel - https://www.singtel.com/personal/support
As you interact with more and more chatbots, you would realise that some of them are scripted or in
other words are traditional chatbots while others are AI-powered and have more knowledge.
With the help of this experience, we can understand that there are 2 types of chatbots around us:
Script-bot and Smart-bot.
Text Processing
Humans interact with each other very easily. For us, the natural languages that we use are so convenient
that we speak them easily and understand them well too. But for computers, our languages are very
complex.
As you have already gone through some of the complications in human languages above, now it is time
to see how Natural Language Processing makes it possible for machines to understand and speak in
Natural Languages just like humans.
Since we all know that the language of computers is Numerical, the very first step that comes to our mind
is to convert our language to numbers. This conversion takes a few steps to happen.
The first step to it is Text Normalisation.
Since human languages are complex, we need to first of all simplify them in order to make sure that
understanding becomes possible. Text Normalisation helps in cleaning up the textual data in such a way
that it comes down to a level where its complexity is lower than the actual data.
Text Normalisation
In Text Normalisation, we undergo several steps to normalise the text to a lower level. Before we begin,
we need to understand that in this section, we will be working on a collection of written text.
That is, we will be working on text from multiple documents and the term used for the whole textual data
from all the documents altogether is known as corpus.
Not only would we go through all the steps of Text Normalisation, we would also work them out on a
corpus.
Let us take a look at the steps:
Sentence Segmentation
Under sentence segmentation, the whole corpus is divided into sentences. Each sentence is taken as a
different data so now the whole corpus gets reduced to sentences.
Tokenization
After segmenting the sentences, each sentence is then further divided into tokens. Tokens is a term used
for any word or number or special character occurring in a sentence. Under tokenisation, every word,
number and special character is considered separately and each of them is now a separate token.
Removing Stop words, Special Characters and Numbers
In this step, the tokens which are not necessary are removed from the token list. What are the possible
words which we might not require?
Stop words are the words which occur very frequently in the corpus but do not add any value to it.
Humans use grammar to make their sentences meaningful for the other person to understand. But
grammatical words do not add any essence to the information which is to be transmitted through the
statement hence they come under stop words. Some examples of stop words are:
These words occur the most in any given corpus but talk very little or nothing about the context or the
meaning of it. Hence, to make it easier for the computer to focus on meaningful terms, these words are
removed.
Along with these words, a lot of times our corpus might have special characters and/or numbers. Now it
depends on the type of corpus that we are working on whether we should keep them in it or not.
For example, if you are working on a document containing email IDs, then you might not want to remove
the special characters and numbers whereas in some other textual data if these characters do not make
sense, then you can remove them along with the stop words.
Converting Text to a Common Case
After the stop words removal, we convert the whole text into a similar case, preferably lowercase. This
ensures that the case sensitivity of the machine does not consider the same words as different just because
of different cases.
Here in this example, all the 6 forms of hello would be converted to lowercase and hence would be
treated as the same word by the machine.
Stemming
In this step, the remaining words are reduced to their root words. In other words, stemming is the process
in which the affixes of words are removed and the words are converted to their base form.
Note that in stemming, the stemmed words (words that we get after removing the affixes) might not be
meaningful. Here in this example as you can see: healed, healing and healer all were reduced to heal but
studies was reduced to studi after the affix removal which is not a meaningful word. Stemming does not
take into account whether the stemmed word is meaningful or not. It just removes the affixes hence it is
faster.
Lemmatization
Stemming and lemmatization both are alternative processes to each other as the role of both the processes
is same – removal of affixes. But the difference between both of them is that in lemmatization, the word
we get after affix removal (also known as lemma) is a
meaningful one. Lemmatization makes sure that a lemma is a word with meaning and hence it takes a
longer time to execute than stemming.
As you can see in the same example, the output for studies after affix removal has become study instead
of studi.
The difference between stemming and lemmatization can be summarized by this example:
With this, we have normalised our text to tokens which are the simplest form of words present in the
corpus. Now it is time to convert the tokens into numbers. For this, we would use the Bag of Words
algorithm
Bag of Words
Bag of Words is a Natural Language Processing model which helps in extracting features out of the text
which can be helpful in machine learning algorithms. In the bag of words, we get the occurrences of each
word and construct the vocabulary for the corpus.
This image gives us a brief overview of how the bag of words works. Let us assume that the text on the
left in this image is the normalised corpus which we have got after going through all the steps of text
processing. Now, as we put this text into the bag of words algorithm, the algorithm returns to us the
unique words out of the corpus and their occurrences in it. As you can see on the right, it shows us a list
of words appearing in the corpus and the numbers corresponding to it show how many times the word
has occurred in the text body. Thus, we can say that the bag of words gives us two things:
1. A vocabulary of words for the corpus
2. The frequency of these words (number of times it has occurred in the whole corpus).
Here calling this algorithm a “bag” of words symbolises that the sequence of sentences or tokens does not
matter. In this case, all we need are the unique words and their frequency.
Here is the step-by-step approach to implementing the bag of words algorithm:
1. Text Processing: Collect data and pre-process it
2. Create a Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the word
from the unique list of words has occurred.
4. Create document vectors for all the documents.
Let us go through all the steps with an example:
Step 1: Collecting data and pre-processing it.
Document 1: Aman and Avni are stressed
Document 2: Aman went to a therapist
Document 3: Avni went to download a health chatbot
Here are three documents having one sentence each. After text normalisation, the text becomes:
Document 1: [aman, and, avni, are, stressed] Document 2: [aman, went, to, a, therapist]
Document 3: [avni, went, to, download, a, health, chatbot]
Step 2: Create a Dictionary
Go through all the steps and create a dictionary i.e., list down all the words which occur in all three
documents:
Dictionary:
Note that even though some words are repeated in different documents, they are all written just once as
while creating the dictionary, we create the list of unique words.
Step 3: Create a document vector
In this step, the vocabulary is written in the top row. Now, for each word in the document, if it matches
the vocabulary, put a 1 under it. If the same word appears again, increment the previous value by 1. And
if the word does not occur in that document, put a 0 under it.
Since, in the first document, we have words: aman, and, avni, are, stressed. So, all these words get a
value of 1 and the rest of the words get a 0 value.
Step 4: Create document vectors for all the documents.
The same exercise has to be done for all the documents. Hence, the table becomes:
In this table, the header row contains the vocabulary of the corpus and three rows correspond to three
different documents. Take a look at this table and analyse the positioning of 0s and 1s in it.
Finally, this gives us the document vector table for our corpus. However, the tokens have still not been
converted to numbers. This leads us to the final step of our algorithm: TFIDF.
TFIDF: Term Frequency & Inverse Document Frequency
Suppose you have a book. Which characters or words do you think would occur the most in it?
The bag of words algorithm gives us the frequency of words in each document we have in our corpus. It
gives us an idea that if the word is occurring more in a document, its value is more for that document.
For example, if I have a document on air pollution, air and pollution would be the words which occur
many times in it. And these words are valuable too as they give us some context around the document.
But let us suppose we have 10 documents and all of them talk about different issues.
One is on women's empowerment; the other is on unemployment and so on.
Do you think air and pollution would still be one of the most occurring words in the whole corpus?
If not, then which words do you think would have the highest frequency in all of them?
And, this, is, the, etc. are the words which occur the most in almost all the documents. But these words
do not talk about the corpus at all.
Though they are important for humans as they make the statements understandable to us, for the machine
they are a complete waste as they do not provide us with any information regarding the corpus.
Hence, these are termed as stop words and are mostly removed at the pre-processing stage only.
Take a look at this graph. It is a plot of the occurrence of words versus their value.
As you can see, if the words have the highest occurrence in all the documents of the corpus, they are said
to have negligible value hence they are termed as stop words. These words are mostly removed at the
pre-processing stage only.
Now as we move ahead from the stop words, the occurrence level drops drastically and the words which
have adequate occurrence in the corpus are said to have some amount of value and are termed as
frequent words. These words mostly talk about the document’s subject and their occurrence is adequate
in the corpus.
Then as the occurrence of words drops further, the value of such words rises. These words are termed as
rare or valuable words. These words occur the least but add the most value to the corpus.
Hence, when we look at the text, we consider frequent and rare words.
TFIDF stands for Term Frequency and Inverse Document Frequency.
TFIDF helps us identify the value of each word.
Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can easily be found in the
document vector table as in that table we mention the frequency of each word of the vocabulary in each
document.
Here, you can see that the frequency of each word for each document has been recorded in the table.
These numbers are nothing but the Term Frequencies!
Inverse Document Frequency
Now, let us look at the other half of TFIDF which is Inverse Document Frequency. For this, let us first
understand what document frequency means. Document Frequency is the number of documents in which
the word occurs irrespective of how many times it has occurred in those documents. The document
frequency for the exemplar vocabulary would be:
Here, you can see that the document frequency of ‘aman’, ‘avni’, ‘went’, ‘to’ and ‘a’ is 2 as they have
occurred in two documents. The rest of them occurred in just one document hence the document
frequency for them is one.
Talking about inverse document frequency, we need to put the document frequency in the denominator
while the total number of documents is the numerator. Here, the total number of documents is 3, hence
inverse document frequency becomes:
Finally, the formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log( IDF(W) )
Here, log is to the base of 10. Don’t worry! You don’t need to calculate the log values by yourself.
Simply use the log function in the calculator and find out!
Now, let’s multiply the IDF values by the TF values.
Note that the TF values are for each document while the IDF values are for the whole corpus. Hence, we
need to multiply the IDF values to each row of the document vector table.
Here, you can see that the IDF values for Aman in each row are the same and a similar pattern is
followed for all the words of the vocabulary. After calculating all the values, we get:
Finally, the words have been converted to numbers. These numbers are the values of each for each
document.
Here, you can see that since we have less amount of data, words like ‘are’ and ‘and’ also have a high
value. But as the IDF value increases, the value of that word decreases.
That is, for example:
Total Number of documents: 10
Number of documents in which ‘and’ occurs: 10. Therefore, IDF(and) = 10/10 = 1
Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.
On the other hand, the number of documents in which ‘pollution’ occurs: 3 IDF(pollution) = 10/3 =
3.3333…
This means log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable value in the
corpus.
Summarising the concept, we can say that:
1.Words that occur in all the documents with high term frequencies have the lowest values and are
considered to be the stop words.
2.For a word to have a high TFIDF value, the word needs to have a high term frequency but less
document frequency which shows that the word is important for one document but is not a common word
for all documents.
3.These values help the computer understand which words are to be considered while processing the
natural language. The higher the value, the more important the word is for a given corpus.
Applications of TFIDF
TFIDF is commonly used in the Natural Language Processing domain. Some of its applications are:
Examples of Code and No-code NLP Tools
Applications of NLP
Introduction to Sentiment Analysis
Applications of Sentiment Analysis-Customer Service
Applications of Sentiment Analysis –Voice of the Customer
•Voice of the customer analysis helps to analyze customer feedback and gain actionable insights from it.
•It measures the gap between what customers expect and what they actually experience when they use the
products or services