Natural Language Processing
Introduction to Natural Language Processing
Learning Objectives
By the end of this lesson, you will be able to:
Describe natural language processing and its components
Explain the different applications of NLP
Define and demonstrate text processing
Introduction to NLP
What Is NLP?
Natural Language Processing (NLP) is a
branch of AI.
It helps machine to deal with human Human
AI
languages. Language
NLP
It helps machine to understand, interpret,
and manipulate human languages.
Computer
Science
Most of the Natural Language Processing
techniques depend on machine learning to
derive meaning from human languages.
Interaction between Humans and Machines Using NLP
Human talks to the
1 machine
Machine responds to the
human through the
6 2 Machine captures the
audio
audio file
Data to audio
conversion
5 3 Audio to text conversion
4
Processing of the text data
History of NLP
1950 1990 2010
The work of the Linguistic
Scientist Chomsky and others, A large quantity of
on formal language theory and spoken and textual data
generative syntax, began. became accessible.
NLP started when Alan Turing Probabilistic and data- Representation of learning and
revealed an article known as driven models had deep neural network in natural
"Machine and Intelligence tries become quite normal. language processing started.
to automatize translation 1960 2000
between Russian and English".
Human Language
To understand NLP,
let us first Alphabets
understand the
human language.
Word: Combination of
Alphabets
Apply Grammar Rules
Sentence: Combination
Meaningful Sentence
of Words
Why NLP
Text is the largest repository of human
knowledge and is growing quickly
02
03 NLP represents computer
programs that understand text or
speech
01
NLP is the hallmark of
human intelligence
Need for NLP
Collections
Words
Organization Packaged Data
Text Big Data Sources
Facilitate Data Patterns
Originates
none
How to analyze this unstructured data?
Use Text Mining
Understanding Text Mining
It is also called text analysis.
This is
Process of deriving insights from natural language text
where NLP
helps
Structure the input Derive pattern Evaluate output
text
How NLP Works
Sequence Text to
Preprocessing Features
02
01
03 04
Usage of NLP
Libraries
Usage of
NLP Models
Different Aspects of NLP
01 NLP is the ability of a computer to analyze, understand,
and generate human languages.
02 A language is a system, a set of rules or set of
symbols.
03 Symbols are combined and used for conveying
information or broadcasting the information.
04 In NLP, rules of grammar are used for handling
symbols.
05 NLP is a component of artificial intelligence.
Categories of NLP
Rule-Based NLP Statistical NLP
• Designed by creating a set of rules • Relies heavily on machine learning
Statistical Revolution
• Developed by heuristic rules • Applies automatic learning procedure
Techniques Used in NLP
NLP
Syntactic Analysis Semantic Analysis
Meaning of a text
Focuses on
arrangement of words Sentence structure
understanding
Aligns with
grammatical rules Interpretation of
words
Components of Natural Language Processing
Components of Natural Language Processing
Natural Language Natural Language
Understanding (NLU) Natural Generation (NLG)
1 NLU Language NLG 2
Processing
Components: Natural Language Understanding (NLU)
Taking some sentences and
finding out what they mean
Components: Natural Language Generation (NLG)
Taking some formal representation of what you want to say and working out a way to
1 express it in a natural language
Mapping the given input in the natural language with a useful representation
2
Producing output in the natural language from some internal representation
3
Different level of analysis: morphological analysis, syntactic analysis, semantic analysis,
4 and discourse analysis
Uses of NLP
Use of NLP in conversational bot in each step:
Intent Entity Content Management, Response
Users Request
Identification Extraction History Processing, Generation
Session Management
Natural Language Understanding Natural Language
Generation
Applications of Natural Language Processing
NLP in Real-Life
Speech Machine
Recognition Translation
Information
Chatbot Retrieval
NLP in Real Life
Information Question
Extraction Answering
Sentiment
Spell Check Analysis
NLP in Real-Life: Business Usage
Improve user experience
• Spellcheck
• Autocomplete
• Autocorrect
Automate support
• Chatbot
• Product ordering
Monitor and analyze feedback
• Generate actionable insight from huge
amount of review or feedback
NLP in Real-Life: Speech Recognition
"Alexa, turn on "Alexa, turn off my
Welcome Home" Bedroom Sonos"
• Google Assistant
• Siri
• Alexa
• Cortana
"Alexa, turn on "Alexa, turn on the
my Chill Time" TV"
NLP in Real-Life: Machine Translation
Google translator
NLP in Real-Life: Chatbots
Uber, Facebook Messenger, and Zendesk are some of the companies who have
implemented chatbots using NLP.
Source
NLP in Real-Life: Information Retrieval
Find information according to the given query
Collections Audio NLP techniques used in IR are:
Words Video • Stemming
Organization Packaged
Involve • Part-of-Speech Tagging
Sentences Text Data Big Data • Compound Recognition
• Decompounding
Facilitate Unstructured Patterns • Chunking
none Originates • Word-Sense Disambiguation
Google finds relevant and similar results using Information
Retrieval.
NLP in Real-Life: Information Extraction
Automatic extraction of structured information from unstructured or semi-
structured machine-readable documents
Gmail structures events from e-mails
Source
NLP in Real-Life: Information Extraction
Raw Text (String) Sentence
Segmentation
Sentences (List of
strings)
Tokenization
Tokenized Sentences (List
of list of strings)
Parts-of-Speech
Tagging
PoS Tagged Sentences
(List of list of tuples)
Entity
Recognition
Chunked Sentences
(List of trees)
Relation
Recognition Relations (List of
Tuples)
NLP in Real-Life: Question Answering
System that automatically answers questions
Source
NLP in Real-Life: Spell Check
Salesforce implemented spell check
in the contact forms using NLP.
Grammarly: A grammar checking
SW built on NLP
Source
NLP in Real-Life: Sentiment Analysis
To extract subjective information from a piece of text
Example: Whether an author is being subjective or objective or even positive or negative
NLP is used here
Source
Challenges and Scope
Why NLP Is Difficult
Nature of the human Rules that dictate the passing of information using natural
language languages are not easy for computers to understand.
Human language is Only 21% of the data is structured data, and a lot of
unstructured data information in the world is unstructured.
Tough to extract Process of reading and understanding English is very
meaning from text complex.
Challenges and Scope
Semantic Meaning Word-Sense Disambiguation
Understanding a word with Actual context of the text
respect to its context
Challenges
Entity Extraction Multiple Intents
Extracting the unknown User speaking many
entities things in single text
Anaphora Resolution
Absence of important entity in text
conversation
Challenges and Scope: Semantic Meaning
There are many good properties available on HDFC Red portal.
I want to purchase a red carpet from a store.
Word RED has different
meanings in these
contexts.
Challenges and Scope: Understanding Entities
A 2M solution of CaCl2 consists of 221.82g of CaCl2 dissolved in enough water to
make one liter of solution.
Understanding and extraction of CaCl2
as entity in this context is complex.
Challenges and Scope: Anaphora Resolution
Peter and Greg are NLP developers.
He is living in Pune.
The word “He” used in
2nd sentence does not
specify which person
to refer.
Challenges and Scope: Multiple Intents
My bank account is functional. Please provide me resolution
process and I want to buy Laptop from Flipkart.
The word “He” used in
2nd sentence does not
specify which person
to refer.
NLU Challenges: Ambiguity
More than 1 meaning of a word in a
Lexical sentence
Ambiguity Example: The fisherman went to the bank.
Syntactic or More than 1 meaning of a sentence
Grammatical
Ambiguity Example : Visiting relatives can be boring.
Referential Reference of a pronoun
Ambiguity The boy told the father about the theft. He
was very upset.
Data Formats
Data Formats
To apply NLP on data, we need to have the data which is available on different kinds of sources in
different formats.
Below are the types of data formats:
Structured Unstructured
Semi-Structured
Data Formats: Structured
Excel, CSV
SQL Data
Data Formats: Unstructured and Semi-Structured
JSON
Text Image
NLP Pipeline
NLP Pipeline
Raw Text
Text Feature
Modeling
Processing Extraction
May need to come back to text May need to come back to feature
processing if feature extraction is extraction if modeling is not as
not proper desired
Text Processing
Text Processing
Input Output
Information
Source Text Processing
Sequence or Text Processing
Sequence or Text Processing has the following steps:-
4
Modified Text
3
2 Standardization of Text
✔ Regular
Expressions
1 Word
Normalization
Noise ✔Tokenization
✔ Stemming
Entity
Removal ✔ Lemmatization
Raw
✔ Stop words
Text
✔ URLs
✔ Punctuations
✔ Numbers
Sequence or Text Processing: Noise Entity Removal
Convert all the words of document
into lowers case
Noise For efficient processing, we must
Entity keep all words in same casing to
Removal avoid case sensitivity of text.
Eliminating unnecessary punctuations,
stop words, numbers, and URLs
These types of things do not
contribute for better result. They
only increase the size of texts and
decrease the efficiency of
algorithms.
Sequence or Text Processing: Tokenization
Tokenization
Break the sentence into separate words.
These words are called tokens.
Split words whenever there is a space between them.
Treat punctuation marks as separate tokens since punctuation
also has meaning.
Example:
Sentence Word
London is the capital and the most “London”, “is”, “ the”, “capital”, “and”,
populous city of England and the “ the”, “most”, “populous”, “city”, “of”,
United Kingdom “England”, “and”, “the”, “United”,
“Kingdom”
Sequence or Text Processing: Stemming
Stemming:
Plays
It takes the root of the words.
It removes the last few words or suffix of a word
where it misspelt or incorrect words.
PLAYER PLAY Playing
Example:
Word Suffix Stem
studies -es studi
Played
ninez -ez nin
Sequence or Text Processing: Lemmatization
Lemmatization:
It converts the text to meaningful base form by considering its
context.
Example:
Word Morphological Lemma
Information
Studying Gerund of the word study Study
Ninez Singular number of nine Ninez
Sequence or Text Processing: Chunking and Chinking
It is the process of extracting
05
It is simplest technique used meaningful short phrases from
for entity detection sentences by analyzing the
parts of speech
01 04
Words or patterns can also be
defined. These should not be a
part of chunk and such words
Chinking a way to remove are known as chinks
a chunk from chunk
02 03
Chunk pattern are made by
normal regular expression
which are designed and
modified to match the part
of speech tags
Sequence or Text Processing: Regular Expression
Object Standardization:
Some words or symbols which are not present in standard
dictionary are also not recognized by any search processes.
Examples: hashtags, acronyms, and colloquial slangs
Note: With the help of regular expression, we can remove these
things.
Sequence or Text Processing: Regular Expression
Regular Expression (Regex):
It is a sequence of characters that define pattern-matching, search-and-replace, and elimination
functions. All type of noises can be removed with the help of regular expressions.
Regex Examples:
Expression Description
[abc] Find any character between the brackets
[^abc] Find any character that is not between
the brackets
[0-9] Find any character between the brackets
(digit)
NLTK
NLTK: Introduction
This tool is used for manipulation or understanding text or speech by any software or
machine.
1
This is one of the most usable and mother of all NLP libraries.
2
It is a platform used for building Python programs that work with human language
data for application in statistical Natural Language Processing (NLP).
3
NLTK: Introduction
Following are text processing libraries:
Tokenization Lemmatization Parsing
Classification Stemming Tagging
Semantic Reasoning
NLTK: Syntax and library
System Requirement:
Operating System:
macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
Python Version:
Python 2.7, 3.5+ (only 64 bit)
>> import nltk
NLTK: Lemmatization
For grammatical purpose, documents are going to use different forms of a word, for example:
#Loading the NLTK package for
lemmatizer
from nltk.stem import WordNetLemmatizer
#creating the Object for Lemmatizer
lemmatizer = WordNetLemmatizer()
#Printing the lemma of word happier
print("happier:",
lemmatizer.lemmatize("happier",
pos="a"))
Output: happier: happy
NLTK: Stemming
#Loading the nltk package to have
PorterStemmer
from nltk.stem import PorterStemmer
#create the object for PorterStemmer
PS = PorterStemmer()
#Print the stemmer of word helping
print("helping :",PS.stem("helping"))
Output: helping: help
NLTK: Processing Raw Text
Text processing includes:
Converting all letters to lower or upper case
#Identify the user input
user_input = "Python is a Programming Language that is Interpreted and High-
Level language"
#Print the sentence by converting the text from uppercase to lowercase
print("Text in lowercase :", user_input.lower())
Output: Text in lowercase : python is a programming language that is interpreted and high-level
language.
NLTK: Processing Raw Text
Converting numbers into words or removing numbers
#Loading the regex package to find number
import re
#identify user input
input_str = "Team A has 6 batsman and 5 bowlers, while team b has
5 batsman and 6 bowlers"
#remove numbers by using regex
output = re.sub(r"\d+", "", input_str)
#print the sentence after removal of numbers
print("remove numbers :", output)
NLTK: Processing Raw Text
Output: remove numbers : Team A has batsman and bowlers, while team b has batsman and
bowlers
NLTK: Processing Raw Text
Removing accent, punctuations marks, and other diacritics
#Load the regex package and string package
import re, string
#define user input
input_str = "Sentence. having. string with. Punctuation?"
#remove punctuation
result = re.sub('[%s]' % re.escape(string.punctuation), '', input_str)
#print the sentence after removal of punctuation
print("result after removing punctuation :", result)
NLTK: Processing Raw Text
Output: result after removing punctuation : Sentence having string with Punctuation
NLTK: Processing Raw Text
Removing white spaces:
#Load the regex and string package
import re
#define input from user
input_str = 'pythonis programming language \t\n\r\tHello \t'
#Print the sentence after removing the spaces
print('Remove spaces using regex :', re.sub(r"\s+", "", input_str),"\n",
sep='')
#Print the sentence after removing the landing spaces
print('Remove landing spaces using regex :', re.sub(r"^\s+", "",
input_str),"\n", sep='')
#Print the sentence after removing the trailing spaces
print('Remove trailing spaces using regex :', re.sub(r"\s+$", "",
input_str),"\n", sep='')
#Print the sentence after removing the leading and trailing spaces
print('Remove landing spaces using regex :', re.sub(r"^\s+|\s+$", "",
input_str),"\n", sep=‘’)
NLTK: Processing Raw Text
Output:
Remove spaces using regex
:pythonisprogramminglanguageHello
Remove landing spaces using regex :pythonis programming
language
Hello
Remove trailing spaces using regex :pythonis programming
language
Hello
Remove landing spaces using regex :pythonis programming
language
Hello
NLTK: Stopwords
#Load the stopwords package
from nltk.corpus import stopwords
#Load the word tokenizer package
from nltk.tokenize import word_tokenize
#define the user input
input_str = "Stop words are the words that are filtered before
and after processing of text."
#crete object for stopwords
stop_word = set(stopwords.words("english"))
#convert word into tokens
token = word_tokenize(input_str)
#remove stopwords from the list of tokens
output = [i for i in token if not i in stop_word]
#remove the stopwords and print the sentence
print("remove stopwords :", output)
NLTK: Stopwords
Output: remove stopwords : ['Stop', 'words', 'words', 'filtered', 'processing', 'text', '.']
NLTK: Tokenizers
#Load the package for tokenizer
from nltk.tokenize import sent_tokenize
#defining user input
text = "Tokenization is the way of tokenizing or dividing
a string, text into a list of tokens"
#print the tokens of sentence.
print("after removing stopwords :", sent_tokenize(text))
NLTK: Tokenizers
Output: after removing stopwords : ['Tokenization is the way of tokenizing or dividing a string, text into a list
of tokens']
NLTK: Ngram
#Load the package for ngrams
from nltk import ngrams
#define user input
usr_input = 'i want to ngramize the foo bar
sentences'
#define number of gram
n = 3
#split the sentence to make grams
sixgrams = ngrams(usr_input.split(), n)
for grams in sixgrams:
#Print the 3 grams of user-input
print(grams)
NLTK: Ngram
Output:
('i', 'want', 'to')
('want', 'to', 'ngramize')
('to', 'ngramize', 'the')
('ngramize', 'the', 'foo')
('the', 'foo', 'bar')
('foo', 'bar', 'sentences')
NLTK: Limitations
Does not support word vectors
1
Is slow
2
Not for production purpose
3
Good only for English and difficult for other languages
4
Re
Re: Introduction
• Re is an inbuilt library which comes with python.
• It uses a set of symbols to identify the patterns from the text.
Example: email address ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
• It is used in information retrieval: import nltk
import re
Text Processing Using Stemming and Regular Expression
Problem Statement: Demonstrate text processing using stemming and regular expression.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Tweets Cleanup and Analysis Using Regular Expressions
Objective: Use regular expressions to work with messy tweets data:
clean up the data, extract hashtags, analyze the most popular hashtags
that occur along with a target hashtag (#economy).
Problem Statement: Social media is a gold mine of information. Brands,
governments, or anyone can leverage their business with the help of
the information contained. It can be information on the sentiments for
a brand, or the themes being spoken about , or the associated trends
for a particular hashtag. In this project, we will work on the tweets on
Twitter. We will find other hashtags that occur frequently with our
target hashtag. This will give us an understanding of which other topics
people are associating this hashtag with.
Knowledge Check
Knowledge
Check
One of the main challenges of NLP is _____________.
1
a. Handling ambiguity of sentences
b. Handling tokenization
c. Both a and b
d. None of the above
Knowledge
Check
One of the main challenge of NLP is _____________.
1
a. Handling ambiguity of sentences
b. Handling tokenization
c. Both a and b
d. None of the above
The correct answer is a.
One of the main challenges of NLP is handling ambiguity of sentences.
Knowledge
Check
Regular expression is used for_______.
2
a. Information retrieval
b. Finding the pattern
c. Database management
d. Both a and b
Knowledge
Check
Regular expression is used for_______.
2
a. Information retrieval
b. Finding the pattern
c. Database management
d. Both a and b
The correct answer is d.
Regular expression is used for information retrieval and finding the pattern.
Knowledge
Check NLP is the technique of interpretation of all types of languages which includes
___________.
3
a. Human Language
b. Assembly Language
c. Machine Language
d. Binary Data
Knowledge
Check
NLP is technique for interpretation of all type of languages which includes ___________.
3
a. Human Language
b. Assembly Language
c. Machine Language
d. Binary Data
The correct answer is a.
NLP has its focus on understanding the human spoken or written language and converting that
interpretation into machine understandable language.
Knowledge
Check
Natural Language Processing (NLP) is a field of _______________.
4
a. Computer Science
b. Artificial Intelligence
c. Linguistics
d. All of the above
Knowledge
Check
Natural Language Processing (NLP) is a field of _______________.
4
a. Computer Science
b. Artificial Intelligence
c. Linguistics
d. All of the above
The correct answer is d.
Natural Language Processing is a field of computer science, artificial intelligence, and linguistics.
Knowledge
Which of the following techniques can be used for the purpose of keyword
Check
normalization?
1- Lemmatization 2- Levenshtein 3- Stemming 4- POS
5
a. 1 and 2
b. 2 and 4
c. 1 and 3
d. 1,2, and 3
Knowledge Which of the following techniques can be used for the purpose of keyword
Check
normalization?
1- Lemmatization 2- Levenshtein 3- Stemming 4- POS
5
a. 1 and 2
b. 2 and 4
c. 1 and 3
d. 1,2, and 3
The correct answer is c.
Lemmatization and stemming are the techniques of keyword normalization.
Key Takeaways
You are now able to:
Describe natural language processing and its components
Explain the different applications of NLP
Define and demonstrate text processing