Language Engineering
Prepared by: Abdelrahman M. Safwat
Section (3) – NLTK Basics
Installing NLTK
!pip install nltk
import nltk
nltk.download(“punkt”)
nltk.download(“wordnet”)
nltk.download(“averaged_perceptron_tagger”)
2
Tokenizing
NLTK has a module that can tokenize text. You can
tokenize text based on sentences or words.
from nltk.tokenize import sent_tokenize
text = """To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer.
The slings and arrows of outrageous fortune, or to take arms against a sea of troubles.
And by opposing end them. To die—to sleep, no more; and by a sleep to say we end. The
heart-ache and the thousand natural shocks"""
tokenized_text = sent_tokenize(text)
3
print(tokenized_text)
Tokenizing (cont.)
from nltk.tokenize import word_tokenize
text = """To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer.
The slings and arrows of outrageous fortune, or to take arms against a sea of troubles.
And by opposing end them. To die—to sleep, no more; and by a sleep to say we end. The
heart-ache and the thousand natural shocks"""
tokenized_text = word_tokenize(text)
print(tokenized_text)
4
Stemming
If we want to get the origin form of a word, we use a
stemmer.
For example, stemming the word “connection,”
“connecting,” or “connected” would all result in the
word “connect”
5
Stemming (cont.)
from nltk.stem import PorterStemmer
words = ["connection", "connected", "connecting"]
for word in words:
print(PorterStemmer().stem(word))
6
Lemmatization
Using stemming sometimes can lead to a wrong origin
word, or a word that doesn’t exist.
In that case, we can use lemmatization, which is
similar to looking up the origin of a word in a dictionary.
7
Lemmatization (cont.)
from nltk.stem import WordNetLemmatizer
text = "The rabbit was running quickly towards the carrot"
tokenized_text = nltk.word_tokenize(text)
for word in tokenized_text:
print(WordNetLemmatizer().lemmatize(word))
8
Lemmatization (cont.)
In the above example, you’ll see that there is no
meaningful change after lemmatization.
That’s because you need to provide the lemmatization
function with Parts of Speech tags.
9
Lemmatization (cont.)
from nltk.stem import WordNetLemmatizer
text = "The rabbit was running quickly towards the carrot"
tokenized_text = nltk.word_tokenize(text)
for word in tokenized_text:
print(WordNetLemmatizer().lemmatize(word, pos="v"))
10
Parts of Speech Tagging
Parts of Speech tagging is the process of tagging a
word in a text based on its definition and context.
Example: Tagging “likes” as a verb.
Note: To tag words in a text, you need to tokenize it
first.
11
Parts of Speech Tagging (cont.)
import nltk
text = "The rabbit was running quickly towards the carrot"
tokenized_text = nltk.word_tokenize(text)
print(nltk.pos_tag(tokenized_text))
12
Parts of Speech Tagging (cont.)
Tag Meaning Examples
ADJ Adjective new, good, high, special, big
ADP Adposition on, of, at, with, by, into,
under
ADV Adverb really, already, still, early,
now
CONJ Conjunction and, or, but, if, while
DET Determiner the, a, some, most, every
NOUN Noun year, home, costs, time
NUM Numeral twenty-four, fourth, 1991
PRT Particle at, on, out, over per, that,
up
13
PRON Pronoun he, their, her, its, my, I, us
WordNet
In lemmatization, we mentioned a process similar to
looking up a dictionary. WordNet is what we use to
look up words.
WordNet is similar to a database or a dictionary of links
and relationships between words.
14
WordNet (cont.)
The WordNet is a part of Python's Natural Language Toolkit. It
is a large word database of English Nouns, Adjectives, Adverbs
and Verbs.
WordNet has been used for a number of purposes in
information systems, including
• Word-sense disambiguation
• Information retrieval
• Automatic text classification
• Automatic text summarization
• Machine translation 15
Example (Synsets and Lemmas)
In WordNet, similar words are grouped into a set known
as a Synset
Every Synset has a name, a part-of-speech, and a
number. The words in a Synset are known as Lemmas.
16
Code
The function wordnet.synsets('word’);
returns an array containing all the Synsets related to the
word passed to it as the argument.
from nltk.corpus import wordnet
synset = wordnet.synsets(“room”)
17
Output
[Synset(‘room.n.01’), Synset(‘room.n.02’), Synset(‘room.n.03’), Synset(‘room.n.04’),
Synset(‘board.v.02’)]
four have the name ’room’ and are nouns, while the last
one’s name is ’board’ and is a verb.
also suggests that the word ‘room’ has a total of five
meanings or contexts.
18
WordNet
from nltk.corpus import wordnet
word = "hungry"
synset = wordnet.synsets(word)[0]
print("Name: " + synset.name())
print("Description: " + synset.definition())
print("Antonym: " + synset.lemmas()[0].antonyms()[0].name())
print("Examples: " + synset.examples()[0])
19
Try it out yourself
Code:
https://colab.research.google.com/drive/1wLjqqi4aLEY2
PWDcpax-_4tCyh946yVQ
Parts of Speech tagger:
https://parts-of-speech.info/
WordNet search:
http://wordnetweb.princeton.edu/perl/webwn
20
Task #1
Read a PDF file using the PyPDF2 library, extract the
text from the first page, tokenize it into sentences and
then tag with the Parts of Speech tagger.
21
Task #2
Use stemming to transform the word to its root form.
22
Task #3
Write a code to determine stems in the input
sentence.
23
Thank you for your attention!
24
References
https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pyth
ons-nltk-library-2d30f70af13b
https://medium.com/@gaurav5430/using-nltk-for-lemmatizing-sentences-c1bfff963258
https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
https://www.nltk.org/book/ch05.html#tab-universal-tagset
25