0% found this document useful (0 votes)

29 views25 pages

Natural Language Processing-Section

The document provides an overview of the Natural Language Toolkit (NLTK) in Python, covering key functionalities such as tokenization, stemming, lemmatization, and parts of speech tagging. It explains how to install NLTK, use its modules for text processing, and introduces WordNet as a resource for word relationships. Additionally, it includes practical tasks for users to apply their knowledge of NLTK features.

Uploaded by

dw9324764

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views25 pages

Natural Language Processing-Section

Uploaded by

dw9324764

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Language Engineering

Prepared by: Abdelrahman M. Safwat

Section (3) – NLTK Basics

Installing NLTK

!pip install nltk

import nltk
nltk.download(“punkt”)
nltk.download(“wordnet”)
nltk.download(“averaged_perceptron_tagger”)

2
Tokenizing

 NLTK has a module that can tokenize text. You can

tokenize text based on sentences or words.
from nltk.tokenize import sent_tokenize

text = """To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer.
The slings and arrows of outrageous fortune, or to take arms against a sea of troubles.
And by opposing end them. To die—to sleep, no more; and by a sleep to say we end. The
heart-ache and the thousand natural shocks"""

tokenized_text = sent_tokenize(text)
3
print(tokenized_text)
Tokenizing (cont.)

from nltk.tokenize import word_tokenize

tokenized_text = word_tokenize(text)

print(tokenized_text)

4
Stemming

 If we want to get the origin form of a word, we use a

stemmer.
 For example, stemming the word “connection,”
“connecting,” or “connected” would all result in the
word “connect”

5
Stemming (cont.)

from nltk.stem import PorterStemmer

words = ["connection", "connected", "connecting"]

for word in words:

print(PorterStemmer().stem(word))

6
Lemmatization

 Using stemming sometimes can lead to a wrong origin

word, or a word that doesn’t exist.
 In that case, we can use lemmatization, which is
similar to looking up the origin of a word in a dictionary.

7
Lemmatization (cont.)

from nltk.stem import WordNetLemmatizer

text = "The rabbit was running quickly towards the carrot"

tokenized_text = nltk.word_tokenize(text)

for word in tokenized_text:

print(WordNetLemmatizer().lemmatize(word))

8
Lemmatization (cont.)

 In the above example, you’ll see that there is no

meaningful change after lemmatization.
 That’s because you need to provide the lemmatization
function with Parts of Speech tags.

9
Lemmatization (cont.)

from nltk.stem import WordNetLemmatizer

text = "The rabbit was running quickly towards the carrot"

tokenized_text = nltk.word_tokenize(text)

for word in tokenized_text:

print(WordNetLemmatizer().lemmatize(word, pos="v"))

10
Parts of Speech Tagging

 Parts of Speech tagging is the process of tagging a

word in a text based on its definition and context.
 Example: Tagging “likes” as a verb.
 Note: To tag words in a text, you need to tokenize it
first.

11
Parts of Speech Tagging (cont.)

import nltk

text = "The rabbit was running quickly towards the carrot"

tokenized_text = nltk.word_tokenize(text)

print(nltk.pos_tag(tokenized_text))

12
Parts of Speech Tagging (cont.)

Tag Meaning Examples

ADJ Adjective new, good, high, special, big
ADP Adposition on, of, at, with, by, into,
under
ADV Adverb really, already, still, early,
now
CONJ Conjunction and, or, but, if, while
DET Determiner the, a, some, most, every
NOUN Noun year, home, costs, time
NUM Numeral twenty-four, fourth, 1991
PRT Particle at, on, out, over per, that,
up
13
PRON Pronoun he, their, her, its, my, I, us
WordNet

 In lemmatization, we mentioned a process similar to

looking up a dictionary. WordNet is what we use to
look up words.
 WordNet is similar to a database or a dictionary of links
and relationships between words.

14
WordNet (cont.)

The WordNet is a part of Python's Natural Language Toolkit. It

is a large word database of English Nouns, Adjectives, Adverbs
and Verbs.

WordNet has been used for a number of purposes in

information systems, including

• Word-sense disambiguation
• Information retrieval
• Automatic text classification
• Automatic text summarization
• Machine translation 15
Example (Synsets and Lemmas)

In WordNet, similar words are grouped into a set known

as a Synset

Every Synset has a name, a part-of-speech, and a

number. The words in a Synset are known as Lemmas.

16
Code

The function wordnet.synsets('word’);

returns an array containing all the Synsets related to the

word passed to it as the argument.
from nltk.corpus import wordnet
synset = wordnet.synsets(“room”)

17
Output

[Synset(‘room.n.01’), Synset(‘room.n.02’), Synset(‘room.n.03’), Synset(‘room.n.04’),

Synset(‘board.v.02’)]

four have the name ’room’ and are nouns, while the last
one’s name is ’board’ and is a verb.

also suggests that the word ‘room’ has a total of five

meanings or contexts.
18
WordNet

from nltk.corpus import wordnet

word = "hungry"
synset = wordnet.synsets(word)[0]

print("Name: " + synset.name())

print("Description: " + synset.definition())
print("Antonym: " + synset.lemmas()[0].antonyms()[0].name())
print("Examples: " + synset.examples()[0])

19
Try it out yourself

 Code:
https://colab.research.google.com/drive/1wLjqqi4aLEY2
PWDcpax-_4tCyh946yVQ
 Parts of Speech tagger:
https://parts-of-speech.info/
 WordNet search:
http://wordnetweb.princeton.edu/perl/webwn
20
Task #1

 Read a PDF file using the PyPDF2 library, extract the

text from the first page, tokenize it into sentences and
then tag with the Parts of Speech tagger.

21
Task #2

 Use stemming to transform the word to its root form.

22
Task #3

 Write a code to determine stems in the input

sentence.

23
Thank you for your attention!

24
References

 https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pyth
ons-nltk-library-2d30f70af13b
 https://medium.com/@gaurav5430/using-nltk-for-lemmatizing-sentences-c1bfff963258
 https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
 https://www.nltk.org/book/ch05.html#tab-universal-tagset

NLTK
No ratings yet
NLTK
4 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
NLTK
No ratings yet
NLTK
3 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
NLP Tutorial with Python NLTK
No ratings yet
NLP Tutorial with Python NLTK
19 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
NLP Record
No ratings yet
NLP Record
15 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLTK Cheatsheet
No ratings yet
NLTK Cheatsheet
27 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
NLP Lab 2
No ratings yet
NLP Lab 2
6 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
Chapter 5
No ratings yet
Chapter 5
7 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
Natural Language Processing With Python's NLTK Package - Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package - Real Python
27 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
Lab 2
No ratings yet
Lab 2
49 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
ChatGPT-Tokenization Stemming Lemmatization NLTK
No ratings yet
ChatGPT-Tokenization Stemming Lemmatization NLTK
110 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLP Soln
No ratings yet
NLP Soln
6 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Lab 04 - Text Normalization Tutorial
No ratings yet
Lab 04 - Text Normalization Tutorial
5 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
NLP Lab
No ratings yet
NLP Lab
63 pages
Text Preprocessing & NLTK Guide
No ratings yet
Text Preprocessing & NLTK Guide
8 pages
NLP Op
No ratings yet
NLP Op
16 pages
NLP Lab Programms
No ratings yet
NLP Lab Programms
9 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP 1
No ratings yet
NLP 1
6 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
Main Topics: Start With A Checkmark Followed by The Topic Name
No ratings yet
Main Topics: Start With A Checkmark Followed by The Topic Name
48 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
Lexical Resources & NLP Tools
No ratings yet
Lexical Resources & NLP Tools
6 pages
NLP Lab
No ratings yet
NLP Lab
7 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
NLTK Package Training
No ratings yet
NLTK Package Training
17 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
6 pages
NLP UNIT 5 Part B
100% (2)
NLP UNIT 5 Part B
31 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
3 A Morphology
No ratings yet
3 A Morphology
4 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
TextMining
No ratings yet
TextMining
43 pages
SWE-Week 05
No ratings yet
SWE-Week 05
32 pages
SWE-Week 04
No ratings yet
SWE-Week 04
17 pages
SWE-Week 03
No ratings yet
SWE-Week 03
21 pages
SWE-Week 01
No ratings yet
SWE-Week 01
25 pages
SWE-Week 02
No ratings yet
SWE-Week 02
24 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
22 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
57 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
29 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
29 pages
4-Finite State Machines - Part1
No ratings yet
4-Finite State Machines - Part1
31 pages
Natural Language Processing Project Spring2024-2025
No ratings yet
Natural Language Processing Project Spring2024-2025
2 pages
1-Introduction To NLP - Part1
No ratings yet
1-Introduction To NLP - Part1
31 pages
Grammar Guide: Comparatives & Superlatives
No ratings yet
Grammar Guide: Comparatives & Superlatives
8 pages
Must or Have To PDF
100% (1)
Must or Have To PDF
3 pages
17511737313631693193
No ratings yet
17511737313631693193
11 pages
Spanish Speak in A Week - Week 2 of 4 - Book - Libgen - Li
100% (1)
Spanish Speak in A Week - Week 2 of 4 - Book - Libgen - Li
245 pages
Clasa A VIII A Reported Speech Review
100% (1)
Clasa A VIII A Reported Speech Review
11 pages
Celf5 RF1 Full
No ratings yet
Celf5 RF1 Full
24 pages
Present Perfect Tense Guide
No ratings yet
Present Perfect Tense Guide
2 pages
Word Transformation Exercise
No ratings yet
Word Transformation Exercise
4 pages
Adverbs of Manner Guide
No ratings yet
Adverbs of Manner Guide
2 pages
Preposition 1
No ratings yet
Preposition 1
2 pages
Active To Passive Rules
No ratings yet
Active To Passive Rules
2 pages
Instituto Superior Tecnológico Vicente León: Sílabo
No ratings yet
Instituto Superior Tecnológico Vicente León: Sílabo
3 pages
Tiếng Anh 9 Friends Plus Unit 1 Langfocus Two Star
No ratings yet
Tiếng Anh 9 Friends Plus Unit 1 Langfocus Two Star
1 page
Presentperfect Simple-Continuous 2
100% (1)
Presentperfect Simple-Continuous 2
22 pages
Jidoushi Tadoushi
No ratings yet
Jidoushi Tadoushi
9 pages
Q1 LAS1 Reading and Parts of Speech
No ratings yet
Q1 LAS1 Reading and Parts of Speech
4 pages
Misplaced and Dangling LS
No ratings yet
Misplaced and Dangling LS
8 pages
Soal Latihan Toefl
No ratings yet
Soal Latihan Toefl
5 pages
Cod Coi
No ratings yet
Cod Coi
5 pages
English Assignment Theology UKI
No ratings yet
English Assignment Theology UKI
4 pages
Unit Review 1: Answer All Thirty Questions. There Is One Mark Per Question
100% (2)
Unit Review 1: Answer All Thirty Questions. There Is One Mark Per Question
8 pages
02 - Singular and Plural Nouns PDF
No ratings yet
02 - Singular and Plural Nouns PDF
5 pages
Countable and Uncountable Nouns Exercise
100% (1)
Countable and Uncountable Nouns Exercise
3 pages
2 BAC Grammar Lesson Plans
100% (2)
2 BAC Grammar Lesson Plans
24 pages
Penang Hokkien
No ratings yet
Penang Hokkien
8 pages
Collective Nouns Examples PDF
No ratings yet
Collective Nouns Examples PDF
2 pages
Deshalb & Trotzdem
75% (4)
Deshalb & Trotzdem
4 pages
The Correct Use of Infinitive Verbs - A Research Guide For Students
No ratings yet
The Correct Use of Infinitive Verbs - A Research Guide For Students
9 pages
Might Vs Will
No ratings yet
Might Vs Will
3 pages
Sintaksa 12c Kolokvijum Pitanja I Odgovori2c Skraćeni Deo
No ratings yet
Sintaksa 12c Kolokvijum Pitanja I Odgovori2c Skraćeni Deo
22 pages

Natural Language Processing-Section

Uploaded by

Natural Language Processing-Section

Uploaded by

Language Engineering

Prepared by: Abdelrahman M. Safwat

Section (3) – NLTK Basics

!pip install nltk

 NLTK has a module that can tokenize text. You can

from nltk.tokenize import word_tokenize

 If we want to get the origin form of a word, we use a

from nltk.stem import PorterStemmer

words = ["connection", "connected", "connecting"]

for word in words:

 Using stemming sometimes can lead to a wrong origin

from nltk.stem import WordNetLemmatizer

text = "The rabbit was running quickly towards the carrot"

for word in tokenized_text:

 In the above example, you’ll see that there is no

from nltk.stem import WordNetLemmatizer

text = "The rabbit was running quickly towards the carrot"

for word in tokenized_text:

 Parts of Speech tagging is the process of tagging a

text = "The rabbit was running quickly towards the carrot"

Tag Meaning Examples

 In lemmatization, we mentioned a process similar to

The WordNet is a part of Python's Natural Language Toolkit. It

WordNet has been used for a number of purposes in

In WordNet, similar words are grouped into a set known

Every Synset has a name, a part-of-speech, and a

The function wordnet.synsets('word’);

returns an array containing all the Synsets related to the

[Synset(‘room.n.01’), Synset(‘room.n.02’), Synset(‘room.n.03’), Synset(‘room.n.04’),

also suggests that the word ‘room’ has a total of five

from nltk.corpus import wordnet

print("Name: " + synset.name())

 Read a PDF file using the PyPDF2 library, extract the

 Use stemming to transform the word to its root form.

 Write a code to determine stems in the input

You might also like