0% found this document useful (0 votes)

25 views19 pages

NLP Part

Here are the key steps in calculating TF-IDF for terms in documents: 1. Calculate TF (Term Frequency) of each term in a document. TF is the number of occurrences of the term in the document divided by the total number of terms in the document. 2. Calculate DF (Document Frequency) of each term in the corpus. DF is the number of documents containing the term. 3. Calculate IDF (Inverse Document Frequency) of each term using the formula: IDF = log(N/DF) where N is the total number of documents. 4. Calculate TF-IDF by multiplying TF and IDF: TF-IDF = TF * IDF. This gives more weight to terms that occur frequently in a

Uploaded by

아이 커Iker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views19 pages

NLP Part

Uploaded by

아이 커Iker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Artificial Intelligence for Natural Language Processing (NLP)

Part II – From Word to Numerical Analysis

Dr. Eng. Wael Ouarda
Assistant Professor, CRNS, Higher Education Ministry, Tunisia

Centre de Recherche en Numérique de Sfax , Route de Tunis km 10 , Sakiet Ezzit , 3021 Sfax – Tunisie

Wael Ouarda - CRNS 1

1. Machine Learning algorithm for NLP
100 persons
7 emotions

Data Scrapping
85 persons 15 persons
Train&Val Test
Data Cleaning
Pr
Data Representation Model
Word Embedding Embedding
85 * 0,8 85 * 0,2
Train Validation
Data Partitioning

Train Data Validation Data Test Data

X_train, Y_Train X_Val, Y_Val X_Test, Y_Test

Machine Learning Y_Val’= Y_Test’=

(Algorithm,Options) Model.predict(X_Val) Model.predict(X_Test)
Pr
Model Pr

Performance Evaluation Performance Evaluation

Wael Ouarda - CRNS 2

2. Web Scraping Tools

Wael Ouarda - CRNS

2. Web Scraping Tools

• Open source python libraries and

frameworks for web scraping:
• Textual Content:
• Newspaper3k: send an HTTP request to the
website’s server to retrieve the data displayed on the
target web page;
• BeautifulSoup: a python library designed to parse
data, i.e., to extract data from HTML or XML
documents;
• Selenium: Selenium is a web driver designed to
render web pages like your web browser would for
the purpose of automated testing of web applications;
• Scrapy: complete web scraping frameworks
designed explicitly for the job of scraping the web.
• Visual Content:
• MechanicalSoup: a python library designed to parse data,
i.e., to extract url and hypertext from webpages.

Wael Ouarda - CRNS

3. Libraries & Frameworks

• Newspaper3k: Scraping data;

• Facebook Scrapper;
• Pandas: IO files;
• Seaborn: Statistics;
• Numpy: Array use;
• NLTK: Natural Language Toolkit (Dictionary (Graph=WordNet), Stopwords,
punctuation ,etc);
• re: Regular Expression.

Wael Ouarda - CRNS 5

4. Cleaning process

1. Tokenization: Split document into list of words

2. Lower casing: Transform Upper case to lower case
3. Stop words removal: Stop words is a list of words=[‘When”, “I”, “How”, …
] (It can be modified by removing some words by adding other ones
4. Special Character removal: @#’” etc
5. Punctuation removal: :,;-,?! etc
6. Stemming: take the basic of the word: player players plaied plays -> play
7. Lemmatization: have and had will be considered have plays and played
will be considered as play
8. Spell check
9. Translation
Wael Ouarda - CRNS 6
4. Cleaning process: Regular Expression (re)

Examples: @ali, @ahmed, #, ‘e’, ‘A12’, ‘A13’, … Can not be removed using NLTK
functions
It will process the text shared on web or on social media as String

• \d : Matches any decimal digit; this is equivalent to the class [0-9].

• \D: Matches any non-digit character; this is equivalent to the class [^0-9].
• \s: Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
• \S: Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
• \w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
• \W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-
9_].
• Exemple: Re.sub(r’,[^@],’ ‘) => @ @ @ @
Wael Ouarda - CRNS 7
4. Cleaning process: Regular Expression (re)

Pattern Description

^ Matches beginning of line (^ab means that it starts with ab)

$ Matches end of line. ($a means that it ends with a)

. Matches any single character except newline. Using m option allows it to match newline as well. (Etc …)

[...] Matches any single character in brackets.

[^...] Matches any single character not in brackets

Wael Ouarda - CRNS 8

Hi? How are you, I am very content to see you today :)!
4. Cleaning process
1. Tokenization: Split document into list of Tokenization
words
2. Lower casing: Transform Upper case to
[Hi,?,,How,are,you,,,I,am,very,content,to,see,you,today, :,),!]
lower case
Punctuation Removal
3. Stop words removal: Stop words is a list
of words=[‘When”, “I”, “How”, … ] (It
can be modified by removing some [Hi,How,are,you,I,am,very,content,to,see,you,today,)]
words by adding other ones
4. Special Character removal: @#’” etc Special Character Removal
5. Punctuation removal: :,;-,?! etc
6. Stemming: take the basic of the word: [Hi,How,are,you,I,am,very,content,to,see,you,today]
player players plaied plays -> play
7. Lemmatization: have and had will be Lower case
considered have plays and played will
be considered as play [hy,how,are,you,i,am,very,content,to,see,you,today]
8. Spell check Translation & Spell check
9. Translation [hi,how,are,you,i,am,very,happy,to,see,you,today]
[very,happy,see,today] Stop words removal Stop words removal
Wael Ouarda - CRNS
[very,happiness,see,today] 9
5. Sample of NLP Libraries for sentiment analysis

Sentiment = is a tuple of (Polarity, Subjectivity)

• Polarity in [-1 (Negative),1(Positive)]: The orientation of opinion behind the text;
• Subjectivity in [0,1]: Weight of Subjectivity of the text.

Data Data
Data Collection Data Cleaning
Representation Classification

Wael Ouarda - CRNS 10

6. Word Embedding Techniques (TF-IDF)

TF-IDF: Term Frequency – Inverse Document Frequency

Terminology
• t — term (word)
• d — document (set of words) user Tweets Label
• N — count of corpus
• Corpus — the total document set Id1 Tweet 11 = [« word 111 », « word 112 »] -> TF +
= [0,5, 0,5]
TF(t,d) = count of t in d / number of words in d Id1 Tweet 12 +
DF(t) = occurrence of t in documents (IDF=N/df)
Id2 Tweet 21 -
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1))

Wael Ouarda - CRNS 11

6. Word Embedding Techniques (TF-IDF)
TF-IDF(‘bonjour’,id1) = tf(bonjour,id1) * log (N/1)= 1 * log(7/2)
Activity TF-IDF(‘Ali’,id1) = tf(‘ali’, id1) * log (7/df(‘ali)) = 1 * log (7/3)
TF-IDF(‘Ali’,id2) = tf(‘ali’, id2) * log (7/df(‘ali)) = 1 * log (7/3)
TF(t,d) = count of t in d / number of words in d TF-IDF(‘Ahmed,id1’) = 2 * log (7/4)
DF(t) = occurrence of t in documents (IDF=N/df) TF-IDF(‘Ahmed,id2’)
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1)) TF-IDF(‘bonsoir’)
TF-IDF(‘leaders’) = 1*log(7/3)
TF-IDF(‘souhaite’)
user Tweets Label TF-IDF(‘bienvenue’) = 1* log(7/3)
Id1 [bonjour, ali, bienvenue, leaders] +
[bonjour, ali, bienvenue] [ali, bienvenue, leaders]
Id1 [bonsoir, ahmed, leaders, souhaite, bienvenue, +
ahmed]
[log(7/2), log (7/3), log(7/3)]
Id2 [bonsoir, ali, ahmed] - [log(7/3), log(7/3), , log(7/3)]

N-gram to include context (N=3)

[bonjour, ali, bienvenue] [ali, bienvenue, leaders]

[bonsoir, ahmed, leaders] [ahmed, leaders, souhaite] [leaders, souhaite, bienvenue]
[bonsoir, ali, ahmed]

Wael Ouarda - CRNS 12

6. Word Embedding Techniques (Word2Vec)

Features Vector
Term = “machine”

Word Identification in
the Vocabulary Bag of Words Neural Network Training
Yes/No

prediction prediction
WordNet is the dictionary (N) in default Neural Network
0 0
0 0
… W V …
“machine” 1 1
Error: Out of vocabulary
0 0
0 13 0

Wael Ouarda - CRNS 13

6. Word Embedding Techniques (Word2Vec)

Some facts about the Autoencoder:

● Image Representation in Low Dimensional Level
● It is an unsupervised learning algorithm (like PCA) z = f(Wx)
y = g(Vz)
● It minimizes the same objective function as PCA X=Input Vector
● It is a neural network X’: Output Vector
X=X’
● The neural network’s target output is its input
W V
Possible Derivatives of Autoencoder

Stacked Autoencoder Sparse Autoencoder

Wael Ouarda - CRNS 14

6. Word Embedding Techniques (Word2Vec)
user Tweets Label

Activity Id1 [bonjour, ali, bienvenue, leaders] +

N=4 size of vocabulary
[bonjour, ali, bienvenue] Id1 [bonsoir, ahmed, leaders, souhaite, bienvenue, ahmed] +
W is the size of features vector
Id2 [bonsoir, ali, ahmed] -
0 v11

1
Bonjour N-gram to include context (N=3)
0
0 V1w

0 (V11+V21+V31)/3
V21
ali 0 Input Weight Matrix
(4,W) Final Features
1 Vectors

0 (V1W+V2W+V3W)/
V2w
3
0
v31
bienvenue 0
0
V3w
1 Wael Ouarda - CRNS 15
7. Features Selection, Analysis and Transformation

• Transformation
• Linear Transformation: Principal Component Analysis (PCA)
• Non-Linear Transformation: Auto encoder
• Selection
• Heuristic Methods: Genetic Algorithm, Particle Swarm Optimization, Ant Colony
Optimization, etc.
• Statistical Methods: Correlation Matrix

Wael Ouarda - CRNS 16

7. Features Selection, Analysis and Transformation
A given dataset of size N features and M samples
Correlation Matrix
Correlation Matrix is based on Pearson moment

M(feature I, feature J) = covariance (I,J) / Variance(I) * Variance (J)

Example: N=3 N=2 (Feature I & III) or (Features II & III)

Feature I Feature II Features III

Feature I M(I,I) = 1 M(II,I) M(III,I) Features I & II are high correlated. So we

Can drop one among it
Feature II M(I,II) M(II,II) = 1 M(III,II)

Feature III M(I,III) M(II,III) M(III,III) = 1

M is in [-1;1]
Feature I Feature II Features III
[-1;-0,5] ]-0,5;0] ]0;0,5] ]0,5;1]
Feature I M(I,I) = 1 0,6 -0,2
M(I,J) I & J are I & J are not I & J are not I & J are high Feature II 0,6 M(II,II) = 1 0,001
inversely High inversely high correlated
high correlated correlated
correlated
Feature III -0,2 0,001 M(III,III) = 1
Wael Ouarda - CRNS 17
7. Features Selection, Analysis and Transformation
Principal Component Analysis

Compute Average For i=1:N

A = 1/N * Sum(Vi) Adjustment of the Dataset
Vector Va = Vi - A

Dataset {Vi}

Adjustment of
the Dataset Adjustment of
the Dataset
Sort for the proper
vector
85%

V1 3/8

V2 8/8
Dataset {Vai} adjusted
V3 2/8 Example: Vector1= a1*v1 + a2*v2 + … an*vn

V4 7/8 Singular Value

Compute the N proper vectors (vi) Decomposition
Each vector from the old dataset Transform the dataset into
can be described as a weighted matrix N*N (N features)
Vn sum of the proper vectors
Wael Ouarda - CRNS 18
8. NLP Applications

• NLP Classification
• Spam & Ham Detector
• Fake News Detecor
• Sentiment Analysis
• NLP Topic Modeling
• Word Cloud Visualisation
• Clustering data/ User -> Communities
• Chatbot
• Natural Lanagage Processing (NLP): to process the natural lanagege input by human
• Natural Lanagage Generation (NLG): to generate response to human

Wael Ouarda - CRNS 19

NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
NLP Essentials for AI Enthusiasts
No ratings yet
NLP Essentials for AI Enthusiasts
4 pages
NLP Questions
No ratings yet
NLP Questions
26 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Cs383 Lecture16 PDF
No ratings yet
Cs383 Lecture16 PDF
46 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Module 3
No ratings yet
Module 3
40 pages
NLP Book
No ratings yet
NLP Book
599 pages
NLP 1
No ratings yet
NLP 1
8 pages
Natural Language Processing For Hackers
No ratings yet
Natural Language Processing For Hackers
176 pages
L5 - L6 - Natural Language Processing
100% (1)
L5 - L6 - Natural Language Processing
94 pages
1 - Overview of NLP
No ratings yet
1 - Overview of NLP
39 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Eisenstein
No ratings yet
Eisenstein
305 pages
Genai Unit !
No ratings yet
Genai Unit !
71 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP - Notes
No ratings yet
NLP - Notes
3 pages
NLP Guide for AI Students
No ratings yet
NLP Guide for AI Students
29 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
Speech and Language Processing - J&M
No ratings yet
Speech and Language Processing - J&M
599 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Formatted-Document NLP
No ratings yet
Formatted-Document NLP
11 pages
Unit 5
No ratings yet
Unit 5
8 pages
Chapter 1
No ratings yet
Chapter 1
78 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
No ratings yet
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
4 pages
Pipeline
No ratings yet
Pipeline
9 pages
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
No ratings yet
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
66 pages
BAI601 All Modules VTU 10 Mark Complete
No ratings yet
BAI601 All Modules VTU 10 Mark Complete
18 pages
Speech and Language Processing
100% (2)
Speech and Language Processing
623 pages
Speech and Language Processing: Third Edition Draft
No ratings yet
Speech and Language Processing: Third Edition Draft
287 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
NLP Notes Unit 1
No ratings yet
NLP Notes Unit 1
179 pages
Understanding Each Pre-Processing Aspect
No ratings yet
Understanding Each Pre-Processing Aspect
5 pages
NLP Short Notes
No ratings yet
NLP Short Notes
21 pages
Ed 3 Book
No ratings yet
Ed 3 Book
577 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Text Mining
No ratings yet
Text Mining
34 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Ed3book (001 282)
No ratings yet
Ed3book (001 282)
282 pages
Transition Networks in Computing
No ratings yet
Transition Networks in Computing
7 pages
Module5 DS PPT
No ratings yet
Module5 DS PPT
38 pages
NLP CH 1
No ratings yet
NLP CH 1
8 pages
Draft: Natural Language Processing For The Working Programmer
No ratings yet
Draft: Natural Language Processing For The Working Programmer
79 pages
A Word Sense Induction Model
No ratings yet
A Word Sense Induction Model
66 pages
Mod 1
No ratings yet
Mod 1
71 pages
Q1. Handling Noisy Test in NLP.: 1. Data Cleaning and Preprocessing
No ratings yet
Q1. Handling Noisy Test in NLP.: 1. Data Cleaning and Preprocessing
23 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
162 pages
Ed 3 Book
No ratings yet
Ed 3 Book
636 pages
NLP Ans
No ratings yet
NLP Ans
91 pages
Project Topics
No ratings yet
Project Topics
5 pages
LEC6
No ratings yet
LEC6
27 pages
Industrial RS485 Temp Sensors
No ratings yet
Industrial RS485 Temp Sensors
2 pages
HVAC Insulation Tape Guide
No ratings yet
HVAC Insulation Tape Guide
2 pages
AI Image Generator Project Report
No ratings yet
AI Image Generator Project Report
16 pages
SAP Cutover Activities and Processes
No ratings yet
SAP Cutover Activities and Processes
4 pages
PNU
100% (2)
PNU
49 pages
Types of Network
No ratings yet
Types of Network
18 pages
Building Automation Product Catalogue - Issue 3: A Vital Part of Your World
No ratings yet
Building Automation Product Catalogue - Issue 3: A Vital Part of Your World
54 pages
Lapp Pro206402en
No ratings yet
Lapp Pro206402en
4 pages
Introduction To Malware Detection
No ratings yet
Introduction To Malware Detection
8 pages
DB en Step Ups 24dc 24dc 3 105623 en 02
No ratings yet
DB en Step Ups 24dc 24dc 3 105623 en 02
21 pages
Networks - WS2
No ratings yet
Networks - WS2
2 pages
AEQ TH-03 Digital Hybrid Manual
No ratings yet
AEQ TH-03 Digital Hybrid Manual
15 pages
09 KHD Ball Mill
100% (1)
09 KHD Ball Mill
101 pages
MSB-HDR Sav
No ratings yet
MSB-HDR Sav
12 pages
DFX8 Web
No ratings yet
DFX8 Web
2 pages
Telehealth Access in Nepal Pandemic
No ratings yet
Telehealth Access in Nepal Pandemic
124 pages
Capsule Neuron 3: Technical Data Sheet
No ratings yet
Capsule Neuron 3: Technical Data Sheet
2 pages
Real-time Face Recognition with Python
No ratings yet
Real-time Face Recognition with Python
6 pages
Oil Drilling Techniques Explained
No ratings yet
Oil Drilling Techniques Explained
22 pages
Abcdegdg
No ratings yet
Abcdegdg
1 page
Official Transcript: Student - Records@boston - Co.za
No ratings yet
Official Transcript: Student - Records@boston - Co.za
1 page
Fluid Assignment
No ratings yet
Fluid Assignment
21 pages
Module 5 Light Side of The Internet
No ratings yet
Module 5 Light Side of The Internet
13 pages
Student Name: Bhumika Shrestha TP Number: NP000194 Performance Criteria: REPORT (30%) Very Poor Poor Adequate Good Excellent
No ratings yet
Student Name: Bhumika Shrestha TP Number: NP000194 Performance Criteria: REPORT (30%) Very Poor Poor Adequate Good Excellent
4 pages
Annual Report - 2023
No ratings yet
Annual Report - 2023
123 pages
Alternative Bom
No ratings yet
Alternative Bom
16 pages
Google Android: Mobile Computing
No ratings yet
Google Android: Mobile Computing
24 pages
Full Chapter of Capital Markets 5th Edition Fabozzi Ebook and TestBank Bundle EPUB DOCX PDF Download Now
No ratings yet
Full Chapter of Capital Markets 5th Edition Fabozzi Ebook and TestBank Bundle EPUB DOCX PDF Download Now
407 pages