Artificial Intelligence for Natural Language Processing (NLP)
Part II – From Word to Numerical Analysis
Dr. Eng. Wael Ouarda
Assistant Professor, CRNS, Higher Education Ministry, Tunisia
Centre de Recherche en Numérique de Sfax , Route de Tunis km 10 , Sakiet Ezzit , 3021 Sfax – Tunisie
Wael Ouarda - CRNS 1
1. Machine Learning algorithm for NLP
100 persons
7 emotions
Data Scrapping
85 persons 15 persons
Train&Val Test
Data Cleaning
Pr
Data Representation Model
Word Embedding Embedding
85 * 0,8 85 * 0,2
Train Validation
Data Partitioning
Train Data Validation Data Test Data
X_train, Y_Train X_Val, Y_Val X_Test, Y_Test
Machine Learning Y_Val’= Y_Test’=
(Algorithm,Options) Model.predict(X_Val) Model.predict(X_Test)
Pr
Model Pr
Performance Evaluation Performance Evaluation
Wael Ouarda - CRNS 2
2. Web Scraping Tools
Wael Ouarda - CRNS
2. Web Scraping Tools
• Open source python libraries and
frameworks for web scraping:
• Textual Content:
• Newspaper3k: send an HTTP request to the
website’s server to retrieve the data displayed on the
target web page;
• BeautifulSoup: a python library designed to parse
data, i.e., to extract data from HTML or XML
documents;
• Selenium: Selenium is a web driver designed to
render web pages like your web browser would for
the purpose of automated testing of web applications;
• Scrapy: complete web scraping frameworks
designed explicitly for the job of scraping the web.
• Visual Content:
• MechanicalSoup: a python library designed to parse data,
i.e., to extract url and hypertext from webpages.
Wael Ouarda - CRNS
3. Libraries & Frameworks
• Newspaper3k: Scraping data;
• Facebook Scrapper;
• Pandas: IO files;
• Seaborn: Statistics;
• Numpy: Array use;
• NLTK: Natural Language Toolkit (Dictionary (Graph=WordNet), Stopwords,
punctuation ,etc);
• re: Regular Expression.
Wael Ouarda - CRNS 5
4. Cleaning process
1. Tokenization: Split document into list of words
2. Lower casing: Transform Upper case to lower case
3. Stop words removal: Stop words is a list of words=[‘When”, “I”, “How”, …
] (It can be modified by removing some words by adding other ones
4. Special Character removal: @#’” etc
5. Punctuation removal: :,;-,?! etc
6. Stemming: take the basic of the word: player players plaied plays -> play
7. Lemmatization: have and had will be considered have plays and played
will be considered as play
8. Spell check
9. Translation
Wael Ouarda - CRNS 6
4. Cleaning process: Regular Expression (re)
Examples: @ali, @ahmed, #, ‘e’, ‘A12’, ‘A13’, … Can not be removed using NLTK
functions
It will process the text shared on web or on social media as String
• \d : Matches any decimal digit; this is equivalent to the class [0-9].
• \D: Matches any non-digit character; this is equivalent to the class [^0-9].
• \s: Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
• \S: Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
• \w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
• \W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-
9_].
• Exemple: Re.sub(r’,[^@],’ ‘) => @ @ @ @
Wael Ouarda - CRNS 7
4. Cleaning process: Regular Expression (re)
Pattern Description
^ Matches beginning of line (^ab means that it starts with ab)
$ Matches end of line. ($a means that it ends with a)
. Matches any single character except newline. Using m option allows it to match newline as well. (Etc …)
[...] Matches any single character in brackets.
[^...] Matches any single character not in brackets
Wael Ouarda - CRNS 8
Hi? How are you, I am very content to see you today :)!
4. Cleaning process
1. Tokenization: Split document into list of Tokenization
words
2. Lower casing: Transform Upper case to
[Hi,?,,How,are,you,,,I,am,very,content,to,see,you,today, :,),!]
lower case
Punctuation Removal
3. Stop words removal: Stop words is a list
of words=[‘When”, “I”, “How”, … ] (It
can be modified by removing some [Hi,How,are,you,I,am,very,content,to,see,you,today,)]
words by adding other ones
4. Special Character removal: @#’” etc Special Character Removal
5. Punctuation removal: :,;-,?! etc
6. Stemming: take the basic of the word: [Hi,How,are,you,I,am,very,content,to,see,you,today]
player players plaied plays -> play
7. Lemmatization: have and had will be Lower case
considered have plays and played will
be considered as play [hy,how,are,you,i,am,very,content,to,see,you,today]
8. Spell check Translation & Spell check
9. Translation [hi,how,are,you,i,am,very,happy,to,see,you,today]
[very,happy,see,today] Stop words removal Stop words removal
Wael Ouarda - CRNS
[very,happiness,see,today] 9
5. Sample of NLP Libraries for sentiment analysis
Sentiment = is a tuple of (Polarity, Subjectivity)
• Polarity in [-1 (Negative),1(Positive)]: The orientation of opinion behind the text;
• Subjectivity in [0,1]: Weight of Subjectivity of the text.
Data Data
Data Collection Data Cleaning
Representation Classification
Wael Ouarda - CRNS 10
6. Word Embedding Techniques (TF-IDF)
TF-IDF: Term Frequency – Inverse Document Frequency
Terminology
• t — term (word)
• d — document (set of words) user Tweets Label
• N — count of corpus
• Corpus — the total document set Id1 Tweet 11 = [« word 111 », « word 112 »] -> TF +
= [0,5, 0,5]
TF(t,d) = count of t in d / number of words in d Id1 Tweet 12 +
DF(t) = occurrence of t in documents (IDF=N/df)
Id2 Tweet 21 -
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1))
Wael Ouarda - CRNS 11
6. Word Embedding Techniques (TF-IDF)
TF-IDF(‘bonjour’,id1) = tf(bonjour,id1) * log (N/1)= 1 * log(7/2)
Activity TF-IDF(‘Ali’,id1) = tf(‘ali’, id1) * log (7/df(‘ali)) = 1 * log (7/3)
TF-IDF(‘Ali’,id2) = tf(‘ali’, id2) * log (7/df(‘ali)) = 1 * log (7/3)
TF(t,d) = count of t in d / number of words in d TF-IDF(‘Ahmed,id1’) = 2 * log (7/4)
DF(t) = occurrence of t in documents (IDF=N/df) TF-IDF(‘Ahmed,id2’)
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1)) TF-IDF(‘bonsoir’)
TF-IDF(‘leaders’) = 1*log(7/3)
TF-IDF(‘souhaite’)
user Tweets Label TF-IDF(‘bienvenue’) = 1* log(7/3)
Id1 [bonjour, ali, bienvenue, leaders] +
[bonjour, ali, bienvenue] [ali, bienvenue, leaders]
Id1 [bonsoir, ahmed, leaders, souhaite, bienvenue, +
ahmed]
[log(7/2), log (7/3), log(7/3)]
Id2 [bonsoir, ali, ahmed] - [log(7/3), log(7/3), , log(7/3)]
N-gram to include context (N=3)
[bonjour, ali, bienvenue] [ali, bienvenue, leaders]
[bonsoir, ahmed, leaders] [ahmed, leaders, souhaite] [leaders, souhaite, bienvenue]
[bonsoir, ali, ahmed]
Wael Ouarda - CRNS 12
6. Word Embedding Techniques (Word2Vec)
Features Vector
Term = “machine”
Word Identification in
the Vocabulary Bag of Words Neural Network Training
Yes/No
prediction prediction
WordNet is the dictionary (N) in default Neural Network
0 0
0 0
… W V …
“machine” 1 1
Error: Out of vocabulary
0 0
0 13 0
Wael Ouarda - CRNS 13
6. Word Embedding Techniques (Word2Vec)
Some facts about the Autoencoder:
● Image Representation in Low Dimensional Level
● It is an unsupervised learning algorithm (like PCA) z = f(Wx)
y = g(Vz)
● It minimizes the same objective function as PCA X=Input Vector
● It is a neural network X’: Output Vector
X=X’
● The neural network’s target output is its input
W V
Possible Derivatives of Autoencoder
Stacked Autoencoder Sparse Autoencoder
Wael Ouarda - CRNS 14
6. Word Embedding Techniques (Word2Vec)
user Tweets Label
Activity Id1 [bonjour, ali, bienvenue, leaders] +
N=4 size of vocabulary
[bonjour, ali, bienvenue] Id1 [bonsoir, ahmed, leaders, souhaite, bienvenue, ahmed] +
W is the size of features vector
Id2 [bonsoir, ali, ahmed] -
0 v11
1
Bonjour N-gram to include context (N=3)
0
0 V1w
0 (V11+V21+V31)/3
V21
ali 0 Input Weight Matrix
(4,W) Final Features
1 Vectors
0 (V1W+V2W+V3W)/
V2w
3
0
v31
bienvenue 0
0
V3w
1 Wael Ouarda - CRNS 15
7. Features Selection, Analysis and Transformation
• Transformation
• Linear Transformation: Principal Component Analysis (PCA)
• Non-Linear Transformation: Auto encoder
• Selection
• Heuristic Methods: Genetic Algorithm, Particle Swarm Optimization, Ant Colony
Optimization, etc.
• Statistical Methods: Correlation Matrix
Wael Ouarda - CRNS 16
7. Features Selection, Analysis and Transformation
A given dataset of size N features and M samples
Correlation Matrix
Correlation Matrix is based on Pearson moment
M(feature I, feature J) = covariance (I,J) / Variance(I) * Variance (J)
Example: N=3 N=2 (Feature I & III) or (Features II & III)
Feature I Feature II Features III
Feature I M(I,I) = 1 M(II,I) M(III,I) Features I & II are high correlated. So we
Can drop one among it
Feature II M(I,II) M(II,II) = 1 M(III,II)
Feature III M(I,III) M(II,III) M(III,III) = 1
M is in [-1;1]
Feature I Feature II Features III
[-1;-0,5] ]-0,5;0] ]0;0,5] ]0,5;1]
Feature I M(I,I) = 1 0,6 -0,2
M(I,J) I & J are I & J are not I & J are not I & J are high Feature II 0,6 M(II,II) = 1 0,001
inversely High inversely high correlated
high correlated correlated
correlated
Feature III -0,2 0,001 M(III,III) = 1
Wael Ouarda - CRNS 17
7. Features Selection, Analysis and Transformation
Principal Component Analysis
Compute Average For i=1:N
A = 1/N * Sum(Vi) Adjustment of the Dataset
Vector Va = Vi - A
Dataset {Vi}
Adjustment of
the Dataset Adjustment of
the Dataset
Sort for the proper
vector
85%
V1 3/8
V2 8/8
Dataset {Vai} adjusted
V3 2/8 Example: Vector1= a1*v1 + a2*v2 + … an*vn
V4 7/8 Singular Value
Compute the N proper vectors (vi) Decomposition
Each vector from the old dataset Transform the dataset into
can be described as a weighted matrix N*N (N features)
Vn sum of the proper vectors
Wael Ouarda - CRNS 18
8. NLP Applications
• NLP Classification
• Spam & Ham Detector
• Fake News Detecor
• Sentiment Analysis
• NLP Topic Modeling
• Word Cloud Visualisation
• Clustering data/ User -> Communities
• Chatbot
• Natural Lanagage Processing (NLP): to process the natural lanagege input by human
• Natural Lanagage Generation (NLG): to generate response to human
Wael Ouarda - CRNS 19