Natural language
processing
Prepared by: Abdelrahman M. Safwat
Section (5) – Machine Learning Basics
What is machine learning?
“Machine learning is the scientific study of
algorithms and statistical models that computer
systems use to perform a specific task without
using explicit instructions, relying on patterns and
inference instead.”
2
Types of machine learning
Supervised learning
Unsupervised learning
Reinforcement learning
3
Supervised learning
Supervised learning is a type of machine learning
where you have data and you know the resulting
output from that data, but you want to make a program
that can predict the output of future data.
Uses of supervised learning include:
Classification
Regression
4
Classification &Regression
A regression : is when the output A classification : is when the output
variable is a real or continuous variable is a category, such as “red” or
value, such as “salary” or “blue” or “disease” and “no disease”.
“weight”. Many different models A classification model attempts to draw
can be used, the simplest is the some conclusion from observed values.
linear regression. It tries to fit Given one or more inputs a
data with the best hyper-plane classification model will try to predict
which goes through the points. the value of one or more outcomes.
5
Unsupervised learning
Unsupervised learning is a type of machine learning
where you have data and you don’t know the
resulting output from that data, but you want to
make a program that can find patterns in your data.
Uses of unsupervised learning include:
Clustering
6
Clustering
Clustering is the act of organizing similar objects into groups within a machine
learning algorithm.
is done by scanning the unlabeled datasets in a machine learning model and
setting measurements for specific data point features. The cluster analysis
will then classify and place the data points in a group with matching features.
Once data has been grouped together, it will be assigned a cluster ID number
to help identify the cluster characteristics.
7
Idea
We want to create a machine learning model
that can take a Tweet from Twitter and decide
whether it’s a positive Tweet or a negative
one.
8
Machine Learning Steps We’ll Study
Preparing data
Splitting our data for training and testing
Choosing an algorithm
Training our model
Testing our model
9
Acquiring our data
First, we’ll need to get the data we want to train
our model on. We can either gather Tweets
ourselves or try to find someone who
already did that. Luckily, there’s already a
dataset for that:
cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
10
Loading our data
Next, we need to load our data. As we can see,
the format of our dataset is CSV, so we’ll
use pandas to load our data.
import pandas as pd
df = pd.read_csv('training.1600000.processed.noemoticon.csv’)
11
Loading our data
Running the code in the previous slide will
result an error, and that’s because we didn’t
consider the encoding of the text.
import pandas as pd
df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1’)
12
Loading our data
If we try run df.head() to get a sample of our
data, we’ll find that it doesn’t say what each
column represents. We can specify what
each column if it’s not in the file is with
pandas.
import pandas as pd
df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-
1', names=["target", "id", "date", "flag", "user", "text"])
13
Loading our data
Sometimes, the file itself won’t contain the
column names, but in those cases you’ll find
probably find them in the page you
downloaded the dataset from.
14
Loading our data
15
Loading our data
We also need to separate the data into input
and output.
X = df["text"]
y = df["target"]
16
Cleaning and preprocessing our
data
Next, we need to clean and preprocess our data.
The dataset we chose already has most of the
cleaning done, we only need to clean it a bit
further.
We need to remove URLs, hashtags and other
information we don’t need from the Tweets.
To do so, we’ll use the Tweet Preprocessor library.
!pip install -i https://pypi.anaconda.org/berber/simple tweet-preprocessor
17
Cleaning and preprocessing our
data
We then need to apply the preprocessing
function on each row in our input data.
import preprocessor as p
X_preprocessed = X.apply(lambda tweet: p.clean(tweet))
18
Cleaning and preprocessing our
data
After that, we need to prepare it to be ready for the
machine learning model.
The input to the model needs to be numeric, so we
need to find a numeric representation to our text.
There are several representations that we can use,
like Bag of Words and TF-IDF.
19
What is Bag of Words?
“A Bag of Words is a representation of text that
describes the occurrence of words within a
document.”
20
Bag of Words
A Bag of Words is basically a matrix of how
many times each word occurs in a document.
But as it takes only the frequency into
consideration, but it doesn’t tell us how relevant
the word is.
21
What is TF-IDF?
“TF-IDF is short for term frequency–inverse
document frequency, is a numerical statistic that is
intended to reflect how important a word is to a
document in a collection or corpus.”
22
TF-IDF
TF-IDF is based on two things, TF (Term
Frequency) and IDF (Inverse Document
Frequency).
23
Term Frequency
Term Frequency determines how important a word is in specific
document, by calculating how many times the word occurs in a
document divided by the total number of words in that document.
Notice that we use the Bag of Words to be able to compute the
Term Frequency
𝑛𝑖 , 𝑗
𝑡𝑓 𝑖 , 𝑗 =
∑ 𝑘 𝑛𝑖 , 𝑗 24
Inverse Document Frequency
Inverse Document Frequency tells us how
unique a word is calculating the log of the total
number of documents divided by the number of
documents containing that word.
𝑁
𝑖𝑑𝑓 ( 𝑤 )=log ( )
𝑑𝑓 𝑡 25
TF-IDF
TF-IDF is then calculated by multiplying the
Term Frequency by the Inverse Document
Frequency.
This basically gives us how relevant a word is.
𝑤𝑖 , 𝑗 =𝑡𝑓 𝑖 , 𝑗 ×𝑖𝑑𝑓 ( 𝑤)
26
Cleaning and preprocessing our
data
We’ll use Scikit-Learn’s TF-IDF implementation.
We need to fit it using our data to later use it to transform
any data we need to preprocess.
We also need to consider the ngrams (how many
consecutive words should we put in consideration) and
remove the stop words.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(encoding='latin-1', ngram_range=(1, 2), stop_words='english')
tfidf = tfidf.fit(X_preprocessed)
27
X_tfidf = tfidf.transform(X_preprocessed)
Splitting our data for training and
testing
Once we’re done with cleaning and
preprocessing, we need to split our data for
training and testing. We’ll use Scikit-Learn for
that.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size= 0.3)
# note that in the code the training is 70% of the dataset and the testing is 30% of the
dataset 28
Choosing our algorithm
After that, we need to choose our algorithm. For
simplicity, we’ll use Logistic Regression in
this project. Scikit-Learn already has an
implementation of that algorithm we can use.
from sklearn.linear_model import LogisticRegression
regressor = LogisticRegression()
29
Training our model
For the model to start learning, we simply need
to give the Logistic Regression algorithm our
input and expected output to begin training.
model = regressor.fit(X_train,y_train)
30
Testing our model
Once we’re done training the model, we need to
test it using our test set.
y_predict = model.predict(X_test)
31
Testing our model
Now we need to measure our accuracy by
comparing the predicted output with the actual
output.
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)
print(score)
32
Testing our model
We can input our own Tweets to the model now.
We just need to preprocess the Tweet the same way we
preprocessed our dataset and use it as input to our
model.
Notice that the TF-IDF expects a list as input, that’s why
we turn our text into a list.
text = "This sandwich is really good"
text = p.clean(text)
text = [text]
text_tfidf = tfidf.transform(text)
text_predict = model.predict(text_tfidf) 33
print(text_predict)
About the project
You must use a dataset and more than one machine learning
algorithm in the project for training and testing.
Use different machine learning algorithms to compare results to
find the best accuracy.
The number of machine learning algorithms will be equal to the
number of students in the project group.
Write the results of different algorithms in the project
documentation.
34
Try it out yourself
Code:
https://colab.research.google.com/drive/1Bp3y63e031O
xOd5EOF9RQYPVmfEv-dCg
35
Task #1
Get text input from the user, try using the model on
that input and output the result to the user.
The output should say “Good”, “Bad” or “Neutral”, not
the numbers, as the model outputs only numbers.
(0 for “Bad”, 2 for “Neutral” and 4 for “Good”)
36
Task #2
Try improving the accuracy of the model by playing
around with the parameters of the training or TF-
IDF, by using different algorithms, or a mix of both.
37
Thank you for your attention!
38