0% found this document useful (0 votes)

22 views38 pages

Natural Language Processing-Section

The document provides an overview of machine learning, focusing on its types: supervised, unsupervised, and reinforcement learning. It details the steps involved in creating a machine learning model for sentiment analysis of Tweets, including data preparation, cleaning, and the use of algorithms like Logistic Regression. Additionally, it discusses methods for text representation such as Bag of Words and TF-IDF, and includes tasks for improving model accuracy.

Uploaded by

dw9324764

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views38 pages

Natural Language Processing-Section

Uploaded by

dw9324764

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Natural language

processing
Prepared by: Abdelrahman M. Safwat

Section (5) – Machine Learning Basics

What is machine learning?

“Machine learning is the scientific study of

algorithms and statistical models that computer
systems use to perform a specific task without
using explicit instructions, relying on patterns and
inference instead.”

2
Types of machine learning

 Supervised learning
 Unsupervised learning
 Reinforcement learning

3
Supervised learning

Supervised learning is a type of machine learning

where you have data and you know the resulting
output from that data, but you want to make a program
that can predict the output of future data.
Uses of supervised learning include:
 Classification
 Regression
4
Classification &Regression

 A regression : is when the output  A classification : is when the output

variable is a real or continuous variable is a category, such as “red” or
value, such as “salary” or “blue” or “disease” and “no disease”.
“weight”. Many different models A classification model attempts to draw
can be used, the simplest is the some conclusion from observed values.
linear regression. It tries to fit Given one or more inputs a
data with the best hyper-plane classification model will try to predict
which goes through the points. the value of one or more outcomes.

5
Unsupervised learning

Unsupervised learning is a type of machine learning

where you have data and you don’t know the
resulting output from that data, but you want to
make a program that can find patterns in your data.
Uses of unsupervised learning include:
 Clustering

6
Clustering

 Clustering is the act of organizing similar objects into groups within a machine
learning algorithm.
 is done by scanning the unlabeled datasets in a machine learning model and
setting measurements for specific data point features. The cluster analysis
will then classify and place the data points in a group with matching features.
Once data has been grouped together, it will be assigned a cluster ID number
to help identify the cluster characteristics.

7
Idea

 We want to create a machine learning model

that can take a Tweet from Twitter and decide
whether it’s a positive Tweet or a negative
one.

8
Machine Learning Steps We’ll Study

 Preparing data
 Splitting our data for training and testing
 Choosing an algorithm
 Training our model
 Testing our model

9
Acquiring our data

 First, we’ll need to get the data we want to train

our model on. We can either gather Tweets
ourselves or try to find someone who
already did that. Luckily, there’s already a
dataset for that:
cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

10
Loading our data

 Next, we need to load our data. As we can see,

the format of our dataset is CSV, so we’ll
use pandas to load our data.

import pandas as pd

df = pd.read_csv('training.1600000.processed.noemoticon.csv’)

11
Loading our data

 Running the code in the previous slide will

result an error, and that’s because we didn’t
consider the encoding of the text.

import pandas as pd

df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1’)

12
Loading our data

 If we try run df.head() to get a sample of our

data, we’ll find that it doesn’t say what each
column represents. We can specify what
each column if it’s not in the file is with
pandas.
import pandas as pd

df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-
1', names=["target", "id", "date", "flag", "user", "text"])

13
Loading our data

 Sometimes, the file itself won’t contain the

column names, but in those cases you’ll find
probably find them in the page you
downloaded the dataset from.

14
Loading our data

15
Loading our data

 We also need to separate the data into input

and output.

X = df["text"]
y = df["target"]

16
Cleaning and preprocessing our
data

 Next, we need to clean and preprocess our data.

The dataset we chose already has most of the
cleaning done, we only need to clean it a bit
further.
 We need to remove URLs, hashtags and other
information we don’t need from the Tweets.
 To do so, we’ll use the Tweet Preprocessor library.
!pip install -i https://pypi.anaconda.org/berber/simple tweet-preprocessor
17
Cleaning and preprocessing our
data

 We then need to apply the preprocessing

function on each row in our input data.
import preprocessor as p

X_preprocessed = X.apply(lambda tweet: p.clean(tweet))

18
Cleaning and preprocessing our
data

 After that, we need to prepare it to be ready for the

machine learning model.
 The input to the model needs to be numeric, so we
need to find a numeric representation to our text.
 There are several representations that we can use,
like Bag of Words and TF-IDF.

19
What is Bag of Words?

“A Bag of Words is a representation of text that

describes the occurrence of words within a
document.”

20
Bag of Words

 A Bag of Words is basically a matrix of how

many times each word occurs in a document.
 But as it takes only the frequency into
consideration, but it doesn’t tell us how relevant
the word is.

21
What is TF-IDF?

“TF-IDF is short for term frequency–inverse

document frequency, is a numerical statistic that is
intended to reflect how important a word is to a
document in a collection or corpus.”

22
TF-IDF

 TF-IDF is based on two things, TF (Term

Frequency) and IDF (Inverse Document
Frequency).

23
Term Frequency

 Term Frequency determines how important a word is in specific

document, by calculating how many times the word occurs in a
document divided by the total number of words in that document.
 Notice that we use the Bag of Words to be able to compute the
Term Frequency

𝑛𝑖 , 𝑗
𝑡𝑓 𝑖 , 𝑗 =
∑ 𝑘 𝑛𝑖 , 𝑗 24
Inverse Document Frequency

 Inverse Document Frequency tells us how

unique a word is calculating the log of the total
number of documents divided by the number of
documents containing that word.

𝑁
𝑖𝑑𝑓 ( 𝑤 )=log ( )
𝑑𝑓 𝑡 25
TF-IDF

 TF-IDF is then calculated by multiplying the

Term Frequency by the Inverse Document
Frequency.
 This basically gives us how relevant a word is.

𝑤𝑖 , 𝑗 =𝑡𝑓 𝑖 , 𝑗 ×𝑖𝑑𝑓 ( 𝑤)
26
Cleaning and preprocessing our
data

 We’ll use Scikit-Learn’s TF-IDF implementation.

 We need to fit it using our data to later use it to transform
any data we need to preprocess.
 We also need to consider the ngrams (how many
consecutive words should we put in consideration) and
remove the stop words.
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(encoding='latin-1', ngram_range=(1, 2), stop_words='english')

tfidf = tfidf.fit(X_preprocessed)
27
X_tfidf = tfidf.transform(X_preprocessed)
Splitting our data for training and
testing

 Once we’re done with cleaning and

preprocessing, we need to split our data for
training and testing. We’ll use Scikit-Learn for
that.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size= 0.3)

# note that in the code the training is 70% of the dataset and the testing is 30% of the
dataset 28
Choosing our algorithm

 After that, we need to choose our algorithm. For

simplicity, we’ll use Logistic Regression in
this project. Scikit-Learn already has an
implementation of that algorithm we can use.
from sklearn.linear_model import LogisticRegression

regressor = LogisticRegression()

29
Training our model

 For the model to start learning, we simply need

to give the Logistic Regression algorithm our
input and expected output to begin training.

model = regressor.fit(X_train,y_train)

30
Testing our model

 Once we’re done training the model, we need to

test it using our test set.
y_predict = model.predict(X_test)

31
Testing our model

 Now we need to measure our accuracy by

comparing the predicted output with the actual
output.
from sklearn.metrics import accuracy_score

score = accuracy_score(y_test, y_pred)

print(score)

32
Testing our model

 We can input our own Tweets to the model now.

 We just need to preprocess the Tweet the same way we
preprocessed our dataset and use it as input to our
model.
 Notice that the TF-IDF expects a list as input, that’s why
we turn our text into a list.
text = "This sandwich is really good"
text = p.clean(text)
text = [text]
text_tfidf = tfidf.transform(text)
text_predict = model.predict(text_tfidf) 33
print(text_predict)
About the project

 You must use a dataset and more than one machine learning
algorithm in the project for training and testing.
 Use different machine learning algorithms to compare results to
find the best accuracy.
 The number of machine learning algorithms will be equal to the
number of students in the project group.
 Write the results of different algorithms in the project
documentation.
34
Try it out yourself

 Code:
https://colab.research.google.com/drive/1Bp3y63e031O
xOd5EOF9RQYPVmfEv-dCg

35
Task #1

 Get text input from the user, try using the model on
that input and output the result to the user.
 The output should say “Good”, “Bad” or “Neutral”, not
the numbers, as the model outputs only numbers.
(0 for “Bad”, 2 for “Neutral” and 4 for “Good”)

36
Task #2

 Try improving the accuracy of the model by playing

around with the parameters of the training or TF-
IDF, by using different algorithms, or a mix of both.

37
Thank you for your attention!

NLP Module 3
No ratings yet
NLP Module 3
66 pages
ML Summer Training
No ratings yet
ML Summer Training
20 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
Python Text Classification Guide
No ratings yet
Python Text Classification Guide
34 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
Building Machine Learning Systems With Python - Second Edition - Sample Chapter
100% (3)
Building Machine Learning Systems With Python - Second Edition - Sample Chapter
32 pages
FND Imp Points
No ratings yet
FND Imp Points
6 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
Machine Learnimg Notes
No ratings yet
Machine Learnimg Notes
13 pages
NLP Escalation Model with BlazingText
No ratings yet
NLP Escalation Model with BlazingText
47 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
ML Intro Theory
No ratings yet
ML Intro Theory
10 pages
Chap 1 Introduction To ML
No ratings yet
Chap 1 Introduction To ML
33 pages
Parabot Notes PDF
No ratings yet
Parabot Notes PDF
2 pages
Fake News Detection
100% (1)
Fake News Detection
25 pages
Module 4
No ratings yet
Module 4
55 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
7 pages
Lec # 9
No ratings yet
Lec # 9
18 pages
ML Projrct Article 2
No ratings yet
ML Projrct Article 2
6 pages
OCI ML Fundations
No ratings yet
OCI ML Fundations
9 pages
Lec2 Intro To ML
No ratings yet
Lec2 Intro To ML
35 pages
Intro to Machine Learning Basics
No ratings yet
Intro to Machine Learning Basics
10 pages
Document Classification Using Machine Learning: What Is Document Classifier?
No ratings yet
Document Classification Using Machine Learning: What Is Document Classifier?
9 pages
Assign 5 TT
No ratings yet
Assign 5 TT
13 pages
ML Report Fake News Detection
No ratings yet
ML Report Fake News Detection
15 pages
Machine Learning for CS Students
No ratings yet
Machine Learning for CS Students
16 pages
Chapter 4 After Modfiy
No ratings yet
Chapter 4 After Modfiy
4 pages
L02 Fundamentals of ML
No ratings yet
L02 Fundamentals of ML
39 pages
AWS Machine Learning Fundamentals
No ratings yet
AWS Machine Learning Fundamentals
41 pages
2 Machine Learning
No ratings yet
2 Machine Learning
21 pages
Part B
No ratings yet
Part B
6 pages
Automatic Irony and Sarcasm Detection in Socmed
No ratings yet
Automatic Irony and Sarcasm Detection in Socmed
49 pages
ELE-COI-521 Machine Learning Topics
No ratings yet
ELE-COI-521 Machine Learning Topics
40 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Mlintro 2
No ratings yet
Mlintro 2
28 pages
Mlintro 3
No ratings yet
Mlintro 3
28 pages
L5 - L6 - Natural Language Processing
100% (1)
L5 - L6 - Natural Language Processing
94 pages
Methodology
No ratings yet
Methodology
9 pages
Machine Learning for Beginners
No ratings yet
Machine Learning for Beginners
27 pages
CE880 Lecture5 Slides
No ratings yet
CE880 Lecture5 Slides
32 pages
Chapter Four
No ratings yet
Chapter Four
75 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
Unit I
No ratings yet
Unit I
69 pages
The Machine Learning Landscape
No ratings yet
The Machine Learning Landscape
25 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Unit 1
No ratings yet
Unit 1
43 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
33 pages
Natural Language Processing 1st Edition Jacob Eisenstein Available Any Format
No ratings yet
Natural Language Processing 1st Edition Jacob Eisenstein Available Any Format
92 pages
ML Lecture 2 Supervised Learning Setup
No ratings yet
ML Lecture 2 Supervised Learning Setup
38 pages
ML Algorithms Explained
No ratings yet
ML Algorithms Explained
27 pages
Introduction To ML
No ratings yet
Introduction To ML
48 pages
Fake News Detection Using NLP
No ratings yet
Fake News Detection Using NLP
6 pages
Machine Learning Model ENG
No ratings yet
Machine Learning Model ENG
16 pages
ML in Simple Words: in Python, The Function Is Used To Display Output On The Screen or Other Standard Output Device
No ratings yet
ML in Simple Words: in Python, The Function Is Used To Display Output On The Screen or Other Standard Output Device
30 pages
SWE-Week 05
No ratings yet
SWE-Week 05
32 pages
SWE-Week 04
No ratings yet
SWE-Week 04
17 pages
SWE-Week 03
No ratings yet
SWE-Week 03
21 pages
SWE-Week 01
No ratings yet
SWE-Week 01
25 pages
SWE-Week 02
No ratings yet
SWE-Week 02
24 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
22 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
57 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
29 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
25 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
29 pages
4-Finite State Machines - Part1
No ratings yet
4-Finite State Machines - Part1
31 pages
Natural Language Processing Project Spring2024-2025
No ratings yet
Natural Language Processing Project Spring2024-2025
2 pages
1-Introduction To NLP - Part1
No ratings yet
1-Introduction To NLP - Part1
31 pages
Proverb and Riddl1
No ratings yet
Proverb and Riddl1
3 pages
Would Rather
No ratings yet
Would Rather
7 pages
aCTION PLAN IN SCIENCE
100% (1)
aCTION PLAN IN SCIENCE
3 pages
1 - Numeracy and Development
100% (3)
1 - Numeracy and Development
12 pages
What Are Semantic Explications
No ratings yet
What Are Semantic Explications
3 pages
Empowering Parents of Disabled Kids
No ratings yet
Empowering Parents of Disabled Kids
3 pages
L6M2 External Report
100% (1)
L6M2 External Report
1 page
Detailed Lesson Plan in Writing For English 7
50% (2)
Detailed Lesson Plan in Writing For English 7
4 pages
EED 502 Ass 2
No ratings yet
EED 502 Ass 2
34 pages
Unit 1. Educational Technology
No ratings yet
Unit 1. Educational Technology
7 pages
Burrel y Morgan Sociological Paradigms
No ratings yet
Burrel y Morgan Sociological Paradigms
19 pages
Novice To Expert
100% (2)
Novice To Expert
26 pages
My Value System
100% (1)
My Value System
22 pages
Enhancing Reading Comprehension Through Guided and Relaxed Reading Practice (FINAL COPY)
No ratings yet
Enhancing Reading Comprehension Through Guided and Relaxed Reading Practice (FINAL COPY)
5 pages
An Invitation To Suggestopedia
No ratings yet
An Invitation To Suggestopedia
7 pages
Factor Structure and Measurement Invariance of The Diffi Culties Emotion Regulation Scale (DERS) in Spanish Adolescents
No ratings yet
Factor Structure and Measurement Invariance of The Diffi Culties Emotion Regulation Scale (DERS) in Spanish Adolescents
8 pages
Studio Art Syllabus Student 1
No ratings yet
Studio Art Syllabus Student 1
3 pages
Introduction to Linguistics
No ratings yet
Introduction to Linguistics
20 pages
English 5: Listening
No ratings yet
English 5: Listening
13 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
Comprehensive Exam Passed 2024
No ratings yet
Comprehensive Exam Passed 2024
9 pages
ISAA Parent Interview Questions
No ratings yet
ISAA Parent Interview Questions
8 pages
History of Cognitive Science
No ratings yet
History of Cognitive Science
10 pages
(Appendix C-12) COT-RPMS Inter-Observer Agreement Form For MT I-IV For SY 2022-2023
100% (3)
(Appendix C-12) COT-RPMS Inter-Observer Agreement Form For MT I-IV For SY 2022-2023
1 page
Self Learning Module Cca
100% (1)
Self Learning Module Cca
14 pages
Use of Silent Way and Total Physical Response (TPR) Methods On The Effectiveness of English Learning
No ratings yet
Use of Silent Way and Total Physical Response (TPR) Methods On The Effectiveness of English Learning
7 pages
Detailed Lesson Plan Measurement
No ratings yet
Detailed Lesson Plan Measurement
10 pages
Opción de Examen #1: LOE JUNIO 2014
No ratings yet
Opción de Examen #1: LOE JUNIO 2014
2 pages
Writing Revision Sheets-Grade 12
No ratings yet
Writing Revision Sheets-Grade 12
3 pages
1 Introduction To Communication Skills Communication Skills Peer Training
No ratings yet
1 Introduction To Communication Skills Communication Skills Peer Training
13 pages

Natural Language Processing-Section

Uploaded by

Natural Language Processing-Section

Uploaded by

Natural language

Section (5) – Machine Learning Basics

“Machine learning is the scientific study of

Supervised learning is a type of machine learning

 A regression : is when the output  A classification : is when the output

Unsupervised learning is a type of machine learning

 We want to create a machine learning model

 First, we’ll need to get the data we want to train

 Next, we need to load our data. As we can see,

 Running the code in the previous slide will

 If we try run df.head() to get a sample of our

 Sometimes, the file itself won’t contain the

 We also need to separate the data into input

 Next, we need to clean and preprocess our data.

 We then need to apply the preprocessing

X_preprocessed = X.apply(lambda tweet: p.clean(tweet))

 After that, we need to prepare it to be ready for the

“A Bag of Words is a representation of text that

 A Bag of Words is basically a matrix of how

“TF-IDF is short for term frequency–inverse

 TF-IDF is based on two things, TF (Term

 Term Frequency determines how important a word is in specific

 Inverse Document Frequency tells us how

 TF-IDF is then calculated by multiplying the

 We’ll use Scikit-Learn’s TF-IDF implementation.

tfidf = TfidfVectorizer(encoding='latin-1', ngram_range=(1, 2), stop_words='english')

 Once we’re done with cleaning and

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size= 0.3)

 After that, we need to choose our algorithm. For

 For the model to start learning, we simply need

 Once we’re done training the model, we need to

 Now we need to measure our accuracy by

score = accuracy_score(y_test, y_pred)

 We can input our own Tweets to the model now.

 Try improving the accuracy of the model by playing

You might also like