KEMBAR78
Machine learning and_nlp | PPTX
Machine Learning and NLP
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning, Classification vs. Regression
▪ Simple Classification Example
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning
▪ Classification vs. Regression
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
What is Machine Learning?
Machine learning allows
computers to learn and infer
from data by applying
algorithms to observed data
and make predictions based
on the data.
Machine Learning Vocabulary
• Target/Outcome: predicted category or value
of the data (column to predict)
• Features: properties of the data used for
prediction (non-target columns)
• Example: a single data point within the data
(one row)
• Label: the target value for a single data point
Types of Machine Learning
Here, the data points have known outcome. We train the
model with data. We feed the model with correct answers.
Model Learns and finally predicts new data’s outcome.
Here, the data points have unknown outcome. Data is
given to the model. Right answers are not provided to the
model. The model makes sense of the data given to it.
Can teach you something you were probably not aware
of in the given dataset.
Unsupervised
Supervised
7
Types of Supervised And Unsupervised Learning
Supervised Unsupervised
Classification
Regression
Clustering
Recommendation
Supervised Learning
data points have known outcomeSupervised
Supervised Learning Example: Housing Prices
Target/OutcomeFeature
LabelExample
Supervised Learning Example: Housing Prices
Supervised Learning Example: Housing Prices
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning
▪ Classification vs. Regression
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
Regression
Classification
Types of Supervised Learning
outcome is continuous (numerical)
outcome is a category
Regression
Classification
Types of Supervised Learning
outcome is continuous (numerical)
outcome is a category
Regression vs Classification Problems
Some supervised learning questions we can ask regarding movie data:
• Regression Questions
▪ Can you predict the gross earnings for a movie? $1.1 billion
• Classification Questions
▪ Can you predict whether a movie will win an Oscar or not? Yes | No
▪ Can you predict what the MPAA rating is for a movie? G | PG | PG-13 | R
Regression
Predict a real numeric value for an entity with a given set of features.
Price
Address
Type
Age
Parking
School
Transit
Total sqft
Lot Size
Bathrooms
Bedrooms
Yard
Pool
Fireplace
$
sqft
Property Attributes Linear Regression Model
Supervised Learning Overview
data with
answers
model
predicted
answers
data without
answers
fit
+
+
predict
model
model
Classification: Categorical Answers
emails labeled as
spam/not spam
model
spam or
not spam
unlabeled
emails
fit
+
+
model
model
predict
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning, Classification vs. Regression
▪ Classification Example
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning, Classification vs. Regression
▪ Classification Example
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
One of the most popular machine learning techniques for binary classification
Binary classification = how do you best split this data into two groups?
Logistic Regression
Kid-Friendly
Not Kid-Friendly
Number of Animals
The most basic regression technique is linear regression
Logistic Regression
y = β1x + β0
Kid-Friendly
Not Kid-Friendly
Number of Animals
Problem: The y values of the line go from -∞ to +∞
Let’s try applying a transformation to the line to limit the y values from 0 to 1
Logistic Regression
y = β1x + β0
Kid-Friendly
Not Kid-Friendly
Number of Animals
The sigmoid function (aka the “S” curve) solves this problem
If you input the Number of Animals (x), the equation gives you the probability that
the movie is Kid-Friendly (p), which is a number between 0 and 1
Logistic Regression
p =
1 + e-(β1x + β0)
1 .Kid-Friendly
Not Kid-Friendly
Number of Animals
Logistic Regression examples
Predict a label for an entity with a given set of features.
SPAM
Prediction Sentiment Analysis
Steps for classification with NLP
1. Prepare the data: Read in labelled data and preprocess the data
2. Split the data: Separate inputs and outputs into a training set and a test set, repsectively
3. Numerically encode inputs: Using Count Vectorizer or TF-IDF Vectorizer
4. Fit a model: Fit a model on the training data and apply the fitted model to the test set
5. Evaluate the model: Decide how good the model is by calculating various error metrics
Building a Logistic Regression model
A classic use of text analytics is to flag messages as spam
Below is data from the SMS Spam Collection Data, which is a set of over 5K
English text messages that have been labeled as spam or ham (legitimate)
Step 1: Prepare the data
Text Message Label
Nah I don't think he goes to usf, he lives around here though ham
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive
entry question(std txt rate)T&C's apply 08452810075over18's
spam
WINNER!! As a valued network customer you have been selected to receivea £900 prize reward!
To claim call 09061701461. Claim code KL341. Valid 12 hours only.
spam
I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried
enough today.
ham
I HAVE A DATE ON SUNDAY WITH WILL!! ham
… …
Step 1: Prepare the data [Code]
# make sure the data is labeled
import pandas as pd
data = pd.read_table('SMSSpamCollection.txt', header=None)
data.columns = ['label', 'text']
print(data.head()) # print function requires Python 3
Input:
Output:
Step 1: Prepare the data [Code]
# remove words with numbers, punctuation and capital letters
import re
import string
alphanumeric = lambda x: re.sub(r"""w*dw*""", ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
data['text'] = data.text.map(alphanumeric).map(punc_lower)
print(data.head())
Input:
Output:
To fit a model, the data needs to be split into inputs and outputs
The inputs and output of these models have various names
• Inputs: Features, Predictors, Independent Variables, X’s
• Outputs: Outcome, Response, Dependent Variable, Y
Step 2: Split the data (into inputs and outputs)
# label congrats eat tonight winner chicken dinner wings
0 ham 0 1 0 0 0 0 0
1 ham 0 1 1 0 0 0 0
2 spam 0 0 0 1 0 0 0
. … … … … … … … …
Step 2: Split the data [Code]
# split the data into inputs and outputs
X = data.text # inputs into model
y = data.label # output of model
Input:
Output:
Step 2: Split the data (into a training and test set)
Why do we need to split data into training and test sets?
• Let’s say we had a data set with 100 observations and we found a model that fit
the data perfectly
• What if you were to use that model on a brand new data set?
Blue = Overfitting
Black = Correct
Source:
https://en.wikipedia.org/wiki/Overfitting
Step 2: Split the data (into a training and test set)
To prevent the issue of overfitting, we divide observations into two sets
• A model is fit on the training data and it is evaluated on the test data
• This way, you can see if the model generalizes well
label congrats eat tonight winner chicken dinner wings
0 ham 0 1 0 0 0 0 0
1 ham 0 1 1 0 0 0 0
2 spam 0 0 0 1 0 0 0
3 spam 1 0 0 0 0 0 0
4 ham 0 0 0 0 0 1 0
5 ham 0 0 1 0 0 0 0
6 ham 0 0 0 0 0 0 0
7 spam 0 0 0 0 0 0 0
8 ham 0 0 0 0 0 1 0
9 ham 0 0 0 0 1 1 0
10 spam 0 0 0 0 0 0 0
11 ham 0 0 0 0 0 0 1
Training Set (70-80%)
Test Set (20-30%)
Step 2: Split the data [Code]
# split the data into a training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# test size = 30% of observations, which means training size = 70% of observations
# random state = 42, so we all get the same random train / test split
Input:
Output:
Step 3: Numerically encode the input data [Code]
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words=‘english’)
X_train_cv = cv.fit_transform(X_train) # fit_transform learns the vocab and one-hot
encodes
X_test_cv = cv.transform(X_test) # transform uses the same vocab and one-hot encodes
# print the dimensions of the training set (text messages, terms)
print(X_train_cv.toarray().shape)
Input:
Output:
(3900, 6103)
Step 4: Fit model and predict outcomes [Code]
# Use a logistic regression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
# Train the model
lr.fit(X_train_cv, y_train)
# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv
data
y_pred_cv = lr.predict(X_test_cv)
y_pred_cv # The output is all of the predictions
Input:
Output:
array(['ham', 'ham', 'ham', ..., 'ham', 'spam', 'ham'], dtype=object)
Step 5: Evaluate the model
After fitting a model on the training data and predicting outcomes for the test data,
how do you know if the model is a good fit?
Actual
# Actual Predicted
1 ham ham
2 ham ham
3 spam spam
4 spam spam
5 ham ham
6 ham spam
7 ham ham
8 ham ham
9 ham ham
10 spam spam
True Negative
(6)
False Positive
(1)
False Negative
(0)
True Positive
(3)
PredictedResult
true negative
true negative
true positive
true positive
true negative
false positive
true negative
true negative
true negative
true positive
ham spam
hamspam
Confusion Matrix
Step 5: Evaluate the model
After fitting a model on the training data and predicting outcomes for the test data,
how do you know if the model is a good fit?
Error Metrics
• Accuracy = (TP + TN) / All
• Precision = TP / (TP + FP)
• Recall = TP / (TP + FN)
• F1 Score = 2*(P*R)/(P+R)
Actual
True Negative
(6)
False Positive
(1)
False Negative
(0)
True Positive
(3)
Predicted
ham spam
hamspam
Confusion Matrix
P = Precision R = Recall
Step 5: Evaluate the model
After fitting a model on the training data and predicting outcomes for the test data,
how do you know if the model is a good fit?
Error Metrics
• Accuracy = (TP + TN) / All = 0.9
• Precision = TP / (TP + FP) = 0.75
• Recall = TP / (TP + FN) = 1
• F1 Score = 2*(P*R)/(P+R) = 0.86
Actual
True Negative
(6)
False Positive
(1)
False Negative
(0)
True Positive
(3)
Predicted
ham spam
hamspam
Confusion Matrix
Step 5: Evaluate the model [Code]
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
cm = confusion_matrix(y_test, y_pred_cv)
sns.heatmap(cm, xticklabels=['predicted_ham', ‘predicted_spam'], yticklabels=['actual_ham', ‘actual_spam'],
annot=True, fmt='d', annot_kws={'fontsize':20}, cmap=“YlGnBu");
true_neg, false_pos = cm[0]
false_neg, true_pos = cm[1]
accuracy = round((true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg),3)
precision = round((true_pos) / (true_pos + false_pos),3)
recall = round((true_pos) / (true_pos + false_neg),3)
f1 = round(2 * (precision * recall) / (precision + recall),3)
print('Accuracy: {}'.format(accuracy))
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print(‘F1 Score: {}’.format(f1))
Input:
Step 5: Evaluate the model [Code]
Output:
Accuracy: 0.986
Precision: 1.0
Recall: 0.893
F1 Score: 0.943
What was our original goal?
To classify text messages as spam or ham
How did we do that?
By collecting labeled data, cleaning the data, splitting it into a training and test set,
numerically encoding it using Count Vectorizer, fitting a logistic regression model on the
training data, and evaluating the results of the model applied to the test set
Was it a good model?
It seems good, but let’s see if we can get better metrics with another classification technique,
Naive Bayes
Logistic Regression Checkpoint
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning,
▪ Classification vs. Regression
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
One of the simpler and less computationally intensive techniques
Naive Bayes tends to perform well on text classifications
1. Conditional Probability and Bayes Theorem
2. Independence and Naive Bayes
3. Apply Naive Bayes to Spam Example and Compare with Logistic Regression
Naive Bayes
Conditional Probability = what’s the probability that something will happen, given
that something else has happened?
Spam Example = what’s the probability that this text message is spam, given that
it contains the word “cash”?
Naive Bayes: Bayes Theorem
P(A|B) =
P(B|A) x P(A)
P(B)
P(spam | cash) =
P(cash | spam) x P(spam)
P(cash)
Naive Bayes assumes that each event is independent, or that it has no effect on other
events. This is a naive assumption, but it provides a simplified approach.
Spam Example: Naive Bayes assumes that each word in a spam message (like “cash”
and “winner”) is unrelated. This is unrealistic, but it’s the naive assumption.
Naive Bayes: Independent Events
P(spam | winner of some cash) =
P(winner | spam) x P(of | spam) x P(some | spam) x P(cash | spam) x P(spam)
P(ham | winner of some cash) =
P(winner | ham) x P(of | ham) x P(some | ham) x P(cash | ham) x P(ham)
The higher probability wins
Naive Bayes: Fit model [Code]
# Use a Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# Train the model
nb.fit(X_train_cv, y_train)
# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv
data
y_pred_cv_nb = nb.predict(X_test_cv)
y_pred_cv_nb # The output is all of the predictions
Input:
Output:
array(['ham', 'ham', 'ham', ..., 'ham', 'spam', ‘ham'], dtype='<U4')
Naive Bayes: Results
Accuracy: 0.986
Precision: 0.939
Recall: 0.952
F1 Score: 0.945
Compare Logistic Regression and Naive Bayes
Logistic Regression
Accuracy: 0.986
Precision: 1.0
Recall: 0.893
F1 Score: 0.943
Naive Bayes
Accuracy: 0.986
Precision: 0.939
Recall: 0.952
F1 Score: 0.945
Machine Learning and NLP Review
• Machine Learning Review
▪ Supervised Learning > Classification Techniques > Logistic Regression & Naive Bayes
▪ Error Metrics for Classification > Accuracy | Precision | Recall | F1 Score
• Classification with NLP
▪ Make sure the data is labeled and split into a training and test set
▪ Clean the data and use one-hot encoding to put in a numeric format for modeling
▪ Fit the model on the training data, apply the model to the test data and evaluate
Machine learning and_nlp

Machine learning and_nlp

  • 1.
  • 2.
    Machine Learning andNLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning, Classification vs. Regression ▪ Simple Classification Example • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 3.
    Machine Learning andNLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning ▪ Classification vs. Regression • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 4.
    What is MachineLearning? Machine learning allows computers to learn and infer from data by applying algorithms to observed data and make predictions based on the data.
  • 5.
    Machine Learning Vocabulary •Target/Outcome: predicted category or value of the data (column to predict) • Features: properties of the data used for prediction (non-target columns) • Example: a single data point within the data (one row) • Label: the target value for a single data point
  • 6.
    Types of MachineLearning Here, the data points have known outcome. We train the model with data. We feed the model with correct answers. Model Learns and finally predicts new data’s outcome. Here, the data points have unknown outcome. Data is given to the model. Right answers are not provided to the model. The model makes sense of the data given to it. Can teach you something you were probably not aware of in the given dataset. Unsupervised Supervised
  • 7.
    7 Types of SupervisedAnd Unsupervised Learning Supervised Unsupervised Classification Regression Clustering Recommendation
  • 8.
    Supervised Learning data pointshave known outcomeSupervised
  • 9.
    Supervised Learning Example:Housing Prices Target/OutcomeFeature LabelExample
  • 10.
  • 11.
  • 12.
    Machine Learning andNLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning ▪ Classification vs. Regression • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 13.
    Regression Classification Types of SupervisedLearning outcome is continuous (numerical) outcome is a category
  • 14.
    Regression Classification Types of SupervisedLearning outcome is continuous (numerical) outcome is a category
  • 15.
    Regression vs ClassificationProblems Some supervised learning questions we can ask regarding movie data: • Regression Questions ▪ Can you predict the gross earnings for a movie? $1.1 billion • Classification Questions ▪ Can you predict whether a movie will win an Oscar or not? Yes | No ▪ Can you predict what the MPAA rating is for a movie? G | PG | PG-13 | R
  • 16.
    Regression Predict a realnumeric value for an entity with a given set of features. Price Address Type Age Parking School Transit Total sqft Lot Size Bathrooms Bedrooms Yard Pool Fireplace $ sqft Property Attributes Linear Regression Model
  • 17.
    Supervised Learning Overview datawith answers model predicted answers data without answers fit + + predict model model
  • 18.
    Classification: Categorical Answers emailslabeled as spam/not spam model spam or not spam unlabeled emails fit + + model model predict
  • 19.
    Machine Learning andNLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning, Classification vs. Regression ▪ Classification Example • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 20.
    Machine Learning andNLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning, Classification vs. Regression ▪ Classification Example • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 21.
    One of themost popular machine learning techniques for binary classification Binary classification = how do you best split this data into two groups? Logistic Regression Kid-Friendly Not Kid-Friendly Number of Animals
  • 22.
    The most basicregression technique is linear regression Logistic Regression y = β1x + β0 Kid-Friendly Not Kid-Friendly Number of Animals
  • 23.
    Problem: The yvalues of the line go from -∞ to +∞ Let’s try applying a transformation to the line to limit the y values from 0 to 1 Logistic Regression y = β1x + β0 Kid-Friendly Not Kid-Friendly Number of Animals
  • 24.
    The sigmoid function(aka the “S” curve) solves this problem If you input the Number of Animals (x), the equation gives you the probability that the movie is Kid-Friendly (p), which is a number between 0 and 1 Logistic Regression p = 1 + e-(β1x + β0) 1 .Kid-Friendly Not Kid-Friendly Number of Animals
  • 25.
    Logistic Regression examples Predicta label for an entity with a given set of features. SPAM Prediction Sentiment Analysis
  • 26.
    Steps for classificationwith NLP 1. Prepare the data: Read in labelled data and preprocess the data 2. Split the data: Separate inputs and outputs into a training set and a test set, repsectively 3. Numerically encode inputs: Using Count Vectorizer or TF-IDF Vectorizer 4. Fit a model: Fit a model on the training data and apply the fitted model to the test set 5. Evaluate the model: Decide how good the model is by calculating various error metrics Building a Logistic Regression model
  • 27.
    A classic useof text analytics is to flag messages as spam Below is data from the SMS Spam Collection Data, which is a set of over 5K English text messages that have been labeled as spam or ham (legitimate) Step 1: Prepare the data Text Message Label Nah I don't think he goes to usf, he lives around here though ham Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's spam WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only. spam I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today. ham I HAVE A DATE ON SUNDAY WITH WILL!! ham … …
  • 28.
    Step 1: Preparethe data [Code] # make sure the data is labeled import pandas as pd data = pd.read_table('SMSSpamCollection.txt', header=None) data.columns = ['label', 'text'] print(data.head()) # print function requires Python 3 Input: Output:
  • 29.
    Step 1: Preparethe data [Code] # remove words with numbers, punctuation and capital letters import re import string alphanumeric = lambda x: re.sub(r"""w*dw*""", ' ', x) punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower()) data['text'] = data.text.map(alphanumeric).map(punc_lower) print(data.head()) Input: Output:
  • 30.
    To fit amodel, the data needs to be split into inputs and outputs The inputs and output of these models have various names • Inputs: Features, Predictors, Independent Variables, X’s • Outputs: Outcome, Response, Dependent Variable, Y Step 2: Split the data (into inputs and outputs) # label congrats eat tonight winner chicken dinner wings 0 ham 0 1 0 0 0 0 0 1 ham 0 1 1 0 0 0 0 2 spam 0 0 0 1 0 0 0 . … … … … … … … …
  • 31.
    Step 2: Splitthe data [Code] # split the data into inputs and outputs X = data.text # inputs into model y = data.label # output of model Input: Output:
  • 32.
    Step 2: Splitthe data (into a training and test set) Why do we need to split data into training and test sets? • Let’s say we had a data set with 100 observations and we found a model that fit the data perfectly • What if you were to use that model on a brand new data set? Blue = Overfitting Black = Correct Source: https://en.wikipedia.org/wiki/Overfitting
  • 33.
    Step 2: Splitthe data (into a training and test set) To prevent the issue of overfitting, we divide observations into two sets • A model is fit on the training data and it is evaluated on the test data • This way, you can see if the model generalizes well label congrats eat tonight winner chicken dinner wings 0 ham 0 1 0 0 0 0 0 1 ham 0 1 1 0 0 0 0 2 spam 0 0 0 1 0 0 0 3 spam 1 0 0 0 0 0 0 4 ham 0 0 0 0 0 1 0 5 ham 0 0 1 0 0 0 0 6 ham 0 0 0 0 0 0 0 7 spam 0 0 0 0 0 0 0 8 ham 0 0 0 0 0 1 0 9 ham 0 0 0 0 1 1 0 10 spam 0 0 0 0 0 0 0 11 ham 0 0 0 0 0 0 1 Training Set (70-80%) Test Set (20-30%)
  • 34.
    Step 2: Splitthe data [Code] # split the data into a training and test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # test size = 30% of observations, which means training size = 70% of observations # random state = 42, so we all get the same random train / test split Input: Output:
  • 35.
    Step 3: Numericallyencode the input data [Code] from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(stop_words=‘english’) X_train_cv = cv.fit_transform(X_train) # fit_transform learns the vocab and one-hot encodes X_test_cv = cv.transform(X_test) # transform uses the same vocab and one-hot encodes # print the dimensions of the training set (text messages, terms) print(X_train_cv.toarray().shape) Input: Output: (3900, 6103)
  • 36.
    Step 4: Fitmodel and predict outcomes [Code] # Use a logistic regression model from sklearn.linear_model import LogisticRegression lr = LogisticRegression() # Train the model lr.fit(X_train_cv, y_train) # Take the model that was trained on the X_train_cv data and apply it to the X_test_cv data y_pred_cv = lr.predict(X_test_cv) y_pred_cv # The output is all of the predictions Input: Output: array(['ham', 'ham', 'ham', ..., 'ham', 'spam', 'ham'], dtype=object)
  • 37.
    Step 5: Evaluatethe model After fitting a model on the training data and predicting outcomes for the test data, how do you know if the model is a good fit? Actual # Actual Predicted 1 ham ham 2 ham ham 3 spam spam 4 spam spam 5 ham ham 6 ham spam 7 ham ham 8 ham ham 9 ham ham 10 spam spam True Negative (6) False Positive (1) False Negative (0) True Positive (3) PredictedResult true negative true negative true positive true positive true negative false positive true negative true negative true negative true positive ham spam hamspam Confusion Matrix
  • 38.
    Step 5: Evaluatethe model After fitting a model on the training data and predicting outcomes for the test data, how do you know if the model is a good fit? Error Metrics • Accuracy = (TP + TN) / All • Precision = TP / (TP + FP) • Recall = TP / (TP + FN) • F1 Score = 2*(P*R)/(P+R) Actual True Negative (6) False Positive (1) False Negative (0) True Positive (3) Predicted ham spam hamspam Confusion Matrix P = Precision R = Recall
  • 39.
    Step 5: Evaluatethe model After fitting a model on the training data and predicting outcomes for the test data, how do you know if the model is a good fit? Error Metrics • Accuracy = (TP + TN) / All = 0.9 • Precision = TP / (TP + FP) = 0.75 • Recall = TP / (TP + FN) = 1 • F1 Score = 2*(P*R)/(P+R) = 0.86 Actual True Negative (6) False Positive (1) False Negative (0) True Positive (3) Predicted ham spam hamspam Confusion Matrix
  • 40.
    Step 5: Evaluatethe model [Code] from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline cm = confusion_matrix(y_test, y_pred_cv) sns.heatmap(cm, xticklabels=['predicted_ham', ‘predicted_spam'], yticklabels=['actual_ham', ‘actual_spam'], annot=True, fmt='d', annot_kws={'fontsize':20}, cmap=“YlGnBu"); true_neg, false_pos = cm[0] false_neg, true_pos = cm[1] accuracy = round((true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg),3) precision = round((true_pos) / (true_pos + false_pos),3) recall = round((true_pos) / (true_pos + false_neg),3) f1 = round(2 * (precision * recall) / (precision + recall),3) print('Accuracy: {}'.format(accuracy)) print('Precision: {}'.format(precision)) print('Recall: {}'.format(recall)) print(‘F1 Score: {}’.format(f1)) Input:
  • 41.
    Step 5: Evaluatethe model [Code] Output: Accuracy: 0.986 Precision: 1.0 Recall: 0.893 F1 Score: 0.943
  • 42.
    What was ouroriginal goal? To classify text messages as spam or ham How did we do that? By collecting labeled data, cleaning the data, splitting it into a training and test set, numerically encoding it using Count Vectorizer, fitting a logistic regression model on the training data, and evaluating the results of the model applied to the test set Was it a good model? It seems good, but let’s see if we can get better metrics with another classification technique, Naive Bayes Logistic Regression Checkpoint
  • 43.
    Machine Learning andNLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning, ▪ Classification vs. Regression • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 44.
    One of thesimpler and less computationally intensive techniques Naive Bayes tends to perform well on text classifications 1. Conditional Probability and Bayes Theorem 2. Independence and Naive Bayes 3. Apply Naive Bayes to Spam Example and Compare with Logistic Regression Naive Bayes
  • 45.
    Conditional Probability =what’s the probability that something will happen, given that something else has happened? Spam Example = what’s the probability that this text message is spam, given that it contains the word “cash”? Naive Bayes: Bayes Theorem P(A|B) = P(B|A) x P(A) P(B) P(spam | cash) = P(cash | spam) x P(spam) P(cash)
  • 46.
    Naive Bayes assumesthat each event is independent, or that it has no effect on other events. This is a naive assumption, but it provides a simplified approach. Spam Example: Naive Bayes assumes that each word in a spam message (like “cash” and “winner”) is unrelated. This is unrealistic, but it’s the naive assumption. Naive Bayes: Independent Events P(spam | winner of some cash) = P(winner | spam) x P(of | spam) x P(some | spam) x P(cash | spam) x P(spam) P(ham | winner of some cash) = P(winner | ham) x P(of | ham) x P(some | ham) x P(cash | ham) x P(ham) The higher probability wins
  • 47.
    Naive Bayes: Fitmodel [Code] # Use a Naive Bayes model from sklearn.naive_bayes import MultinomialNB nb = MultinomialNB() # Train the model nb.fit(X_train_cv, y_train) # Take the model that was trained on the X_train_cv data and apply it to the X_test_cv data y_pred_cv_nb = nb.predict(X_test_cv) y_pred_cv_nb # The output is all of the predictions Input: Output: array(['ham', 'ham', 'ham', ..., 'ham', 'spam', ‘ham'], dtype='<U4')
  • 48.
    Naive Bayes: Results Accuracy:0.986 Precision: 0.939 Recall: 0.952 F1 Score: 0.945
  • 49.
    Compare Logistic Regressionand Naive Bayes Logistic Regression Accuracy: 0.986 Precision: 1.0 Recall: 0.893 F1 Score: 0.943 Naive Bayes Accuracy: 0.986 Precision: 0.939 Recall: 0.952 F1 Score: 0.945
  • 50.
    Machine Learning andNLP Review • Machine Learning Review ▪ Supervised Learning > Classification Techniques > Logistic Regression & Naive Bayes ▪ Error Metrics for Classification > Accuracy | Precision | Recall | F1 Score • Classification with NLP ▪ Make sure the data is labeled and split into a training and test set ▪ Clean the data and use one-hot encoding to put in a numeric format for modeling ▪ Fit the model on the training data, apply the model to the test data and evaluate