0% found this document useful (0 votes)

2K views4 pages

Unstructured Data Classification Handson

This document discusses loading and preprocessing an IMDB movie review dataset using pandas and NLTK. Key steps include: 1. Loading the CSV dataset and viewing the first 5 rows. 2. Analyzing the dataset shape and statistics, and identifying the target variable. 3. Preprocessing the text data via tokenization, lemmatization, and stop word removal. 4. Creating term-document matrices using CountVectorizer and TfidfVectorizer. 5. Splitting the data into train and test sets for model training and evaluation. 6. Training Support Vector Machine and Stochastic Gradient Descent classifiers on the preprocessed data.

Uploaded by

mohamed yasin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views4 pages

Unstructured Data Classification Handson

Uploaded by

mohamed yasin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Join our channel if you haven’t joined yet https://t.me/fresco_milestone (https://t.

me/fresco_milestone) ( @fresco_milestone )

In [1]: import pandas as pd

import numpy as np

import csv

Fill in the Command to load your CSV dataset "imdb.csv" with pandas

In [2]: #Data Loading

imdb=pd.read_csv('imdb.csv')

imdb.columns = ["index","text","label"]

print(imdb.head(5))

index text label

0 0 A very, very, very slow-moving, aimless movie ... 0

1 1 Not sure who was more lost - the flat characte... 0

2 2 Attempting artiness with black & white and cle... 0

3 3 Very little music or anything to speak of. 0

4 4 The best scene in the movie was when Gerardo i... 1

Data Analysis

Get the shape of the dataset and print it.

Get the column names in list and print it.
Group the dataset by label and describe the dataset to understand the basic statistics of the dataset.
Print the first three rows of the dataset

In [3]: data_size =imdb.shape

print(data_size)

imdb_col_names =list(imdb.columns)

print(imdb_col_names)

print(imdb.describe(include='all'))

print(imdb.head(3))

(1000, 3)

['index', 'text', 'label']

index text label

count 1000.000000 1000 1000.00000

unique NaN 997 NaN

top NaN 10/10 NaN

freq NaN 2 NaN

mean 499.500000 NaN 0.50000

std 288.819436 NaN 0.50025

min 0.000000 NaN 0.00000

25% 249.750000 NaN 0.00000

50% 499.500000 NaN 0.50000

75% 749.250000 NaN 1.00000

max 999.000000 NaN 1.00000

index text label

0 0 A very, very, very slow-moving, aimless movie ... 0

1 1 Not sure who was more lost - the flat characte... 0

2 2 Attempting artiness with black & white and cle... 0

Target Identification

Execute the below cell to identify the target variables. If 0 it is a bad review,if it is 1 it is a good review.

In [4]: imdb_target=imdb['label']

print(imdb_target)

Tokenization

Convert the text into lower.

Tokenize the text using word_tokenize
Apply the function split_tokens for the column text in the imdb dataset with axis =1
In [6]: from nltk.tokenize import word_tokenize

import nltk

nltk.download('all')

def split_tokens(text):

message = text.lower()

word_tokens = word_tokenize(message)

return word_tokens

imdb['tokenized_message'] = imdb.text.apply(split_tokens)

[nltk_data] Downloading collection 'all'

[nltk_data] |

[nltk_data] | Downloading package abc to /home/user/nltk_data...

Lemmatization

Apply the function split_into_lemmas for the column tokenized_message with axis=1
Print the 55th row from the column tokenized_message.
Print the 55th row from the column lemmatized_message

In [7]: from nltk.stem.wordnet import WordNetLemmatizer

def split_into_lemmas(text):

lemma = []

lemmatizer = WordNetLemmatizer()

for word in text:

a=lemmatizer.lemmatize(word)

lemma.append(a)

return lemma

imdb['lemmatized_message'] = imdb.tokenized_message.apply(split_into_lemmas)

print('Tokenized message:',imdb.tokenized_message[54] )

print('Lemmatized message:',imdb.lemmatized_message[54] )

Tokenized message: ['long', ',', 'whiny', 'and', 'pointless', '.']

Lemmatized message: ['long', ',', 'whiny', 'and', 'pointless', '.']

Stop Word Removal

Set the stop words language as english in the variable stop_words

Apply the function stopword_removal to the column lemmatized_message with axis=1
Print the 55th row from the column preprocessed_message

In [8]: from nltk.corpus import stopwords

def stopword_removal(text):

stop_words = stopwords.words('english')

filtered_sentence = []

filtered_sentence = ' '.join([word for word in text if word not in stop_words])

return filtered_sentence

imdb['preprocessed_message'] = imdb.lemmatized_message.apply(stopword_removal)

print('Preprocessed message:',imdb.preprocessed_message[54])

Training_data=pd.Series(list(imdb['preprocessed_message']))

Training_label=pd.Series(list(imdb['label']))

Preprocessed message: long , whiny pointless .

Term Document Matrix

Apply CountVectorizer with following parameters

ngram_range = (1,2)
min_df = (1/len(Training_label))
max_df = 0.7
Fit the tf_vectorizer with the Training_data
Transform the Total_Dictionary_TDM with the Training_data

In [9]: from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tf_vectorizer = CountVectorizer(ngram_range = (1,2),min_df = (1/len(Training_label)),max_df = 0.7 )

Total_Dictionary_TDM = tf_vectorizer.fit(Training_data)

message_data_TDM = tf_vectorizer.transform(Training_data)

Term Frequency Inverse Document Frequency (TFIDF)

Apply TfidfVectorizer with following parameters

ngram_range = (1,2)
min_df = (1/len(Training_label))
max_df = 0.7
Fit the tfidf_vectorizer with the Training_data
Transform the Total_Dictionary_TFIDF with the Training_data

In [10]: from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer( ngram_range = (1,2),min_df = (1/len(Training_label)),max_df = 0.7 )

Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)

message_data_TFIDF = tfidf_vectorizer.transform(Training_data)

Train and Test Data

Splitting the data for training and testing(90% train,10% test)

Perform train-test split on message_data_TDM and Training_label with 90% as train data and 10% as test data.

In [11]: from sklearn.model_selection import train_test_split

train_data,test_data, train_label, test_label = train_test_split(message_data_TDM,Training_label,test_size=0.1)

Join our channel if you haven’t joined yet https://t.me/fresco_milestone (https://t.me/fresco_milestone) ( @fresco_milestone )

Support Vector Machine

Get the shape of the train-data and print the same.

Get the shape of the test-data and print the same.
Initialize SVM classifier with following parameters
kernel = linear
C= 0.025
random_state=seed
Train the model with train_data and train_label
Now predict the output with test_data
Evaluate the classifier with score from test_data and test_label
Print the predicted score

In [12]: seed=9

from sklearn.svm import SVC

train_data_shape = train_data.shape

test_data_shape = test_data.shape

print("The shape of train data:", train_data_shape)

print("The shape of test data:", test_data_shape)

classifier = SVC( kernel='linear',C=0.025,random_state=seed )

classifier = classifier.fit(train_data,train_label)

target = classifier.predict(test_data)

score = classifier.score(test_data,test_label)

print('SVM Classifier : ',score)

with open('output.txt', 'w') as file:

file.write(str((imdb['tokenized_message'][55],imdb['lemmatized_message'][55])))

The shape of train data: (900, 9051)

The shape of test data: (100, 9051)

SVM Classifier : 0.73

Stochastic Gradient Descent Classifier

Perform train-test split on message_data_TDM and Training_label with this time 80% as train data and 20% as test data.
Get the shape of the train-data and print the same.
Get the shape of the test-data and print the same.
Initialize SVM classifier with following parameters
loss = modified_huber
shuffle= True
random_state=seed
Train the model with train_data and train_label
Now predict the output with test_data
Evaluate the classifier with score from test_data and test_label
Print the predicted score

In [13]: from sklearn.linear_model import SGDClassifier

train_data,test_data, train_label, test_label = train_test_split(message_data_TDM,Training_label,test_size=0.2)

train_data_shape = train_data.shape

test_data_shape = test_data.shape

print("The shape of train data:", train_data_shape )

print("The shape of test data:", test_data_shape )

classifier = SGDClassifier( loss='modified_huber',shuffle=True,random_state=seed )

classifier = classifier.fit(train_data,train_label)

target = classifier.predict(test_data)

score = classifier.score(test_data,test_label)

print('SGD classifier : ',score)

with open('output1.txt', 'w') as file:

file.write(str((imdb['preprocessed_message'][55])))

The shape of train data: (800, 9051)

The shape of test data: (200, 9051)

SGD classifier : 0.7

In [ ]: "@fresco_milestone"

Unstructured
No ratings yet
Unstructured
37 pages
Text Preprocessing & Classification
100% (1)
Text Preprocessing & Classification
4 pages
Python Pandas MCQs
No ratings yet
Python Pandas MCQs
7 pages
New Text Document
No ratings yet
New Text Document
10 pages
Import As From Import Import: Problem 1
100% (1)
Import As From Import Import: Problem 1
5 pages
Unstructured Data Classification Guide
No ratings yet
Unstructured Data Classification Guide
5 pages
Data Visualization New
No ratings yet
Data Visualization New
3 pages
Kafka Remanere
No ratings yet
Kafka Remanere
3 pages
Python & R Statistics Guide
No ratings yet
Python & R Statistics Guide
12 pages
Python Quiz for Beginners
No ratings yet
Python Quiz for Beginners
32 pages
R Data Visualization Hands-On Guide
100% (3)
R Data Visualization Hands-On Guide
3 pages
TensorFlow Basics for Beginners
No ratings yet
TensorFlow Basics for Beginners
2 pages
Python 3 Programming
No ratings yet
Python 3 Programming
3 pages
NLP Using Python
No ratings yet
NLP Using Python
50 pages
Image Classification Handson-Image - Test
No ratings yet
Image Classification Handson-Image - Test
5 pages
Informatica
No ratings yet
Informatica
5 pages
DATAbase Connectivity
100% (2)
DATAbase Connectivity
4 pages
Python 3 Functions and OOPs
No ratings yet
Python 3 Functions and OOPs
7 pages
Data Handling in R - Introduction To Dplyr
No ratings yet
Data Handling in R - Introduction To Dplyr
2 pages
R
No ratings yet
R
15 pages
AngularJS Packaging and Testing (1) - 1
0% (1)
AngularJS Packaging and Testing (1) - 1
2 pages
Untitled
No ratings yet
Untitled
2 pages
Scala - The Diatonic Syallable
No ratings yet
Scala - The Diatonic Syallable
2 pages
Nodejs Mock Test III
No ratings yet
Nodejs Mock Test III
6 pages
Scala Constructs: Concepts of Functional Programming
No ratings yet
Scala Constructs: Concepts of Functional Programming
21 pages
This Study Resource Was
No ratings yet
This Study Resource Was
4 pages
Q Answer
No ratings yet
Q Answer
11 pages
Spark SQL Hands - On
No ratings yet
Spark SQL Hands - On
3 pages
ScalaNew Malay
No ratings yet
ScalaNew Malay
4 pages
Python 3 Programming Q & A
No ratings yet
Python 3 Programming Q & A
4 pages
Statistics and Probability Basics
No ratings yet
Statistics and Probability Basics
2 pages
Machine Learning - Exploring The Model Q&A.txt TCS
100% (1)
Machine Learning - Exploring The Model Q&A.txt TCS
1 page
Data Analysis & Processing Guide
100% (2)
Data Analysis & Processing Guide
17 pages
ABBYY FlexiCapture
No ratings yet
ABBYY FlexiCapture
7 pages
Robotics-Automatix - Art of RPA Q & A
No ratings yet
Robotics-Automatix - Art of RPA Q & A
3 pages
Final Assessment Complete With Answers All 30
100% (1)
Final Assessment Complete With Answers All 30
6 pages
Tableau Sequel
No ratings yet
Tableau Sequel
5 pages
Milestone - Coding - Python - Cu
No ratings yet
Milestone - Coding - Python - Cu
3 pages
Hands On Python Qualis Pytest
No ratings yet
Hands On Python Qualis Pytest
7 pages
RYC Azzure
No ratings yet
RYC Azzure
6 pages
This Study Resource Was: - Are A Set of Rules That Determine The Execution of A Transaction
No ratings yet
This Study Resource Was: - Are A Set of Rules That Determine The Execution of A Transaction
8 pages
AngularJS 1.x Routers and Custom Directives Q&A
No ratings yet
AngularJS 1.x Routers and Custom Directives Q&A
4 pages
Data Science with Python & Anaconda
No ratings yet
Data Science with Python & Anaconda
9 pages
Elements of User Experience
No ratings yet
Elements of User Experience
3 pages
Hands-On JSON Verify JSON Datatypes
No ratings yet
Hands-On JSON Verify JSON Datatypes
2 pages
Python Qualis
No ratings yet
Python Qualis
6 pages
DNN Handson
No ratings yet
DNN Handson
2 pages
Flask-Python Web Framework Hands-On
No ratings yet
Flask-Python Web Framework Hands-On
12 pages
Advance Statistics & Probability Q & A
100% (3)
Advance Statistics & Probability Q & A
2 pages
Python 3 Application Programming
100% (1)
Python 3 Application Programming
12 pages
Rsa
No ratings yet
Rsa
2 pages
Bitbucket
No ratings yet
Bitbucket
2 pages
Cassandra Data Handling Hands On
No ratings yet
Cassandra Data Handling Hands On
3 pages
Descriptor
No ratings yet
Descriptor
4 pages
KDD Process for Data Analysts
No ratings yet
KDD Process for Data Analysts
3 pages
Association Rule Mining
100% (2)
Association Rule Mining
2 pages
Sentiments Analysis Code Analysis
No ratings yet
Sentiments Analysis Code Analysis
42 pages
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
18 pages
Methodology
No ratings yet
Methodology
9 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Python Pandas Handson
No ratings yet
Python Pandas Handson
6 pages
Mini-Project - Java Fullstack Developer - MySQL - FP (63426)
No ratings yet
Mini-Project - Java Fullstack Developer - MySQL - FP (63426)
1 page
Color Theory
No ratings yet
Color Theory
4 pages
Internet of Things Prime
No ratings yet
Internet of Things Prime
3 pages
Bower
No ratings yet
Bower
1 page
Machine Learning - Exploring The Model
50% (2)
Machine Learning - Exploring The Model
3 pages
Audio Engineer & Sound Designer Expertise
No ratings yet
Audio Engineer & Sound Designer Expertise
3 pages
Chapter One Background of The Study
No ratings yet
Chapter One Background of The Study
3 pages
Articulo Agro PDF
No ratings yet
Articulo Agro PDF
6 pages
6.1. Electrochemistry
No ratings yet
6.1. Electrochemistry
77 pages
Ice Cream AAK
100% (1)
Ice Cream AAK
12 pages
English - Reviewer
No ratings yet
English - Reviewer
8 pages
6562a589ed1f42b18f8e01b8 Muwitusasevigawifu
No ratings yet
6562a589ed1f42b18f8e01b8 Muwitusasevigawifu
2 pages
Document
No ratings yet
Document
5 pages
Skyline Leaflet
No ratings yet
Skyline Leaflet
2 pages
Static Liquefaction-Type Tailings Dam Failures: Understanding Options For Detecting Failures
No ratings yet
Static Liquefaction-Type Tailings Dam Failures: Understanding Options For Detecting Failures
6 pages
Camden Market (London) - All You Need To Know BEFORE You Go
No ratings yet
Camden Market (London) - All You Need To Know BEFORE You Go
1 page
Pharmaceutics & Pharmacy Syllabus
No ratings yet
Pharmaceutics & Pharmacy Syllabus
6 pages
Process Management
No ratings yet
Process Management
25 pages
Sans241 2
No ratings yet
Sans241 2
27 pages
Technical Specs for FN080-SDQ.6N.V7
No ratings yet
Technical Specs for FN080-SDQ.6N.V7
6 pages
26 5200 Safety Lighting
No ratings yet
26 5200 Safety Lighting
12 pages
Earthquake Preparedness Measures
No ratings yet
Earthquake Preparedness Measures
4 pages
BPAG 172 Solved Assignment
No ratings yet
BPAG 172 Solved Assignment
6 pages
Purplsoc 2017 Pursuit of Pattern Languages For Societal Change Richard Sickinger Download
No ratings yet
Purplsoc 2017 Pursuit of Pattern Languages For Societal Change Richard Sickinger Download
28 pages
Caterpillar Cat TH255C Telehandler Service Repair Manual Instant Download
No ratings yet
Caterpillar Cat TH255C Telehandler Service Repair Manual Instant Download
23 pages
46 PDF
No ratings yet
46 PDF
23 pages
Toffie
No ratings yet
Toffie
6 pages
HUMAN PHYSIOLOGY Paper II
100% (1)
HUMAN PHYSIOLOGY Paper II
5 pages
Engineering Specification For MV Switchgear
No ratings yet
Engineering Specification For MV Switchgear
21 pages
Syllabus Decoded UPSC
No ratings yet
Syllabus Decoded UPSC
30 pages
Backup Online de Valores Actuales s7300
No ratings yet
Backup Online de Valores Actuales s7300
11 pages
INSTRUMENTATION
No ratings yet
INSTRUMENTATION
43 pages
Food As Medicine Everyda PDF
75% (8)
Food As Medicine Everyda PDF
267 pages
Inference Based CBT Questions
No ratings yet
Inference Based CBT Questions
8 pages
Industrial Robotics Assignment 1
No ratings yet
Industrial Robotics Assignment 1
5 pages

Unstructured Data Classification Handson

Uploaded by

Unstructured Data Classification Handson

Uploaded by

Join our channel if you haven’t joined yet https://t.me/fresco_milestone (https://t.

In [1]: import pandas as pd

In [2]: #Data Loading

index text label

0 0 A very, very, very slow-moving, aimless movie ... 0

1 1 Not sure who was more lost - the flat characte... 0

2 2 Attempting artiness with black & white and cle... 0

3 3 Very little music or anything to speak of. 0

4 4 The best scene in the movie was when Gerardo i... 1

Get the shape of the dataset and print it.

In [3]: data_size =imdb.shape

['index', 'text', 'label']

index text label

count 1000.000000 1000 1000.00000

unique NaN 997 NaN

top NaN 10/10 NaN

freq NaN 2 NaN

mean 499.500000 NaN 0.50000

std 288.819436 NaN 0.50025

min 0.000000 NaN 0.00000

25% 249.750000 NaN 0.00000

50% 499.500000 NaN 0.50000

75% 749.250000 NaN 1.00000

max 999.000000 NaN 1.00000

index text label

0 0 A very, very, very slow-moving, aimless movie ... 0

1 1 Not sure who was more lost - the flat characte... 0

2 2 Attempting artiness with black & white and cle... 0

Convert the text into lower.

[nltk_data] Downloading collection 'all'

[nltk_data] | Downloading package abc to /home/user/nltk_data...

In [7]: from nltk.stem.wordnet import WordNetLemmatizer

for word in text:

Tokenized message: ['long', ',', 'whiny', 'and', 'pointless', '.']

Lemmatized message: ['long', ',', 'whiny', 'and', 'pointless', '.']

Stop Word Removal

Set the stop words language as english in the variable stop_words

In [8]: from nltk.corpus import stopwords

filtered_sentence = ' '.join([word for word in text if word not in stop_words])

Preprocessed message: long , whiny pointless .

Term Document Matrix

Apply CountVectorizer with following parameters

In [9]: from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tf_vectorizer = CountVectorizer(ngram_range = (1,2),min_df = (1/len(Training_label)),max_df = 0.7 )

Term Frequency Inverse Document Frequency (TFIDF)

Apply TfidfVectorizer with following parameters

In [10]: from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer( ngram_range = (1,2),min_df = (1/len(Training_label)),max_df = 0.7 )

Train and Test Data

Splitting the data for training and testing(90% train,10% test)

In [11]: from sklearn.model_selection import train_test_split

train_data,test_data, train_label, test_label = train_test_split(message_data_TDM,Training_label,test_size=0.1)

Support Vector Machine

Get the shape of the train-data and print the same.

from sklearn.svm import SVC

print("The shape of train data:", train_data_shape)

print("The shape of test data:", test_data_shape)

classifier = SVC( kernel='linear',C=0.025,random_state=seed )

print('SVM Classifier : ',score)

with open('output.txt', 'w') as file:

The shape of train data: (900, 9051)

The shape of test data: (100, 9051)

SVM Classifier : 0.73

In [13]: from sklearn.linear_model import SGDClassifier

train_data,test_data, train_label, test_label = train_test_split(message_data_TDM,Training_label,test_size=0.2)

print("The shape of train data:", train_data_shape )

print("The shape of test data:", test_data_shape )

classifier = SGDClassifier( loss='modified_huber',shuffle=True,random_state=seed )

print('SGD classifier : ',score)

with open('output1.txt', 'w') as file:

The shape of train data: (800, 9051)

The shape of test data: (200, 9051)

SGD classifier : 0.7

You might also like