0% found this document useful (0 votes)

4 views7 pages

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Uploaded by

Sujay Kumar (2100622)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views7 pages

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Uploaded by

Sujay Kumar (2100622)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

# 🔹 Lab 1: Introduction to NLP

# 🎯 Objective: Learn basic text processing

# Step 1: Import libraries

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required resources (only first time)

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Step 2: Example text

text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

[nltk_data] Downloading package punkt to C:\Users\Sujay

[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...

Original Text:
I am learning Natural Language Processing in class.

# Step 2: Example text

text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

Original Text:
I am learning Natural Language Processing in class.

Tokenization (Breaking text into words)

In simple words:

word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

👉 Example: Input → "I am learning NLP in class." Output → ['I', 'am', 'learning', 'NLP', 'in', 'class',
'.']

# Step 3: Tokenization (Breaking text into words)

tokens = word_tokenize(text) # This line takes the sentence in
'text' and splits it into individual words (tokens)
print("After Tokenization:") # This line just prints a heading
so output looks clear
print(tokens) # This line prints the list of
tokens (words) after splitting

After Tokenization:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'in',
'class', '.']

Removing Stopwords (like "am", "in", "the")

In simple words: We are cleaning the sentence by throwing away unnecessary words (like “is”,
“the”, “in”) that don’t add much meaning.

👉 Example: Input Tokens → ['I', 'am', 'learning', 'NLP', 'in', 'class'] Output → ['I', 'learning', 'NLP',
'class']

# Step 4: Removing Stopwords (like "am", "in", "the")

stop_words = set(stopwords.words('english'))
# This line loads a list of common English words (like "am", "is",
"the") from NLTK
# and stores them inside a set called stop_words.
# A "set" is used because it's faster to check membership.

filtered_tokens = [word for word in tokens if word.lower() not in

stop_words]
# This line creates a new list.
# It goes through each word in tokens one by one,
# converts the word into lowercase (word.lower()),
# and keeps it ONLY if it is NOT in stop_words.

print("After Removing Stopwords:")

# Prints a heading so the output looks clear.

print(filtered_tokens)
# Prints the list of words after removing stopwords.
After Removing Stopwords:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

Stemming (cutting words to their root form)

In simple words: Stemming is like chopping words down to their base. It may not always be a
real word but gives the core root.

👉 Example: Input → ['learning', 'NLP', 'classes'] Output → ['learn', 'nlp', 'class']

# Step 5: Stemming (cutting words to their root form)

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
# Creates a "stemmer" object using the Porter algorithm.

stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# For each word in filtered_tokens,
# apply stemmer.stem(word) which cuts the word down to its root form.
# Example: "learning" → "learn", "classes" → "class"

print("After Stemming:")
print(stemmed_tokens)
# Prints the list of words after stemming.

After Stemming:
['learn', 'natur', 'languag', 'process', 'class', '.']

# Lemmatization (finding correct base word using grammar rules)

In simple words: Lemmatization is smarter than stemming because it uses grammar + dictionary
to find the correct base word.

👉 Example: Input → ['better', 'running', 'classes'] Output → ['good', 'run', 'class']

# Step 6: Lemmatization (finding correct base word using grammar

rules)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Creates a "lemmatizer" object that uses the WordNet dictionary.

lemmatized_tokens = [lemmatizer.lemmatize(word) for word in

filtered_tokens]
# For each word in filtered_tokens,
# apply lemmatizer.lemmatize(word) which returns the meaningful base
form.
# Example: "better" → "good", "running" → "run"

print("After Lemmatization:")
print(lemmatized_tokens)
# Prints the list of words after lemmatization.

After Lemmatization:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

In simple words:
word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

Installing Other Necessary Packages

📊 Core Data Handling & Analysis

NumPy – For numerical computing, arrays, and mathematical operations.

Pandas – For data manipulation, cleaning, and analysis (DataFrames).

📈 Data Visualization

Matplotlib – For basic 2D plotting and charts.

Seaborn – Built on Matplotlib, makes statistical graphics more attractive.

Plotly – For interactive, dynamic, and web-based visualizations.

🤖 Machine Learning

Scikit-learn – For machine learning models (classification, regression, clustering, etc.).

XGBoost – For gradient boosting and high-performance ML models.

🧠 Deep Learning

TensorFlow – Popular deep learning framework from Google.

PyTorch – Deep learning framework widely used in research and industry.

🌐 NLP & Data Preprocessing

NLTK / SpaCy – For natural language processing (tokenization, text cleaning, etc.).

# 📊 Core Data Handling

import numpy as np
import pandas as pd

# 📈 Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px # For interactive plots

# 🤖 Machine Learning
import sklearn
import xgboost as xgb

# 🧠 Deep Learning
import tensorflow as tf

# 🌐 NLP & Text Processing

import nltk

Chunking: Dividing Data into Chunks

Example 1: Chunking Text Data

Sometimes in NLP (Natural Language Processing), you break text into chunks of words for
processing.

text = "Data Science is an exciting field to learn"

# Split into words

words = text.split()
words

['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to', 'learn']

text = "Data Science is an exciting field to learn"

# Split into words

words = text.split()

# Chunk into size 2 (pairs of words)

def chunk_words(words, size):
for i in range(0, len(words), size):
yield words[i:i+size]

chunks = list(chunk_words(words, 2))

print("Words:", words)
print("Chunks:", chunks)

Words: ['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to',

'learn']
Chunks: [['Data', 'Science'], ['is', 'an'], ['exciting', 'field'],
['to', 'learn']]
Meaning of each word that used in above
example
text = "Data Science is an exciting field to learn"

words = text.split()

def chunk_words(words, size):

for i in range(0, len(words), size):

yield words[i:i+size]

chunks = list(chunk_words(words, 2))

print("Words:", words)
print("Chunks:", chunks)

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
4 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
NLPEXP3
No ratings yet
NLPEXP3
3 pages
A7 Dsbda Sana
No ratings yet
A7 Dsbda Sana
15 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
Text Processing
No ratings yet
Text Processing
5 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
AMLTA
No ratings yet
AMLTA
17 pages
NLP PDF
No ratings yet
NLP PDF
3 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
NLP Lab Manual - Final
No ratings yet
NLP Lab Manual - Final
15 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
Text Preprocessing & NLTK Guide
No ratings yet
Text Preprocessing & NLTK Guide
8 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
NLPPractical
No ratings yet
NLPPractical
12 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLP Record
No ratings yet
NLP Record
15 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP Short Notes
No ratings yet
NLP Short Notes
21 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
Tokenizations
No ratings yet
Tokenizations
3 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
NLP Test Questions With Answers
No ratings yet
NLP Test Questions With Answers
3 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
Token Ization
No ratings yet
Token Ization
5 pages
Module 5
No ratings yet
Module 5
69 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
Sujay Final Report
No ratings yet
Sujay Final Report
25 pages
Resume 1234
No ratings yet
Resume 1234
1 page
Jahid 1
No ratings yet
Jahid 1
5 pages
Week Test Answers
No ratings yet
Week Test Answers
1 page
DataScience Institute Brochure
No ratings yet
DataScience Institute Brochure
1 page
Sujay Resume Updated
No ratings yet
Sujay Resume Updated
1 page
My Resume
No ratings yet
My Resume
1 page
Fill Data by Backward and Forward Filling
No ratings yet
Fill Data by Backward and Forward Filling
9 pages
MD Jahid Cv.
No ratings yet
MD Jahid Cv.
2 pages
MD Jahid
No ratings yet
MD Jahid
5 pages
MD Jahid CV
No ratings yet
MD Jahid CV
2 pages
I.K.Gujral Punjab Technical University, Jalandhar: Grade Cum Marks Sheet
No ratings yet
I.K.Gujral Punjab Technical University, Jalandhar: Grade Cum Marks Sheet
1 page
Job D of AI-ML Instructor
No ratings yet
Job D of AI-ML Instructor
1 page
My Resume
No ratings yet
My Resume
2 pages
Python Basics For AI
No ratings yet
Python Basics For AI
1 page
Python Basics For AI With Examples
No ratings yet
Python Basics For AI With Examples
2 pages
Happy
No ratings yet
Happy
1 page
20 Python Basic Practice Questions
No ratings yet
20 Python Basic Practice Questions
1 page
Foundations of Data Analysis
No ratings yet
Foundations of Data Analysis
5 pages
NLTK Package Training
No ratings yet
NLTK Package Training
17 pages
Resume Formatted
No ratings yet
Resume Formatted
2 pages
Python
No ratings yet
Python
13 pages
NLP Test Questions
No ratings yet
NLP Test Questions
1 page
AI Agents Masterclass Presentation
No ratings yet
AI Agents Masterclass Presentation
12 pages
Qualifications 28-08-2025 09 - 50 - 37
No ratings yet
Qualifications 28-08-2025 09 - 50 - 37
8 pages
Resume Template
No ratings yet
Resume Template
1 page
Recreated Resume
No ratings yet
Recreated Resume
1 page
Student's Digital Archive
No ratings yet
Student's Digital Archive
51 pages
Cat Data Link Circuit - Test: Troubleshooting
No ratings yet
Cat Data Link Circuit - Test: Troubleshooting
6 pages
Nor Azimah Khalid FSKM, Uitm Shah Alam
No ratings yet
Nor Azimah Khalid FSKM, Uitm Shah Alam
39 pages
LM2575
No ratings yet
LM2575
25 pages
PTE - (Full Presentation)
0% (1)
PTE - (Full Presentation)
64 pages
ECDIS Passage Planning Guide
No ratings yet
ECDIS Passage Planning Guide
5 pages
Toshiba 2SK2961 N-Channel MOSFET
No ratings yet
Toshiba 2SK2961 N-Channel MOSFET
7 pages
RDBMS 12
No ratings yet
RDBMS 12
35 pages
Industrial Network Switch Guide
No ratings yet
Industrial Network Switch Guide
6 pages
SDLC
100% (3)
SDLC
85 pages
3d Modelling For Virtual Reality: Tutorial #2 - VRML Sliding Door!
No ratings yet
3d Modelling For Virtual Reality: Tutorial #2 - VRML Sliding Door!
12 pages
LG K8 (2017) - Schematic Diagarm PDF
No ratings yet
LG K8 (2017) - Schematic Diagarm PDF
141 pages
Catalogo LanPro
No ratings yet
Catalogo LanPro
8 pages
BOQ Pack 4
No ratings yet
BOQ Pack 4
36 pages
CMSC 130 Syllabus
No ratings yet
CMSC 130 Syllabus
2 pages
How To Restore Deleted Files From The Recycle Bin
No ratings yet
How To Restore Deleted Files From The Recycle Bin
1 page
( - AMW - 32P - ) - Owner's - Manual - V0.1 - 2024-11-13T142334.190
No ratings yet
( - AMW - 32P - ) - Owner's - Manual - V0.1 - 2024-11-13T142334.190
13 pages
Completing the Square Practice
No ratings yet
Completing the Square Practice
4 pages
Thesis Topics in Cloud Computing
100% (3)
Thesis Topics in Cloud Computing
8 pages
Lab 10 2022 2
No ratings yet
Lab 10 2022 2
4 pages
Chapter 11
No ratings yet
Chapter 11
7 pages
IT's Role in Banking Business Growth
No ratings yet
IT's Role in Banking Business Growth
40 pages
SchoolBus Web Studyguide 2019
100% (1)
SchoolBus Web Studyguide 2019
44 pages
Entry-Level Web Developer Profile
No ratings yet
Entry-Level Web Developer Profile
2 pages
Mod04 K Nearest Neighbor
No ratings yet
Mod04 K Nearest Neighbor
48 pages
SCADA Supervisory Control and Data Acquisition 3rd Edition Stuart A. Boyer (Boyer Download
No ratings yet
SCADA Supervisory Control and Data Acquisition 3rd Edition Stuart A. Boyer (Boyer Download
105 pages
Zero Point Calibration
100% (1)
Zero Point Calibration
4 pages
Wireshark Lab 1.2 Import and Examine PCAP File (V1.1)
No ratings yet
Wireshark Lab 1.2 Import and Examine PCAP File (V1.1)
9 pages
Naive Bayes for Data Science Students
No ratings yet
Naive Bayes for Data Science Students
1,652 pages
Goldman Sachs Cover Letter Advice
100% (2)
Goldman Sachs Cover Letter Advice
7 pages

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Uploaded by

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Uploaded by

# 🔹 Lab 1: Introduction to NLP

# 🎯 Objective: Learn basic text processing

# Step 1: Import libraries

# Download required resources (only first time)

# Step 2: Example text

[nltk_data] Downloading package punkt to C:\Users\Sujay

# Step 2: Example text

Tokenization (Breaking text into words)

word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print(tokens) → shows the actual list of words.

# Step 3: Tokenization (Breaking text into words)

Removing Stopwords (like "am", "in", "the")

# Step 4: Removing Stopwords (like "am", "in", "the")

filtered_tokens = [word for word in tokens if word.lower() not in

print("After Removing Stopwords:")

Stemming (cutting words to their root form)

👉 Example: Input → ['learning', 'NLP', 'classes'] Output → ['learn', 'nlp', 'class']

# Step 5: Stemming (cutting words to their root form)

stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# Lemmatization (finding correct base word using grammar rules)

👉 Example: Input → ['better', 'running', 'classes'] Output → ['good', 'run', 'class']

# Step 6: Lemmatization (finding correct base word using grammar

lemmatized_tokens = [lemmatizer.lemmatize(word) for word in

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

Installing Other Necessary Packages

NumPy – For numerical computing, arrays, and mathematical operations.

Pandas – For data manipulation, cleaning, and analysis (DataFrames).

Matplotlib – For basic 2D plotting and charts.

Seaborn – Built on Matplotlib, makes statistical graphics more attractive.

Plotly – For interactive, dynamic, and web-based visualizations.

Scikit-learn – For machine learning models (classification, regression, clustering, etc.).

XGBoost – For gradient boosting and high-performance ML models.

TensorFlow – Popular deep learning framework from Google.

PyTorch – Deep learning framework widely used in research and industry.

🌐 NLP & Data Preprocessing

# 📊 Core Data Handling

# 🌐 NLP & Text Processing

Chunking: Dividing Data into Chunks

text = "Data Science is an exciting field to learn"

# Split into words

['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to', 'learn']

text = "Data Science is an exciting field to learn"

# Split into words

# Chunk into size 2 (pairs of words)

chunks = list(chunk_words(words, 2))

Words: ['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to',

def chunk_words(words, size):

for i in range(0, len(words), size):

chunks = list(chunk_words(words, 2))

You might also like