KEMBAR78
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import | PDF | Computer Science | Software Engineering
0% found this document useful (0 votes)
4 views7 pages

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

# 🔹 Lab 1: Introduction to NLP

# 🎯 Objective: Learn basic text processing

# Step 1: Import libraries


import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required resources (only first time)


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Step 2: Example text


text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

[nltk_data] Downloading package punkt to C:\Users\Sujay


[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...

Original Text:
I am learning Natural Language Processing in class.

# Step 2: Example text


text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

Original Text:
I am learning Natural Language Processing in class.

Tokenization (Breaking text into words)


In simple words:

word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.


print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

👉 Example: Input → "I am learning NLP in class." Output → ['I', 'am', 'learning', 'NLP', 'in', 'class',
'.']

# Step 3: Tokenization (Breaking text into words)


tokens = word_tokenize(text) # This line takes the sentence in
'text' and splits it into individual words (tokens)
print("After Tokenization:") # This line just prints a heading
so output looks clear
print(tokens) # This line prints the list of
tokens (words) after splitting

After Tokenization:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'in',
'class', '.']

Removing Stopwords (like "am", "in", "the")


In simple words: We are cleaning the sentence by throwing away unnecessary words (like “is”,
“the”, “in”) that don’t add much meaning.

👉 Example: Input Tokens → ['I', 'am', 'learning', 'NLP', 'in', 'class'] Output → ['I', 'learning', 'NLP',
'class']

# Step 4: Removing Stopwords (like "am", "in", "the")

stop_words = set(stopwords.words('english'))
# This line loads a list of common English words (like "am", "is",
"the") from NLTK
# and stores them inside a set called stop_words.
# A "set" is used because it's faster to check membership.

filtered_tokens = [word for word in tokens if word.lower() not in


stop_words]
# This line creates a new list.
# It goes through each word in tokens one by one,
# converts the word into lowercase (word.lower()),
# and keeps it ONLY if it is NOT in stop_words.

print("After Removing Stopwords:")


# Prints a heading so the output looks clear.

print(filtered_tokens)
# Prints the list of words after removing stopwords.
After Removing Stopwords:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

Stemming (cutting words to their root form)


In simple words: Stemming is like chopping words down to their base. It may not always be a
real word but gives the core root.

👉 Example: Input → ['learning', 'NLP', 'classes'] Output → ['learn', 'nlp', 'class']

# Step 5: Stemming (cutting words to their root form)


from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
# Creates a "stemmer" object using the Porter algorithm.

stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]


# For each word in filtered_tokens,
# apply stemmer.stem(word) which cuts the word down to its root form.
# Example: "learning" → "learn", "classes" → "class"

print("After Stemming:")
print(stemmed_tokens)
# Prints the list of words after stemming.

After Stemming:
['learn', 'natur', 'languag', 'process', 'class', '.']

# Lemmatization (finding correct base word using grammar rules)

In simple words: Lemmatization is smarter than stemming because it uses grammar + dictionary
to find the correct base word.

👉 Example: Input → ['better', 'running', 'classes'] Output → ['good', 'run', 'class']

# Step 6: Lemmatization (finding correct base word using grammar


rules)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Creates a "lemmatizer" object that uses the WordNet dictionary.

lemmatized_tokens = [lemmatizer.lemmatize(word) for word in


filtered_tokens]
# For each word in filtered_tokens,
# apply lemmatizer.lemmatize(word) which returns the meaningful base
form.
# Example: "better" → "good", "running" → "run"

print("After Lemmatization:")
print(lemmatized_tokens)
# Prints the list of words after lemmatization.

After Lemmatization:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

In simple words:
word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

Installing Other Necessary Packages


📊 Core Data Handling & Analysis

NumPy – For numerical computing, arrays, and mathematical operations.

Pandas – For data manipulation, cleaning, and analysis (DataFrames).

📈 Data Visualization

Matplotlib – For basic 2D plotting and charts.

Seaborn – Built on Matplotlib, makes statistical graphics more attractive.

Plotly – For interactive, dynamic, and web-based visualizations.

🤖 Machine Learning

Scikit-learn – For machine learning models (classification, regression, clustering, etc.).

XGBoost – For gradient boosting and high-performance ML models.

🧠 Deep Learning

TensorFlow – Popular deep learning framework from Google.

PyTorch – Deep learning framework widely used in research and industry.

🌐 NLP & Data Preprocessing

NLTK / SpaCy – For natural language processing (tokenization, text cleaning, etc.).

# 📊 Core Data Handling


import numpy as np
import pandas as pd

# 📈 Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px # For interactive plots

# 🤖 Machine Learning
import sklearn
import xgboost as xgb

# 🧠 Deep Learning
import tensorflow as tf

# 🌐 NLP & Text Processing


import nltk

Chunking: Dividing Data into Chunks


Example 1: Chunking Text Data

Sometimes in NLP (Natural Language Processing), you break text into chunks of words for
processing.

text = "Data Science is an exciting field to learn"

# Split into words


words = text.split()
words

['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to', 'learn']

text = "Data Science is an exciting field to learn"

# Split into words


words = text.split()

# Chunk into size 2 (pairs of words)


def chunk_words(words, size):
for i in range(0, len(words), size):
yield words[i:i+size]

chunks = list(chunk_words(words, 2))

print("Words:", words)
print("Chunks:", chunks)

Words: ['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to',


'learn']
Chunks: [['Data', 'Science'], ['is', 'an'], ['exciting', 'field'],
['to', 'learn']]
Meaning of each word that used in above
example
text = "Data Science is an exciting field to learn"

words = text.split()

def chunk_words(words, size):

for i in range(0, len(words), size):


yield words[i:i+size]

chunks = list(chunk_words(words, 2))

print("Words:", words)
print("Chunks:", chunks)

You might also like