# 🔹 Lab 1: Introduction to NLP
# 🎯 Objective: Learn basic text processing
# Step 1: Import libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Download required resources (only first time)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Step 2: Example text
text = "I am learning Natural Language Processing in class."
print("Original Text:")
print(text)
[nltk_data] Downloading package punkt to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
Original Text:
I am learning Natural Language Processing in class.
# Step 2: Example text
text = "I am learning Natural Language Processing in class."
print("Original Text:")
print(text)
Original Text:
I am learning Natural Language Processing in class.
Tokenization (Breaking text into words)
In simple words:
word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).
tokens = ... → saves those words into a variable called tokens.
print("After Tokenization:") → shows a label so students know what’s coming.
print(tokens) → shows the actual list of words.
👉 Example: Input → "I am learning NLP in class." Output → ['I', 'am', 'learning', 'NLP', 'in', 'class',
'.']
# Step 3: Tokenization (Breaking text into words)
tokens = word_tokenize(text) # This line takes the sentence in
'text' and splits it into individual words (tokens)
print("After Tokenization:") # This line just prints a heading
so output looks clear
print(tokens) # This line prints the list of
tokens (words) after splitting
After Tokenization:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'in',
'class', '.']
Removing Stopwords (like "am", "in", "the")
In simple words: We are cleaning the sentence by throwing away unnecessary words (like “is”,
“the”, “in”) that don’t add much meaning.
👉 Example: Input Tokens → ['I', 'am', 'learning', 'NLP', 'in', 'class'] Output → ['I', 'learning', 'NLP',
'class']
# Step 4: Removing Stopwords (like "am", "in", "the")
stop_words = set(stopwords.words('english'))
# This line loads a list of common English words (like "am", "is",
"the") from NLTK
# and stores them inside a set called stop_words.
# A "set" is used because it's faster to check membership.
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]
# This line creates a new list.
# It goes through each word in tokens one by one,
# converts the word into lowercase (word.lower()),
# and keeps it ONLY if it is NOT in stop_words.
print("After Removing Stopwords:")
# Prints a heading so the output looks clear.
print(filtered_tokens)
# Prints the list of words after removing stopwords.
After Removing Stopwords:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']
Stemming (cutting words to their root form)
In simple words: Stemming is like chopping words down to their base. It may not always be a
real word but gives the core root.
👉 Example: Input → ['learning', 'NLP', 'classes'] Output → ['learn', 'nlp', 'class']
# Step 5: Stemming (cutting words to their root form)
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
# Creates a "stemmer" object using the Porter algorithm.
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
# For each word in filtered_tokens,
# apply stemmer.stem(word) which cuts the word down to its root form.
# Example: "learning" → "learn", "classes" → "class"
print("After Stemming:")
print(stemmed_tokens)
# Prints the list of words after stemming.
After Stemming:
['learn', 'natur', 'languag', 'process', 'class', '.']
# Lemmatization (finding correct base word using grammar rules)
In simple words: Lemmatization is smarter than stemming because it uses grammar + dictionary
to find the correct base word.
👉 Example: Input → ['better', 'running', 'classes'] Output → ['good', 'run', 'class']
# Step 6: Lemmatization (finding correct base word using grammar
rules)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Creates a "lemmatizer" object that uses the WordNet dictionary.
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in
filtered_tokens]
# For each word in filtered_tokens,
# apply lemmatizer.lemmatize(word) which returns the meaningful base
form.
# Example: "better" → "good", "running" → "run"
print("After Lemmatization:")
print(lemmatized_tokens)
# Prints the list of words after lemmatization.
After Lemmatization:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']
In simple words:
word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).
tokens = ... → saves those words into a variable called tokens.
print("After Tokenization:") → shows a label so students know what’s coming.
print(tokens) → shows the actual list of words.
Installing Other Necessary Packages
📊 Core Data Handling & Analysis
NumPy – For numerical computing, arrays, and mathematical operations.
Pandas – For data manipulation, cleaning, and analysis (DataFrames).
📈 Data Visualization
Matplotlib – For basic 2D plotting and charts.
Seaborn – Built on Matplotlib, makes statistical graphics more attractive.
Plotly – For interactive, dynamic, and web-based visualizations.
🤖 Machine Learning
Scikit-learn – For machine learning models (classification, regression, clustering, etc.).
XGBoost – For gradient boosting and high-performance ML models.
🧠 Deep Learning
TensorFlow – Popular deep learning framework from Google.
PyTorch – Deep learning framework widely used in research and industry.
🌐 NLP & Data Preprocessing
NLTK / SpaCy – For natural language processing (tokenization, text cleaning, etc.).
# 📊 Core Data Handling
import numpy as np
import pandas as pd
# 📈 Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px # For interactive plots
# 🤖 Machine Learning
import sklearn
import xgboost as xgb
# 🧠 Deep Learning
import tensorflow as tf
# 🌐 NLP & Text Processing
import nltk
Chunking: Dividing Data into Chunks
Example 1: Chunking Text Data
Sometimes in NLP (Natural Language Processing), you break text into chunks of words for
processing.
text = "Data Science is an exciting field to learn"
# Split into words
words = text.split()
words
['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to', 'learn']
text = "Data Science is an exciting field to learn"
# Split into words
words = text.split()
# Chunk into size 2 (pairs of words)
def chunk_words(words, size):
for i in range(0, len(words), size):
yield words[i:i+size]
chunks = list(chunk_words(words, 2))
print("Words:", words)
print("Chunks:", chunks)
Words: ['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to',
'learn']
Chunks: [['Data', 'Science'], ['is', 'an'], ['exciting', 'field'],
['to', 'learn']]
Meaning of each word that used in above
example
text = "Data Science is an exciting field to learn"
words = text.split()
def chunk_words(words, size):
for i in range(0, len(words), size):
yield words[i:i+size]
chunks = list(chunk_words(words, 2))
print("Words:", words)
print("Chunks:", chunks)