0% found this document useful (0 votes)

12 views19 pages

NLP Module 2 1 (SAMI)

Uploaded by

ameyamkk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views19 pages

NLP Module 2 1 (SAMI)

Uploaded by

ameyamkk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

NLP MODULE 2

SAMI THAKUR

What is a Corpus

In Natural Language Processing (NLP), a corpus (plural: corpora) is a large and structured
collection of texts that are stored in electronic form.
It is used as a dataset for training, testing, and evaluating NLP models.

Think of it as the "text database" that computers use to learn language.

For example, if you want a computer to understand English grammar, you give it a huge
collection of English sentences (corpus).

Features of a Corpus:

1. Large Size – Usually contains millions of words/sentences.

2. Structured – Text is often annotated (tagged with parts of speech, meaning, etc.).
3. Domain-specific or General – Some corpora are built for a specific domain (like medical
text), while others are general.

Examples of Famous Corpora in NLP:

1. Brown Corpus – One of the first English corpora, with over 1 million words.
2. British National Corpus (BNC) – A 100-million-word collection of samples of written
and spoken English.
3. COCA (Corpus of Contemporary American English) – A large, balanced corpus with 1
billion words from spoken, fiction, magazines, newspapers, and academic texts.
4. Penn Treebank – Includes syntactically parsed sentences (used for training parsers).
5. Google N-gram Corpus – Contains phrases and word sequences from billions of words
collected from the web.
6. Wikipedia Dumps – Used as an open-source corpus for general NLP tasks.
7. Domain-Specific Corpora –
○ Medical: MIMIC-III (clinical text dataset).
○ Legal: EUR-Lex corpus (legal documents).
○ Social Media: Twitter Corpus.
Regular Expressions (Regex)

A Regular Expression (Regex) is a sequence of characters that defines a search pattern. It is

mainly used for string matching and text processing, such as searching, extracting, or
replacing text.

Types of Regular Expressions

Regular expressions can be classified into different types based on what they match.

1. Literal Characters

● Match exact characters as they are.

● Example:
○ Regex: cat
○ Matches: "cat" in "The cat is sleeping."
○ Does not match "Cat" (case-sensitive).

2. Metacharacters
Special characters that have specific meanings in regex:
3. Character Classes
Used to match a set or range of characters.

4. Predefined Character Classes

Shortcut notations for common sets

5. Quantifiers
Used to specify how many times a pattern should repeat.
6. Grouping & Alternation

● Grouping ( ): To group patterns together.

Example: (ab)+ → matches ab, abab, ababab.
● Alternation |: OR condition.
Example: cat|dog → matches cat or dog.

7. Anchors & Boundaries

● \b → Matches word boundary (beginning or end of a word).

○ Example: \bcat\b matches "cat" but not "category".
● \B → Matches non-word boundary.
○ Example: \Bcat matches "educate".

Practical Example: Email Validation (Simplified)

Let's build a (basic) regex to match simple email addresses like

name@domain.com.

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

1. Anchor: ^ - Start of the string.

2. Character Class: [a-zA-Z0-9._%+-] - Local part can contain letters, numbers, and these
symbols.
3. Quantifier: + - The local part must have one or more of the allowed characters.
4. Literal: @ - The literal "at" symbol.
5. Character Class: [a-zA-Z0-9.-] - Domain name (letters, numbers, dots, hyphens).
6. Quantifier: + - Domain name must have one or more allowed characters.
7. Literal: \. - A literal dot (escaped because . is a metacharacter). This is the "dot" in
".com".
8. Character Class: [a-zA-Z] - Top-Level Domain (TLD) must be letters.
9. Quantifier: {2,} - TLD must be at least 2 characters long (e.g., .com, .io, .org).
10.Anchor: $ - End of the string.

What is Finite Automata?

Finite Automata (FA) is a mathematical model of computation used in computer science,

linguistics, and NLP.
It is mainly used to represent and recognize patterns in text (like regular expressions).
Think of FA as a machine with limited memory that reads input one symbol at a time and
decides whether to accept or reject it.

Components of Finite Automata

A finite automaton consists of:

1. States (Q) → A finite set of conditions the machine can be in.
2. Alphabet (Σ) → A finite set of input symbols (like {a, b}).
3. Transition function (δ) → Rules for moving from one state to another based on input.
4. Start state (q₀) → The state where the machine begins.
5. Final/accept states (F) → The states that represent successful recognition.

Types of Finite Automata

1. Deterministic Finite Automaton (DFA)

○ For each state and input symbol, there is exactly one transition.
○ Example:
■ Pattern: strings ending with ab
■ DFA ensures only one possible path for each input.
2. Non-Deterministic Finite Automaton (NFA)
○ For each state and input symbol, there can be multiple transitions (including
ε-moves = empty moves).
○ Easier to design, but equivalent to DFA in power (anything an NFA can do, a
DFA can also do).
3. ε-NFA (Epsilon NFA)
○ NFA with ε-transitions, meaning it can move from one state to another without
consuming any input.

What is a Finite State Transducer?

A Finite State Transducer (FST) is an extension of a Finite State Automaton (FSA).

● An FSA only accepts or rejects strings (it tells if a string is valid or not).
● An FST not only accepts strings but also produces an output for each input.

How It Works

● Like FSA, an FST has:

○ States (circles)
○ Transitions (arrows between states)
○ Start state and Final states
● Difference:
In FST, each transition has input/output labels.
Example: a:x means if the machine sees input a, it outputs x.

Example 1: Simple Translation

Imagine we want to convert lowercase letters to uppercase.

● Input: cat
● Output: CAT

Transitions might look like:

● c:C, a:A, t:T

So when the FST reads cat, it produces CAT.

Why Are FSTs Important in NLP?

1. Morphology:
○ Converting word roots to inflected forms.
○ Example: run → running, played → play + ed.
2. Phonology:
○ Mapping between spelling and pronunciation.
○ Example: cat → k æ t.
3. Machine Translation:
○ Mapping words/phrases from one language to another.
4. Text-to-Speech (TTS):
○ Input: read
○ Output: pronunciation rules applied (reed or red depending on tense).

Word Tokenization

Word tokenization is the process of splitting a text into smaller pieces called tokens, where
tokens are usually words.
It’s one of the first steps in Natural Language Processing (NLP) because most NLP tasks work
with words rather than raw text.

Why is it Important?

● Computers cannot understand raw text directly. Tokenization converts text into a
structured form.
● Helps in text analysis like:
○ Counting words (word frequency)
○ Finding keywords
○ Performing sentiment analysis
○ Feeding words into machine learning models
● It is the foundation for more advanced NLP tasks like POS tagging, parsing, and
named entity recognition.

Issues in tokenization

Punctuation Problems
Punctuation can stick to words or be considered separate tokens depending on the context.
"Hello, world!"
Should we tokenize as:
['Hello', ',', 'world', '!']
or
['Hello', 'world']?
Mismanaging punctuation can affect NLP tasks like sentiment analysis or named entity
recognition.
Contractions
English contractions like "don't", "isn't", "we'll" can be tokenized in multiple ways:
Keep as single token: ["don't"]
Split into multiple: ["do", "n't"]
This affects language models, search, and text normalization.

Hyphenated Words & Compound Words

Words with hyphens or compounds may be split incorrectly:
"state-of-the-art technology"
Options: ['state', 'of', 'the', 'art', 'technology'] or
['state-of-the-art', 'technology']
Tokenization must respect meaning.

Numbers, Dates, & Special Formats

Tokens like "12:30", "2025-08-20", "3.14" may be split incorrectly by simple tokenizers.
Mis-tokenizing can break downstream tasks like information extraction.

Non-English & Space-less Languages

Languages like Chinese, Japanese, or Thai don’t use spaces between words
Requires specialized tokenizers (word dictionaries or statistical models).

Emojis, Hashtags, Mentions

Social media text introduces new challenges
Tokenizer must decide whether to split emojis, hashtags, or mentions as separate tokens.

Ambiguity
Some strings can be interpreted in multiple ways:
"U.S.A. is amazing"
Tokenizer might treat "U.S.A." as one token or split into "U", ".", "S", ".", "A", "."
Compound Words
● Certain expressions carry meaning together: "New York", "machine learning"
● Tokenizing them as separate words may lose context in NLP tasks.

Tokenization is not just splitting text by spaces.|

Challenges include punctuation, contractions, numbers, languages, emojis, abbreviations,
multi-word phrases, and noisy text. Choosing the right tokenizer is crucial for accurate NLP
results.

Stemming (Simple Explanation)

Stemming is the process of reducing a word to its root/base form by chopping off prefixes or
suffixes.
Example:
● playing, played, plays → play
● fishing, fished, fisher → fish
It’s like saying: "No matter how the word changes with tense, plural, or form, just keep the
main part."
Stemming is a text normalization technique used in Natural Language Processing (NLP) and
Information Retrieval (IR).
It helps computers understand that different word forms have the same meaning.
Types of Stemming Algorithms
1. Rule-Based Stemming
○ Applies predefined rules for suffix/prefix removal.
○ Example: If a word ends in "ing" → remove "ing".
2. Porter Stemmer (Most popular)
○ Uses step-by-step rules (suffix stripping).
○ Example: caresses → caress, ponies → poni.
Advantages
● Reduces vocabulary size.
● Improves efficiency in search engines (finds related words).
● Helps in information retrieval and sentiment analysis.
Limitations
● Can produce non-words (e.g., studies → studi).
● Sometimes too aggressive (removes too much) or too light (removes too little).
● Doesn’t always respect grammar or meaning.
Porter Stemming Algorithm (Detailed Steps)

It works by removing common morphological and inflectional endings (like -ing, -ed, -ly, -es)
from words to get the root form.

Step 1: Define Measure (m)

The algorithm is based on a measure m, which counts the number of VC (vowel-consonant)
sequences.
● Example:
○ ca → m = 0
○ trouble → m = 2 (trou | ble)
○ hopping → m = 2
This measure is used to decide when to remove suffixes.

Step 2: Apply Rules in Phases

The algorithm goes through five main phases (steps), each with its own set of rules.

Step 2a: Remove plurals and past tense suffixes

Rules:
● sses → ss (e.g., caresses → caress)
● ies → i (e.g., ponies → poni)
● ss → ss (unchanged)
● s → (remove) (cats → cat)

Step 2b: Remove -ed and -ing

● If the word ends with -ed or -ing, remove it if the remaining part has a vowel.
● After removal:
○ If it ends with at, bl, or iz → add e (hopping → hope)
○ If it ends with a double consonant (e.g., -tt, -ss) → remove one letter (hopping →
hop)
○ If it has the form CVC (consonant-vowel-consonant) and the last consonant is
not w, x, y → add e (hop → hope)

Step 2c: Replace -y with -i if there’s a vowel in the stem

● Example: happy → happi, cry → cri

Step 2d: Handle longer suffixes

If word has m > 0, replace:
● ational → ate (relational → relate)
● tional → tion (conditional → condition)
● enci → ence (valenci → valence)
● anci → ance (hesitanci → hesitance)
● izer → ize (analyzer → analyze)
● bli → ble (abli → able)
... and so on for a large set of suffixes.

Step 2e: Remove further common endings

If word has m > 1:
● Remove al, ance, ence, er, ic, able, ible, ant, ement, ment, ent, ism, ate, iti, ous, ive, ize
Examples:
● rational → ration
● comfortable → comfort

Step 3: Final clean-up

Step 3a: Remove a final -e if m > 1
● rate → rat
● rate → rate (if short)

Step 3b: If word ends in double consonant -ll and m > 1, remove one l
● control → control
● controll → control

Lemmatization in NLP
Lemmatization is the process of reducing a word to its base or dictionary form (called a
lemma).
Unlike stemming (which just chops off suffixes blindly), lemmatization uses linguistic
knowledge, grammar, and dictionaries to return a valid word.
Example:
● "better" → good (lemma, based on grammar/dictionary)
● "running" → run
● "studies" → study
● "mice" → mouse
Why Lemmatization is Important?
● In natural language, many words appear in different forms (plural, tense, comparison,
etc.).
● To analyze text properly (search engines, chatbots, sentiment analysis), we need to map
all these forms to one base form.

How Lemmatization Works (Detailed but Simple)

1. Tokenization
Break text into words (tokens).
Example: "The cats are running fast" → ["The", "cats", "are", "running", "fast"]
2. Part-of-Speech (POS) Tagging
Identify the role of each word (noun, verb, adjective, etc.).
Example:
○ cats → noun (NNS: plural noun)
○ running → verb (VBG: present participle)
3. Morphological Analysis
○ Checks word endings (suffixes, inflections).
○ Example: cats → ends with "s" → possible plural form.
○ running → ends with "ing" → verb in continuous tense.
4. Dictionary Lookup (Lexicon)
Uses a dictionary (WordNet or custom lexicon) to find the lemma.
○ cats → cat
○ running → run
○ better → good (special case: irregular forms)
5. Return Lemma
Replace the word with its lemma for processing.

Lemmatization vs Stemming

Feature Stemming Lemmatization

Method Removes suffixes/prefixes mechanically Uses grammar + dictionary

Output May produce non-words (e.g., studies → Always valid words (e.g., studies →
studi) study)

Accuracy Less accurate More accurate

Speed Faster Slower (needs dictionary)

Example "running" → run, "studies" → studi "running" → run, "studies" → study

In simple words:
Lemmatization = smart stemming.
It reduces words to their base form using grammar rules + dictionary knowledge, ensuring
the result is always a meaningful word.

What is Edit Distance?

Edit distance between two strings is the minimum number of edits needed to turn one string
into the other.
An edit is typically:
● Insertion (insert a character)
● Deletion (remove a character)
● Substitution (replace one character with another)

Why it matters (common uses)

● Spell-checking / autocorrect (closest dictionary word)
● Fuzzy search & de-duplication (“Jon” vs “John”)
● Plagiarism & log compariso
● Machine Translation evaluation (TER)
● ASR/OCR post-processing
● Bioinformatics (sequence alignment; often with custom costs)
Algorithm (DP Approach)
1. Define the Problem
Let:
● s1 = first string of length m
● s2 = second string of length n
We create a DP table dp[m+1][n+1] where:
● dp[i][j] = minimum edit distance between first i characters of s1 and first j characters of
s2
2. Base Cases
● If s1 is empty (i = 0), we need j insertions:
dp[0][j] = j
● If s2 is empty (j = 0), we need i deletions:
dp[i][0] = i
3. Recursive Relation
For each i and j:
● If characters are equal (s1[i-1] == s2[j-1]):
dp[i][j] = dp[i-1][j-1]
● Else, take the minimum of three possibilities:
1. Insert (add character from s2[j-1]):
dp[i][j-1] + 1
2. Delete (remove s1[i-1]):
dp[i-1][j] + 1
3. Replace (change s1[i-1] to s2[j-1]):
dp[i-1][j-1] + 2
So:
dp[i][j] = min( dp[i-1][j] + 1, dp[i][j-1] + 1, dp[i-1][j-1] + 2 )
4. Final Answer
The result is stored in dp[m][n].

What is a Collocation?
A collocation is a sequence of words that occur together more often than would be expected
by chance.
In simple terms: Certain words naturally "go together".
For example:
● "strong tea" ✅ (natural collocation)
● "powerful tea" ❌ (odd, even though strong and powerful are synonyms)
So, collocations represent habitual word pairings in a language.

Types of Collocations
Collocations can be of different types depending on grammatical relation:
1. Adjective + Noun
○ strong tea, heavy rain, fast food
2. Noun + Noun
○ data science, climate change, traffic jam
3. Verb + Noun
○ make a decision, take a risk, catch a cold
4. Verb + Preposition
○ depend on, rely upon, approve of
5. Adverb + Adjective
○ deeply concerned, highly effective, completely wrong
6. Verb + Adverb
○ run quickly, argue strongly, work hard

Importance of Collocations in NLP

Collocat\ions are very important in Natural Language Processing (NLP) because they:

● Improve machine translation (Google Translate should not translate “strong tea”
word-by-word).
● Help in information retrieval (search engines need to recognize "artificial intelligence"
as one concept).
● Enhance speech recognition & text-to-speech (natural sounding output).
● Improve sentiment analysis (e.g., “hot topic” ≠ “high temperature”).
● Used in word embeddings and language models (to capture contextual meaning).

What is Morphological Analysis?
Morphology studies how words are built from smaller meaning units called morphemes
(roots/stems, prefixes, suffixes, infixes, clitics).
Morphological analysis is the NLP task of:
1. Segmenting a surface word into morphemes,

2. Recovering the lemma (dictionary base form), and

3. Assigning grammatical features (tense, number, case, person, gender, mood, aspect,
etc.).
The inverse task is morphological generation: produce the correct surface form from a
lemma + feature bundle.

Why it matters
● Normalization & search: “runs, ran, running” → lemma run for better recall.
● Parsing & translation: correct features (case, gender, tense) guide syntax and choice in
target language.
● Speech & TTS: pronunciation and stress often depend on morphology.
● Low-resource & morphologically rich languages: huge gains when tokens pack lots of
grammar (Turkish, Finnish, Arabic, Hindi).

Morphemes
● Definition: The smallest meaningful unit of language.
● They cannot be divided further without losing or changing their meaning.
● Example:
○ "unhappiness" → un- + happy + -ness
○ Here:
■ un- = prefix (meaning "not")
■ happy = stem/root (main meaning)
■ -ness = suffix (turns adjective into noun)

Stems
● The core meaning-bearing unit of a word.
● It carries the main semantic meaning, and affixes attach to it.
● Example:
○ In "played", stem = play
○ In "unfriendly", stem = friend
Affixes
● Definition: Small units (morphemes) that attach to stems to modify meaning or
grammatical function.
● Types of affixes:
1. Prefix – added before the stem.
■ Example: un- in unhappy
2. Suffix – added after the stem.
■ Example: -ed in played
3. Infix – inserted inside the stem (rare in English, common in other languages).
■ Example (Tagalog): sulat (write) → sumulat (wrote)
4. Circumfix – added around the stem (prefix + suffix together).
■ Example (German): ge-lieb-t ("loved")
Why Are These Important?
● They form the basis of Morphological Analysis (studying word structure).
● Used in:
○ Stemming (cutting to stem)
○ Lemmatization (finding dictionary form)
○ POS tagging (deciding if a word is noun/verb etc., often depends on suffix)
○ Machine Translation & Spell Checking (handling word variations)

Derivational Morphology
Focus: Creates new words by adding affixes (prefixes/suffixes) to a root/stem.
Changes in meaning and often the part of speech.
Examples:
● "happy" → "unhappy" (prefix un- changes meaning)
● "teach" → "teacher" (suffix -er changes verb → noun)
● "nation" → "national" → "nationalize" → "nationalization"
Each step derives a new word with a distinct meaning.

Inflectional Morphology
Focus: Changes the form of a word to express grammatical categories.
Does NOT create a new word; only modifies tense, number, gender, case, etc.
The word stays in the same part of speech.
Examples:
● Nouns:
○ "cat" → "cats" (plural)
○ "child" → "children" (plural)
● Verbs:
○ "play" → "played" (past tense)
○ "go" → "going" (progressive form)
● Adjectives:
○ "big" → "bigger" → "biggest"

Key Differences
Feature Derivational Inflectional Morphology
Morphology

Purpose Create new Show grammatical features

words

Changes Often Yes No (POS stays same)

POS?

Meaning New meaning Same meaning, new form

Examples happy → walk → walked

happiness

Count in Many (open set) Limited (only 8 inflections in English: plural -s, possessive
English 's, 3rd person -s, past -ed, progressive -ing, past participle
-en/-ed, comparative -er, superlative -est)

v24dsl07t - Unit I - NLP
No ratings yet
v24dsl07t - Unit I - NLP
65 pages
New Toc
No ratings yet
New Toc
36 pages
NLP Reading Material-1
No ratings yet
NLP Reading Material-1
15 pages
NLP - Sem
No ratings yet
NLP - Sem
31 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
74 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Words
No ratings yet
Words
44 pages
Module 1 - Part 3 Regex Fa
No ratings yet
Module 1 - Part 3 Regex Fa
30 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
NLP Notes Complete
No ratings yet
NLP Notes Complete
99 pages
Theory of Computation - Practical
No ratings yet
Theory of Computation - Practical
23 pages
Module 2 Reference Material 1
No ratings yet
Module 2 Reference Material 1
43 pages
NLP Module 2 - 1
100% (1)
NLP Module 2 - 1
86 pages
Intro To NLP
No ratings yet
Intro To NLP
44 pages
Module2 NLP BAD613B Notes
100% (1)
Module2 NLP BAD613B Notes
16 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
Word Level Analysis
No ratings yet
Word Level Analysis
49 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
2.chapter3 - Regular Expressions and Automata
No ratings yet
2.chapter3 - Regular Expressions and Automata
28 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Word and Syntactic Analysis Guide
No ratings yet
Word and Syntactic Analysis Guide
278 pages
WINSEM2018-19 SWE1017 ETH VL2018195004705 2018-12-19 Reference-Material-I
No ratings yet
WINSEM2018-19 SWE1017 ETH VL2018195004705 2018-12-19 Reference-Material-I
42 pages
Formal Languages & Finite Theory of Automata: BS Course
No ratings yet
Formal Languages & Finite Theory of Automata: BS Course
54 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
NLP Learning Materials 1
No ratings yet
NLP Learning Materials 1
28 pages
Basics of Text Processing - Tokenization and Stemming
No ratings yet
Basics of Text Processing - Tokenization and Stemming
47 pages
NLP Insem FlyHigh Services
No ratings yet
NLP Insem FlyHigh Services
7 pages
RegexFSA
No ratings yet
RegexFSA
59 pages
ch-2.pdf 2
No ratings yet
ch-2.pdf 2
27 pages
Lecture2 436n
No ratings yet
Lecture2 436n
140 pages
Automata Theory for CS Students
No ratings yet
Automata Theory for CS Students
40 pages
2 Text Processing
No ratings yet
2 Text Processing
58 pages
UNIT-1 Notes
No ratings yet
UNIT-1 Notes
19 pages
NLP Introduction Notes Anna University
No ratings yet
NLP Introduction Notes Anna University
2 pages
COMP3411 Week 8 - Language Processing
No ratings yet
COMP3411 Week 8 - Language Processing
74 pages
Lecture 2 - NLP-I
No ratings yet
Lecture 2 - NLP-I
91 pages
Week 2
No ratings yet
Week 2
90 pages
Compiler Construction Notes
No ratings yet
Compiler Construction Notes
21 pages
Lec 2
No ratings yet
Lec 2
21 pages
4.word Level Analysis-Regular Expression
No ratings yet
4.word Level Analysis-Regular Expression
8 pages
Regular Expression & Autometa
No ratings yet
Regular Expression & Autometa
62 pages
Lec 1.1
No ratings yet
Lec 1.1
26 pages
Background
No ratings yet
Background
18 pages
Unit I - NLP
No ratings yet
Unit I - NLP
24 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
Text Proc
No ratings yet
Text Proc
55 pages
Finite Automata in Natural Language Processing
No ratings yet
Finite Automata in Natural Language Processing
5 pages
NLP QB Final
No ratings yet
NLP QB Final
51 pages
Mod 2
No ratings yet
Mod 2
49 pages
Compiler Construction Lecture Notes
No ratings yet
Compiler Construction Lecture Notes
27 pages
Q1: Defination Of: - Regular Language
No ratings yet
Q1: Defination Of: - Regular Language
8 pages
Regex & FSA Basics for CS Students
No ratings yet
Regex & FSA Basics for CS Students
24 pages
NLP Unit-1 Merged
No ratings yet
NLP Unit-1 Merged
41 pages
02 Text Processing - Regular Expressions-Text Normalization
No ratings yet
02 Text Processing - Regular Expressions-Text Normalization
58 pages
The Branch of Computer Science That Deals With How Efficiently The Problem Can Be Solved On A Model of Computation, Using An Algorithm
No ratings yet
The Branch of Computer Science That Deals With How Efficiently The Problem Can Be Solved On A Model of Computation, Using An Algorithm
9 pages
Unit 2 - EX 1,2,3
No ratings yet
Unit 2 - EX 1,2,3
4 pages
Verb Tenses - Classroom Observation
No ratings yet
Verb Tenses - Classroom Observation
52 pages
Present Simple Tense
No ratings yet
Present Simple Tense
10 pages
Passion - Main Story
No ratings yet
Passion - Main Story
11 pages
Mercy Watson - To The Rescue
No ratings yet
Mercy Watson - To The Rescue
79 pages
Cambridge Latin Course 6th Edition Overview
No ratings yet
Cambridge Latin Course 6th Edition Overview
113 pages
PAST IGCSE 2021 Paper 1 ANS.1
No ratings yet
PAST IGCSE 2021 Paper 1 ANS.1
11 pages
English Spelling Contest Guide
No ratings yet
English Spelling Contest Guide
11 pages
The Importances of English Language I1
No ratings yet
The Importances of English Language I1
15 pages
Careers of Perspective: As Repositories Knowledge: A New On Boundaryless Careers
No ratings yet
Careers of Perspective: As Repositories Knowledge: A New On Boundaryless Careers
20 pages
Openmind 3 Unit 8 Student's Book Answer Key
No ratings yet
Openmind 3 Unit 8 Student's Book Answer Key
2 pages
Docente: Mg. Karin Melendez Grijalva
No ratings yet
Docente: Mg. Karin Melendez Grijalva
11 pages
Revisiting Phil Iri and Crla
No ratings yet
Revisiting Phil Iri and Crla
21 pages
Who Wants To Be Skeleton FCE
No ratings yet
Who Wants To Be Skeleton FCE
27 pages
Verb 1 2 3 Bhs Inggris Kls 6
No ratings yet
Verb 1 2 3 Bhs Inggris Kls 6
30 pages
Class 3 - Summer Holiday Homework 2023-24
No ratings yet
Class 3 - Summer Holiday Homework 2023-24
20 pages
Acholi Phrasebook
No ratings yet
Acholi Phrasebook
23 pages
Language Learners Discuss SRS
No ratings yet
Language Learners Discuss SRS
12 pages
Autoevaluación N°02: (Inglés Profesional 2)
No ratings yet
Autoevaluación N°02: (Inglés Profesional 2)
3 pages
Women'S Language and Its Deviation On Lady Parts
No ratings yet
Women'S Language and Its Deviation On Lady Parts
56 pages
Why English Matters Globally
No ratings yet
Why English Matters Globally
2 pages
DATA SISWA TERBARU (AutoRecovered)
No ratings yet
DATA SISWA TERBARU (AutoRecovered)
89 pages
Text Discourse and Corpora Theory and Analysis 1st Edition Michael Hoey PDF Download
No ratings yet
Text Discourse and Corpora Theory and Analysis 1st Edition Michael Hoey PDF Download
63 pages
Ravenshaw Girls School Staff List
No ratings yet
Ravenshaw Girls School Staff List
2 pages
Ls1 English Communication Junior High School
No ratings yet
Ls1 English Communication Junior High School
74 pages
English Language Multiple Choice Quiz
No ratings yet
English Language Multiple Choice Quiz
5 pages
Adverbs: Grade Level: 4 - 6
100% (1)
Adverbs: Grade Level: 4 - 6
11 pages
Unit 1 Lesson 2 Anglais
No ratings yet
Unit 1 Lesson 2 Anglais
4 pages
50 Essential English Idioms for ESL
No ratings yet
50 Essential English Idioms for ESL
38 pages
43.2 We-Use-Modal-Verbs-To-Express-Mind-Map-Classroom-Posters-Clt-Communicative-Language-Teach - 79045
No ratings yet
43.2 We-Use-Modal-Verbs-To-Express-Mind-Map-Classroom-Posters-Clt-Communicative-Language-Teach - 79045
1 page

NLP Module 2 1 (SAMI)

Uploaded by

NLP Module 2 1 (SAMI)

Uploaded by

NLP MODULE 2

Think of it as the "text database" that computers use to learn language.​

1.​ Large Size – Usually contains millions of words/sentences.

Examples of Famous Corpora in NLP:

A Regular Expression (Regex) is a sequence of characters that defines a search pattern. It is

Types of Regular Expressions

●​ Match exact characters as they are.

4. Predefined Character Classes​

●​ Grouping ( ): To group patterns together.​

7. Anchors & Boundaries

●​ \b → Matches word boundary (beginning or end of a word).

Practical Example: Email Validation (Simplified)

Let's build a (basic) regex to match simple email addresses like

1.​ Anchor: ^ - Start of the string.

What is Finite Automata?

Finite Automata (FA) is a mathematical model of computation used in computer science,

Components of Finite Automata​

Types of Finite Automata

1.​ Deterministic Finite Automaton (DFA)

A Finite State Transducer (FST) is an extension of a Finite State Automaton (FSA).

●​ Like FSA, an FST has:

Example 1: Simple Translation

Imagine we want to convert lowercase letters to uppercase.

Transitions might look like:

●​ c:C, a:A, t:T

So when the FST reads cat, it produces CAT.

Why Are FSTs Important in NLP?

Hyphenated Words & Compound Words​

Numbers, Dates, & Special Formats

Non-English & Space-less Languages​

Emojis, Hashtags, Mentions

Tokenization is not just splitting text by spaces.|​

Stemming (Simple Explanation)

Step 1: Define Measure (m)

Step 2: Apply Rules in Phases

Step 2a: Remove plurals and past tense suffixes

Step 2b: Remove -ed and -ing

Step 2c: Replace -y with -i if there’s a vowel in the stem

Step 2d: Handle longer suffixes

Step 2e: Remove further common endings

Step 3: Final clean-up

How Lemmatization Works (Detailed but Simple)

Feature Stemming Lemmatization

Method Removes suffixes/prefixes mechanically Uses grammar + dictionary

Accuracy Less accurate More accurate

Example "running" → run, "studies" → studi "running" → run, "studies" → study

What is Edit Distance?

Why it matters (common uses)

Importance of Collocations in NLP

2.​ Recovering the lemma (dictionary base form), and

Purpose Create new Show grammatical features

Changes Often Yes No (POS stays same)

Meaning New meaning Same meaning, new form

Examples happy → walk → walked

You might also like

Think of it as the "text database" that computers use to learn language.

1. Large Size – Usually contains millions of words/sentences.

● Match exact characters as they are.

4. Predefined Character Classes

● Grouping ( ): To group patterns together.

● \b → Matches word boundary (beginning or end of a word).

1. Anchor: ^ - Start of the string.

Components of Finite Automata

1. Deterministic Finite Automaton (DFA)

● Like FSA, an FST has:

● c:C, a:A, t:T

Hyphenated Words & Compound Words

Non-English & Space-less Languages

Tokenization is not just splitting text by spaces.|

2. Recovering the lemma (dictionary base form), and