KEMBAR78
NLP Module 2 1 (SAMI) | PDF | Morphology (Linguistics) | Automata Theory
0% found this document useful (0 votes)
12 views19 pages

NLP Module 2 1 (SAMI)

Uploaded by

ameyamkk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

NLP Module 2 1 (SAMI)

Uploaded by

ameyamkk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

NLP MODULE 2

SAMI THAKUR

What is a Corpus

In Natural Language Processing (NLP), a corpus (plural: corpora) is a large and structured
collection of texts that are stored in electronic form.​
It is used as a dataset for training, testing, and evaluating NLP models.

Think of it as the "text database" that computers use to learn language.​


For example, if you want a computer to understand English grammar, you give it a huge
collection of English sentences (corpus).

Features of a Corpus:

1.​ Large Size – Usually contains millions of words/sentences.


2.​ Structured – Text is often annotated (tagged with parts of speech, meaning, etc.).
3.​ Domain-specific or General – Some corpora are built for a specific domain (like medical
text), while others are general.

Examples of Famous Corpora in NLP:

1.​ Brown Corpus – One of the first English corpora, with over 1 million words.
2.​ British National Corpus (BNC) – A 100-million-word collection of samples of written
and spoken English.
3.​ COCA (Corpus of Contemporary American English) – A large, balanced corpus with 1
billion words from spoken, fiction, magazines, newspapers, and academic texts.
4.​ Penn Treebank – Includes syntactically parsed sentences (used for training parsers).
5.​ Google N-gram Corpus – Contains phrases and word sequences from billions of words
collected from the web.
6.​ Wikipedia Dumps – Used as an open-source corpus for general NLP tasks.
7.​ Domain-Specific Corpora –
○​ Medical: MIMIC-III (clinical text dataset).
○​ Legal: EUR-Lex corpus (legal documents).
○​ Social Media: Twitter Corpus.​
Regular Expressions (Regex)

A Regular Expression (Regex) is a sequence of characters that defines a search pattern. It is


mainly used for string matching and text processing, such as searching, extracting, or
replacing text.

Types of Regular Expressions

Regular expressions can be classified into different types based on what they match.

1. Literal Characters

●​ Match exact characters as they are.


●​ Example:
○​ Regex: cat
○​ Matches: "cat" in "The cat is sleeping."
○​ Does not match "Cat" (case-sensitive).

2. Metacharacters​
Special characters that have specific meanings in regex:
3. Character Classes​
Used to match a set or range of characters.​

4. Predefined Character Classes​


Shortcut notations for common sets

5. Quantifiers​
Used to specify how many times a pattern should repeat.
6. Grouping & Alternation

●​ Grouping ( ): To group patterns together.​


Example: (ab)+ → matches ab, abab, ababab.
●​ Alternation |: OR condition.​
Example: cat|dog → matches cat or dog.

7. Anchors & Boundaries

●​ \b → Matches word boundary (beginning or end of a word).


○​ Example: \bcat\b matches "cat" but not "category".
●​ \B → Matches non-word boundary.
○​ Example: \Bcat matches "educate".

Practical Example: Email Validation (Simplified)

Let's build a (basic) regex to match simple email addresses like

name@domain.com.

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

1.​ Anchor: ^ - Start of the string.


2.​ Character Class: [a-zA-Z0-9._%+-] - Local part can contain letters, numbers, and these
symbols.
3.​ Quantifier: + - The local part must have one or more of the allowed characters.
4.​ Literal: @ - The literal "at" symbol.
5.​ Character Class: [a-zA-Z0-9.-] - Domain name (letters, numbers, dots, hyphens).
6.​ Quantifier: + - Domain name must have one or more allowed characters.
7.​ Literal: \. - A literal dot (escaped because . is a metacharacter). This is the "dot" in
".com".
8.​ Character Class: [a-zA-Z] - Top-Level Domain (TLD) must be letters.
9.​ Quantifier: {2,} - TLD must be at least 2 characters long (e.g., .com, .io, .org).
10.​Anchor: $ - End of the string.

What is Finite Automata?

Finite Automata (FA) is a mathematical model of computation used in computer science,


linguistics, and NLP.​
It is mainly used to represent and recognize patterns in text (like regular expressions).​
Think of FA as a machine with limited memory that reads input one symbol at a time and
decides whether to accept or reject it.

Components of Finite Automata​


A finite automaton consists of:

1.​ States (Q) → A finite set of conditions the machine can be in.
2.​ Alphabet (Σ) → A finite set of input symbols (like {a, b}).
3.​ Transition function (δ) → Rules for moving from one state to another based on input.
4.​ Start state (q₀) → The state where the machine begins.
5.​ Final/accept states (F) → The states that represent successful recognition.

Types of Finite Automata

1.​ Deterministic Finite Automaton (DFA)


○​ For each state and input symbol, there is exactly one transition.
○​ Example:
■​ Pattern: strings ending with ab
■​ DFA ensures only one possible path for each input.
2.​ Non-Deterministic Finite Automaton (NFA)
○​ For each state and input symbol, there can be multiple transitions (including
ε-moves = empty moves).
○​ Easier to design, but equivalent to DFA in power (anything an NFA can do, a
DFA can also do).
3.​ ε-NFA (Epsilon NFA)
○​ NFA with ε-transitions, meaning it can move from one state to another without
consuming any input.


What is a Finite State Transducer?

A Finite State Transducer (FST) is an extension of a Finite State Automaton (FSA).

●​ An FSA only accepts or rejects strings (it tells if a string is valid or not).
●​ An FST not only accepts strings but also produces an output for each input.

How It Works

●​ Like FSA, an FST has:


○​ States (circles)
○​ Transitions (arrows between states)
○​ Start state and Final states
●​ Difference:​
In FST, each transition has input/output labels.​
Example: a:x means if the machine sees input a, it outputs x.

Example 1: Simple Translation

Imagine we want to convert lowercase letters to uppercase.

●​ Input: cat
●​ Output: CAT

Transitions might look like:

●​ c:C, a:A, t:T

So when the FST reads cat, it produces CAT.

Why Are FSTs Important in NLP?

1.​ Morphology:
○​ Converting word roots to inflected forms.
○​ Example: run → running, played → play + ed.
2.​ Phonology:
○​ Mapping between spelling and pronunciation.
○​ Example: cat → k æ t.
3.​ Machine Translation:
○​ Mapping words/phrases from one language to another.
4.​ Text-to-Speech (TTS):
○​ Input: read
○​ Output: pronunciation rules applied (reed or red depending on tense).

Word Tokenization

Word tokenization is the process of splitting a text into smaller pieces called tokens, where
tokens are usually words.​
It’s one of the first steps in Natural Language Processing (NLP) because most NLP tasks work
with words rather than raw text.

Why is it Important?

●​ Computers cannot understand raw text directly. Tokenization converts text into a
structured form.
●​ Helps in text analysis like:
○​ Counting words (word frequency)
○​ Finding keywords
○​ Performing sentiment analysis
○​ Feeding words into machine learning models
●​ It is the foundation for more advanced NLP tasks like POS tagging, parsing, and
named entity recognition.

Issues in tokenization

Punctuation Problems​
Punctuation can stick to words or be considered separate tokens depending on the context.
"Hello, world!"
Should we tokenize as:​
['Hello', ',', 'world', '!']​
or​
['Hello', 'world']?
Mismanaging punctuation can affect NLP tasks like sentiment analysis or named entity
recognition.
Contractions​
English contractions like "don't", "isn't", "we'll" can be tokenized in multiple ways:
Keep as single token: ["don't"]
Split into multiple: ["do", "n't"]
This affects language models, search, and text normalization.

Hyphenated Words & Compound Words​


Words with hyphens or compounds may be split incorrectly:
"state-of-the-art technology"
Options: ['state', 'of', 'the', 'art', 'technology'] or
['state-of-the-art', 'technology']
Tokenization must respect meaning.

Numbers, Dates, & Special Formats

Tokens like "12:30", "2025-08-20", "3.14" may be split incorrectly by simple tokenizers.​
Mis-tokenizing can break downstream tasks like information extraction.

Non-English & Space-less Languages​


Languages like Chinese, Japanese, or Thai don’t use spaces between words
Requires specialized tokenizers (word dictionaries or statistical models).

Emojis, Hashtags, Mentions


Social media text introduces new challenges​
Tokenizer must decide whether to split emojis, hashtags, or mentions as separate tokens.

Ambiguity​
Some strings can be interpreted in multiple ways:
"U.S.A. is amazing"
Tokenizer might treat "U.S.A." as one token or split into "U", ".", "S", ".", "A", "."
Compound Words
●​ Certain expressions carry meaning together: "New York", "machine learning"
●​ Tokenizing them as separate words may lose context in NLP tasks.

Tokenization is not just splitting text by spaces.|​


Challenges include punctuation, contractions, numbers, languages, emojis, abbreviations,
multi-word phrases, and noisy text. Choosing the right tokenizer is crucial for accurate NLP
results.

Stemming (Simple Explanation)


Stemming is the process of reducing a word to its root/base form by chopping off prefixes or
suffixes.​
Example:
●​ playing, played, plays → play
●​ fishing, fished, fisher → fish
It’s like saying: "No matter how the word changes with tense, plural, or form, just keep the
main part." ​
Stemming is a text normalization technique used in Natural Language Processing (NLP) and
Information Retrieval (IR).​
It helps computers understand that different word forms have the same meaning.
Types of Stemming Algorithms
1.​ Rule-Based Stemming
○​ Applies predefined rules for suffix/prefix removal.
○​ Example: If a word ends in "ing" → remove "ing".
2.​ Porter Stemmer (Most popular)
○​ Uses step-by-step rules (suffix stripping).
○​ Example: caresses → caress, ponies → poni.
Advantages
●​ Reduces vocabulary size.
●​ Improves efficiency in search engines (finds related words).
●​ Helps in information retrieval and sentiment analysis.
Limitations
●​ Can produce non-words (e.g., studies → studi).
●​ Sometimes too aggressive (removes too much) or too light (removes too little).
●​ Doesn’t always respect grammar or meaning.​
Porter Stemming Algorithm (Detailed Steps)

It works by removing common morphological and inflectional endings (like -ing, -ed, -ly, -es)
from words to get the root form.

Step 1: Define Measure (m)


The algorithm is based on a measure m, which counts the number of VC (vowel-consonant)
sequences.
●​ Example:
○​ ca → m = 0
○​ trouble → m = 2 (trou | ble)
○​ hopping → m = 2
This measure is used to decide when to remove suffixes.

Step 2: Apply Rules in Phases


The algorithm goes through five main phases (steps), each with its own set of rules.

Step 2a: Remove plurals and past tense suffixes


Rules:
●​ sses → ss (e.g., caresses → caress)
●​ ies → i (e.g., ponies → poni)
●​ ss → ss (unchanged)
●​ s → (remove) (cats → cat)

Step 2b: Remove -ed and -ing


●​ If the word ends with -ed or -ing, remove it if the remaining part has a vowel.
●​ After removal:
○​ If it ends with at, bl, or iz → add e (hopping → hope)
○​ If it ends with a double consonant (e.g., -tt, -ss) → remove one letter (hopping →
hop)
○​ If it has the form CVC (consonant-vowel-consonant) and the last consonant is
not w, x, y → add e (hop → hope)

Step 2c: Replace -y with -i if there’s a vowel in the stem


●​ Example: happy → happi, cry → cri

Step 2d: Handle longer suffixes


If word has m > 0, replace:
●​ ational → ate (relational → relate)
●​ tional → tion (conditional → condition)
●​ enci → ence (valenci → valence)
●​ anci → ance (hesitanci → hesitance)
●​ izer → ize (analyzer → analyze)
●​ bli → ble (abli → able)
... and so on for a large set of suffixes.

Step 2e: Remove further common endings


If word has m > 1:
●​ Remove al, ance, ence, er, ic, able, ible, ant, ement, ment, ent, ism, ate, iti, ous, ive, ize
Examples:
●​ rational → ration
●​ comfortable → comfort

Step 3: Final clean-up


Step 3a: Remove a final -e if m > 1
●​ rate → rat
●​ rate → rate (if short)​

Step 3b: If word ends in double consonant -ll and m > 1, remove one l
●​ control → control
●​ controll → control​

Lemmatization in NLP
Lemmatization is the process of reducing a word to its base or dictionary form (called a
lemma).​
Unlike stemming (which just chops off suffixes blindly), lemmatization uses linguistic
knowledge, grammar, and dictionaries to return a valid word.
Example:
●​ "better" → good (lemma, based on grammar/dictionary)
●​ "running" → run
●​ "studies" → study
●​ "mice" → mouse
Why Lemmatization is Important?
●​ In natural language, many words appear in different forms (plural, tense, comparison,
etc.).
●​ To analyze text properly (search engines, chatbots, sentiment analysis), we need to map
all these forms to one base form.

How Lemmatization Works (Detailed but Simple)


1.​ Tokenization​
Break text into words (tokens).​
Example: "The cats are running fast" → ["The", "cats", "are", "running", "fast"]
2.​ Part-of-Speech (POS) Tagging​
Identify the role of each word (noun, verb, adjective, etc.).​
Example:
○​ cats → noun (NNS: plural noun)
○​ running → verb (VBG: present participle)
3.​ Morphological Analysis
○​ Checks word endings (suffixes, inflections).
○​ Example: cats → ends with "s" → possible plural form.
○​ running → ends with "ing" → verb in continuous tense.
4.​ Dictionary Lookup (Lexicon)​
Uses a dictionary (WordNet or custom lexicon) to find the lemma.
○​ cats → cat
○​ running → run
○​ better → good (special case: irregular forms)
5.​ Return Lemma​
Replace the word with its lemma for processing.

Lemmatization vs Stemming

Feature Stemming Lemmatization

Method Removes suffixes/prefixes mechanically Uses grammar + dictionary

Output May produce non-words (e.g., studies → Always valid words (e.g., studies →
studi) study)

Accuracy Less accurate More accurate


Speed Faster Slower (needs dictionary)

Example "running" → run, "studies" → studi "running" → run, "studies" → study

In simple words:
Lemmatization = smart stemming.
It reduces words to their base form using grammar rules + dictionary knowledge, ensuring
the result is always a meaningful word.​

What is Edit Distance?

Edit distance between two strings is the minimum number of edits needed to turn one string
into the other.​
An edit is typically:
●​ Insertion (insert a character)
●​ Deletion (remove a character)
●​ Substitution (replace one character with another)

Why it matters (common uses)


●​ Spell-checking / autocorrect (closest dictionary word)
●​ Fuzzy search & de-duplication (“Jon” vs “John”)
●​ Plagiarism & log compariso
●​ Machine Translation evaluation (TER)
●​ ASR/OCR post-processing
●​ Bioinformatics (sequence alignment; often with custom costs)
Algorithm (DP Approach)
1. Define the Problem
Let:
●​ s1 = first string of length m
●​ s2 = second string of length n
We create a DP table dp[m+1][n+1] where:
●​ dp[i][j] = minimum edit distance between first i characters of s1 and first j characters of
s2
2. Base Cases
●​ If s1 is empty (i = 0), we need j insertions:​
dp[0][j] = j
●​ If s2 is empty (j = 0), we need i deletions:​
dp[i][0] = i
3. Recursive Relation
For each i and j:
●​ If characters are equal (s1[i-1] == s2[j-1]):​
dp[i][j] = dp[i-1][j-1]
●​ Else, take the minimum of three possibilities:
1.​ Insert (add character from s2[j-1]):​
dp[i][j-1] + 1
2.​ Delete (remove s1[i-1]):​
dp[i-1][j] + 1
3.​ Replace (change s1[i-1] to s2[j-1]):​
dp[i-1][j-1] + 2
So:​
dp[i][j] = min( dp[i-1][j] + 1, dp[i][j-1] + 1, dp[i-1][j-1] + 2 )
4. Final Answer
The result is stored in dp[m][n].

What is a Collocation?
A collocation is a sequence of words that occur together more often than would be expected
by chance.​
In simple terms: Certain words naturally "go together".
For example:
●​ "strong tea" ✅ (natural collocation)
●​ "powerful tea" ❌ (odd, even though strong and powerful are synonyms)
So, collocations represent habitual word pairings in a language.

Types of Collocations
Collocations can be of different types depending on grammatical relation:
1.​ Adjective + Noun
○​ strong tea, heavy rain, fast food
2.​ Noun + Noun
○​ data science, climate change, traffic jam
3.​ Verb + Noun
○​ make a decision, take a risk, catch a cold
4.​ Verb + Preposition
○​ depend on, rely upon, approve of
5.​ Adverb + Adjective
○​ deeply concerned, highly effective, completely wrong
6.​ Verb + Adverb
○​ run quickly, argue strongly, work hard

Importance of Collocations in NLP

Collocat\ions are very important in Natural Language Processing (NLP) because they:

●​ Improve machine translation (Google Translate should not translate “strong tea”
word-by-word).
●​ Help in information retrieval (search engines need to recognize "artificial intelligence"
as one concept).
●​ Enhance speech recognition & text-to-speech (natural sounding output).
●​ Improve sentiment analysis (e.g., “hot topic” ≠ “high temperature”).
●​ Used in word embeddings and language models (to capture contextual meaning).


What is Morphological Analysis?
Morphology studies how words are built from smaller meaning units called morphemes
(roots/stems, prefixes, suffixes, infixes, clitics).​
Morphological analysis is the NLP task of:
1.​ Segmenting a surface word into morphemes,​

2.​ Recovering the lemma (dictionary base form), and


3.​ Assigning grammatical features (tense, number, case, person, gender, mood, aspect,
etc.).​
The inverse task is morphological generation: produce the correct surface form from a
lemma + feature bundle.​

Why it matters
●​ Normalization & search: “runs, ran, running” → lemma run for better recall.
●​ Parsing & translation: correct features (case, gender, tense) guide syntax and choice in
target language.
●​ Speech & TTS: pronunciation and stress often depend on morphology.
●​ Low-resource & morphologically rich languages: huge gains when tokens pack lots of
grammar (Turkish, Finnish, Arabic, Hindi).

Morphemes
●​ Definition: The smallest meaningful unit of language.
●​ They cannot be divided further without losing or changing their meaning.
●​ Example:
○​ "unhappiness" → un- + happy + -ness
○​ Here:
■​ un- = prefix (meaning "not")
■​ happy = stem/root (main meaning)
■​ -ness = suffix (turns adjective into noun)

Stems
●​ The core meaning-bearing unit of a word.
●​ It carries the main semantic meaning, and affixes attach to it.
●​ Example:
○​ In "played", stem = play
○​ In "unfriendly", stem = friend
Affixes
●​ Definition: Small units (morphemes) that attach to stems to modify meaning or
grammatical function.
●​ Types of affixes:
1.​ Prefix – added before the stem.
■​ Example: un- in unhappy
2.​ Suffix – added after the stem.
■​ Example: -ed in played
3.​ Infix – inserted inside the stem (rare in English, common in other languages).
■​ Example (Tagalog): sulat (write) → sumulat (wrote)
4.​ Circumfix – added around the stem (prefix + suffix together).
■​ Example (German): ge-lieb-t ("loved")
Why Are These Important?
●​ They form the basis of Morphological Analysis (studying word structure).
●​ Used in:
○​ Stemming (cutting to stem)
○​ Lemmatization (finding dictionary form)
○​ POS tagging (deciding if a word is noun/verb etc., often depends on suffix)
○​ Machine Translation & Spell Checking (handling word variations)

Derivational Morphology
Focus: Creates new words by adding affixes (prefixes/suffixes) to a root/stem.​
Changes in meaning and often the part of speech.
Examples:
●​ "happy" → "unhappy" (prefix un- changes meaning)
●​ "teach" → "teacher" (suffix -er changes verb → noun)
●​ "nation" → "national" → "nationalize" → "nationalization"
Each step derives a new word with a distinct meaning.

Inflectional Morphology
Focus: Changes the form of a word to express grammatical categories.​
Does NOT create a new word; only modifies tense, number, gender, case, etc.​
The word stays in the same part of speech.
Examples:
●​ Nouns:
○​ "cat" → "cats" (plural)
○​ "child" → "children" (plural)
●​ Verbs:
○​ "play" → "played" (past tense)
○​ "go" → "going" (progressive form)
●​ Adjectives:
○​ "big" → "bigger" → "biggest"​

Key Differences
Feature Derivational Inflectional Morphology
Morphology

Purpose Create new Show grammatical features


words

Changes Often Yes No (POS stays same)


POS?

Meaning New meaning Same meaning, new form

Examples happy → walk → walked


happiness

Count in Many (open set) Limited (only 8 inflections in English: plural -s, possessive
English 's, 3rd person -s, past -ed, progressive -ing, past participle
-en/-ed, comparative -er, superlative -est)

You might also like