KEMBAR78
NLP Short Notes | PDF | Phrase | Parsing
0% found this document useful (0 votes)
11 views21 pages

NLP Short Notes

Uploaded by

Harsh sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views21 pages

NLP Short Notes

Uploaded by

Harsh sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

You're absolutely right!

Looking at the comprehensive document you


provided, my previous notes were indeed missing substantial content.
Let me create a much more complete and detailed version that includes
all the important information from your NLP notes.

Natural Language Processing


(NLP) - Complete Comprehensive
Notes
Table of Contents
1. Introduction to NLP
2. Components of NLP
3. Steps in NLP Processing
4. Applications of NLP
5. Word Structure and Morphology
6. NLTK and Text Processing
7. Syntax and Parsing
8. Ambiguity in NLP
9. Algorithms and Models
10. Knowledge Bottlenecks
11. Advanced Topics

Introduction to NLP
Natural Language Processing (NLP) is the study and engineering of
computational systems that can analyze, understand, generate, and
interact using human language. It bridges unstructured linguistic signals
—text and speech—and structured machine representations by
combining computational linguistics with statistical and neural machine
learning.

Key Objectives
 Making computers learn, understand, analyze, manipulate and
interpret natural (human) languages
 Enable human-computer interaction through natural language
 Part of Computer Science, Human Languages/Linguistics, and
Artificial Intelligence

NLP Pipeline Architecture


Complete Pipeline Flow:

text
Raw Text → Normalization → Tokenization → Linguistic Analysis →
Representations → Model → Output

Detailed Examples for Each Stage:

 Normalization: Lowercasing, Unicode NFKC


 Tokenization: Whitespace splitting, WordPiece
 Linguistic Analysis: POS tagging, dependency parsing
 Representations: TF-IDF, BERT embeddings
 Model: CRF, Transformer
 Output: Labels, summary, translation

Historical Evolution
1. Rule-based Systems: Grammars, lexicons
2. Statistical NLP: N-grams, HMMs/CRFs, PCFGs
3. Neural NLP: Attention-based Transformers, large-scale pretraining

Components of NLP
1. Natural Language Understanding (NLU)
 Function: Transforms human language into machine-readable
format
 Tasks: Extract keywords, emotions, relations, semantics
 Complexity: Harder than NLG

2. Natural Language Generation (NLG)


 Function: Converts computerized data into natural language
 Components:
 Text planning
 Sentence planning
 Text realization

Steps in NLP Processing


1. Lexical Analysis
 Primary Phase: Scans text as character stream
 Functions:
 Converts characters into meaningful lexemes
 Divides text into paragraphs, sentences, words
 Key Concepts:
 Lexeme: Basic unit of meaning (individual or multiword)
 Examples:
 Individual: "talk" → "talks", "talked", "talking"
 Multiword: "speak up", "pull through"

2. Syntactic Analysis (Parsing)


 Purpose: Check grammar and word arrangements
 Functions:
 Show relationships among words
 Reject grammatically incorrect sentences
 Example: "The school goes to boy" → Rejected by English
syntactic analyzer

3. Semantic Analysis
 Focus: Literal meaning of words, phrases, sentences
 Functions:
 Meaning representation
 Reject semantically incorrect phrases
 Examples:
 Rejects: "hot ice-cream"
 Passes syntax but fails semantics: "Manhattan calls out to
Dave"

4. Discourse Integration
 Context Dependency: Considers preceding and following
sentences
 Example:
 "Manhattan speaks to all its people"
 "It calls out to Dave"
 → "It" refers to Manhattan

5. Pragmatic Analysis
 Purpose: Re-interpret what was said vs. what was meant
 Requirements: Real-world knowledge
 Example: "Manhattan speaks to all its people" → Metaphor for
emotional connection
Applications of NLP
1. Sentiment Analysis (Opinion Mining)
 Purpose: Identify emotional tone (positive, negative, neutral)
 Applications: Customer sentiment, brand reputation
 Data Sources: Emails, reviews, social media, surveys
 Technology: Machine learning text analysis models

2. Machine Translation (MT)


 Challenges:
 No equivalent words across languages
 Multiple word meanings
 Idiom translation
 Solutions: Corpus statistical and neural techniques
 Example: Handling linguistic typology differences

3. Text Extraction
 Information Types: Entity names, locations, quantities
 Industries: Healthcare, finance, e-commerce
 Benefits: Process unstructured data efficiently

4. Text Classification
 Also Known As: Text tagging, text categorization
 Process: Categorize text into organized groups
 Business Value: Automated insights, process automation

5. Speech Recognition (ASR)


 Alternative Names: Computer speech recognition, Speech-to-
Text (STT)
 Applications by Industry:
 Automotive: Voice-activated navigation
 Technology: Virtual assistants (Siri, Alexa, Google Assistant)
 Healthcare: Medical dictation
 Sales: Call center transcription, AI chatbots

6. Chatbots
 Function: Automated conversation systems
 Benefits: 24/7 customer service, quick assistance
 Types: Pre-scripted or AI-generated responses

7. Email Filtering
 Evolution: From simple spam filters to intelligent categorization
 Example: Gmail's categorization (main, social, promotional)

8. Search Autocorrect and Autocomplete


 Functions:
 Suggest probable search keywords
 Correct typing errors
 Improve search accuracy

Word Structure and Morphology


Tokens and Tokenization
 Tokens: Syntactic words with independent roles
 Examples:
 "newspaper" → compound word with derivational structure
 "won't" → "will" + "not" (two syntactic words)

Lexemes and Lemmas


 Lexeme: Set of alternative forms expressing same concept
 Lemma: Citation form of lexeme
 Operations:
 Inflection: Convert word to other forms (mouse → mice)
 Derivation: Transform to morphologically related lexeme
(receive → receiver, reception)

Morphemes
 Definition: Minimal meaningful elements of words
 Types:
 Stems: Core meaning (play, cat, friend)
 Affixes: Modify meaning
 Prefixes: Precede stem (un-)
 Suffixes: Follow stem (-ed, -s, -ly)

Morphological Processes
Inflectional Morphology
 Purpose: Different forms of same word
 Examples:
 cat → cats
 mouse → mice

Derivational Morphology
 Purpose: Create new words from roots
 Examples:
 inter + national = international
 international + ize = internationalize
 internationalize + ation = internationalization

Morphophonemic Changes
 Allomorphs: Alternative forms of morphemes
 Examples: Plural morpheme
 -s in "cats", "dogs"
 -es in "dishes"
 -en in "oxen"

Morphological Typology
1. Isolating/Analytic Languages
 Characteristics: Few morphemes per word
 Examples: Chinese, Vietnamese, Thai, English

2. Synthetic Languages
 Agglutinative: One function per morpheme
 Examples: Korean, Japanese, Finnish, Tamil
 Fusional: Multiple functions per morpheme
 Examples: Arabic, Czech, Latin, German

3. Word Formation Processes


 Concatenative: Morphemes linked sequentially
 Nonlinear: Structural components merge non-sequentially
NLTK and Text Processing
Introduction to NLTK
Natural Language Toolkit - Pedagogical and prototyping Python
library

Key Features:

 Tokenizers, stemmers, lemmatizers


 POS taggers, chunkers, parsers
 Corpus readers, evaluation metrics
 Access to corpora (Gutenberg, Brown, movie_reviews)

Basic NLTK Workflow


python
# Installation and setup
import nltk
nltk.download() # Fetch resources

1. Tokenization
Sentence Tokenization
python
from nltk.tokenize import sent_tokenize, word_tokenize

example_string = """
Muad'Dib learned rapidly because his first training was in how to learn.

And the first lesson of all was the basic trust that he could learn.

It's shocking to find how many people do not believe they can learn.
"""

sentences = sent_tokenize(example_string)

Output:

text
["Muad'Dib learned rapidly because his first training was in how to learn.",
'And the first lesson of all was the basic trust that he could learn.',

"It's shocking to find how many people do not believe they can learn."]

Word Tokenization
python
words = word_tokenize(example_string)

Output:

text
["Muad'Dib", 'learned', 'rapidly', 'because', 'his', 'first', 'training',
'was', 'in', 'how', 'to', 'learn', '.', 'And', 'the', 'first', 'lesson', ...]

2. Stop Words Filtering


python
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

worf_quote = "Sir, I protest. I am not a merry man!"


words_in_quote = word_tokenize(worf_quote)

stop_words = set(stopwords.words("english"))
filtered_list = []

for word in words_in_quote:


if word.casefold() not in stop_words:
filtered_list.append(word)

Results:

 Original: ['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man',
'!']
 Filtered: ['Sir', ',', 'protest', '.', 'merry', 'man', '!']

Content vs. Context Words


 Content Words: Information about topics and sentiment
 Context Words: Information about writing style

3. Stemming
python
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()
string_for_stemming = "The crew of the USS Discovery discovered many discoveries.
Discovering is what explorers do."

words = word_tokenize(string_for_stemming)
stemmed_words = [stemmer.stem(word) for word in words]

Stemming Results:
Original Word Stemmed Version

'Discovery' 'discoveri'

'discovered' 'discov'

'discoveries' 'discoveri'

'Discovering' 'discov'

4. Part-of-Speech (POS) Tagging


Parts of Speech Categories

Part of Speech Role Examples

Noun Person, place, or thing mountain, bagel, Poland

Pronoun Replaces a noun you, she, we

Adjective Describes what a noun is like efficient, windy, colorful

Verb Action or state of being learn, is, go

Adverb Modifies verb, adjective, or adverb efficiently, always, very


Part of Speech Role Examples

Preposition Shows relationship between noun/pronoun and another word from, about, at

Conjunction Connects words or phrases so, because, and

Interjection Exclamation yay, ow, wow

python
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize

sagan_quote = "If you wish to make an apple pie from scratch, you must first invent
the universe."
words_in_sagan_quote = word_tokenize(sagan_quote)
pos_tags = nltk.pos_tag(words_in_sagan_quote)

Output:

text
[('If', 'IN'), ('you', 'PRP'), ('wish', 'VBP'), ('to', 'TO'), ('make', 'VB'),
('an', 'DT'), ('apple', 'NN'), ('pie', 'NN'), ('from', 'IN'), ('scratch', 'NN'),

(',', ','), ('you', 'PRP'), ('must', 'MD'), ('first', 'VB'), ('invent', 'VB'),

('the', 'DT'), ('universe', 'NN'), ('.', '.')]

5. Lemmatization
python
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()
string_for_lemmatizing = "The friends of DeSoto love scarves."
words = word_tokenize(string_for_lemmatizing)

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

Advanced Lemmatization:

python
lemmatizer.lemmatize("worst") # Output: 'worst'
lemmatizer.lemmatize("worst", pos="a") # Output: 'bad'
6. Chunking
Purpose: Identify phrases (groups of words functioning as single units)

Noun Phrase Examples


 "A planet"
 "A tilting planet"
 "A swiftly tilting planet"

python
import nltk
from nltk.tokenize import word_tokenize

quote = "It's a dangerous business, Frodo, going out your door."


words_quote = word_tokenize(quote)
tags = nltk.pos_tag(words_quote)

# Define chunk grammar


grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tags)

Output Tree Structure:

text
(S
It/PRP

's/VBZ

(NP a/DT dangerous/JJ business/NN)

,/,

Frodo/NNP

,/,

going/VBG

out/RP

your/PRP$

(NP door/NN)

./.)

7. Chinking
Purpose: Exclude patterns from chunks (opposite of chunking)
python
grammar = """
Chunk: {<.*>+}
}<JJ>{"""

chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tags)

8. Named Entity Recognition (NER)


python
nltk.download("maxent_ne_chunker")
nltk.download("words")

tree = nltk.ne_chunk(tags)

Output:

text
(S
It/PRP

's/VBZ

a/DT

dangerous/JJ

business/NN

,/,

(PERSON Frodo/NNP)

,/,

going/VBG

out/RP

your/PRP$

door/NN

./.)

Binary NER (without entity type specification):

python
tree = nltk.ne_chunk(tags, binary=True)
Syntax and Parsing
Grammar Types in NLP
1. Context-Free Grammar (CFG)
 Structure: Rules for forming well-structured sentences
 Language Patterns: SVO (Subject-Verb-Object), SOV, OSV

2. Constituency Grammar (Phrase Structure


Grammar)
 Focus: Phrase and clause structure
 Examples: NP (Noun Phrase), PP (Prepositional Phrase), VP (Verb
Phrase)

3. Dependency Grammar
 Focus: Grammatical relations between individual words
 Structure: Network of relations rather than recursive structure
 Features: Labeled relations between words

Parsing Process
Definition: Determining syntactic structure of text by analyzing
constituent words based on underlying grammar.

Example Grammar Rules


text
sentence → noun_phrase verb_phrase
noun_phrase → determiner noun

verb_phrase → verb noun_phrase

determiner → 'the', 'a', 'an'

noun → 'Tom', 'apple'

verb → 'ate'

Parse Tree Structure


 Root: sentence
 Non-terminals: noun_phrase, verb_phrase (intermediate nodes)
 Terminals: 'Tom', 'ate', 'an', 'apple' (leaves)

Syntactic Structure Representations


1. Dependency Parsing
 Philosophy: Connect head word with dependents in phrase
 Structure: Directed (asymmetric) connections
 Components:
 Words as vertices
 Directed arcs as binary relations (head to dependent)
 Each word depends on exactly one parent

2. Phrase Structure Parsing


 Approach: Partition sentences into constituents
 Method: Recursive partitioning into phrases (NP, VP, PP)
 Traditional: Derives from sentence diagrams

Constituency Tree vs. Dependency Tree


Dependency structures explicitly
represent:
 Head-dependent relations (directed arcs)
 Functional categories (arc labels)
 Possibly some structural categories (POS)

Phrase structure explicitly represents:


 Phrases (non-terminal nodes)
 Structural categories (non-terminal labels)
 Possibly some functional categories (grammatical functions)

Treebanks: Data-Driven Approach


Definition: Linguistically annotated corpus including syntactic analysis
beyond POS tagging

Key Features:
 Collection of sentences with complete syntax analysis
 Human expert judgment for most plausible analysis
 Consistent treatment across related grammatical phenomena
 No explicit grammar rules provided

Benefits:
1. Solves Grammar Problem: Syntactic analysis directly given
2. Solves Probability Problem: Supervised learning for scoring
functions

Analysis Types:
 Dependency Analysis: Favored for free word order languages
(Czech, Turkish)
 Phrase Structure Analysis: Used for long-distance dependencies
(English, French)

Parsing Algorithms
Key Concepts:
 Derivation: Sequence of steps to derive string from grammar
 Sentential Form: Each line in derivation sequence
 Rightmost Derivation: Expand rightmost nonterminal at each
step

Algorithm Types:
 CKY Parsing: Requires CNF grammar, fills triangular chart bottom-
up
 Earley Parsing: Handles arbitrary CFGs with dotted rules
 Neural Parsers: Use encoders for contextual token vectors

Ambiguity in NLP
Core Challenge: Ambiguity drives the fundamental difficulty in
language understanding

Types of Ambiguity
1. Lexical Ambiguity
 Definition: Word has multiple senses
 Example: "bank" = financial institution vs. river edge
 Resolution: Word Sense Disambiguation using context and world
knowledge

2. Syntactic Ambiguity
 Definition: Same sequence yields multiple parse trees
 Example: "I saw the man with a telescope"
 Did seeing happen with telescope?
 Did the man have the telescope?
 Resolution: Probabilistic or neural parsers score structures

3. Semantic Ambiguity
 Definition: Sentence-meaning level ambiguity
 Example: "Visiting relatives can be boring"
 Act of visiting them is boring
 Relatives who visit are boring
 Requirements: Selectional preferences and event structure

4. Pragmatic Ambiguity
 Definition: Depends on context, intention, social norms
 Example: "Can you pass the salt?" (request, not ability query)
 Resolution: Speech act recognition

5. Referential/Anaphoric Ambiguity
 Definition: Pronouns/descriptions have multiple candidates
 Example: "Alice told Jane that she would win" (who is "she"?)
 Resolution: Coreference resolution and entity tracking

Mathematical Approaches to Ambiguity


Bayesian Scoring
For syntactic ambiguity: P(T|x) ∝ P(x|T)P(T)

Viterbi Algorithm
Used in HMMs for optimal tag sequence selection

Persistent Challenges
 Rare senses
 Long-distance dependencies
 Idioms and sarcasm
 Under-specified references
 Need for external knowledge and discourse modeling

Algorithms and Models


Algorithmic Families by Task and Data
1. Rule-Based Methods
 Components: Finite-state transducers, handcrafted grammars
 Advantages: Interpretability, precision in constrained domains
 Applications: Tokenization, morphology, parsing

2. Statistical Learning
 Models:
 N-gram language models
 HMMs and CRFs (tagging, segmentation)
 PCFGs (parsing with chart algorithms)
 Decoding: Dynamic programming

3. Neural Models
 Evolution:
 RNNs and LSTMs (sequences)
 CNNs (character/subword features)
 Attention mechanisms (long-range dependencies)
 Transformer architecture (self-attention only)

Knowledge Bottlenecks
The Knowledge Gap
Definition: Gap between what model encodes vs. what robust language
understanding requires

Problem Areas:
 Text alone lacks full commonsense and world knowledge
 Labeled data is costly and uneven across domains/languages
 Distribution shift causes failures when test differs from training

Classic Error Examples:


 Pronoun Resolution: "The trophy doesn't fit in the suitcase
because it is too small" (it = suitcase)
 Temporal Reasoning: Understanding time relationships
 Spatial Relations: Understanding spatial concepts
 Procedural Knowledge: Understanding processes

Strategies to Address Bottlenecks:


1. Pretraining
 Massive corpora for linguistic and factual regularities

2. Retrieval Augmentation
 Ground generation in up-to-date sources

3. Knowledge Integration
 Structured knowledge graphs (Wikidata)
 Differentiable memory systems

4. Weak Supervision
 Data programming to expand coverage

5. Instruction Tuning
 Preference optimization for better task intent following

Advanced Topics
Word-Level Analysis
Text Normalization (Task-Dependent
Decisions):
 Lowercasing: Affects proper nouns
 Unicode Normalization: NFKC harmonizes similar characters
 Punctuation Handling: Differs between IR and sentiment tasks
 Number Handling: Map digits to placeholders
 Contractions: Expansion improves parsing
 Stopword Removal: Reduces noise for bag-of-words, may harm
generation

Stemming vs. Lemmatization:


 Stemming: Aggressive suffix chopping (may over-stem)
 Lemmatization: Uses vocabulary and POS for accurate lemmas

Tokenization Approaches:
 Whitespace → Rule-based → Subword (BPE/WordPiece)
 Trade-offs: OOV handling vs. morphological fidelity

Edit Distance
Levenshtein Distance:
 Counts insertions, deletions, substitutions (unit costs)
 Example: "kitten" → "sitting" = 3 edits
1. k → s (substitute)
2. e → i (substitute)
3. Insert g at end

Damerau-Levenshtein Distance:
 Adds transposition of adjacent characters

Applications:
 Spell correction
 DNA sequence comparison
 String similarity tasks

Spelling Correction
Two-Phase Process:
1. Detection: Is token erroneous?
2. Correction: Which candidate is intended?
Error Types:
 Nonword Errors: Easy detection with lexicon
 Real-word Errors: Require contextual modeling ("peace" vs
"piece")

Candidate Generation Methods:


 Edit neighborhoods (distance 1-2)
 Keyboard adjacency graphs
 Phonetic hashing (Soundex, Metaphone)
 Morphological variants

POS Tagging Details


Word Classes:
 Open Classes: Nouns, verbs, adjectives, adverbs
 Closed Classes: Prepositions, determiners, conjunctions,
pronouns, particles

Tagset Examples:
 Penn Treebank: NN, NNS, NNP, VB, VBD, VBG, JJ, RB, IN, DT, PRP
 Universal Dependencies: NOUN, VERB, ADJ, ADV, ADP, PRON,
DET

Applications:
 Supporting parsing
 Lemmatization (verb vs noun lemmas)
 Downstream tasks like NER

Modern System Patterns


1. Retrieval-Augmented Generation (RAG)
 Dense retrievers fetch knowledge
 Generate grounded responses

2. Tool-Use Agents
 Combine language models with deterministic tools
 Ensure reliable actions
Evaluation Considerations
Task-Specific Metrics:
 Classification/Extraction: Accuracy, F1-score
 Generation: BLEU, ROUGE, BERTScore
 Production: Human preference, task success metrics, latency,
cost

Deployment Considerations:
 Multilingual coverage
 Fairness and bias audits
 Privacy for sensitive data
 Robustness to adversarial prompts/noisy inputs

Summary
This comprehensive guide covers all essential aspects of Natural
Language Processing, from basic concepts and preprocessing techniques
to advanced parsing algorithms and modern neural approaches. The field
continues to evolve rapidly, with current research focusing on large
language models, multimodal systems, and addressing the persistent
challenges of ambiguity, knowledge integration, and robust
understanding across diverse domains and languages.

The integration of statistical methods with neural architectures,


combined with massive pretraining and retrieval-augmented approaches,
represents the current state-of-the-art, while traditional rule-based and
statistical methods remain important for specific applications and
understanding fundamental principles.

You might also like