INTRODUCTION
Natural Language Processing (NLP) is a subfield of Artificial Intelligence that focuses on
enabling machines to understand, interpret, generate, and respond to human languages. It
bridges the gap between human communication and computer understanding.
Origins and Challenges of NLP
The origins of NLP can be traced to the 1950s, with early efforts in machine translation
and rule-based parsing. Over the decades, it evolved through symbolic methods, statistical
models, and now deep learning. Challenges in NLP include ambiguity, contextual
understanding, world knowledge, sarcasm, multilingualism, and noisy data.
Language Modeling
Language models predict the next word in a sequence or assign probabilities to
sentences. 1. Grammar-Based Language Modeling: - Relies on predefined syntactic rules
(e.g., Context-Free Grammars). - Ensures grammatical correctness but lacks flexibility for
real-world data. 2. Statistical Language Modeling: - Based on probabilities derived from
large corpora. - Examples: Unigram, Bigram, Trigram models. - Applications: Speech
recognition, text prediction.
Regular Expressions
Regular Expressions (regex) are patterns used to match character combinations in text.
They are widely used in text preprocessing, pattern recognition, and lexical analysis.
Examples: - \d: Matches any digit. - [a-z]: Matches any lowercase letter. - ^start: Matches
string starting with "start".
Finite-State Automata (FSA)
FSA is a computational model used to recognize regular languages. It is composed of
states, transitions, a start state, and accepting states. - Deterministic FSA (DFA) -
Non-deterministic FSA (NFA) Used for token recognition, lexical analysis, and
morphological processing.
English Morphology
Morphology is the study of the structure of words. In English: - Inflectional Morphology:
modifies tense, number, etc. (e.g., cat → cats) - Derivational Morphology: creates new
words (e.g., teach → teacher) Morphological analysis is crucial for POS tagging and
lemmatization.
Transducers for Lexicon and Rules
Finite-State Transducers (FSTs) extend FSAs by associating output with transitions. They
map between lexical forms and surface forms. Applications: - Morphological analysis -
Phonological rules - Spelling correction
Tokenization
Tokenization is the process of splitting text into tokens (words, punctuation marks, etc.).
Types: - Word Tokenization: "NLP is fun" → ["NLP", "is", "fun"] - Sentence Tokenization:
Splitting paragraphs into sentences. Challenges: - Handling abbreviations, hyphenations,
contractions.
Detecting and Correcting Spelling Errors
Spelling errors can be detected and corrected using dictionary lookup, language models,
and context. - Non-word errors: "speling" → "spelling" - Real-word errors: "form" vs. "from"
Approaches: - Edit distance - Noisy Channel Model
Minimum Edit Distance
Minimum Edit Distance (MED) quantifies how dissimilar two strings are by counting the
minimum number of operations (insertions, deletions, substitutions) required to transform
one string into the other. Applications: - Spell checking - DNA sequence alignment -
Plagiarism detection