KEMBAR78
NLP Introduction Notes Anna University | PDF | Regular Expression | Language Mechanics
0% found this document useful (0 votes)
7 views2 pages

NLP Introduction Notes Anna University

Natural Language Processing (NLP) is a branch of Artificial Intelligence focused on enabling machines to understand human languages, evolving from early machine translation to modern deep learning. Key challenges in NLP include ambiguity and multilingualism, while techniques such as language modeling, regular expressions, and tokenization play crucial roles in processing text. Additionally, methods like Minimum Edit Distance are used for tasks such as spell checking and plagiarism detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views2 pages

NLP Introduction Notes Anna University

Natural Language Processing (NLP) is a branch of Artificial Intelligence focused on enabling machines to understand human languages, evolving from early machine translation to modern deep learning. Key challenges in NLP include ambiguity and multilingualism, while techniques such as language modeling, regular expressions, and tokenization play crucial roles in processing text. Additionally, methods like Minimum Edit Distance are used for tasks such as spell checking and plagiarism detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

INTRODUCTION

Natural Language Processing (NLP) is a subfield of Artificial Intelligence that focuses on


enabling machines to understand, interpret, generate, and respond to human languages. It
bridges the gap between human communication and computer understanding.

Origins and Challenges of NLP


The origins of NLP can be traced to the 1950s, with early efforts in machine translation
and rule-based parsing. Over the decades, it evolved through symbolic methods, statistical
models, and now deep learning. Challenges in NLP include ambiguity, contextual
understanding, world knowledge, sarcasm, multilingualism, and noisy data.

Language Modeling
Language models predict the next word in a sequence or assign probabilities to
sentences. 1. Grammar-Based Language Modeling: - Relies on predefined syntactic rules
(e.g., Context-Free Grammars). - Ensures grammatical correctness but lacks flexibility for
real-world data. 2. Statistical Language Modeling: - Based on probabilities derived from
large corpora. - Examples: Unigram, Bigram, Trigram models. - Applications: Speech
recognition, text prediction.

Regular Expressions
Regular Expressions (regex) are patterns used to match character combinations in text.
They are widely used in text preprocessing, pattern recognition, and lexical analysis.
Examples: - \d: Matches any digit. - [a-z]: Matches any lowercase letter. - ^start: Matches
string starting with "start".

Finite-State Automata (FSA)


FSA is a computational model used to recognize regular languages. It is composed of
states, transitions, a start state, and accepting states. - Deterministic FSA (DFA) -
Non-deterministic FSA (NFA) Used for token recognition, lexical analysis, and
morphological processing.

English Morphology
Morphology is the study of the structure of words. In English: - Inflectional Morphology:
modifies tense, number, etc. (e.g., cat → cats) - Derivational Morphology: creates new
words (e.g., teach → teacher) Morphological analysis is crucial for POS tagging and
lemmatization.

Transducers for Lexicon and Rules


Finite-State Transducers (FSTs) extend FSAs by associating output with transitions. They
map between lexical forms and surface forms. Applications: - Morphological analysis -
Phonological rules - Spelling correction

Tokenization
Tokenization is the process of splitting text into tokens (words, punctuation marks, etc.).
Types: - Word Tokenization: "NLP is fun" → ["NLP", "is", "fun"] - Sentence Tokenization:
Splitting paragraphs into sentences. Challenges: - Handling abbreviations, hyphenations,
contractions.

Detecting and Correcting Spelling Errors


Spelling errors can be detected and corrected using dictionary lookup, language models,
and context. - Non-word errors: "speling" → "spelling" - Real-word errors: "form" vs. "from"
Approaches: - Edit distance - Noisy Channel Model

Minimum Edit Distance


Minimum Edit Distance (MED) quantifies how dissimilar two strings are by counting the
minimum number of operations (insertions, deletions, substitutions) required to transform
one string into the other. Applications: - Spell checking - DNA sequence alignment -
Plagiarism detection

You might also like