KEMBAR78
Natural Language Processin1 | PDF | Morphology (Linguistics) | Word
0% found this document useful (0 votes)
77 views86 pages

Natural Language Processin1

The document provides an in-depth overview of Natural Language Processing (NLP), detailing its definition, techniques, and applications across various units. It covers stages in the NLP process such as lexical analysis, syntactic analysis, and semantic analysis, along with examples and discussions on key concepts like Named Entity Recognition and Information Retrieval. Additionally, it highlights the future scope of NLP, including advancements in chatbots and smarter search functionalities.

Uploaded by

shruti Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views86 pages

Natural Language Processin1

The document provides an in-depth overview of Natural Language Processing (NLP), detailing its definition, techniques, and applications across various units. It covers stages in the NLP process such as lexical analysis, syntactic analysis, and semantic analysis, along with examples and discussions on key concepts like Named Entity Recognition and Information Retrieval. Additionally, it highlights the future scope of NLP, including advancements in chatbots and smarter search functionalities.

Uploaded by

shruti Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Natural Language Processing

Subject Code & Name: BTAIIC701- Natural Language Processing

Unit:1

1 Explain what is Natural Language Processing? Discuss various stages involved in the
NLP Process with suitable examples.

2 Write a short note on

1. HMM.

2. Speech Recognition.

3 Explain ArgMax() Based Computation in NLP.

Illustrate Using (i) Single Class (ii) Multiclass on Classification

Unit:2

1 Show Working of Named Entity Relation(NER) With appropriate example.

2 Explain the Finite state Automation with suitable example. Differentiate between
DFA and NDFA?

3 What do you mean by Morphological? Explain Derivational and Inflectional


Morphology in detail with examples.

Unit:3

1 What is language model? Write a detailed note on the N-Gram Language Model
and its significance.

2 What are different parsing techniques? Explain Dependency parsing and


Constituency parsing.

3 What is Parsing? Explain Top-down and Bottom-up approach of parsing with


suitable example.

4 Explain POS tagging with example.

Unit:4
1 What do you mean by word sense disambiguation (WSD)? Discuss dictionary based
approach for WSD.

2 What is Semantic Analysis? Why semantic analysis is difficult? Explain various


approaches to perform semantic analysis.

3 Explain with suitable examples following relationship between word meanings

1. Homonyms and Hyponyms

2. Polysemy

3. Synonyms

4. Antonyms

5. Meronomy

4 Explain the Selection Restriction. Analyze with examples how Selectional


Restrictions can use to solve following NLP Problem:

i. Semantic Role Assignments

ii. Syntactic Ambiguity

iii. Word Sense Disambiguation

Unit:5

1 What is Information Retrieval in NLP? Explain Cross Lingual Information Retrieval


(CLIR) in Brief.

2 Explain the Architecture of an Information Retrieval System with a neat Diagram.

3 Explain the concept of Textual Entailment in NLP. Apply it on question answering


application and simplify its working with suitable examples.

4 Write a note on:

1. Sentiment Analysis

2. Machine Translation
[Unit 1] Introduction:

Biology of Speech Processing; Place and Manner of Articulation; Word Boundary Detection;
Argmax based computations; HMM and Speech Recognition.

Natural Language Processing (NLP) –

The meaning of NLP is Natural Language Processing (NLP) which is a fascinating and rapidly
evolving field that intersects computer science, artificial intelligence, and linguistics. NLP
focuses on the interaction between computers and human language, enabling machines to
understand, interpret, and generate human language in a way that is both meaningful and
useful. With the increasing volume of text data generated every day, from social media posts
to research articles, NLP has become an essential tool for extracting valuable insights and
automating various tasks.

What is Natural Language Processing?

Natural language processing (NLP) is a field of computer science and a subfield of artificial
intelligence that aims to make computers understand human language. NLP uses
computational linguistics, which is the study of how language works, and various models
based on statistics, machine learning, and deep learning. These technologies allow
computers to analyze and process text or voice data, and to grasp their full meaning,
including the speaker’s or writer’s intentions and emotions.

NLP powers many applications that use language, such as text translation, voice recognition,
text summarization, and chatbots. You may have used some of these applications yourself,
such as voice-operated GPS systems, digital assistants, speech-to-text software, and
customer service bots. NLP also helps businesses improve their efficiency, productivity, and
performance by simplifying complex tasks that involve language.

NLP Techniques

NLP encompasses a wide array of techniques that aimed at enabling computers to process
and understand human language. These tasks can be categorized into several broad areas,
each addressing different aspects of language processing. Here are some of the key NLP
techniques:

1. Text Processing and Preprocessing In NLP

• Tokenization: Dividing text into smaller units, such as words or sentences.

• Stemming and Lemmatization: Reducing words to their base or root forms.

• Stopword Removal: Removing common words (like “and”, “the”, “is”) that may not
carry significant meaning.
• Text Normalization: Standardizing text, including case normalization, removing
punctuation, and correcting spelling errors.

2. Syntax and Parsing In NLP

• Part-of-Speech (POS) Tagging: Assigning parts of speech to each word in a sentence


(e.g., noun, verb, adjective).

• Dependency Parsing: Analyzing the grammatical structure of a sentence to identify


relationships between words.

• Constituency Parsing: Breaking down a sentence into its constituent parts or phrases
(e.g., noun phrases, verb phrases).

3. Semantic Analysis

• Named Entity Recognition (NER): Identifying and classifying entities in text, such as
names of people, organizations, locations, dates, etc.

• Word Sense Disambiguation (WSD): Determining which meaning of a word is used in


a given context.

• Coreference Resolution: Identifying when different words refer to the same entity in
a text (e.g., “he” refers to “John”).

4. Information Extraction

• Entity Extraction: Identifying specific entities and their relationships within the text.

• Relation Extraction: Identifying and categorizing the relationships between entities


in a text.

5. Text Classification in NLP

• Sentiment Analysis: Determining the sentiment or emotional tone expressed in a


text (e.g., positive, negative, neutral).

• Topic Modeling: Identifying topics or themes within a large collection of documents.

• Spam Detection: Classifying text as spam or not spam.

6. Language Generation

• Machine Translation: Translating text from one language to another.

• Text Summarization: Producing a concise summary of a larger text.

• Text Generation: Automatically generating coherent and contextually relevant text.

7. Speech Processing

• Speech Recognition: Converting spoken language into text.


• Text-to-Speech (TTS) Synthesis: Converting written text into spoken language.

8. Question Answering

• Retrieval-Based QA: Finding and returning the most relevant text passage in
response to a query.

• Generative QA: Generating an answer based on the information available in a text


corpus.

9. Dialogue Systems

• Chatbots and Virtual Assistants: Enabling systems to engage in conversations with


users, providing responses and performing tasks based on user input.

10. Sentiment and Emotion Analysis in NLP

• Emotion Detection: Identifying and categorizing emotions expressed in text.

• Opinion Mining: Analyzing opinions or reviews to understand public sentiment


toward products, services, or topics.

Working of Natural Language Processing (NLP)

Working in natural language processing (NLP) typically involves using computational


techniques to analyze and understand human language. This can include tasks such as
language understanding, language generation, and language interaction.

1. Text Input and Data Collection

• Data Collection: Gathering text data from various sources such as websites, books,
social media, or proprietary databases.

• Data Storage: Storing the collected text data in a structured format, such as a
database or a collection of documents.

2. Text Preprocessing

Preprocessing is crucial to clean and prepare the raw text data for analysis. Common
preprocessing steps include:

• Tokenization: Splitting text into smaller units like words or sentences.

• Lowercasing: Converting all text to lowercase to ensure uniformity.

• Stopword Removal: Removing common words that do not contribute significant


meaning, such as “and,” “the,” “is.”

• Punctuation Removal: Removing punctuation marks.


• Stemming and Lemmatization: Reducing words to their base or root forms.
Stemming cuts off suffixes, while lemmatization considers the context and converts
words to their meaningful base form.

• Text Normalization: Standardizing text format, including correcting spelling errors,


expanding contractions, and handling special characters.

3. Text Representation

• Bag of Words (BoW): Representing text as a collection of words, ignoring grammar


and word order but keeping track of word frequency.

• Term Frequency-Inverse Document Frequency (TF-IDF): A statistic that reflects the


importance of a word in a document relative to a collection of documents.

• Word Embeddings: Using dense vector representations of words where semantically


similar words are closer together in the vector space (e.g., Word2Vec, GloVe).

4. Feature Extraction

Extracting meaningful features from the text data that can be used for various NLP tasks.

• N-grams: Capturing sequences of N words to preserve some context and word order.

• Syntactic Features: Using parts of speech tags, syntactic dependencies, and parse
trees.

• Semantic Features: Leveraging word embeddings and other representations to


capture word meaning and context.

5. Model Selection and Training

Selecting and training a machine learning or deep learning model to perform specific NLP
tasks.

• Supervised Learning: Using labeled data to train models like Support Vector
Machines (SVM), Random Forests, or deep learning models like Convolutional Neural
Networks (CNNs) and Recurrent Neural Networks (RNNs).

• Unsupervised Learning: Applying techniques like clustering or topic modeling (e.g.,


Latent Dirichlet Allocation) on unlabeled data.

• Pre-trained Models: Utilizing pre-trained language models such as BERT, GPT, or


transformer-based models that have been trained on large corpora.

6. Model Deployment and Inference

Deploying the trained model and using it to make predictions or extract insights from new
text data.
• Text Classification: Categorizing text into predefined classes (e.g., spam detection,
sentiment analysis).

• Named Entity Recognition (NER): Identifying and classifying entities in the text.

• Machine Translation: Translating text from one language to another.

• Question Answering: Providing answers to questions based on the context provided


by text data.

7. Evaluation and Optimization

Evaluating the performance of the NLP algorithm using metrics such as accuracy, precision,
recall, F1-score, and others.

• Hyperparameter Tuning: Adjusting model parameters to improve performance.

• Error Analysis: Analyzing errors to understand model weaknesses and improve


robustness.

8. Iteration and Improvement

Continuously improving the algorithm by incorporating new data, refining preprocessing


techniques, experimenting with different models, and optimizing features.

Technologies related to Natural Language Processing

There are a variety of technologies related to natural language processing (NLP) that are
used to analyze and understand human language. Some of the most common include:

1. Machine learning: NLP relies heavily on machine learning techniques such as


supervised and unsupervised learning, deep learning, and reinforcement learning to
train models to understand and generate human language.

2. Natural Language Toolkits (NLTK) and other libraries: NLTK is a popular open-source
library in Python that provides tools for NLP tasks such as tokenization, stemming,
and part-of-speech tagging. Other popular libraries include spaCy, OpenNLP, and
CoreNLP.

3. Parsers: Parsers are used to analyze the syntactic structure of sentences, such as
dependency parsing and constituency parsing.

4. Text-to-Speech (TTS) and Speech-to-Text (STT) systems: TTS systems convert written
text into spoken words, while STT systems convert spoken words into written text.

5. Named Entity Recognition (NER) systems: NER systems identify and extract named
entities such as people, places, and organizations from the text.
6. Sentiment Analysis: A technique to understand the emotions or opinions expressed
in a piece of text, by using various techniques like Lexicon-Based, Machine Learning-
Based, and Deep Learning-based methods

7. Machine Translation: NLP is used for language translation from one language to
another through a computer.

8. Chatbots: NLP is used for chatbots that communicate with other chatbots or humans
through auditory or textual methods.

9. AI Software: NLP is used in question-answering software for knowledge


representation, analytical reasoning as well as information retrieval.

Applications of Natural Language Processing (NLP)

• Spam Filters: One of the most irritating things about email is spam. Gmail uses
natural language processing (NLP) to discern which emails are legitimate and which
are spam. These spam filters look at the text in all the emails you receive and try to
figure out what it means to see if it’s spam or not.

• Algorithmic Trading: Algorithmic trading is used for predicting stock market


conditions. Using NLP, this technology examines news headlines about companies
and stocks and attempts to comprehend their meaning in order to determine if you
should buy, sell, or hold certain stocks.

• Questions Answering: NLP can be seen in action by using Google Search or Siri
Services. A major use of NLP is to make search engines understand the meaning of
what we are asking and generate natural language in return to give us the answers.

• Summarizing Information: On the internet, there is a lot of information, and a lot of


it comes in the form of long documents or articles. NLP is used to decipher the
meaning of the data and then provides shorter summaries of the data so that
humans can comprehend it more quickly.

Future Scope

• Bots: Chatbots assist clients to get to the point quickly by answering inquiries and
referring them to relevant resources and products at any time of day or night. To be
effective, chatbots must be fast, smart, and easy to use, To accomplish this, chatbots
employ NLP to understand language, usually over text or voice-recognition
interactions

• Supporting Invisible UI: Almost every connection we have with machines involves
human communication, both spoken and written. Amazon’s Echo is only one
illustration of the trend toward putting humans in closer contact with technology in
the future. The concept of an invisible or zero user interface will rely on direct
communication between the user and the machine, whether by voice, text, or a
combination of the two. NLP helps to make this concept a real-world thing.

• Smarter Search: NLP’s future also includes improved search, something we’ve been
discussing at Expert System for a long time. Smarter search allows a chatbot to
understand a customer’s request can enable “search like you talk” functionality
(much like you could query Siri) rather than focusing on keywords or topics. Google
recently announced that NLP capabilities have been added to Google Drive, allowing
users to search for documents and content using natural language.

1.Explain what is Natural Language Processing? Discuss various stages involved in the NLP
Process with suitable examples.

Natural language processing (NLP) is a field of computer science and a subfield of artificial
intelligence that aims to make computers understand human language. NLP uses
computational linguistics, which is the study of how language works, and various models
based on statistics, machine learning, and deep learning. These technologies allow
computers to analyze and process text or voice data, and to grasp their full meaning,
including the speaker’s or writer’s intentions and emotions.

NLP powers many applications that use language, such as text translation, voice recognition,
text summarization, and chatbots. You may have used some of these applications yourself,
such as voice-operated GPS systems, digital assistants, speech-to-text software, and
customer service bots. NLP also helps businesses improve their efficiency, productivity, and
performance by simplifying complex tasks that involve language.

1. Lexical Analysis

Lexical analysis involves breaking down the text into its basic components, typically words or
tokens, and analyzing their structure. This stage includes processes like tokenization
(splitting the text into words), and morphological analysis (identifying root words and their
grammatical forms).

• Example:

o Input Sentence: "The cats are running."

o Tokens: ["The", "cats", "are", "running", "."]

o Morphological Analysis:

▪ "cats" → root word "cat"


▪ "running" → root word "run" + "ing" (indicates present continuous
tense)

2. Syntactic Analysis (Parsing)

Syntactic analysis, or parsing, examines the grammatical structure of a sentence. It


determines how words are related to each other based on the rules of grammar. This stage
involves building a parse tree or syntax tree that represents the structure of the sentence,
showing how words and phrases are organized.

• Example:

o Input Sentence: "The cats are running."

o Parse Tree: A hierarchical tree structure that connects:

▪ "The" + "cats" as a noun phrase (NP)

▪ "are running" as a verb phrase (VP)

▪ The entire sentence as a subject-verb-object structure

3. Semantic Analysis

Semantic analysis focuses on understanding the meaning of the words and sentences. It
resolves ambiguities (e.g., word sense disambiguation) and ensures that the meaning
derived from the sentence is coherent. This stage includes processes like identifying the
correct sense of a word based on context and establishing relationships between entities.

• Example:

o Input Sentence: "The bank will close soon."

o Semantic Analysis:

▪ Resolves the word "bank" to mean a financial institution rather than


the side of a river, depending on the context provided by the
sentence.

4. Discourse Analysis

Discourse analysis looks at the larger context beyond individual sentences, considering how
sentences interact with each other to create meaning in a conversation or text. It involves
understanding pronoun references, sentence cohesion, and the overall flow of information
across sentences or paragraphs.

• Example:

o Input Text: "John went to the bank. He withdrew some money."

o Discourse Analysis:
▪ Identifies "He" as referring to "John."

▪ Connects the two sentences logically, recognizing that "withdrew


some money" is an action performed by John at the bank.

5. Pragmatic Analysis

Pragmatic analysis interprets the intended meaning behind the text by considering the
context in which it was said, including the speaker's intentions, the relationship between the
speakers, and external factors. This stage is crucial for understanding implied meanings,
politeness, and indirect communication.

• Example:

o Input Sentence: "Can you pass the salt?"

o Pragmatic Analysis:

▪ Interprets the sentence as a polite request for the listener to pass the
salt, rather than a literal question about the listener's ability to pass it.
This understanding comes from the context of a typical dining
scenario.

Explain Argmax() based computation in NLP. IIustrate using 1) Single class 2) Multiclass on
classification.

What is Argmax?

Argmax is a function that returns the index of the maximum value in a list or array. In the
context of classification tasks in NLP (Natural Language Processing), argmax helps determine
which class or category has the highest score or probability, thereby making the final
decision about the class of an input.

Argmax in Single-Class Classification

In single-class classification, we generally have two classes: one class of interest and its
complement. The argmax function helps decide between these two classes based on the
scores or probabilities provided by a model.

Example: Sentiment Analysis

• Task: Classify a sentence as either "Positive" or "Negative."

• Model Output: The model outputs a score or probability for each class.

Scenario:

• Input Sentence: "I really enjoyed the movie."


• Model Scores: [0.3 (Negative), 0.7 (Positive)]

Here’s the detailed process:

1. Model Scores: The model generates scores or probabilities for each class. In this case,
the scores are [0.3, 0.7], where 0.3 is the score for "Negative" and 0.7 is the score for
"Positive."

2. Apply Argmax:

o The argmax function looks at the list of scores and finds the index of the
highest score. In this case, it compares 0.3 and 0.7.

o The highest score is 0.7, which corresponds to index 1 (assuming the list is 0-
based indexing, so index 0 is "Negative" and index 1 is "Positive").

3. Class Decision:

o Argmax returns index 1, which corresponds to the "Positive" class.

o Therefore, the sentence "I really enjoyed the movie" is classified as "Positive."

Why This Works: The argmax function is used because it effectively selects the most
probable class based on the model's output. For binary classification, this means choosing
between the two available classes.

Argmax in Multi-Class Classification

In multi-class classification, we have more than two classes. The argmax function is used to
choose the class with the highest score from among all possible classes.

Example: Topic Classification

• Task: Classify a news article into one of several categories: Sports, Politics, or
Technology.

• Model Output: The model provides a probability or score for each class.

Scenario:

• Input Article: "The latest smartphone model has advanced features."

• Model Scores: [0.2 (Sports), 0.1 (Politics), 0.7 (Technology)]

Here’s the detailed process:

1. Model Scores: The model outputs scores for each category. Here, the scores are [0.2,
0.1, 0.7], where 0.2 is the score for "Sports," 0.1 for "Politics," and 0.7 for
"Technology."

2. Apply Argmax:
o The argmax function compares all the scores: 0.2, 0.1, and 0.7.

o It identifies that the highest score is 0.7, which corresponds to index 2


(assuming the list is 0-based indexing, so index 0 is "Sports," index 1 is
"Politics," and index 2 is "Technology").

3. Class Decision:

o Argmax returns index 2, which corresponds to the "Technology" class.

o Thus, the article is classified as "Technology."

Why This Works: In multi-class classification, argmax is used to determine which class has
the highest probability or score. It simplifies decision-making by picking the most likely
category based on the model's output.

Summary

• Single-Class Classification: Argmax is used to pick the class with the highest score
between two classes. For instance, deciding if a sentence is "Positive" or "Negative"
based on the sentiment scores.

• Multi-Class Classification: Argmax selects the class with the highest score among
many possible classes. For example, categorizing a news article into "Sports,"
"Politics," or "Technology" based on the highest probability score.

In both cases, argmax helps in making a final decision by selecting the most probable class
based on the model's output. It ensures that the class with the highest likelihood or score is
chosen, thereby enabling effective classification in NLP tasks.

3.Write a short note on

1. HMM. 2. Speech Recognition.

• HMM: One of the most important tools in NLP is the Hidden Markov Model (HMM).

• The Hidden Markov Model is a statistical model.

• Used to analyze sequential data, such as language.

• It is particularly useful for tasks like

➢ speech recognition,

➢ machine translation, and

➢ text analysis.
Hidden Markov Model (HMM) is a statistical model that is used to describe the probabilistic
relationship between a sequence of observations and a sequence of hidden states.

It is often used in situations where the underlying system or process that generates the
observations is unknown or hidden, hence it got the name “Hidden Markov Model.”

Markov Chain :

• A Markov chain is a mathematical model that represents a process where the system
transitions from one state to another.

• The transition assumes that the probability of moving to the next state is solely
dependent on the current state.

Please refer to the figure behind for an illustration:

• In the above figure, ‘a’, ‘p’, ‘i’, ‘t’, ‘e’, and ‘h’ represent the states, while the numbers
on the edges indicate the probability of transitioning from one state to another.

• For example, the probability of transitioning from state ‘t’ to states ‘i’, ‘a’, and ‘h’ are
0.3, 0.3, and 0.4, respectively.

• The start state is a special state that represents the initial state of the process, such
as the start of a sentence.

• Markov processes are commonly used to model sequential data, like text and
speech.

• For instance, if you want to build an application that predicts the next word in a
sentence, you can represent each word in a sentence as a state.

• The transition probabilities can be learned from a quantity and represent the
probability of moving from the current word to the next word.

• For example, the transition probability from the state ‘San’ to ‘Francisco’ will be
higher than the probability of transitioning to the state ‘San’ to ‘Delhi’.
• Example –If it rains today, then there is a 40% chance it will rain tomorrow, and
60% chance of no rain.

• If it doesn’t rain today, then there is a 20% chance it will rain tomorrow and 80%
chance of no rain.

Advantages of the Hidden Markov Model

• One of the advantages of HMM is its ability to learn from data.

• HMM can be trained on large datasets to learn the probabilities of certain events
occurring in certain states.

• For example, HMM can be trained on a quantity of sentences to learn the probability
of a verb following a noun or an adjective.

Speech Recognition : also known as automatic speech recognition (ASR), computer speech
recognition, or speech-to-text, focuses on enabling computers to understand and interpret
human speech. Speech recognition involves converting spoken language into text or
executing commands based on the recognized words. This technology relies on sophisticated
algorithms and machine learning models to process and understand human speech in real-
time, despite the variations in accents, pitch, speed, and slang.

Key Features of Speech Recognition

• Accuracy and Speed: They can process speech in real-time or near real-time,
providing quick responses to user inputs.

• Natural Language Understanding (NLU): NLU enables systems to handle complex


commands and queries, making technology more intuitive and user-friendly.

• Multi-Language Support: Support for multiple languages and dialects, allowing


users from different linguistic backgrounds to interact with technology in their native
language.

• Background Noise Handling: This feature is crucial for voice-activated systems used
in public or outdoor settings.

Speech Recognition Algorithms


Speech recognition technology relies on complex algorithms to translate spoken language
into text or commands that computers can understand and act upon. Here are the
algorithms and approaches used in speech recognition:

1. Hidden Markov Models (HMM)

Hidden Markov Models have been the backbone of speech recognition for many years. They
model speech as a sequence of states, with each state representing a phoneme (basic unit of
sound) or group of phonemes. HMMs are used to estimate the probability of a given
sequence of sounds, making it possible to determine the most likely words spoken. Usage:
Although newer methods have surpassed HMM in performance, it remains a fundamental
concept in speech recognition, often used in combination with other techniques.

2. Natural language processing (NLP)

NLP is the area of artificial intelligence which focuses on the interaction between humans
and machines through language through speech and text. Many mobile devices incorporate
speech recognition into their systems to conduct voice search. Example such as: Siri or
provide more accessibility around texting.

Challenges in Speech Recognition:

• Accents and Dialects: Variability in pronunciation can affect recognition accuracy.


Models need to be trained on diverse datasets to handle different accents and
dialects effectively.

• Background Noise: High levels of ambient noise can interfere with accurate speech
recognition, requiring advanced noise-cancellation techniques.

• Contextual Understanding: Recognizing homophones and understanding context is


crucial for accurate transcription and interpretation.

Explain in brief Places of Articulation


Places of Articulation+ are the specific areas in the mouth or throat where speech sounds
are produced.

1. Bilabial:

o Where: Both lips.

o How: Lips come together.

o Examples: /p/ (pat), /b/ (bat).

2. Labiodental:

o Where: Lower lip and upper teeth.

o How: Lower lip touches upper teeth.

o Examples: /f/ (fun), /v/ (van).

3. Dental:

o Where: Tongue and upper teeth.

o How: Tongue touches upper teeth.

o Examples: /θ/ (think), /ð/ (this).

4. Alveolar:

o Where: Tongue and the ridge just behind the upper front teeth.

o How: Tongue touches or is close to the alveolar ridge.

o Examples: /t/ (top), /d/ (dog).

5. Post-Alveolar:

o Where: Tongue and the area just behind the alveolar ridge.

o How: Tongue is placed slightly further back.

o Examples: /ʃ/ (she), /ʒ/ (measure).

6. Palatal:

o Where: Tongue and the hard part of the roof of the mouth.

o How: Tongue raises to the hard palate.

o Examples: /j/ (yes).

7. Velar:

o Where: Back of the tongue and the soft part of the roof of the mouth.
o How: Back of the tongue touches the velum.

o Examples: /k/ (cat), /g/ (go).

8. Glottal:

o Where: Vocal cords.

o How: Constriction at the vocal cords.

o Examples: /h/ (hat), the glottal stop /ʔ/ (uh-oh).

Places of articulation

Speech sounds are separated according to their place of articulation and manner of
articulation. There are eight places of articulation:

• Bilabial: contact between the lips;

• Labio-dental: contact between the lower lip and the upper teeth;

• Dental: contact between the tip of the tongue and the area just behind the upper
teeth;

• Alveolar: contact between the tongue and the Alveolar ridge (this is the ridged area
between the upper teeth and the hard palate);

• Palatal: contact between the tongue and the hard palate or Alveolar ridge;

• Post-alveolar: contact between the tongue and the back of the Alveolar ridge;

• Velar: contact between the tongue and the soft palate;

• Glottal: restriction of the airflow at the glottis.


[Unit 2] Word level Analysis:

Morphology fundamentals; Morphological Diversity of Indian Languages; Morphology


Paradigms; Finite State Machine Based Morphology; Automatic Morphology Learning;
Shallow Parsing; Named Entities; Maximum Entropy Models; Random Fields.

Introduction to Morphological Analysis

Morphology is the branch of linguistics concerned with the structure and form of words in a
language. Morphological analysis, in the context of NLP, refers to the computational
processing of word structures. It aims to break down words into their constituent parts, such
as roots, prefixes, and suffixes, and understand their roles and meanings. This process is
essential for various NLP tasks, including language modeling, text analysis, and machine
translation.

Importance of Morphological Analysis

Morphological analysis is a critical step in NLP for several reasons:

1. Understanding Word Formation: It helps in identifying the basic building blocks of


words, which is crucial for language comprehension.

2. Improving Text Analysis: By breaking down words into their roots and affixes, it
enhances the accuracy of text analysis tasks like sentiment analysis and topic
modeling.

3. Enhancing Language Models: Morphological analysis provides detailed insights into


word formation, improving the performance of language models used in tasks like
speech recognition and text generation.

4. Facilitating Multilingual Processing: It aids in handling the morphological diversity of


different languages, making NLP systems more robust and versatile.

Key Techniques used in Morphological Analysis for NLP Tasks

Morphological analysis involves breaking down words into their constituent morphemes (the
smallest units of meaning) and understanding their structure and formation. Various
techniques can be employed to perform morphological analysis, each with its own strengths
and applications.

Here are some of the key techniques used in morphological analysis:

1. Stemming

Stemming reduces words to their base or root form, usually by removing suffixes. The
resulting stems are not necessarily valid words but are useful for text normalization.

Common ways to implement stemming in python:


• Porter Stemmer: One of the most popular stemming algorithms, known for its
simplicity and efficiency.

• Snowball Stemmer: An improvement over the Porter Stemmer, supporting multiple


languages.

• Lancaster Stemmer: A more aggressive stemming algorithm, often resulting in


shorter stems.

2. Lemmatization

Lemmatization reduces words to their base or dictionary form (lemma). It considers the
context and part of speech, producing valid words. To implement lemmatization in
python, WordNet Lemmatizer is used, which leverages the WordNet lexical database to find
the base form of words.

3. Morphological Parsing

Morphological parsing involves analyzing the structure of words to identify their morphemes
(roots, prefixes, suffixes). It requires knowledge of morphological rules and patterns. Finite-
State Transducers (FSTs) is uses as a tool for morphological parsing.

Finite-State Transducers (FSTs)

FSTs are computational models used to represent and analyze the morphological structure of
words. They consist of states and transitions, capturing the rules of word formation.

Applications:

• Morphological Analysis: Parsing words into their morphemes.

• Morphological Generation: Generating word forms from morphemes.

4. Neural Network Models

Neural network models, especially deep learning models, can be trained to perform
morphological analysis by learning patterns from large datasets.

Types of Neural Network

• Recurrent Neural Networks (RNNs): Useful for sequential data like text.

• Convolutional Neural Networks (CNNs): Can capture local patterns in the text.

• Transformers: Advanced models like BERT and GPT that understand context and
semantics.

5. Rule-Based Methods

Rule-based methods rely on manually defined linguistic rules for morphological analysis.
These rules can handle specific language patterns and exceptions.
Applications:

• Affix Stripping: Removing known prefixes and suffixes to find the root form.

• Inflectional Analysis: Identifying grammatical variations like tense, number, and case.

6. Hidden Markov Models (HMMs)

Hidden Markov Models (HMMs) are probabilistic models that can be used to analyze
sequences of data, such as morphemes in words. HMMs consist of a set of hidden states,
each representing a possible state of the system, and observable outputs generated from
these states. In the context of morphological analysis, HMMs can be used to model the
probabilistic relationships between sequences of morphemes, helping to predict the most
likely sequence of morphemes for a given word.

Components of Hidden Markov Models (HMMs):

• States: Represent different parts of words (e.g., prefixes, roots, suffixes).

• Observations: The actual characters or morphemes in the words.

• Transition Probabilities: Probabilities of moving from one state to another.

• Emission Probabilities: Probabilities of an observable output being generated from a


state.

Applications:

• Morphological Segmentation: Breaking words into morphemes.

• Part-of-Speech Tagging: Assigning parts of speech to each word in a sentence.

• Sequence Prediction: Predicting the most likely sequence of morphemes for a given
word.

Morphology refers to the study of the structure and formation of words in a language. It
deals with how words are formed from smaller units called morphemes, which are the
smallest meaningful units of language. Morphology looks at how these morphemes combine
to form words and the processes that govern word construction.

There are two main types of morphology:

1. Derivational Morphology

2. Inflectional Morphology

1. Derivational Morphology

Derivational morphology involves the process of creating new words by adding prefixes,
suffixes, or infixes to existing words (roots or stems). These changes often result in a new
word with a different meaning or grammatical category. Derivational morphemes change the
meaning or function of the base word.

Examples:

• Prefixation: Adding a prefix to a base word.

o Unhappy → un- (prefix) + happy (root) = not happy.

o Preheat → pre- (prefix) + heat (root) = to heat something in advance.

• Suffixation: Adding a suffix to a base word.

o Happy → happy (root) + -ness (suffix) = happiness (a noun from an adjective).

o Teach → teach (root) + -er (suffix) = teacher (a noun from a verb).

• Changing part of speech: Sometimes, derivational affixes change the grammatical


category of the word.

o Beauty (noun) → beautiful (adjective) (via the suffix -ful).

o Friend (noun) → befriend (verb) (via the prefix be-).

In derivational morphology, the affixes can significantly change the meaning of the root
word and may also change its grammatical class (from noun to verb, adjective to noun, etc.).

2. Inflectional Morphology

Inflectional morphology deals with the modification of a word to express different


grammatical features such as tense, number, gender, case, person, or comparison.
Inflectional affixes do not change the core meaning of the word but modify it according to its
grammatical context. The word remains within the same grammatical category.

Examples:

• Tense: Changes the verb form to indicate time.

o Walk → walked (past tense).

o Eat → eating (present continuous tense).

• Number: Modifies the noun to indicate singular or plural.

o Cat → cats (plural form).

o Child → children (irregular plural form).

• Person: Changes the verb form according to the subject.

o I walk → He walks (3rd person singular).

o You walk → They walk (3rd person plural).


• Comparison: Modifies adjectives or adverbs to show different degrees of
comparison.

o Big → bigger (comparative).

o Fast → fastest (superlative).

Unlike derivational morphemes, inflectional morphemes do not change the word’s category
or core meaning but provide additional grammatical information.

Summary of Differences:

• Derivational Morphemes:

o Create new words with different meanings.

o Often change the grammatical category (e.g., verb to noun, adjective to


adverb).

o Example: happy → happiness (noun).

• Inflectional Morphemes:

o Modify a word to express different grammatical features.

o Do not change the core meaning or grammatical category.

o Example: walk → walks (3rd person singular).

1) Morphology Paradigms

Morphology paradigms refer to the set of forms a word can take based on its grammatical
features, such as tense, case, gender, number, etc. A paradigm is essentially a model or a
pattern that represents all the possible forms of a particular word, given various grammatical
rules.

In natural languages, words change their form depending on the context. These changes
follow certain predictable patterns, or paradigms. For example, in English, a verb like "to
run" follows a specific paradigm when it changes for different tenses, persons, or numbers:

• Base Form (Infinitive): run (verb)

• Third Person Singular Present: runs (e.g., She runs every day.)

• Past Tense: ran (e.g., He ran yesterday.)

• Past Participle: run (e.g., I have run five miles.)

• Present Participle: running (e.g., She is running now.)

Each of these forms is part of the paradigm of the verb "to run." Similarly, nouns follow
paradigms for plural forms:
• Singular: dog

• Plural: dogs

In NLP, recognizing the paradigms of words is crucial because it helps computers understand
how to analyze and generate words correctly based on their grammatical context. For
instance, when processing a text, a machine needs to identify the various word forms a
single root word might take, and morphology paradigms help in identifying these patterns.

For languages like Turkish, Finnish, or Arabic, the morphological paradigms are even more
complex. These languages often have extensive inflectional systems where a single root
word can have hundreds of different forms, all of which must be understood to properly
analyze and generate text.

2) Finite State Morphology

Finite State Morphology refers to a computational approach for modeling and processing
the structure of words using finite state machines (FSMs). FSMs are mathematical models
used to represent a system that can exist in one of a finite number of states. In the context
of morphology, FSMs help in analyzing and generating word forms based on predefined
rules.

A finite state machine is a state machine that transitions between states based on input
symbols (in this case, morphemes). Each state represents a part of a word, such as a root,
prefix, or suffix. FSMs can be used to identify morphemes and understand how words are
constructed.

For example, consider the word "dogs":

1. The FSM starts in an initial state where it recognizes the root dog.

2. It transitions to a state where it recognizes the -s suffix, marking the word as plural.

3. The FSM finishes in a final state, which is the complete word "dogs."

FSMs are beneficial because they are efficient and can handle regular morphology well,
meaning words that follow consistent and predictable patterns. This is important for NLP
tasks such as:

• Morphological Analysis: Breaking down complex words into their components (e.g.,
recognizing "dogs" as "dog" + plural "-s").

• Morphological Generation: Creating valid word forms based on a given root (e.g.,
generating "played" from the root "play").

• Part-of-Speech Tagging: Identifying a word's grammatical category (e.g., noun, verb,


adjective) based on its form.
For languages with regular inflectional rules (e.g., English), finite state morphology is an
efficient and straightforward way to model and process word forms.

3) State Machine-Based Morphology

State machine-based morphology is a broader concept that uses state machines (including
finite state machines) to model the morphological structure of words. State machines, in
this context, represent the process of analyzing or generating word forms through a series of
transitions between states.

In state machine-based morphology, a word is processed as it transitions through various


states, each of which corresponds to a specific part of the word (root, prefix, suffix). This
approach is useful because it can handle both morphological analysis (e.g., breaking down a
word into its components) and morphological generation (e.g., creating different forms of a
word).

For example, let’s analyze the word "happiness" using a state machine:

1. The machine starts with the root of the word: happy.

2. It transitions to a state where it adds the -ness suffix, turning the adjective into a
noun.

3. The machine finishes in a final state where the word is now "happiness."

The system can also handle more complex words like "unhappiness":

1. The root is happy.

2. The prefix un- is added.

3. The suffix -ness is added.

State machines can handle both inflectional and derivational morphology. Inflectional
morphology refers to creating different grammatical forms of a word, like turning verbs into
past tense or pluralizing nouns. Derivational morphology involves creating entirely new
words with different meanings, like turning an adjective into a noun or verb.

State machine-based morphology is especially useful in languages with complex word


structures and where rules need to be followed rigorously.

4) Automatic Morphology Learning; Shallow Parsing

Automatic Morphology Learning refers to the process by which a computational system


learns the morphological rules of a language automatically, without human intervention.
This is especially useful in NLP when dealing with languages that have complex or irregular
morphological systems.
In automatic morphology learning, the system is typically trained on large datasets, often
corpora (large collections of text). Through statistical analysis and pattern recognition, the
system identifies and learns the morphemes that make up words and the patterns in which
they appear. For example, it can learn that adding the suffix -ed to verbs like "play" creates
the past tense "played".

This process involves:

1. Identifying the morphemes (the smallest meaningful units of language) in a large


dataset.

2. Analyzing how these morphemes combine to form valid word structures.

3. Using this information to generate word forms and analyze unknown words in the
future.

Shallow Parsing (also called chunking) is a technique used to break down sentences into
smaller, more manageable parts or chunks. Unlike deep parsing, which aims to identify the
entire syntactic structure of a sentence, shallow parsing only identifies key chunks like noun
phrases, verb phrases, and prepositional phrases. It does not attempt to fully parse the
syntax or deep relationships between words.

For example, consider the sentence:

• The dog quickly ran to the park.

A shallow parser might divide this sentence into chunks:

• [The dog] (Noun Phrase)

• [quickly ran] (Verb Phrase)

• [to the park] (Prepositional Phrase)

Shallow parsing is useful because it provides a first-level understanding of a sentence's


structure, which is sufficient for many NLP tasks like information extraction, named entity
recognition, and machine translation.

In the context of automatic morphology learning, shallow parsing can help identify which
parts of a sentence are likely to contain morphemes that need to be analyzed or generated.
For instance, it might help the system focus on specific noun phrases or verb phrases where
morphological changes are likely to occur (e.g., pluralizing a noun or conjugating a verb).

Conclusion

To conclude:
1. Morphology Paradigms help in identifying and organizing the various forms a word
can take based on grammatical features.

2. Finite State Morphology uses finite state machines to model and process word forms
efficiently, often applied to languages with regular morphology.

3. State Machine-Based Morphology involves using state machines to model both


inflectional and derivational morphology, providing a way to break down or generate
complex words.

4. Automatic Morphology Learning uses data-driven approaches to learn the rules of


word formation, while Shallow Parsing is a technique for breaking sentences into
chunks to simplify further analysis.

These concepts are central to improving how computers handle and process natural
language, making them essential in many NLP applications such as machine translation,
information retrieval, and speech recognition.

Named Entity Recognition (NER) and Named Entity Relations in NLP

Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that
focuses on identifying and classifying named entities (NEs) in text. Named entities are words
or phrases that represent specific objects, such as persons, organizations, locations, dates,
monetary values, etc. NER helps machines understand and extract these important entities
from unstructured text.

Named Entity Relations refers to identifying and understanding the relationships between
these named entities in a given text. This is an extension of NER that goes beyond
recognizing individual entities to recognizing how these entities are related to one another
in the context of the sentence or document.

1. What is Named Entity Recognition (NER)?

Named Entity Recognition (NER) is the task of classifying words or phrases in a text that
refer to specific entities (names, places, dates, etc.). The goal of NER is to categorize these
entities into predefined categories. For example, in a sentence like:

• "Albert Einstein was born in Ulm, Germany, on March 14, 1879."

NER would identify:

• Albert Einstein as a Person.

• Ulm as a Location.

• Germany as a Location.

• March 14, 1879 as a Date.


NER systems typically use a combination of techniques, including:

• Rule-based systems (manual rules for pattern matching).

• Machine learning models (training on labeled datasets to recognize entities).

• Deep learning models (using neural networks, especially for large-scale text
corpora).

NER can recognize a variety of entity types, such as:

• Persons (PER): Individuals or groups of people.

• Organizations (ORG): Corporations, institutions, or agencies.

• Locations (LOC): Geographical entities like cities, countries, or landmarks.

• Dates (DATE): Specific dates or time periods.

• Monetary values (MONEY): Amounts of money.

• Percentages (PERCENT): Percent values.

• Time (TIME): Specific points in time.

Example of Named Entity Recognition

Sentence: "Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976."

• Apple Inc. → Organization

• Steve Jobs → Person

• Cupertino → Location

• California → Location

• April 1, 1976 → Date

2. Named Entity Relations (NER + Relationships)

Once the named entities are recognized, the next step is to identify how they relate to each
other. Named Entity Relation Extraction focuses on uncovering relationships between
entities that may not be explicitly stated but are implied by their co-occurrence and context
in the text.

In the same example:

• "Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976."

We can derive relationships:

• Apple Inc. founded by Steve Jobs (relationship: Founder)


• Apple Inc. located in Cupertino (relationship: Location)

• Cupertino located in California (relationship: Location)

• Apple Inc. founded on April 1, 1976 (relationship: Founding Date)

In this case, NER helps identify the entities (Apple Inc., Steve Jobs, Cupertino, etc.), and the
relationship extraction process identifies the relations (e.g., "founded by," "located in,"
"founded on") that describe how these entities are connected.

Types of Relations in Named Entity Recognition

In Named Entity Relation Extraction, there are several types of relations that can be
identified:

1. Founders and Organizations:

o "Bill Gates founded Microsoft."

o Bill Gates founded Microsoft.

2. Locations of Organizations:

o "Google is headquartered in Mountain View."

o Google headquartered in Mountain View.

3. People and Locations:

o "Mark Zuckerberg lives in Palo Alto."

o Mark Zuckerberg lives in Palo Alto.

4. Date of Events:

o "The Declaration of Independence was signed on July 4, 1776."

o Declaration of Independence signed on July 4, 1776.

5. Person’s Role in Organization:

o "Elon Musk is the CEO of Tesla."

o Elon Musk CEO of Tesla.

3. Techniques for Named Entity Relation Extraction

To extract relationships between named entities, there are several methods employed in
NLP:

1. Rule-based Methods:
o These use predefined linguistic patterns or regular expressions to find
relations. For example, a pattern might look for the structure "X is the Y of Z"
to recognize the relationship between a person and their job title in an
organization.

o Example pattern: "is a [Job Title] of [Company]"

2. Supervised Machine Learning:

o In this approach, a model is trained on a labeled dataset where entities and


their relationships are manually annotated. Common algorithms include
Support Vector Machines (SVM), Conditional Random Fields (CRF), or Deep
Neural Networks (DNNs).

o A training dataset might consist of pairs of entities (e.g., person-organization)


along with the relationship labels (e.g., founder, CEO).

3. Unsupervised or Semi-supervised Methods:

o These methods do not require labeled training data. Instead, they use
unsupervised clustering algorithms or semi-supervised approaches to identify
patterns and relationships in the text.

o Techniques such as word embeddings (e.g., Word2Vec, GloVe) can help


model semantic similarities between entities, which can then be used to infer
relationships.

4. Deep Learning (Neural Networks):

o Neural networks, particularly Recurrent Neural Networks (RNNs) or


Transformers (like BERT), have been highly effective in recognizing relations.
These models can capture contextual information and relationships by
processing entire sequences of text, rather than relying on pre-determined
rules or patterns.

o Example: Using BERT, a model might be able to infer relationships such as


"person X works at organization Y" from a sentence.

4. Example of Named Entity Relations in NLP

Let’s take a complex example:

Sentence: "Bill Gates co-founded Microsoft in Albuquerque in 1975 with Paul Allen."

NER Output:

• Bill Gates → Person

• Microsoft → Organization
• Albuquerque → Location

• 1975 → Date

• Paul Allen → Person

Named Entity Relations:

• Bill Gates co-founded Microsoft (Founder relation).

• Microsoft located in Albuquerque (Location relation).

• Microsoft founded in 1975 (Founding Date relation).

• Bill Gates co-founder with Paul Allen (Co-founder relation).

5. Applications of Named Entity Relations in NLP

1. Information Extraction:

o NER and relation extraction allow systems to automatically extract useful


information from large text corpora. This is useful in applications such as
news aggregation, research, and financial analysis, where understanding the
connections between entities can reveal insights.

o Example: Extracting all mentions of companies and their founders from news
articles.

2. Question Answering:

o NER and relation extraction are used to answer questions by recognizing


entities and understanding their relationships. For example, "Who is the CEO
of Tesla?" can be answered by extracting the relationship between Elon Musk
and Tesla.

o Example: Given the sentence "Elon Musk is the CEO of Tesla," the system
recognizes Elon Musk and Tesla as entities and understands the relationship
CEO of.

3. Knowledge Graphs:

o NER and relation extraction are essential for building knowledge graphs,
where entities are represented as nodes, and relationships between them are
represented as edges. This helps in connecting and organizing information,
enabling advanced queries and insights.

o Example: A knowledge graph might connect Steve Jobs to Apple Inc. with a
relationship labeled Founder.

4. Text Summarization:
o Identifying key entities and their relations helps in generating concise and
informative summaries of long documents. By focusing on the most relevant
entities and their relationships, a summary can convey the essential
information.

Conclusion

Named Entity Recognition (NER) identifies specific entities such as people, organizations,
and locations in a text. Named Entity Relation Extraction goes beyond simply identifying
these entities and focuses on extracting the relationships between them, such as ownership,
affiliation, or location. This is an essential task in NLP that enables machines to understand
and organize information, making it applicable in a wide range of applications such as
information extraction, question answering, and knowledge graph creation.

1. Maximum Entropy Models in NLP

Maximum Entropy Models (also known as MaxEnt models) are statistical models used for
classification tasks in Natural Language Processing (NLP). They are based on the principle of
maximum entropy, which means they try to make the least biased predictions by assuming
the most uncertain (uniform) distribution, given the information we have.

In simple terms, a Maximum Entropy Model is used to predict the probability of an event
(like classifying a word or sentence), but it tries to maximize uncertainty about unknown
information, only using the known data to make its predictions.

How Does Maximum Entropy Work?

MaxEnt models rely on the idea of entropy (a measure of uncertainty or unpredictability).


The basic idea is that the model should make as few assumptions as possible, given the
available data. So, if there are multiple ways to explain the data, the model chooses the one
that maximizes the uncertainty about the unobserved information.

For example:

• In a text classification task, MaxEnt uses available features (like the presence of
specific words or word combinations) to predict the category of the text.

• The model adjusts the probabilities of different possible categories in such a way that
it does not assume anything extra that is not justified by the data.

Mathematical Formulation:

The maximum entropy principle says that, given some known features f1,f2,…,fnf_1, f_2,
\dots, f_nf1,f2,…,fn about an event, the model should assign the maximum entropy
distribution, i.e., the least biased distribution, subject to the constraints that the expected
values of the features match their observed values in the training data.
The probability distribution for a set of categories CCC (e.g., “spam” or “not spam”) is given
by:

P(y∣x)=1Z(x)exp⁡(∑i=1nλifi(x,y))P(y|x) = \frac{1}{Z(x)} \exp \left( \sum_{i=1}^{n} \lambda_i


f_i(x, y) \right)P(y∣x)=Z(x)1exp(i=1∑nλifi(x,y))

Where:

• fi(x,y)f_i(x, y)fi(x,y) is a feature function.

• λi\lambda_iλi is the weight for the feature.

• Z(x)Z(x)Z(x) is a normalizing factor to ensure the sum of probabilities is 1.

Example:

Let’s say you have a sentence and you want to predict the part of speech (POS) tag for each
word (e.g., noun, verb, adjective, etc.).

• Features might be things like the word itself, the previous word, or if the word starts
with a capital letter.

• Categories are the possible POS tags (e.g., noun, verb, etc.).

• The MaxEnt model would use the features to calculate the probabilities for each
category (POS tag) for each word in the sentence.

The model adjusts its parameters to maximize entropy while respecting the data constraints
(the features).

Applications:

• Text Classification: Classifying documents into categories.

• Named Entity Recognition (NER): Identifying entities like names of people,


organizations, or places.

• Part-of-Speech Tagging: Labeling each word in a sentence with its corresponding POS
tag.

2. Random Fields (Conditional Random Fields - CRFs)

Random Fields are models used to predict sequences of labels, such as sequences of words
in sentences, or sequences of actions in temporal data. The most commonly used random
field model in NLP is the Conditional Random Field (CRF), which is often applied to
sequence labeling tasks.

A Conditional Random Field (CRF) is a probabilistic graphical model that is used for
structured prediction. Unlike traditional classification methods where each instance is
classified independently, CRFs predict sequences or structures of labels based on the
context provided by the whole input sequence.

How Does CRF Work?

In NLP, CRFs are used to model the conditional probability of a label sequence
y=(y1,y2,…,yn)y = (y_1, y_2, \dots, y_n)y=(y1,y2,…,yn) given an observation sequence
x=(x1,x2,…,xn)x = (x_1, x_2, \dots, x_n)x=(x1,x2,…,xn), where each xix_ixi might represent a
word or feature in a sequence, and each yiy_iyi represents a label (like POS tags).

The probability of a label sequence yyy is modeled as:

P(y∣x)=1Z(x)exp⁡(∑i=1n∑j=1mλjfj(yi−1,yi,xi))P(y|x) = \frac{1}{Z(x)} \exp \left( \sum_{i=1}^{n}


\sum_{j=1}^{m} \lambda_j f_j(y_{i-1}, y_i, x_i) \right)P(y∣x)=Z(x)1exp(i=1∑nj=1∑mλjfj(yi−1,yi
,xi))

Where:

• fj(yi−1,yi,xi)f_j(y_{i-1}, y_i, x_i)fj(yi−1,yi,xi) are feature functions that describe how


the label yiy_iyi depends on the previous label yi−1y_{i-1}yi−1 and the current
observation xix_ixi.

• λj\lambda_jλj are parameters learned during training.

• Z(x)Z(x)Z(x) is a normalizing factor to ensure the probabilities sum to 1.

Key Ideas:

• CRFs are Conditional: Unlike other models (like HMMs), which model the joint
probability of the observations and labels, CRFs model the conditional probability of
the labels given the observations. This means they avoid issues with the
independence assumptions that other models make.

• Feature Functions: CRFs use features that describe how likely certain labels are,
based on the data. Features might include things like:

o The word itself.

o The previous word.

o The current tag or label.

o Other word features (e.g., if the word is capitalized, its part of speech, etc.).

Example:

Consider the task of POS tagging in a sentence:

Sentence: "I love programming."

The goal is to predict the part of speech for each word:


• "I" → Pronoun

• "love" → Verb

• "programming" → Noun

Using a CRF model, we can take into account not just the word itself (e.g., "love") but also
the context (e.g., "I" is usually followed by a verb, "love" is typically a verb, and
"programming" is a noun).

Applications:

• Named Entity Recognition (NER): Classifying parts of a sentence as people,


organizations, or places.

• Part-of-Speech Tagging: Assigning grammatical tags to each word.

• Chunking: Identifying groups of words that form meaningful units (like noun
phrases).

• Speech Recognition: Modeling sequences of sounds or words.

Difference between CRF and MaxEnt:

• MaxEnt is generally used for independent classification tasks, where each instance is
classified independently.

• CRF is used for structured prediction where the labels depend on each other, making
it ideal for sequence labeling tasks where the relationship between labels is
important.

Summary

• Maximum Entropy Models (MaxEnt) are statistical models that predict probabilities
for classification tasks while maximizing uncertainty (entropy), using available data
without making assumptions.

• Random Fields (specifically Conditional Random Fields or CRFs) are used for
sequence labeling tasks in NLP. CRFs predict sequences of labels given input
sequences, using contextual information and feature functions that describe how
labels depend on the observations and previous labels.

Both MaxEnt and CRF are crucial for various NLP tasks like text classification, POS tagging,
named entity recognition, and more, with MaxEnt focusing on independent classification
and CRF handling sequence-based predictions.
[Unit 3] Syntax Analysis
Theories of Parsing, Parsing Algorithms; Robust and Scalable Parsing on Noisy Text as in Web
documents

Hybrid of Rule Based and Probabilistic Parsing; Scope Ambiguity and Attachment Ambiguity
resolution

Theories of Parsing

Parsing, also known as syntactic analysis, is the process of analyzing a sequence of tokens to
determine the grammatical structure of a program. It takes the stream of tokens, which are
generated by a lexical analyzer or tokenizer, and organizes them into a parse tree or syntax
tree.

The parse tree visually represents how the tokens fit together according to the rules of the
language’s syntax. This tree structure is crucial for understanding the program’s structure
and helps in the next stages of processing, such as code generation or execution.
Additionally, parsing ensures that the sequence of tokens follows the syntactic rules of the
programming language, making the program valid and ready for further analysis or
execution

What is the Role of Parser?

A parser performs syntactic and semantic analysis of source code, converting it into an
intermediate representation while detecting and handling errors.

1. Context-free syntax analysis: The parser checks if the structure of the code follows
the basic rules of the programming language (like grammar rules). It looks at how
words and symbols are arranged.

2. Guides context-sensitive analysis: It helps with deeper checks that depend on the
meaning of the code, like making sure variables are used correctly. For example, it
ensures that a variable used in a mathematical operation, like x + 2, is a number and
not text.

3. Constructs an intermediate representation: The parser creates a simpler version of


your code that’s easier for the computer to understand and work with.
4. Produces meaningful error messages: If there’s something wrong in your code, the
parser tries to explain the problem clearly so you can fix it.

5. Attempts error correction: Sometimes, the parser tries to fix small mistakes in your
code so it can keep working without breaking completely.

Types of Parsing

The parsing is divided into two types, which are as follows:

• Top-down Parsing

• Bottom-up Parsing

Top-Down Parsing

Top-down parsing is a method of building a parse tree from the start symbol (root) down to
the leaves (end symbols). The parser begins with the highest-level rule and works its way
down, trying to match the input string step by step.

• Process: The parser starts with the start symbol and looks for rules that can help it
rewrite this symbol. It keeps breaking down the symbols (non-terminals) into smaller
parts until it matches the input string.

• Leftmost Derivation: In top-down parsing, the parser always chooses the leftmost
non-terminal to expand first, following what is called leftmost derivation. This means
the parser works on the left side of the string before moving to the right.

• Other Names: Top-down parsing is sometimes called recursive parsing or predictive


parsing. It is called recursive because it often uses recursive functions to process the
symbols.

Top-down parsing is useful for simple languages and is often easier to implement. However,
it can have trouble with more complex or ambiguous grammars.
Top-down parsers can be classified into two types based on whether they use backtracking
or not:

1. Top-down Parsing with Backtracking

In this approach, the parser tries different possibilities when it encounters a choice, If one
possibility doesn’t work (i.e., it doesn’t match the input string), the parser backtracks to the
previous decision point and tries another possibility.

Example: If the parser chooses a rule to expand a non-terminal, and it doesn’t work, it will
go back, undo the choice, and try a different rule.

Advantage: It can handle grammars where there are multiple possible ways to expand a
non-terminal.

Disadvantage: Backtracking can be slow and inefficient because the parser might have to try
many possibilities before finding the correct one.

2. Top-down Parsing without Backtracking

In this approach, the parser does not backtrack. It tries to find a match with the input using
only the first choice it makes, If it doesn’t match the input, it fails immediately instead of
going back to try another option.

Example: The parser will always stick with its first decision and will not reconsider other
rules once it starts parsing.

Advantage: It is faster because it doesn’t waste time going back to previous steps.

Disadvantage: It can only handle simpler grammars that don’t require trying multiple
choices.

Read more about classification of top-down parser.

Bottom-Up Parsing

Bottom-up parsing is a method of building a parse tree starting from the leaf nodes (the
input symbols) and working towards the root node (the start symbol). The goal is to reduce
the input string step by step until we reach the start symbol, which represents the entire
language.

• Process: The parser begins with the input symbols and looks for patterns that can be
reduced to non-terminals based on the grammar rules. It keeps reducing parts of the
string until it forms the start symbol.

• Rightmost Derivation in Reverse: In bottom-up parsing, the parser traces the


rightmost derivation of the string but works backwards, starting from the input string
and moving towards the start symbol.
• Shift-Reduce Parsing: Bottom-up parsers are often called shift-reduce parsers
because they shift (move symbols) and reduce (apply rules to replace symbols) to
build the parse tree.

Bottom-up parsing is efficient for handling more complex grammars and is commonly used
in compilers. However, it can be more challenging to implement compared to top-down
parsing.

Generally, bottom-up parsing is categorized into the following types:

1. LR parsing/Shift Reduce Parsing: Shift reduce Parsing is a process of parsing a string to


obtain the start symbol of the grammar.

• LR(0)

• SLR(1)

• LALR

• CLR

2. Operator Precedence Parsing: The grammar defined using operator grammar is known as
operator precedence parsing. In operator precedence parsing there should be no null
production and two non-terminals should not be adjacent to each other.

Difference Between Bottom-Up and Top-Down Parser

Feature Top-down Parsing Bottom-up Parsing

Direction Builds tree from root to leaves. Builds tree from leaves to root.

Uses rightmost derivation in


Derivation Uses leftmost derivation.
reverse.

Can be slower, especially with More efficient for complex


Efficiency
backtracking. grammars.

Example
Recursive descent, LL parser. Shift-reduce, LR parser.
Parsers

Robust and Scalable Parsing on Noisy Text as in Web Documents


Parsing noisy text, especially from web documents, is a challenge because such text often
contains irregularities, typos, incomplete sentences, or mixed structures. The goal is to make
parsing robust (able to handle noisy input without failure) and scalable (able to process
large volumes of data efficiently). In Natural Language Processing (NLP), parsing web data is
crucial for tasks like information extraction, sentiment analysis, and search engine
optimization.

Challenges in Parsing Noisy Web Text

1. Noisy Text: Web text often contains informal language, abbreviations, slang, typos,
inconsistent punctuation, and mixed languages. For example:

o "I dont knw, where it's at??"

o "This's awesome...LUV it!"

2. Irregular Sentence Structures: Web documents can have fragmented sentences,


missing subjects or verbs, and inconsistent grammatical structure, making them
harder to parse.

o "Great article! I love it, the best thing."

o "When you go to the store, make sure to pick up some bread."

3. Code and Non-Text Content: Many web documents include HTML, JavaScript, or
other non-natural language elements mixed with text. This can confuse parsers,
which are usually designed for clean, standard text.

4. Ambiguity: Words or phrases in noisy text can be ambiguous, and the lack of context
or non-standard syntax makes disambiguation harder.

Approaches to Robust and Scalable Parsing

1. Preprocessing Noisy Text

o Text Normalization: This involves converting informal language or


abbreviations into standard forms to help parsers handle the text better.

▪ Example: Convert "dont" → "don't", "luv" → "love", "u" → "you".

▪ Tools: Spelling correction algorithms, regular expressions, or pre-


trained models for slang.

o Sentence Splitting: Text on the web can have improper sentence


segmentation. Algorithms can be used to identify sentence boundaries, even
when punctuation is missing or misused.
▪ Example: "I went to the park but I forgot my shoes" can be split
correctly by recognizing the conjunction 'but' as a sentence boundary.

2. Handling Incomplete Sentences or Fragments

o Many web texts include sentence fragments. Dependency parsing or


Constituency parsing models can be designed to handle incomplete
sentences by using language models trained on noisy corpora.

o Machine Learning Models: Using models like BERT, GPT, or transformers can
help in handling fragments by providing better context understanding.

▪ Example: For a fragment like "The dog running fast," these models can
still derive relationships (dog → running).

3. Handling Ambiguity in Noisy Text

o Ambiguity is a common issue, especially in noisy text. Disambiguation models


that rely on contextual information can help resolve ambiguities.

o Contextualized Word Embeddings: Pre-trained models such as BERT or ELMo


that generate embeddings based on context can resolve ambiguities (e.g.,
recognizing that "bank" refers to a financial institution or a riverbank
depending on the context).

o Example: In a sentence, "I went to the bank," using contextual embeddings, a


parser can understand whether "bank" is a financial institution or the side of
a river.

4. Scalable Parsing on Web Documents

o Efficient Algorithms: For scalability, shift-reduce parsing and chart parsing


are popular choices. These algorithms can handle larger datasets efficiently
because they avoid redundant computation by storing intermediate results.

o Distributed Processing: To scale, parsing can be distributed across multiple


machines or processes, especially when dealing with vast amounts of web
data. Using Apache Spark or Hadoop clusters can help in distributing parsing
tasks to increase efficiency.

o Incremental Parsing: Earley Parsing is one example that can be used for
incremental parsing, where text is parsed as it comes in, without needing to
wait for the entire document to be processed. This is particularly useful when
parsing live web data or stream processing.

5. Handling Mixed Content (HTML, CSS, JavaScript)


o Web documents often contain a mix of text and non-text elements like HTML,
CSS, and JavaScript. Content extraction methods (like HTML parsers) help
remove unnecessary tags and focus only on the meaningful text content.

o Hybrid Approaches: One approach is to use a combination of semantic


parsing and syntactic parsing to first clean up the document (removing HTML
or JavaScript code) and then parse the text. This helps isolate only the textual
content for parsing.

o Tools like BeautifulSoup or lxml can help clean HTML and prepare text for
parsing.

6. Use of Hybrid Parsing Models

o Statistical and Neural Models: A hybrid approach, combining traditional


statistical parsing methods with deep learning models like transformers or
RNNs, can help improve robustness in noisy conditions.

o Rule-based and Data-driven Approaches: Combining rule-based systems that


follow linguistic rules and data-driven models like deep learning helps handle
various types of noise and adapt to new patterns.

o Example: A rule-based parser can be used to handle simple grammatical


structures, while a machine learning model can take care of more complex or
ambiguous structures.

Example

Let’s take a look at an example from a noisy web text and how robust and scalable parsing
could help:

Noisy Text:
"yesterday i seen a beautiful dog walking down the street!!!"

1. Text Normalization:

o Correcting "i" → "I", "seen" → "saw", and "!!!" → just a period.

o Result: "Yesterday I saw a beautiful dog walking down the street."

2. Dependency Parsing:

o Parsing the corrected text would identify the grammatical structure:

▪ "Yesterday" → temporal modifier

▪ "I" → subject
▪ "saw" → verb

▪ "dog" → object of "saw"

▪ "walking" → modifier of "dog"

▪ "down the street" → prepositional phrase

3. Disambiguation and Sentence Structure Handling:

o The parser would understand that "walking down the street" modifies "dog"
and not the verb "saw."

4. Scalability:

o If this parsing needed to be done for millions of web pages, using a


distributed parsing model across multiple servers would make the process
scalable. Also, if it’s happening in real-time, incremental parsing would allow
the system to process text as it is retrieved, making it faster.

Conclusion

Parsing noisy web documents requires techniques that make parsing models robust (able to
handle text with errors, ambiguity, and informal language) and scalable (able to efficiently
handle large amounts of data). By using techniques such as preprocessing, contextual
embeddings, efficient algorithms, and distributed computing, we can build parsers that are
both accurate and capable of handling the vast, unstructured data found on the web.

Hybrid of Rule-Based and Probabilistic Parsing in NLP

Parsing is the process of analyzing the structure of a sentence to understand its meaning.
There are two main approaches to parsing: rule-based parsing and probabilistic parsing. A
hybrid approach combines both to take advantage of their strengths while minimizing their
weaknesses.

Rule-Based Parsing

Rule-based parsing follows a set of predefined grammatical rules to understand sentences.


These rules define sentence structures using formal grammar, such as Context-Free
Grammar (CFG).

Example of Rule-Based Parsing

Consider the sentence: The cat sits on the mat.

Using a simple grammar:


• A sentence consists of a noun phrase and a verb phrase.

• A noun phrase consists of a determiner and a noun.

• A verb phrase consists of a verb and a prepositional phrase.

• A prepositional phrase consists of a preposition and a noun phrase.

By following these rules, the parser can structure the sentence correctly.

Advantages of Rule-Based Parsing

• It provides clear and understandable sentence structures.

• It works well for formal and structured text.

Limitations of Rule-Based Parsing

• It cannot handle variations in sentence structure.

• It requires a lot of effort to create and maintain rules.

• It struggles with informal or incorrect grammar.

Probabilistic Parsing

Probabilistic parsing uses statistical methods to choose the most likely structure of a
sentence based on previous examples. It assigns probabilities to different sentence
structures and selects the one with the highest probability.

Example of Probabilistic Parsing

For the sentence The cat sits on the mat, the parser will look at previous data and
determine the most common way this structure appears. If it finds that "The cat sits" is more
common than "The sits cat," it will choose the correct arrangement.

Advantages of Probabilistic Parsing

• It can handle ambiguous sentences.

• It improves as more data is available.

• It works well with real-world language, including informal text.

Limitations of Probabilistic Parsing

• It requires a large amount of training data.

• It may not always follow grammatical rules strictly.

• It is less explainable compared to rule-based parsing.


Hybrid Parsing: Combining Rule-Based and Probabilistic Approaches

A hybrid parser integrates both rule-based and probabilistic methods. The rule-based
component ensures grammatical correctness, while the probabilistic component resolves
ambiguity and improves accuracy.

How Hybrid Parsing Works

1. The rule-based part applies grammar rules to identify the basic structure of a
sentence.

2. The probabilistic part analyzes multiple possible structures and selects the most
likely one based on real-world usage.

Example of Hybrid Parsing

Consider the sentence The old man the boat.


This sentence can be confusing because "man" can be a noun or a verb.

• The rule-based parser recognizes "man" as both a noun and a verb.

• The probabilistic model checks real-world usage and determines that "man" is more
likely to be a verb in this context.

Advantages of Hybrid Parsing

• It improves accuracy by combining structure and flexibility.

• It can handle both well-structured and informal language.

• It is more robust when dealing with noisy or incomplete text.

• It scales better than pure rule-based systems.

Applications of Hybrid Parsing

• Machine Translation – Helps generate grammatically correct translations.

• Speech Recognition – Ensures meaningful sentences from spoken words.

• Chatbots and Virtual Assistants – Helps understand human input better.

• Question-Answering Systems – Improves responses by analyzing sentence meaning.

Summary
Rule-based parsing is strict and follows predefined grammar rules, while probabilistic parsing
learns from data and handles ambiguity. A hybrid approach combines both methods to
ensure grammatical correctness while improving flexibility and accuracy. This makes hybrid
parsing suitable for real-world applications like speech recognition, chatbots, and machine
translation.

Scope Ambiguity and Attachment Ambiguity Resolution in NLP

Natural Language Processing (NLP) deals with various types of ambiguities that arise due to
different interpretations of sentences. Scope ambiguity and attachment ambiguity are two
common issues that can make sentence meaning unclear. Resolving these ambiguities is
crucial for accurate language understanding in NLP applications.

1. Scope Ambiguity

Scope ambiguity occurs when there is uncertainty about which part of a sentence a word or
phrase applies to. This usually happens with quantifiers, negations, and modals.

Example of Scope Ambiguity

• "Every student did not pass the exam."

This sentence has two possible interpretations:

1. Wide scope of negation: None of the students passed the exam.

2. Narrow scope of negation: Some students passed, but not all.

Resolution of Scope Ambiguity

NLP systems resolve scope ambiguity using:

• Syntactic analysis (parsing) to understand sentence structure.

• Semantic interpretation to determine logical meaning.

• Contextual information to find the most likely interpretation.

• Probabilistic models trained on real-world data to select the common meaning.

Example of Resolution

In a chatbot, if a user says "Every student did not pass," the system can check:

• Exam results data: If some students passed, it selects the "not all" meaning.

• Common usage patterns: If "every student" with negation often means "none," it
selects that meaning.
2. Attachment Ambiguity

Attachment ambiguity happens when it is unclear which word or phrase a modifying


element (like a prepositional phrase or relative clause) is connected to in a sentence.

Example of Attachment Ambiguity

• "She saw the man with a telescope."

There are two possible interpretations:

1. High attachment: She saw a man who was holding a telescope.

2. Low attachment: She used a telescope to see the man.

Resolution of Attachment Ambiguity

To resolve attachment ambiguity, NLP systems use:

• Syntactic parsing to check grammatical rules.

• Semantic understanding to see which meaning is more logical.

• Statistical models to predict common usage patterns.

• World knowledge to infer real-world likelihood.

Example of Resolution

If a chatbot processes "She saw the man with a telescope," it can use:

• Context clues: If the topic is astronomy, "using a telescope" is more likely.

• Common word relationships: "Saw" often implies using an instrument.

• Dependency parsing: It checks if "with a telescope" modifies "saw" or "man."

Comparison of Scope and Attachment Ambiguity

Type of
Definition Example Possible Meanings Resolution Methods
Ambiguity

1. None of the
Syntactic parsing,
Uncertainty about students passed.
Scope "Every student semantic analysis,
how far a quantifier 2. Some students
Ambiguity did not pass." contextual clues,
or negation applies. passed, but not
probabilistic models.
all.
Type of
Definition Example Possible Meanings Resolution Methods
Ambiguity

Uncertainty about
1. The man has a Dependency parsing,
which part of a "She saw the
Attachment telescope. 2. She semantic analysis,
sentence a man with a
Ambiguity used a telescope statistical models,
modifier belongs telescope."
to see the man. world knowledge.
to.

Conclusion

Scope and attachment ambiguities create confusion in understanding sentences. NLP


systems use parsing, semantic analysis, statistical models, and contextual knowledge to
resolve these ambiguities. This is essential for applications like machine translation, speech
recognition, and chatbots, where correct interpretation of sentences is crucial.
[Unit 4] Semantic Analysis:
Lexical Knowledge Networks, Wordnet Theory; Indian Language Word nets and Multilingual
Dictionaries; Semantic Roles; Word Sense Disambiguation; WSD and Multilinguality;
Metaphors; Coreferences.

Introduction to Semantic Analysis

Semantic Analysis is a subfield of Natural Language Processing (NLP) that attempts to


understand the meaning of Natural Language. Understanding Natural Language might seem
a straightforward process to us as humans. However, due to the vast complexity and
subjectivity involved in human language, interpreting it is quite a complicated task for
machines. Semantic Analysis of Natural Language captures the meaning of the given text
while taking into account context, logical structuring of sentences and grammar roles.

Parts of Semantic Analysis

Semantic Analysis of Natural Language can be classified into two broad parts:

1. Lexical Semantic Analysis: Lexical Semantic Analysis involves understanding the meaning
of each word of the text individually. It basically refers to fetching the dictionary meaning
that a word in the text is deputed to carry.

2. Compositional Semantics Analysis: Although knowing the meaning of each word of the
text is essential, it is not sufficient to completely understand the meaning of the text.

For example, consider the following two sentences:

• Sentence 1: Students love GeeksforGeeks.

• Sentence 2: GeeksforGeeks loves Students.

Although both these sentences 1 and 2 use the same set of root words {student, love,
geeksforgeeks}, they convey entirely different meanings.

Hence, under Compositional Semantics Analysis, we try to understand how combinations of


individual words form the meaning of the text.

Tasks involved in Semantic Analysis

In order to understand the meaning of a sentence, the following are the major processes
involved in Semantic Analysis:

1. Word Sense Disambiguation

2. Relationship Extraction

Word Sense Disambiguation:


In Natural Language, the meaning of a word may vary as per its usage in sentences and the
context of the text. Word Sense Disambiguation involves interpreting the meaning of a word
based upon the context of its occurrence in a text.

For example, the word ‘Bark’ may mean ‘the sound made by a dog’ or ‘the outermost layer
of a tree.’

Likewise, the word ‘rock’ may mean ‘a stone‘ or ‘a genre of music‘ – hence, the accurate
meaning of the word is highly dependent upon its context and usage in the text.

Thus, the ability of a machine to overcome the ambiguity involved in identifying the
meaning of a word based on its usage and context is called Word Sense Disambiguation.

Relationship Extraction:

Another important task involved in Semantic Analysis is Relationship Extracting. It involves


firstly identifying various entities present in the sentence and then extracting the
relationships between those entities.

For example, consider the following sentence:

Semantic Analysis is a topic of NLP which is explained on the GeeksforGeeks blog. The
entities involved in this text, along with their relationships, are shown below.

Entities

Relationships

Elements of Semantic Analysis

Some of the critical elements of Semantic Analysis that must be scrutinized and taken into
account while processing Natural Language are:

• Hyponymy: Hyponymys refers to a term that is an instance of a generic term. They


can be understood by taking class-object as an analogy. For example: ‘Color‘ is a
hypernymy while ‘grey‘, ‘blue‘, ‘red‘, etc, are its hyponyms.

• Homonymy: Homonymy refers to two or more lexical terms with the same spellings
but completely distinct in meaning. For example: ‘Rose‘ might mean ‘the past form of
rise‘ or ‘a flower‘, – same spelling but different meanings; hence, ‘rose‘ is a
homonymy.

• Synonymy: When two or more lexical terms that might be spelt distinctly have the
same or similar meaning, they are called Synonymy. For example: (Job, Occupation),
(Large, Big), (Stop, Halt).

• Antonymy: Antonymy refers to a pair of lexical terms that have contrasting meanings
– they are symmetric to a semantic axis. For example: (Day, Night), (Hot, Cold),
(Large, Small).

• Polysemy: Polysemy refers to lexical terms that have the same spelling but multiple
closely related meanings. It differs from homonymy because the meanings of the
terms need not be closely related in the case of homonymy. For example: ‘man‘ may
mean ‘the human species‘ or ‘a male human‘ or ‘an adult male human‘ – since all
these different meanings bear a close association, the lexical term ‘man‘ is a
polysemy.

• Meronomy: Meronomy refers to a relationship wherein one lexical term is


a constituent of some larger entity. For example: ‘Wheel‘ is a meronym of
‘Automobile‘

Meaning Representation

While, as humans, it is pretty simple for us to understand the meaning of textual


information, it is not so in the case of machines. Thus, machines tend to represent the text
in specific formats in order to interpret its meaning. This formal structure that is used to
understand the meaning of a text is called meaning representation.

Basic Units of Semantic System:

In order to accomplish Meaning Representation in Semantic Analysis, it is vital to understand


the building units of such representations. The basic units of semantic systems are explained
below:

1. Entity: An entity refers to a particular unit or individual in specific such as a person or


a location. For example GeeksforGeeks, Delhi, etc.

2. Concept: A Concept may be understood as a generalization of entities. It refers to a


broad class of individual units. For example Learning Portals, City, Students.

3. Relations: Relations help establish relationships between various entities and


concepts. For example: ‘GeeksforGeeks is a Learning Portal’, ‘Delhi is a City.’, etc.

4. Predicate: Predicates represent the verb structures of the sentences.

In Meaning Representation, we employ these basic units to represent textual information.


Approaches to Meaning Representations:

Now that we are familiar with the basic understanding of Meaning Representations, here are
some of the most popular approaches to meaning representation:

1. First-order predicate logic (FOPL)

2. Semantic Nets

3. Frames

4. Conceptual dependency (CD)

5. Rule-based architecture

6. Case Grammar

7. Conceptual Graphs

Semantic Analysis Techniques

Based upon the end goal one is trying to accomplish, Semantic Analysis can be used in
various ways. Two of the most common Semantic Analysis techniques are:

Text Classification

In-Text Classification, our aim is to label the text according to the insights we intend to gain
from the textual data.

For example:

• In Sentiment Analysis, we try to label the text with the prominent emotion they
convey. It is highly beneficial when analyzing customer reviews for improvement.

• In Topic Classification, we try to categories our text into some predefined categories.
For example: Identifying whether a research paper is of Physics, Chemistry or Maths

• In Intent Classification, we try to determine the intent behind a text message. For
example: Identifying whether an e-mail received at customer care service is a query,
complaint or request.

Text Extraction

In-Text Extraction, we aim at obtaining specific information from our text.

For Example,

• In Keyword Extraction, we try to obtain the essential words that define the entire
document.

• In Entity Extraction, we try to obtain all the entities involved in a document.


Significance of Semantics Analysis

Semantics Analysis is a crucial part of Natural Language Processing (NLP). In the ever-
expanding era of textual information, it is important for organizations to draw insights from
such data to fuel businesses. Semantic Analysis helps machines interpret the meaning of
texts and extract useful information, thus providing invaluable data while reducing manual
efforts.

Besides, Semantics Analysis is also widely employed to facilitate the processes of automated
answering systems such as chatbots – that answer user queries without any human
interventions.

Lexical Analysis

Lexical Analysis is the first step in compiler design and Natural Language Processing (NLP). It
is the process of converting an input (such as source code or text) into meaningful units
called tokens. These tokens are then used for further processing, such as syntax analysis.

What is Lexical Analysis?

Lexical Analysis breaks down a sequence of characters into smaller units called tokens. A
program called a lexical analyzer (lexer) or scanner performs this task.

Example

Consider this input in a programming language:

cpp

CopyEdit

int age = 25;

The lexer will convert it into tokens:

• "int" → Keyword

• "age" → Identifier

• "=" → Assignment Operator

• "25" → Number

• ";" → Separator

These tokens are passed to the syntax analyzer for further processing.

Steps in Lexical Analysis

1. Removing White Spaces and Comments


o The lexer removes spaces, tabs, and comments.

o Example:

cpp

CopyEdit

int x = 10; // This is a comment

o The comment is ignored before processing.

2. Reading Characters and Forming Lexemes

o A lexeme is a sequence of characters that form a token.

o Example:

ini

CopyEdit

y = x + 20;

o The lexemes are y, =, x, +, 20, ;.

3. Tokenization

o Each lexeme is classified into a token.

o Example:

vbnet

CopyEdit

Lexeme: "int" → Token: <KEYWORD, int>

Lexeme: "x" → Token: <IDENTIFIER, x>

4. Symbol Table Maintenance

o The lexer stores identifiers in a symbol table.

o Example:

pgsql

CopyEdit

Identifier | Type | Memory Location

-----------------------------------

x | int | 1001
y | int | 1002

5. Error Handling

o Detects invalid characters.

o Example:

cpp

CopyEdit

int 9x = 5; // Error: Identifiers cannot start with a number.

Lexical Analysis in NLP (Natural Language Processing)

Lexical Analysis helps in understanding text in Natural Language Processing (NLP).

Important Tasks in NLP Lexical Analysis

1. Tokenization

o Splitting a sentence into words.

o Example:

css

CopyEdit

"I love coding." → ["I", "love", "coding", "."]

2. Part-of-Speech (POS) Tagging

o Assigning grammatical categories.

o Example:

arduino

CopyEdit

"The cat sleeps." → [("The", DET), ("cat", NOUN), ("sleeps", VERB)]

3. Lemmatization and Stemming

o Reducing words to their base form.

o Example:

arduino

CopyEdit
"running" → "run"

"better" → "good"

4. Named Entity Recognition (NER)

o Identifying names of people, places, and organizations.

o Example:

arduino

CopyEdit

"Elon Musk is the CEO of Tesla."

→ ["Elon Musk" (Person), "Tesla" (Organization)]

Lexical Analysis in Compiler vs NLP

Feature Compiler NLP

Input Source code Natural text

Output Tokens for syntax analysis Words, POS tags, named entities

Example int a = 5; → (int, a, =, 5, ;) "I love NLP" → (I, love, NLP)

Techniques in Lexical Analysis

1. Regular Expressions (Regex)

o Defines token patterns.

o Example:

makefile

CopyEdit

Identifier: [a-zA-Z_][a-zA-Z0-9_]*

Number: [0-9]+

2. Finite State Automata (FSA)

o A state machine that recognizes tokens.

o Example:
css

CopyEdit

If input starts with a letter → Identifier state

If input starts with a digit → Number state

3. Lexical Analyzers (Lex, Flex)

o Tools that generate tokenizers automatically.

Challenges in Lexical Analysis

1. Handling Ambiguity

o "He will book the ticket."

o "book" (noun) vs. "book" (verb).

2. Unicode and Multi-language Support

o Example: "Hola! ¿Cómo estás?" (Spanish)

o Different tokenization rules for different languages.

3. Complex Word Forms

o Words with multiple meanings and spellings.

Conclusion

Lexical Analysis is an important step in both compiler design and NLP. It converts raw input
into structured tokens that can be used for further processing. In NLP, it helps in
tokenization, POS tagging, and Named Entity Recognition (NER). In compilers, it prepares the
source code for parsing and execution.

Lexical Knowledge Networks

Introduction

Lexical Knowledge Networks (LKNs) are structured databases that store information about
words and their relationships. They help computers understand word meanings, synonyms,
antonyms, hierarchies, and semantic connections. These networks are widely used in
Natural Language Processing (NLP), Machine Translation, Information Retrieval, and AI-
driven chatbots.
What is a Lexical Knowledge Network?

A Lexical Knowledge Network is a graph-based representation of words and their


relationships. It consists of:

• Nodes → Represent words or concepts.

• Edges → Represent relationships between words (synonymy, antonymy, hypernymy,


etc.).

Example:

Consider the word "dog" in a Lexical Knowledge Network:

markdown

CopyEdit

Animal

Dog

/ \

Puppy Canine

• Dog is a type of Animal (Hypernym).

• Puppy is a younger form of Dog (Hyponym).

• Canine is a broader category that includes dogs and wolves (Synonym/Related


Word).

Types of Relations in Lexical Knowledge Networks

1. Synonymy (Synonyms)

o Words with similar meanings.

o Example: "Happy" ↔ "Joyful"

2. Antonymy (Antonyms)

o Words with opposite meanings.

o Example: "Hot" ↔ "Cold"

3. Hypernymy (Supertype Relationship)

o A general term (parent category) for a more specific word.


o Example: "Bird" → "Parrot" (Bird is a hypernym of Parrot)

4. Hyponymy (Subtype Relationship)

o A specific term under a broader category.

o Example: "Parrot" → "Bird" (Parrot is a hyponym of Bird)

5. Meronymy (Part-Whole Relationship)

o A word that represents a part of something.

o Example: "Wheel" → "Car" (Wheel is a part of a Car)

6. Holonymy (Whole-Part Relationship)

o A word that represents the whole of something.

o Example: "Tree" → "Leaf" (Tree is a holonym of Leaf)

7. Troponymy (Manner Relationship)

o A word that describes a specific way of performing an action.

o Example: "Run" → "Sprint" (Sprint is a troponym of Run)

Popular Lexical Knowledge Networks

1. WordNet

• One of the most widely used lexical databases.

• Groups words into sets of synonyms called synsets.

• Stores relationships like synonymy, antonymy, hypernymy, and hyponymy.

Example in WordNet:

markdown

CopyEdit

Car (synset: automobile)

- Hypernym: Vehicle

- Hyponym: Sedan, SUV

- Meronym: Wheel, Engine

2. ConceptNet

• A large knowledge graph that connects words based on real-world knowledge.


• Helps in common-sense reasoning for AI systems.

• Example: "Ice cream" is "made of" Milk

3. BabelNet

• A multilingual knowledge network.

• Combines WordNet, Wikipedia, and other lexical resources.

• Used for machine translation and multilingual NLP.

Applications of Lexical Knowledge Networks

1. Word Sense Disambiguation (WSD)

o Helps understand the correct meaning of a word in a sentence.

o Example: "I went to the **bank** to withdraw money."

▪ "Bank" means financial institution, not "riverbank".

2. Text Summarization

o Helps in shortening long text while retaining key information.

o Uses hypernymy and synonymy to condense content.

3. Question Answering Systems

o AI assistants (like ChatGPT, Siri, Google Assistant) use LKNs to understand


questions.

o Example:

▪ Question: "What is a canine?"

▪ Answer: "A dog or wolf." (from WordNet relations).

4. Machine Translation

o Helps in better translation by mapping words to concepts.

o Example:

▪ English: "She is very happy."

▪ Spanish: "Ella está muy contenta."

o WordNet can help link "happy" to "contenta".

5. Sentiment Analysis
o Understanding the emotions behind text.

o Example: "I love this movie!"

▪ "Love" is linked to positive sentiment in LKN.

Conclusion

Lexical Knowledge Networks are powerful tools for understanding word relationships and
meanings. They improve NLP applications like chatbots, search engines, and machine
translation by providing a structured way to process language.

1. WordNet Theory

What is WordNet?

WordNet is a large lexical database of the English language. It groups words into sets of
synonyms called synsets and provides their meanings, relationships, and usage. It was
developed at Princeton University by George A. Miller and his team.

Key Features of WordNet:

• Words are organized based on meaning rather than alphabetically like in a dictionary.

• It includes synonyms, antonyms, hypernyms, hyponyms, meronyms, holonyms, and


other relationships.

• It helps in Natural Language Processing (NLP), Word Sense Disambiguation (WSD),


and Machine Translation.

Example:

For the word “car”, WordNet provides:

• Synonyms: automobile, motorcar

• Hypernym: vehicle

• Hyponyms: sedan, SUV, hatchback

• Meronyms (Parts of Car): wheel, engine, seat

Thus, WordNet helps computers understand words in a meaningful way.

2. Indian Language WordNets and Multilingual Dictionaries

Indian Language WordNets


• Indian languages like Hindi, Marathi, Tamil, Bengali, and Telugu have their own
WordNets.

• These were developed under the IndoWordNet project, coordinated by IIT Bombay.

• They follow the same structure as Princeton WordNet but for Indian languages.

• Helps in Machine Translation, Speech Processing, and Information Retrieval in


regional languages.

Example:
For the Hindi word "गाड़ी" (Gaadi - meaning Car), its WordNet entry includes:

• Synonyms: वाहन (Vaahan), मोटर (Motor)

• Hypernym: पररवहन साधन (Transport Vehicle)

Multilingual Dictionaries

• These are dictionaries that provide word meanings in multiple languages.

• They help in cross-lingual communication and translation.

• Example: Google Translate is a multilingual dictionary that provides word meanings


across different languages.

3. Semantic Roles

What are Semantic Roles?

• Semantic roles describe how words function in a sentence.

• They help in understanding who is doing what in a sentence.

Common Semantic Roles:

1. Agent: The doer of an action.

o Example: Ram ate the apple. (Ram = Agent)

2. Theme: The entity affected by an action.

o Example: Ram ate the apple. (Apple = Theme)

3. Instrument: The tool used in an action.

o Example: She wrote with a pen. (Pen = Instrument)

4. Experiencer: Someone who experiences something.

o Example: John felt happy. (John = Experiencer)


Importance of Semantic Roles in NLP:

• Helps in Machine Translation and Text Understanding.

• Used in chatbots to analyze user intent.

4. Word Sense Disambiguation (WSD)

What is WSD?

• Words can have multiple meanings, and WSD helps in finding the correct meaning
based on context.

• Example: The word "bank" has two meanings:

1. He went to the bank to deposit money. (Bank = Financial Institution)

2. He sat by the bank of the river. (Bank = Edge of a river)

Methods of WSD:

1. Dictionary-Based Approach: Uses WordNet to check all possible meanings.

2. Supervised Machine Learning: Uses labeled data to train models for context
detection.

3. Unsupervised Learning: Uses clustering techniques to identify different senses


without prior knowledge.

Applications of WSD:

• Machine Translation → Helps in translating words correctly.

• Search Engines → Improves search results by finding the correct word meaning.

5. WSD and Multilinguality

• WSD is challenging in multilingual settings because words in one language may have
multiple translations in another.

• Example: The English word "light" can mean:

o Opposite of dark (हल्का (Halka) in Hindi)

o Opposite of heavy (हल्का (Halka) in Hindi)

• WSD helps in selecting the right meaning based on the sentence context.

Importance in Machine Translation:


• WSD helps Google Translate and other translation tools in choosing the correct
word.

• Example:

o "I saw a bat in the cave."

o Should "bat" mean a flying mammal or a cricket bat? WSD helps in selecting
the right one.

6. Metaphors in NLP

What is a Metaphor?

• A metaphor is a figure of speech where a word is applied to something it does not


literally belong to.

• Example: "Time is money." (Time is not literally money, but it is valuable

4o

You said:

this is good but i want more more thry for each point every point thoery should be bigger

ChatGPT said:

ChatGPT

Here’s a detailed explanation of each topic with more lines to help you prepare for your
university paper. I have expanded each section with more in-depth theoretical content and
examples to make it easy to understand and remember.

1. WordNet Theory

What is WordNet?

WordNet is a lexical database of the English language that organizes words based on their
meanings and relationships. It was developed at Princeton University by George A. Miller
and his team. Unlike a traditional dictionary that lists words alphabetically, WordNet groups
words into sets of synonyms called synsets and arranges them according to their semantic
relationships.

Structure of WordNet:

WordNet connects words through different types of relationships:

1. Synonymy (Synonyms): Words that have similar meanings.


o Example: Happy and Joyful.

2. Antonymy (Antonyms): Words that have opposite meanings.

o Example: Hot and Cold.

3. Hypernymy (Hypernyms): A more general word.

o Example: Dog is a hypernym of Labrador.

4. Hyponymy (Hyponyms): A more specific word.

o Example: Rose is a hyponym of Flower.

5. Meronymy (Meronyms): Parts of a whole.

o Example: Wheel is a meronym of Car.

6. Holonymy (Holonyms): The whole to which a part belongs.

o Example: Car is a holonym of Wheel.

Example in WordNet:

For the word “car”, WordNet provides the following relationships:

• Synonyms: automobile, motorcar

• Hypernym (General Term): vehicle

• Hyponyms (More Specific Types): sedan, SUV, hatchback

• Meronyms (Parts of Car): wheel, engine, seat

Applications of WordNet in NLP:

• Word Sense Disambiguation (WSD): Helps in identifying the correct meaning of


words based on context.

• Information Retrieval: Improves search engine results by understanding word


relationships.

• Machine Translation: Helps in translating words correctly by considering their


meanings.

2. Indian Language WordNets and Multilingual Dictionaries

What is an Indian Language WordNet?

WordNets are not limited to English; several Indian languages like Hindi, Marathi, Tamil,
Telugu, and Bengali have their own WordNets. The IndoWordNet project, led by IIT
Bombay, has created a structured WordNet for multiple Indian languages based on the
Princeton WordNet model.

Features of Indian Language WordNets:

• Follows synsets like WordNet but for regional languages.

• Maintains semantic relationships such as synonyms, antonyms, and hypernyms.

• Helps in Machine Translation, Speech Processing, and Sentiment Analysis.

Example:

For the Hindi word "गाड़ी" (Gaadi - meaning Car), its WordNet entry includes:

• Synonyms: वाहन (Vaahan), मोटर (Motor)

• Hypernym: पररवहन साधन (Transport Vehicle)

• Meronyms: पहहया (Wheel), इं जन (Engine)

Multilingual Dictionaries:

• A multilingual dictionary provides word meanings in multiple languages.

• Example: Google Translate acts as a multilingual dictionary by translating words


between different languages.

• Useful in cross-lingual communication, text translation, and computational


linguistics.

3. Semantic Roles

What are Semantic Roles?

Semantic roles describe the function of words in a sentence and explain who is performing
an action and what is being affected. These roles help in understanding the structure of
sentences and aid in NLP tasks such as sentence parsing, machine translation, and text
summarization.

Common Semantic Roles in NLP:

1. Agent: The one who performs the action.

o Example: Ram kicked the ball. (Ram = Agent)

2. Theme: The object affected by the action.

o Example: Ram kicked the ball. (Ball = Theme)

3. Instrument: The tool used for the action.


o Example: She wrote with a pen. (Pen = Instrument)

4. Experiencer: The entity experiencing a state.

o Example: John felt happy. (John = Experiencer)

5. Location: The place where an action happens.

o Example: She lives in Mumbai. (Mumbai = Location)

6. Goal: The destination of an action.

o Example: She went to the market. (Market = Goal)

Applications of Semantic Roles in NLP:

• Chatbots and Virtual Assistants: Understanding user queries.

• Machine Translation: Maintaining correct sentence structure.

• Speech Recognition: Identifying sentence meaning from spoken words.

4. Word Sense Disambiguation (WSD)

What is Word Sense Disambiguation?

• Some words have multiple meanings, and WSD helps in selecting the correct
meaning based on the sentence context.

• Example:

o He went to the bank to deposit money. (Bank = Financial Institution)

o He sat by the bank of the river. (Bank = Edge of a river)

Techniques for WSD:

1. Dictionary-Based Approach: Uses WordNet to find meanings of words and their


relationships.

2. Supervised Learning: Machine learning models are trained using labeled data to
predict word meanings.

3. Unsupervised Learning: Uses clustering techniques to group words based on their


meaning.

Applications of WSD in NLP:

• Search Engines: Helps in retrieving relevant search results.

• Machine Translation: Ensures correct word selection in translations.


• Text Summarization: Extracts key information from large texts.

5. WSD and Multilinguality

Challenges of WSD in Multiple Languages:

• Different languages may have words with multiple meanings.

• Some words may not have an exact translation in another language.

Example:

The English word "light" can mean:

• Opposite of dark (हल्का (Halka) in Hindi)

• Opposite of heavy (हल्का (Halka) in Hindi)

How WSD Helps in Multilingual NLP:

• Ensures accurate translation of words by selecting the correct meaning.

• Used in Google Translate, AI-powered translation systems, and multilingual search


engines.

6. Metaphors in NLP

What is a Metaphor?

A metaphor is a figure of speech where a word is applied to something it does not literally
belong to.

• Example: "Time is money." (Time is not money, but it is valuable.)

Challenges of Metaphors in NLP:

• Difficult for AI to understand non-literal meanings.

• Example:

o "He has a heart of stone."

o AI may interpret it as a real stone heart instead of a person being


emotionally cold.

Solutions in NLP:

• Machine Learning Models can be trained to detect metaphorical meanings using


context analysis.
• Deep Learning Models can analyze sentence structures to identify hidden meanings in
text.
[Unit 5] Applications

Sentiment Analysis; Text Entailment; Robust and Scalable Machine Translation; Question
Answering in Multilingual Setting; Cross Lingual Information Retrieval (CLIR).

Information retrieval (IR) may be defined as a software program that deals with the
organization, storage, retrieval and evaluation of information from document repositories
particularly textual information. The system assists users in finding the information they
require but it does not explicitly return the answers of the questions. It informs the
existence and location of documents that might consist of the required information. The
documents that satisfy user’s requirement are called relevant documents. A perfect IR
system will retrieve only relevant documents.

With the help of the following diagram, we can understand the process of information
retrieval (IR) −

It is clear from the above diagram that a user who needs information will have to formulate
a request in the form of query in natural language. Then the IR system will respond by
retrieving the relevant output, in the form of documents, about the required information.

Classical Problem in Information Retrieval (IR) System

The main goal of IR research is to develop a model for retrieving information from the
repositories of documents. Here, we are going to discuss a classical problem, named ad-hoc
retrieval problem, related to the IR system.
In ad-hoc retrieval, the user must enter a query in natural language that describes the
required information. Then the IR system will return the required documents related to the
desired information. For example, suppose we are searching something on the Internet and
it gives some exact pages that are relevant as per our requirement but there can be some
non-relevant pages too. This is due to the ad-hoc retrieval problem.

CLIR and its Motivation

Cross-lingual Information Retrieval is the task of retrieving relevant information when the
document collection is written in a different language from the user query. Figure 1 below
shows a typical architecture of a CLIR system. There are many situations where CLIR
becomes essential because the information is not in the user’s native language.

Figure 1. Typical architecture of a CLIR system [1].

Translation Approaches

CLIR requires the ability to represent and match information in the same representation
space even if the query and the document collection are in different languages. The
fundamental problem in CLIR is to match terms in different languages that describe the
same or a similar meaning. The strategy of mapping between different language
representations is usually machine translation. In CLIR, this translation process can be in
several ways.

• Document translation [2] is to map the document representation into the query
representation space, as illustrated in Figure 2.

• Query translation [3] is to map the query representation into the document
representation space, as illustrated in Figure 3.

• Pivot language or Interlingua [4,5] is to map both document and query


representations to a third space.

Textual Entailment (Natural Language Inference - NLI)

• Objective: Determine the relationship between a premise (PP) and a hypothesis (HH)
from three categories:

1. Entailment: PP guarantees HH.

2. Contradiction: PP refutes HH.

3. Neutral: PP neither confirms nor refutes HH.

• Significance: Essential for NLP tasks like question answering (validating answers),
information retrieval (ensuring document relevance), information extraction
(consistency checks), and machine translation evaluation (maintaining semantic
accuracy).

• Textual entailment, often referred to as natural language inference (NLI), is a


fundamental task in natural language processing that involves determining the
relationship between two pieces of text, a premise, and a hypothesis. The task is to
decide whether the hypothesis is entailed (can be logically inferred), contradicted, or
is neutral with respect to the premise.

Definitions

• Entailment: If the truth of the premise guarantees the truth of the hypothesis.

o Premise: The cat is sleeping.

o Hypothesis: There is a cat.

o Relationship: Entailment

• Contradiction: If the truth of the premise guarantees the hypothesis is false.

o Premise: The cat is sleeping.

o Hypothesis: The cat is playing.

o Relationship: Contradiction

• Neutral: If the truth of the premise neither guarantees the truth nor the falsehood of
the hypothesis.

o Premise: The cat is sleeping.

o Hypothesis: The cat is dreaming.

o Relationship: Neutral

Importance

• Textual entailment plays a crucial role in many NLP applications, including:

1. Question Answering: To verify if the answer obtained from a source truly addresses
the posed question.

2. Information Retrieval: To ensure the retrieved documents are relevant to the search
query.

3. Information Extraction: To verify if the extracted pieces of information are consistent


with the source content.

4. Machine Translation Evaluation: To determine if the translated content retains the


meaning of the original.

1. Sentiment Analysis

What is Sentiment Analysis?

Sentiment Analysis is the computational process of determining the emotional tone behind
a body of text. It’s used to understand opinions, attitudes, and emotions expressed by
people in written form. The goal of sentiment analysis is to classify the sentiment of the text
as positive, negative, or neutral.

How Sentiment Analysis Works:

Sentiment analysis typically involves two main stages:

1. Text Preprocessing:

o This involves removing irrelevant parts such as stopwords, punctuation, and


performing tokenization.

2. Feature Extraction:

o Important words or phrases that could indicate sentiment are identified.


These features can be:

▪ Lexicons: Lists of words associated with positive or negative sentiment


(e.g., happy = positive, sad = negative).

▪ Machine Learning Models: Algorithms are trained to recognize


sentiment based on features in the text.

Methods in Sentiment Analysis:

1. Lexicon-Based Approach:

o Uses dictionaries of words with pre-assigned sentiment scores.

o Example: Words like "love", "good", "fantastic" are classified as positive;


words like "hate", "bad", "horrible" are negative.

2. Machine Learning-Based Approach:

o Involves training a classifier (like a Naive Bayes classifier or an SVM) on


labeled text data to classify new texts.

o Example: A model could be trained on movie reviews where each review is


labeled as positive or negative, and the model will then predict the sentiment
of new reviews.

3. Deep Learning-Based Approach:

o Uses neural networks to automatically learn sentiment-related features from


the text. Techniques like Recurrent Neural Networks (RNNs) or Long Short-
Term Memory (LSTM) networks are common.

Applications of Sentiment Analysis:

• Social Media Monitoring: Companies monitor tweets, posts, and comments to gauge
public sentiment about their products or services.
• Product Reviews: Companies use sentiment analysis to classify customer reviews as
positive or negative.

• Political Campaigns: Analyzing public opinion about political figures or events.

2. Text Entailment

What is Text Entailment?

Text entailment is the task of determining whether a hypothesis is true based on a given
premise. In other words, it involves checking if the information in one text logically follows
from the information in another text.

Types of Text Entailment:

1. Entailment: The premise supports the hypothesis.

o Example:

▪ Premise: "John is a student at the university."

▪ Hypothesis: "John is enrolled in a university."

▪ Entailment: The hypothesis is supported by the premise.

2. Contradiction: The premise contradicts the hypothesis.

o Example:

▪ Premise: "John is a student at the university."

▪ Hypothesis: "John is a teacher at a university."

▪ Contradiction: The hypothesis contradicts the premise.

3. Neutral: The premise does not provide enough information to either support or
contradict the hypothesis.

o Example:

▪ Premise: "John is a student."

▪ Hypothesis: "John is studying engineering."

▪ Neutral: The premise doesn’t provide enough detail to decide


whether the hypothesis is true or false.

Applications of Text Entailment:

• Question Answering Systems: Helps in determining whether the answer to a


question is entailed by a given passage.
• Text Summarization: Ensures that the summary logically follows from the original
text.

• Information Extraction: Helps in extracting relationships between entities in a


document.

3. Robust and Scalable Machine Translation

What is Machine Translation (MT)?

Machine Translation is the automatic process of translating text from one language to
another using computers. The goal of MT is to produce accurate and fluent translations
across different languages.

Challenges in Machine Translation:

1. Robustness:

o A translation system should work well even in the presence of noisy data,
such as slang, grammatical errors, or incomplete sentences.

o Example: Translating informal social media posts or texts with typos can be
challenging.

2. Scalability:

o As the amount of data grows, MT systems need to handle large-scale


processing efficiently without compromising on accuracy.

o Example: Google Translate supports hundreds of languages, and the system


must scale effectively to handle billions of users and documents.

Approaches in Machine Translation:

1. Rule-Based Machine Translation (RBMT):

o Uses a set of linguistic rules and dictionaries to translate between languages.

o Example: English to Spanish: Translating “I am going” to “Yo estoy yendo.”

2. Statistical Machine Translation (SMT):

o Uses statistical models trained on large bilingual corpora to learn translation


patterns.

o Example: “I am going” might be translated based on how frequently similar


sentences appear in parallel corpora.

3. Neural Machine Translation (NMT):


o Uses neural networks (specifically sequence-to-sequence models) to model
the entire translation process. NMT has proven to be more accurate and
fluent in translating complex sentences.

o Example: Google Translate now uses NMT to produce more contextually


appropriate translations.

Applications of MT:

• Document Translation: Translating documents automatically without human


intervention.

• Cross-lingual Search: Searching in one language and retrieving documents in


another.

• Localization of Content: Translating websites and applications to different languages


for global audiences.

4. Question Answering in Multilingual Setting

What is Question Answering (QA)?

QA refers to the process where a system automatically provides answers to questions posed
by humans in natural language. In a multilingual setting, the system is required to
understand and answer questions in various languages.

Challenges in Multilingual QA:

1. Language Barriers:

o QA systems need to process questions and documents in different languages


and provide answers accordingly.

o Example: The same question might have different forms or nuances in


Spanish and English.

2. Cross-lingual Transfer:

o The system must transfer knowledge from one language to another to answer
questions accurately.

o Example: If a document is in English, but the question is in French, the system


needs to extract information from the English document and generate a
French answer.

Approaches to Multilingual QA:

1. Language Identification:
o Identifying the language of the question and documents.

2. Cross-lingual Information Retrieval:

o Retrieve relevant documents in multiple languages and then translate them


or use language-independent methods to answer the question.

Applications:

• Customer Support: Multilingual QA systems can be used to answer customer queries


in different languages.

• Search Engines: Help in answering questions posed in any language, based on the
relevant web data.

• Virtual Assistants: AI assistants like Siri or Alexa that answer multilingual questions.

5. Cross Lingual Information Retrieval (CLIR)

What is Cross-Lingual Information Retrieval (CLIR)?

CLIR is the task of retrieving information from a foreign language based on a query in a
different language. The main challenge here is to match the query with the information in
the target language, despite language differences.

Challenges in CLIR:

1. Translation:

o Queries in one language need to be translated to the language of the target


documents.

o Example: A query in English “Best places to visit in India” needs to be


translated to Hindi to retrieve relevant documents.

2. Cross-lingual Search Strategies:

o Some CLIR systems use bilingual dictionaries or parallel corpora to map


queries to documents in a different language.

o Example: A German query about “buying a phone” may need to find


documents in English or French that discuss phone purchases.

Techniques in CLIR:

1. Query Translation:

o Translating the user’s query into multiple languages and searching the
documents in those languages.
2. Document Translation:

o Translating the documents into the query’s language and then performing the
search.

3. Pivot Language Approach:

o Using a third language (e.g., English) as an intermediary to translate between


two languages.

Applications of CLIR:

• Global Search Engines: Google’s search engine that retrieves documents in multiple
languages based on a single query.

• Multilingual Digital Libraries: Accessing books and articles in different languages.

• Cross-language Legal or Medical Research: Searching for research articles in other


languages to support cross-border studies.

1. Information Retrieval in NLP

What is Information Retrieval (IR)?

Information Retrieval (IR) refers to the process of obtaining relevant documents from a large
collection of text based on a user's query. In the context of Natural Language Processing
(NLP), IR is primarily concerned with finding documents, images, or data that satisfy the
information needs specified by users through natural language queries.

How Information Retrieval Works:

1. Text Preprocessing:

o The system first preprocesses the documents and the query to remove
stopwords, punctuation, and other irrelevant elements. It might also involve
stemming (reducing words to their root forms).

o Example: "Running" becomes "Run".

2. Indexing:
o An index is created, which is essentially a mapping between the terms in the
documents and their locations in the document corpus. This helps in fast
retrieval when a query is issued.

o Example: If you search for "data science," the system needs to find
documents containing both "data" and "science."

3. Query Processing:

o When a user submits a query, the system processes it, converts it into a form
suitable for searching, and matches it against the indexed documents.

o Example: A query like "What is Natural Language Processing?" will be


processed and matched with documents containing relevant terms like "NLP,"
"artificial intelligence," etc.

4. Ranking:

o The system ranks the documents based on relevance to the query. This can be
done using various algorithms such as TF-IDF (Term Frequency-Inverse
Document Frequency), BM25, etc.

o Example: Documents mentioning the exact phrase "Natural Language


Processing" would rank higher than documents with a related but vague
description.

5. Document Retrieval:

o Finally, the most relevant documents are presented to the user based on the
ranking algorithm.

Applications of Information Retrieval in NLP:

1. Search Engines:

o Google, Bing, and other search engines use IR techniques to match web pages
with user queries.

2. Text Mining:

o Extracting useful information from large corpora, like analyzing customer


feedback to identify common issues.

2. Cross-Lingual Information Retrieval (CLIR)

What is Cross-Lingual Information Retrieval?


Cross-Lingual Information Retrieval (CLIR) refers to the process of retrieving documents
written in different languages based on a query in one language. It aims to bridge the gap
between language barriers by allowing users to search in one language and retrieve
documents written in another language.

Challenges in CLIR:

1. Language Mismatch:

o The main challenge in CLIR is dealing with languages that have different
structures, syntax, and vocabulary. For example, a query in English might
need to be translated into Hindi or Spanish to retrieve relevant documents.

2. Translation Issues:

o Accurate translations of the query are crucial, and misinterpretations can lead
to irrelevant results.

3. Ambiguity in Translations:

o Certain words or phrases may have multiple meanings. For example, the word
"bank" in English can refer to a financial institution or the side of a river, and
translating it correctly is crucial.

How CLIR Works:

1. Query Translation:

o The user’s query is translated into the target language using a machine
translation (MT) system. This allows the system to retrieve documents in that
language.

o Example: If the query is in English, it is translated into French to retrieve


documents written in French.

2. Document Retrieval:

o The system searches for documents in the target language using standard IR
techniques (e.g., TF-IDF, BM25).

3. Document Ranking:

o The retrieved documents are ranked based on relevance to the translated


query.

4. Cross-Language Query Expansion (CLQE):

o This involves expanding the query by including synonyms or related terms in


the target language to improve the retrieval process.
Applications of CLIR:

• Multilingual Search Engines:

o Google allows users to search in one language and retrieves documents in


multiple languages.

• Cross-Border Research:

o Researchers can search through documents and research papers written in


different languages.

3. Architecture of an Information Retrieval System

Overview of the IR System Architecture:

An Information Retrieval (IR) system typically has a layered architecture that allows efficient
document retrieval. The architecture includes several components that work together to
retrieve documents based on user queries.

Components of the IR System:

1. Document Collection:

o This is the corpus or the database of documents that are indexed for retrieval.
These documents can be anything from web pages to scientific papers.

2. Preprocessing Module:

o This module is responsible for cleaning and preparing the data by removing
stopwords, stemming, and tokenizing the text into smaller units (like words).

3. Indexer:

o This component creates an index for all the words in the document collection,
mapping them to their locations. The index helps in fast search retrieval.

4. Query Processor:

o This component handles the processing of user queries. It interprets the


user’s query, converts it into a suitable search format, and matches it against
the indexed documents.

5. Ranking and Retrieval:

o The ranking module assigns a score to the documents based on relevance to


the query. Common ranking algorithms include TF-IDF and BM25.

6. User Interface (UI):


o This is the front-end where users submit their queries and view the retrieved
documents. It’s the interaction point for the user.

Diagram of an IR System:

pgsql

CopyEdit

+------------------+

| Document | Query

| Collection |<------> User

+------------------+ |

| v

+------------------+ +------------------+

| Preprocessing | | Query Processor |

| (Text Cleaning, | | (Parsing, |

| Tokenizing) | | Stemming, |

+------------------+ | Query Expansion) |

| +------------------+

v |

+------------------+ +------------------+

| Indexer |<--->| Ranking & |

| (Index Creation) | | Retrieval |

+------------------+ +------------------+

| |

v v

+------------------+ +------------------+

| User | | Retrieved |

| Interface |<--->| Documents |

+------------------+ +------------------+
4. Textual Entailment in NLP

What is Textual Entailment?

Textual Entailment (TE) is the task of determining whether a hypothesis logically follows
from a given premise. In NLP, this helps in understanding whether one piece of text implies
another. It’s crucial for applications like question answering (QA), where understanding the
relationship between sentences is key.

How Textual Entailment Works:

1. Premise:

o This is the statement that contains the information from which we infer the
hypothesis.

o Example: "John is a doctor."

2. Hypothesis:

o The hypothesis is a claim or statement that we want to check whether it is


entailed by the premise.

o Example: "John works in the medical field."

3. Entailment Decision:

o We check if the premise supports the hypothesis. If the premise logically


leads to the hypothesis, the result is entailment.

o In the above example, the premise “John is a doctor” entails “John works in
the medical field.”

Application in Question Answering (QA):

1. Scenario:

o A user asks, "Is John a doctor?" The system needs to check whether the
answer "Yes, John works in the medical field" is entailed by a provided
document or text.

2. Steps:

o The system first checks if the premise, e.g., "John is a doctor," supports the
hypothesis, "John works in the medical field."

o If the premise supports the hypothesis, the system would respond


affirmatively.

3. Example:
o Premise: "Alice is a professional photographer."

o Hypothesis: "Alice takes photographs professionally."

o Entailment Decision: This is a clear example of entailment because the


premise confirms the hypothesis.

5. Sentiment Analysis

What is Sentiment Analysis?

Sentiment Analysis is the process of detecting and analyzing the sentiment expressed in text,
particularly whether it is positive, negative, or neutral. It is widely used in social media,
product reviews, and market analysis.

Working of Sentiment Analysis:

1. Data Collection:

o Gather text data such as social media posts, product reviews, etc.

2. Text Preprocessing:

o Clean the text data by removing stopwords, special characters, and irrelevant
words.

3. Sentiment Detection:

o Classify the sentiment using methods like Lexicon-based (using pre-defined


sentiment dictionaries) or Machine Learning-based (training models to
classify sentiment).

Applications of Sentiment Analysis:

• Social Media Monitoring: Monitoring brand sentiment across Twitter or Facebook.

• Customer Feedback Analysis: Analyzing reviews to gauge customer satisfaction.

• Market Research: Understanding consumer opinions on various products or services.

6. Machine Translation

What is Machine Translation?

Machine Translation (MT) is the task of automatically converting text from one language to
another using computational methods. It has revolutionized cross-lingual communication.

Types of Machine Translation:


1. Rule-Based Translation:

o Relies on linguistic rules for grammar and syntax translation.

2. Statistical Machine Translation:

o Uses statistical models to identify likely translations based on large corpora of


text.

3. Neural Machine Translation (NMT):

o Uses deep learning techniques to translate text in a more fluid and context-
aware manner.

Applications of Machine Translation:

• Google Translate: Translating documents or web pages.

• Cross-border Communication: Helping people understand content in foreign


languages.

You might also like