NLP Unit 2 Imp
1. What do you mean by Morphology?
Morphology is the branch of linguistics that deals with the internal structure and formation of words. It studies how words
are built from smaller meaningful units called morphemes. A morpheme is the smallest grammatical unit in a language,
which carries meaning. These morphemes can either be free (they can stand alone as words) or bound (they must be
attached to other morphemes to convey meaning). Morphology explains how words change form to indicate things like
tense, number, gender, case, and so on. For example, in English, the word "unhappiness" is made up of the root
"happy", the prefix "un-" meaning "not", and the suffix "-ness" which turns the word into a noun. Thus, morphology helps
us understand how new words are formed and how existing words are modified according to grammatical rules.
2. What are issues related to Morphology of Indian Languages?
Indian languages are morphologically rich and diverse, which poses several challenges in natural language processing.
One of the major issues is the high degree of inflection. Words in Indian languages undergo numerous changes to
express tense, gender, number, mood, etc., leading to a large number of word forms for a single root word. This makes
morphological analysis difficult. Another issue is the agglutinative nature of some Indian languages like Tamil and
Telugu, where multiple suffixes are added to a root word to express complex meanings, forming very long words.
Furthermore, Indian languages often have dialectal variations and spelling inconsistencies, which affect the accuracy of
morphological tools. Additionally, there is a lack of standard digital linguistic resources and annotated corpora for many
Indian languages, making it harder to build effective computational models for these languages.
3. What is the relationship between Morphology and Finite State Automata?
Finite State Automata (FSA) is a mathematical model used to recognize patterns within input data, and it has an
important role in computational morphology. Morphology is concerned with the structure and formation of words, and
FSA can be used to model these structures. A morphological analyzer uses FSA to check whether a word is valid or not,
by moving through different states based on the morphemes it processes. For example, a finite state machine can
recognize word forms by defining transitions between states for roots and affixes. This approach is efficient in handling
regular morphological rules and is widely used in language processing tasks such as spelling checkers, stemmers, and
syntactic analyzers. Therefore, FSA provides a systematic way to model the morphological rules of a language
computationally.
4. Discuss the difference between Word Segmentation and Tokenization.
Word segmentation and tokenization are two fundamental processes in natural language processing, but they serve
slightly different purposes. Word segmentation refers to the process of identifying word boundaries in a continuous string
of text. This is particularly important in languages like Chinese, Japanese, and Thai, where spaces are not used to
separate words. In such languages, word segmentation involves determining where one word ends and another begins.
On the other hand, tokenization is a broader process that involves dividing text into smaller units called tokens. These
tokens can be words, punctuation marks, or other meaningful elements. While tokenization is straightforward in
languages like English, where words are typically separated by spaces, it can be more complex when dealing with
compound words or punctuation. The main difference is that word segmentation focuses on identifying words in
languages with no spaces, whereas tokenization applies to many languages and involves breaking text into analyzable
units.
5. Describe Part-of-Speech Tagging with suitable example.
Part-of-Speech (POS) tagging is the process of assigning a grammatical category, such as noun, verb, adjective, etc., to
each word in a sentence. This is a crucial step in many natural language processing applications, as it helps the
computer understand the structure and meaning of sentences. POS tagging is not always straightforward because the
same word can have different tags depending on its context. For example, the word "book" can be a noun in the
sentence "I read a book" and a verb in "I will book a ticket." POS tagging systems use linguistic rules or statistical
models to determine the most likely tag for a word based on its context in the sentence. This tagging allows further
processing such as parsing, machine translation, and question-answering systems.
6. Discuss Maximum Entropy Model of Part-of-Speech Tagging.
The Maximum Entropy (MaxEnt) model is a statistical method used for part-of-speech tagging and other natural
language processing tasks. It is based on the principle of maximum entropy, which states that among all possible
probability distributions, the one with the highest entropy (i.e., the most uniform) should be chosen, provided it satisfies
the known constraints. In the context of POS tagging, the MaxEnt model considers multiple features of the context in
which a word appears, such as the words before and after it, word prefixes and suffixes, and other lexical clues. It then
uses these features to predict the most probable tag for a given word. The advantage of the MaxEnt model is its ability
to combine many types of information in a flexible and powerful way. It does not assume independence between
features, making it suitable for complex natural language tasks.
7. What is smoothing in the context of N-gram models and why is it necessary? Provide example to show how
smoothing works.
In the context of N-gram language models, smoothing is a technique used to handle the problem of zero probability for
unseen word sequences. N-gram models predict the likelihood of a word based on the previous words, but if a particular
sequence of words has not been seen in the training data, it is assigned a zero probability. This can severely affect the
performance of the model, especially in applications like speech recognition or machine translation. Smoothing solves
this problem by assigning a small non-zero probability to unseen word combinations. One simple method is add-one
smoothing, where one is added to the count of every possible N-gram in the data. For example, if the bigram "love
pizza" did not occur in the training data, its probability would be zero. After applying add-one smoothing, it gets a small
positive probability, allowing the model to handle new combinations more effectively. Smoothing makes the model more
robust and realistic in practical applications.
8. Give an introductory note on Corpora. Write significance of Corpora analysis.
A corpus (plural: corpora) is a large and structured collection of texts that are used for linguistic research and
computational language processing. Corpora can include books, newspapers, websites, social media content, and
spoken dialogues, and they are often annotated with additional linguistic information such as part-of-speech tags or
syntactic structures. Corpus analysis is the study of language using these collections, allowing researchers to
investigate language patterns, frequency of word usage, grammar rules, and much more. The significance of corpora in
natural language processing lies in its role as training data for machine learning models. By analyzing large corpora,
computers can learn how language is actually used in real life, rather than relying only on fixed rules. This helps in
building applications like machine translation, voice assistants, grammar checkers, and search engines. Corpora also
support language learning, dictionary creation, and linguistic theory development.