KEMBAR78
NLP Unit 2 Imp | PDF | Morphology (Linguistics) | Word
0% found this document useful (0 votes)
40 views4 pages

NLP Unit 2 Imp

The document discusses key concepts in morphology, including its definition, challenges in Indian languages, and its relationship with finite state automata. It also covers processes like word segmentation, tokenization, part-of-speech tagging, and the Maximum Entropy model for tagging, as well as the importance of smoothing in N-gram models. Additionally, it highlights the significance of corpora in linguistic research and natural language processing applications.

Uploaded by

chourasiaronit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views4 pages

NLP Unit 2 Imp

The document discusses key concepts in morphology, including its definition, challenges in Indian languages, and its relationship with finite state automata. It also covers processes like word segmentation, tokenization, part-of-speech tagging, and the Maximum Entropy model for tagging, as well as the importance of smoothing in N-gram models. Additionally, it highlights the significance of corpora in linguistic research and natural language processing applications.

Uploaded by

chourasiaronit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

NLP Unit 2 Imp

1. What do you mean by Morphology?

Morphology is the branch of linguistics that deals with the internal structure and formation of words. It studies how words

are built from smaller meaningful units called morphemes. A morpheme is the smallest grammatical unit in a language,

which carries meaning. These morphemes can either be free (they can stand alone as words) or bound (they must be

attached to other morphemes to convey meaning). Morphology explains how words change form to indicate things like

tense, number, gender, case, and so on. For example, in English, the word "unhappiness" is made up of the root

"happy", the prefix "un-" meaning "not", and the suffix "-ness" which turns the word into a noun. Thus, morphology helps

us understand how new words are formed and how existing words are modified according to grammatical rules.

2. What are issues related to Morphology of Indian Languages?

Indian languages are morphologically rich and diverse, which poses several challenges in natural language processing.

One of the major issues is the high degree of inflection. Words in Indian languages undergo numerous changes to

express tense, gender, number, mood, etc., leading to a large number of word forms for a single root word. This makes

morphological analysis difficult. Another issue is the agglutinative nature of some Indian languages like Tamil and

Telugu, where multiple suffixes are added to a root word to express complex meanings, forming very long words.

Furthermore, Indian languages often have dialectal variations and spelling inconsistencies, which affect the accuracy of

morphological tools. Additionally, there is a lack of standard digital linguistic resources and annotated corpora for many

Indian languages, making it harder to build effective computational models for these languages.

3. What is the relationship between Morphology and Finite State Automata?

Finite State Automata (FSA) is a mathematical model used to recognize patterns within input data, and it has an

important role in computational morphology. Morphology is concerned with the structure and formation of words, and

FSA can be used to model these structures. A morphological analyzer uses FSA to check whether a word is valid or not,
by moving through different states based on the morphemes it processes. For example, a finite state machine can

recognize word forms by defining transitions between states for roots and affixes. This approach is efficient in handling

regular morphological rules and is widely used in language processing tasks such as spelling checkers, stemmers, and

syntactic analyzers. Therefore, FSA provides a systematic way to model the morphological rules of a language

computationally.

4. Discuss the difference between Word Segmentation and Tokenization.

Word segmentation and tokenization are two fundamental processes in natural language processing, but they serve

slightly different purposes. Word segmentation refers to the process of identifying word boundaries in a continuous string

of text. This is particularly important in languages like Chinese, Japanese, and Thai, where spaces are not used to

separate words. In such languages, word segmentation involves determining where one word ends and another begins.

On the other hand, tokenization is a broader process that involves dividing text into smaller units called tokens. These

tokens can be words, punctuation marks, or other meaningful elements. While tokenization is straightforward in

languages like English, where words are typically separated by spaces, it can be more complex when dealing with

compound words or punctuation. The main difference is that word segmentation focuses on identifying words in

languages with no spaces, whereas tokenization applies to many languages and involves breaking text into analyzable

units.

5. Describe Part-of-Speech Tagging with suitable example.

Part-of-Speech (POS) tagging is the process of assigning a grammatical category, such as noun, verb, adjective, etc., to

each word in a sentence. This is a crucial step in many natural language processing applications, as it helps the

computer understand the structure and meaning of sentences. POS tagging is not always straightforward because the

same word can have different tags depending on its context. For example, the word "book" can be a noun in the

sentence "I read a book" and a verb in "I will book a ticket." POS tagging systems use linguistic rules or statistical

models to determine the most likely tag for a word based on its context in the sentence. This tagging allows further

processing such as parsing, machine translation, and question-answering systems.


6. Discuss Maximum Entropy Model of Part-of-Speech Tagging.

The Maximum Entropy (MaxEnt) model is a statistical method used for part-of-speech tagging and other natural

language processing tasks. It is based on the principle of maximum entropy, which states that among all possible

probability distributions, the one with the highest entropy (i.e., the most uniform) should be chosen, provided it satisfies

the known constraints. In the context of POS tagging, the MaxEnt model considers multiple features of the context in

which a word appears, such as the words before and after it, word prefixes and suffixes, and other lexical clues. It then

uses these features to predict the most probable tag for a given word. The advantage of the MaxEnt model is its ability

to combine many types of information in a flexible and powerful way. It does not assume independence between

features, making it suitable for complex natural language tasks.

7. What is smoothing in the context of N-gram models and why is it necessary? Provide example to show how

smoothing works.

In the context of N-gram language models, smoothing is a technique used to handle the problem of zero probability for

unseen word sequences. N-gram models predict the likelihood of a word based on the previous words, but if a particular

sequence of words has not been seen in the training data, it is assigned a zero probability. This can severely affect the

performance of the model, especially in applications like speech recognition or machine translation. Smoothing solves

this problem by assigning a small non-zero probability to unseen word combinations. One simple method is add-one

smoothing, where one is added to the count of every possible N-gram in the data. For example, if the bigram "love

pizza" did not occur in the training data, its probability would be zero. After applying add-one smoothing, it gets a small

positive probability, allowing the model to handle new combinations more effectively. Smoothing makes the model more

robust and realistic in practical applications.

8. Give an introductory note on Corpora. Write significance of Corpora analysis.

A corpus (plural: corpora) is a large and structured collection of texts that are used for linguistic research and

computational language processing. Corpora can include books, newspapers, websites, social media content, and
spoken dialogues, and they are often annotated with additional linguistic information such as part-of-speech tags or

syntactic structures. Corpus analysis is the study of language using these collections, allowing researchers to

investigate language patterns, frequency of word usage, grammar rules, and much more. The significance of corpora in

natural language processing lies in its role as training data for machine learning models. By analyzing large corpora,

computers can learn how language is actually used in real life, rather than relying only on fixed rules. This helps in

building applications like machine translation, voice assistants, grammar checkers, and search engines. Corpora also

support language learning, dictionary creation, and linguistic theory development.

You might also like