KEMBAR78
Bai601 NLP Module 1 Lecture Notes | PDF | Semantics | Linguistics
0% found this document useful (0 votes)
1 views17 pages

Bai601 NLP Module 1 Lecture Notes

Uploaded by

Kalidass Anu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views17 pages

Bai601 NLP Module 1 Lecture Notes

Uploaded by

Kalidass Anu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Regulation – 2022 BAI601-Natural Language Processing

Module 1

Syllabus:
Introduction: What is Natural Language Processing? Origins of NLP, Language and Knowledge,
the Challenges of NLP, Language and Grammar, Processing Indian Languages, NLP
Applications. Language Modeling: Statistical Language Model - N-gram model (unigram,
bigram), Paninion Framework, Karaka theory.

INTRODUCTION
1.1 WHAT IS NATURAL LANGUAGE PR0CESSING (NLP)
Language is the primary means of communication used by humans. It is the tool we use
to express the greater part of our ideas and emotions. It shapes thought, has a structure, and
carries meaning. Learning new concepts and expressing ideas through them is so natural that
we hardly realize how we process natural language. But there must be some kind of
representation in our mind, of the content of language. When we want to express a thought,
this content helps represent language in real time. As children, we never learn a
computational model of language, yet this is the first step in the automatic processing of
languages. Natural language processing (NLP) is concerned with the development of
computational models of aspects of human language processing.
There are two main reasons for such development:
1. To develop automated tools for language processing
2. To gain a better understanding of human communication
Building computational models with human language-processing abilities requires a
knowledge of how humans acquire, store, and process language. It also requires knowledge of
the world and of language.
Historically, there have been two major approaches to NLP the rationalist approach
and the empiricist approach. Early NLP research took a rationalist approach, which assumes
the existence of some language faculty in the human brain. Supporters of this approach argue
that it is not possible for children to learn a complex thing like natural language from limited
sensory inputs. Empiricists do not believe in existence of a language faculty. Instead, they
believe in the existence of some general organization principles such as pattern recognition,
generalization, and association. Learning of detailed structures can, therefore, take place
through the application of these principles on sensory inputs available to the child.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

1.2 ORIGINS OF NLP


Natural language processing sometimes mistakenly termed natural language
understanding-originated from machine translation research. While natural language
understanding involves only the interpretation of language, natural language processing
includes both understanding (Interpretation) and generation (production). The NLP also
includes speech processing. However, in this book, we are concerned with text processing
only, covering work in the area of computational linguistics, and the tasks in which NLP has
found useful application.
Computational linguistics is similar to theoretical- and psycho-linguistics, but uses
different tools. Theoretical linguists mainly provide structural description of natural language
and its semantics. They are not concerned with the actual processing of sentences or
generation of sentences from structural description. They are in a quest for principles that
remain common across languages and identify rules that capture linguistic generalization. For
example, most languages have constructs like noun and verb phrases. Theoretical linguists
identify rules that describe and restrict the structure of languages (grammar). Psycholinguists
explain how humans produce and comprehend natural language. Unlike theoretical linguists,
they are interested in the representation of linguistic structures as well as in the process by
which these structures are produced. They rely primarily on empirical investigations to back
up their theories.
Computational linguistics is concerned with the study of language using computational
models of linguistic phenomena. It deals with the applications of linguistic theories and
computational techniques for NLP Computational linguistics, representing a language most
knowledge representations tackle only a small part of knowledge. Representing the whole
body of knowledge is almost impossible. The words knowledge and language should not be
confused.
Computational models may be broadly classified under knowledge- driven and data-
driven categories. Knowledge-driven systems rely on explicitly coded linguistic knowledge,
often expressed as a set of handcrafted grammar rules. Acquiring and encoding such
knowledge is difficult and is the main bottleneck in the development of such systems. They
are, therefore, often constrained by the lack of sufficient coverage of domain knowledge. Data-
driven approaches presume the existence of a large amount of data and usually employ some
machine learning technique to learn syntactic patterns.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

The amount of human effort is less and the performance of these systems is dependent
on the quantity of the data. These systems are usually adaptive to noisy data. As mentioned
earlier, this book is mainly concerned with computational linguistics approaches. We try to
achieve a balance between semantic (knowledge-driven) and data-driven approaches on one
hand, and between theory and practice on the other. It is at this point that the book differs
significantly from other textbooks in this area. The tools and techniques have been covered to
the extent that is needed to build sufficient understanding of the domain and to provide a base
for application.
The NLP is no longer confined to classroom teaching and a few traditional applications.
With the unprecedented amount of information now available on the web, NLP has become
one of the leading techniques for processing and retrieving information. In order to cope with
these developments, this book brings together information retrieval with NLP. The term
information retrieval is used here in a broad manner to include a number of information
processing applications such as information extraction, text summarization, question
answering, and so forth. The distinction between these applications is made in terms of the
level of detail or amount of information retrieved. We consider retrieval of information as part
of processing. The word information' itself has a much broader sense. It includes multiple
modes of information, including speech, images, and text. However, it is not possible to cover
all these modes due to space constraints. Hence, this book focuses on textual information only.

1.3 LANGUAGE AND KNOWLEDGE


Language is the medium of expression in which knowledge is deciphered. We are not
competent enough to define language and knowledge and its implications. We are here
considering the text from of the language and the content of it as knowledge.
Language, being a medium of expression, is the outer form of the content it expresses.
The same content can be expressed in different languages. But can language be separated
from its content? If so, how can the content itself be represented? Generally, the meaning of
one language is written in the same language (but with a different set of words). It may also be
written in some other, formal, language. Hence, to process a language means to process the
content of it. As computers are not able to understand natural language, methods are
developed to map its content in a formal language. Sometimes, formal language content may
have to be expressed in a natural language as well. Thus, in this book, language is taken up as
a knowledge representation tool that has historically represented the whole body of

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

knowledge and that has been modified. maybe through generation of new words, to include
new ideas and situations. The language and speech community, on the other hand, considers a
language as a set of sounds that, through combinations, conveys meaning to a listener.
However, we are concerned with representing and processing text only. Language (text)
processing has different levels, each involving different types of knowledge. We now discuss
various levels of processing and the types of knowledge it involves.
The simplest level of analysis is lexical analysis, which involves analysis of words.
Words are the most fundamental unit (syntactic as well as semantic) of any natural language
text. Word-level processing requires morphological knowledge, i.e., knowledge about the
structure and formation of words from basic units (morphemes). The rules for forming words
from morphemes are language specific.
The next level of analysis is syntactic analysis, which considers a sequence of words as
a unit, usually a sentence, and finds its structure. Syntactic analysis decomposes a sentence
into its constituents (or words) and identifies how they relate to each other. It captures
grammaticality or non-grammaticality of sentences by looking at constraints like word order,
number, and case agreement. This level of processing requires syntactic knowledge, i.e.,
knowledge about how words are combined to form larger units such as phrases and
sentences, and what constraints are imposed on them. Not every sequence of words results in
a sentence. For example, went to the market' is a valid sentence whereas ‘went the I market
to' is not. Similarly, 'She is going to the market' is valid, but she are going the market is not.
Thus, this level of analysis requires detailed knowledge about rules of grammar.

Yet another level of analysis is semantic analysis. Semantics is associated with the
meaning of the language. Semantic analysis is concerned with creating meaningful
representation of linguistic inputs. The general idea of semantic interpretation is to take
natural language sentences or utterances and map them onto some representation of
meaning. Defining meaning components is difficult as grammatically valid sentences can be
meaningless. One of the famous examples is, Colorless green ideas sleep furiously' (Chomsky
1957). The sentence is well formed, i.e., syntactically correct, but semantically anomalous.
However, this does not mean that syntax has no role to play in meaning. Bach (2002)
considers: ... semantics to be a projection of its syntax. That is semantic structure is
interpreted syntactic structure." But definitely, syntax is not the only component to contribute
meaning. Our conception of meaning is quite broad. We feel that humans apply all sorts of

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

knowledge (i.e., lexical, syntactic, semantic, discourse, pragmatic, and world knowledge) to
arrive at the meaning of a sentence. The starting point in semantic analysis, however, has
been lexical semantics (meaning of words). A word can have a number of possible meanings
associated with it. But in a given context, only one of these meanings participates. Finding out
the correct meaning of a particular use of word is necessary to find meaning of larger units.
However, the meaning of a sentence cannot be composed solely on the basis of the meaning of
its words. Consider the following sentences:
Kabir and Ayan are married. (1.1a)
Kabir and Suha are married. (1.1b)
Both sentences have identical structures, and the meanings of individual words are
clear. But most of us end up with two different interpretations. We may interpret the second
sentence to mean that Kabir and Suha are married to each other, but this interpretation does
not occur for the first sentence. Syntactic structure and compositional semantics fail to
explain these interpretations. We make use of pragmatic information. This means that
semantic analysis requires pragmatic knowledge besides semantic and syntactic knowledge. A
still higher level of analysis is discourse analysis. Discourse-level processing attempts to
interpret the structure and meaning of even larger units, e.g., at the paragraph and document
level, in terms of words, phrases, clusters, and sentences. It requires the resolution of
anaphoric references and identification of discourse structure. It also requires discourse
knowledge, that is, knowledge of how the meaning of a sentence is determined by preceding
sentences.
1.4 The Challenges of NLP
There are a number of factors that make NLP difficult. These relate to the problems of
representation and interpretation. The natural language are highly ambiguous and vague,
achieving such representation can be difficult. The greatest source of difficulty in natural
language is identifying its semantics. Words alone do not make a sentence. Instead it is the
words as well as their syntactic and semantic relation that give meaning to a sentence. Idioms,
metaphor and ellipses are more complexity to identify the meaning of the written text.
Example: “The old man finally kicked the bucket”
The meaning of this sentence has nothing to do with words ‘kick’ and ‘bucket’
appearing in it. The ambiguity of natural languages is another difficulty. These go unnoticed
most of times, yet are interpreted. This is possible because we explicit as well as implicit
sources of knowledge.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

Example:
"I saw the man with the telescope."
 Does this mean I used a telescope to see the man?
 Or does it mean I saw a man who was holding a telescope?
 The first level of ambiguity arises at the word level.
 Without much effort, we can identify words that have multiple meaning associated
with them.
Example:
“Manya is looking for a match”
 In the above example, the word match refers to that eighter “Manya is looking for a
partner” or “Manya looking for a match”. (cricket or other match)
Solution: part of speech tagging and sense disambiguation.
1.5 Language and Grammar
Automatic processing of language requires the rules and exceptions of a language to be
explained to the computer. Grammar consists of a set of rules that allows us to parse and
generate sentences in a language. Rules relate information to coding devices at the language
level not at the world knowledge level. The world knowledge affects both coding (i.e words)
and the coding convention(structure), this blurs the boundary between syntax and semantics.
Nevertheless, such a separation is made because of the ease of processing and grammar
writing.
Types of grammars.
1. Transformational grammar (Chomsky 1957)
2. Lexical functional grammar (Kaplan and Bresnan 1982)
3. Government and binding (Chomsky 1981)
4. Generalized phrase structure grammar
5. Dependency grammar
6. Paninian grammar
7. Tree – adjoining grammar (Joshi 1985)
Some of these grammars focus on derivation other focus on relationship. Transformational
grammar, each sentence in a language has two level representation namely deep structure
and surface structure. The greatest contribution to grammar comes from Noam Chomsky,
who proposed a hierarchy of formal grammar. The sentence has two level of representation,
namely, a deep structure and a surface structure is carried out by transformations.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

Transformational grammar has three components:


1. Phrase structure grammar.
2. Transformational rules.
3. Morphophonemic rules. -These rules match each sentence representation to a string of
phonemes.
Phrase structure grammar:
Each of these components consists of a set of rules. Phrase Structure grammar consists
of rules that generate natural language sentence and assign a structural description them. As
an example, consider the following set of rules:
S-> Sentence
VP-> Verb Phrase
NP-> Noun Phrase
Det->Determiner

Transformational grammar:
The second component of transformational grammar is a set of transformation rules,
which transform one phrase-maker (underlying) into another phrase-marker (derived).
These rules are applied on the terminal string generated by phrase structure rules. Unlike
phrase structure rules, transformational rules are heterogeneous and may have more than
one symbol on their lefthand side. These rules are used to transform one surface
representation into another, e.g., an active sentence into passive one. The rule relating active
and passive sentences (as given by Chomsky)

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

NP1 - AUx - V- NP2→ NP2 - Aux + be + en -V- by + NP1


This rule says that an underlying input having the structure NP-Aux-V-NP can be
transformed to NP - Aux + be + en -V - by + NP. This transformation involves addition of
strings 'be' and 'en' and certain re-arrangements of the constituents of a sentence.
Transformational rules can be obligatory or optional. An obligatory transformation is one that
ensures agreement in number of subject and verb, etc., whereas an optional transformation is
one that modifies the structure of a sentence while preserving its meaning. Morphophonemic
rules match each sentence representation to a string of phonemes.
The Consider the active sentence:
The police will catch the snatcher.
The application of phrase structure rules will assign the structure shown in Figure to this
sentence.

The passive transformation rules will convert the sentence into:


The + culprit + will + be + en + catch + by + police (below figure).

Morphophonemic rules:
Another transformational rule will then reorder ‘en+catch’ to ‘catch+en’ and
subsequently one of the morphophonemic rules will convert ‘catch+en’ to ‘caught’. In general,
the noun phrase is not always as simple as in sentence.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

1.6 Processing Indian Language:


There are a number of differences between Indian languages and English. This introduces
differences in their processing. Some of these differences are listed here.
 Unlike English, Indic scripts have a non-linear structure.
 Unlike English, Indian languages have SOV (Subject-Object-Verb) as the default sentence
structure.
 Indian languages have a free word order, i.e., words can be moved freely within a sentence
without changing the meaning of the sentence.
 Spelling standardization is more subtle in Hindi than in English.
 Indian languages have a relatively rich set of morphological variants.
 Indian languages make extensive and productive use of complex predicates (CPs).
 Indian languages use post-position (Karakas) case markers instead of prepositions.
 Spelling standardization is more subtle in Hindi than in English.
 पुस्तक (Pustak) — Standard spelling for "book"
 किताब (Kitaab) — Another commonly used word for "book,"
 Indian languages have a relatively rich set of morphological variants.
“to read” ,” I am reading”, “I read”
 Indian language use verb complexes consisting of sequences of verbs.
गा रहा है –(ga raha hai-singing) and खेल रही है-(khel rahi hai – playing). The
auxiliary verbs in this sequence provide information about tense, aspect, modality, etc.

Except for the direction in which its script is written, Urdu is closely related to Hindi. Both
share similar phonology, morphology, and syntax. Both are free-word-order languages and
use post-positions. They also share a large amount of their vocabulary. Differences in the
vocabulary arise mainly because a significant portion of Urdu vocabulary comes from Persian
and Arabic, while Hindi borrows much of its vocabulary from Sanskrit.
Paninian grammar provides a framework for Indian language models. These can be used
for computation of Indian languages. The grammar focuses on extraction of Karaka relations
from a sentence.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

1.7 NLP Applications


The applications utilizing NLP include the following.
1. Machine Translation: This refers to automatic translation of text from one human
language to another (understanding of words and phrases, grammars of two languages)
2. Speech Recognition: This is the process of mapping acoustic speech signals to a set of
words. (Siri, google assistant, Alexa)
3. Speech Synthesis: it is the process of converting written text into spoken words using
computer-generated voices (or) it refers to automatic production of speech.
Example :
• User: "What is the weather like today?"
• Speech Synthesis Response: "The weather today is sunny with a high of 75°F."
4. Natural Language Interface to Database: It allow querying a structured database using
natural language sentences.
5. Information Retrieval: refers to process of searching for and obtaining information from
a large collection of data or documents.
6. Information Extraction: it is the process of automatically extracting structured
information from unstructured text.
7. Question Answering: It attempts to find the precise answer, or at least the precise portion
of text in which the answer appears.
8. Text Summarization: This deals with the creation of summaries of documents and
involves syntactic, semantic and discourse level processing of text.

2.0 LANGUAGE MODELLING


Introduction:
Language modeling is the process of building a statistical or Machine Learning model
that can predict the next word or sequence of words in a sentence. It helps computers
understand and generate human language.
A model is a description of some complex entity or process. A language model is thus a
description of language. Indeed, natural language is a complex entity and in order to process it
through a computer-based program, we need to build a representation (model) of it. This is
known as language modelling. Language modelling can be viewed either as a problem of
grammar inference or a problem of probability estimation. A grammar based language model
attempts to distinguish a grammatical sentence from a non-grammatical (ill-formed) one,

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

whereas a probabilistic language model attempts to identify a sentence based on a probability


measure, usually a maximum likelihood estimate. These two viewpoints have led to the
following categorization of language modelling approaches.
1. The statistical language modelling
2. Grammar-based language model
2.1 The statistical language modelling
The statistical approach creates a language model by training it from a corpus. In order to
capture regularities of a language, the training corpus needs to be sufficiently large. Rosenfeld
(1994) pointed out:
Statistical language modelling (SLM) is the attempt to capture regularities of natural
language for the purpose of improving the performance of various natural language
applications.
Statistical language modelling is one of the fundamental tasks in many NLP applications,
including speech recognition, spelling correction, handwriting recognition, and machine
translation. It has now found applications in information retrieval, text summarization, and
question answering also, A number of statistical language models have been proposed in
literature, The most popular of these are the n-gram models.
A statistical language model is a probability distribution P(s) over all possible word
sequences (or any other linguistic unit like words, sentences, paragraphs, documents, or
spoken utterances). A number of statistical language models have been proposed in literature.
The dominant approach in statistical language modelling is the n-gram model.
2.1.1 n-gram Model
As discussed earlier, the goal of a statistical language model is to estimate the
probability (likelihood) of a sentence. This is achieved by decomposing sentence probability
into a product of conditional probabilities using the chain rule as follows:
P(s)=P(w1, w2, w3….wn)
=p(w1) p(w2/w1) p(w3/w1w2) p(w4/w1w2w3)….. p(wn/w1w2w3…..Wn-1))
n
wi
=∏ P ( )
i=1 hi
Where hi is history of word w i defined as wn/w1w2w3…..Wi-1
So, in order to calculate sentence probability, we need to calculate the probability of a
word, given the sequence of words preceding it. This is not a simple task. An n-gram model
simplifies the task by the approximating the probability of a given all previous words by the

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

conditional probability given previous n-1 words only.


We are going discuss about n-gram types such as unigram and bigram.
Unigram model:
A unigram model estimates word probabilities by counting the frequency of each word in the
corpus.
Example
Consider a small corpus of sentences:
• "I like dogs"
• "I like cats"
• "I love dogs"
Step 1: Tokenize the corpus
["I", "like", "dogs"]
["I", "like", "cats"]
["I", "love", "dogs"]
Step 2: Calculate word frequencies
Now, we count how often each word appears in the corpus:
• "I" appears 3 times.
• "like" appears 2 times.
• "dogs" appears 2 times.
• "cats" appears 1 time.
• "love" appears 1 time.
Step 3: Compute the total number of words
3+3+3=9
Step 4: Calculate the probabilities
The probability of a word P(w) in a unigram model is computed as:
P(w)=count of word w/ total number of words
For each word in our example:
• P(I)=3/9=0.33
• P(like)=2/9=0.22
• P(dogs)=2/9=0.22
• P(cats)=1/9=0.11
• P(love)=1/9=0.11

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

bigram model
In a bigram model, the probability of a word depends on the previous word. The
model uses bigrams (pairs of consecutive words) to estimate probabilities. We calculate the
probability of each bigram using the formula.
P(w2∣w1)=count of bigram (w1,w2)/count of word.
Example
Consider a small corpus of sentences:
• "I like dogs"
• "I like cats"
• "I love dogs"
Step 1: Tokenize the corpus
["I", "like", "dogs"]
["I", "like", "cats"]
["I", "love", "dogs"]
Step 2: Generate bigrams
("I", "like")
("like", "dogs")
("I", "like")
("like", "cats")
("I", "love")
("love", "dogs")
Step 3: Count bigram frequencies
We now count how often each bigram appears in the corpus:
• ("I", "like") appears 2 times.
• ("like", "dogs") appears 1 time.
• ("like", "cats") appears 1 time.
• ("I", "love") appears 1 time.
• ("love", "dogs") appears 1 time.

Step 4: Calculate bigram probabilities


The probability of a bigram P(w2∣w1)=count of bigram (w1,w2)/count of word
For each bigram in the example:

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

• P(like∣I)=2/3 =0.67(since "I" appears 3 times in total, and "I like" appears 2 times)
• P(dogs∣like)=1/2 =0.5 (since "like" appears 2 times in total, and "like dogs" appears 1
time)
• P(cats∣like)=1/2 =0.5 (since "like" appears 2 times in total, and "like cats" appears 1
time)
• P(love∣I)=1/3 = 0.33(since "I" appears 3 times in total, and "I love" appears 1 time)
• P(dogs∣love)=1/1=1 (since "love" appears 1 time in total, and "love dogs" appears 1
time)

2.2 Grammar-based language model


A grammar-based approach uses the grammar of a language to create its model. It
attempts to represent the syntactic structure of language. Grammar consists of hand-coded
rules defining the structure and ordering of various constituents appearing in a linguistic unit
(phrase, sentence, etc.). For example, a sentence usually consists of noun phrase and a verb
phrase. The grammar-based approach attempts to utilize this structure and also the
relationships between these structures.
2.2.1 Paninian Framework
Paninian grammar is the important model for Grammar based model. It was written by
Panini in 500 BC in sanskrit (the original text being titled Asthadhyayi). The framework used
for other Indian languages and possibly some Asian language as well. Unlike English, Asian
language are SOV(Subject-Object-Verb) ordered and inflectionally rich.
The inflections provide syntactic and semantic cues for language analysis and
understanding Indian language have traditionally used oral communication for knowledge
preparation. The purpose of these languages is to communicate ideas from the speaker’s mind
to the listener’s mind. Such oral traditions have given rise a morphologically rich language.
Some languages, like Sanskrit, have the flexibility to allow word groups representing subject,
object and verb to occur in any order. In others, like Hindi, we can change the position of
subject and object.
For example:

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

The auxiliary verbs follow the main verb. In Hindi, they remain as separate words, whereas in
South Indian (Dravidian) languages, they combine with the main verb.
For example:

Layered Representation in Paninian Grammar


The GB theory represents three syntactic levels: deep structure, surface
structure, and logical form (LF), where the LF is nearer to semantics, This theory tries to
resolve all language issues at syntactic levels only. Unlike GB, Paninian grammar framework is
said to be syntactic-semantic, that is, one can go from surface layer to deep semantics by
passing through are not named, as per intermediate layers. Although all these layers
Bharti, et al. (1995), the language can be represented as follows:

Semantic level

Karaka level

Vibhakti level

Surface level

The surface and the semantic levels are obvious. The other two levels should not be
confused with the levels of GB. Vibhakti literally means inflection, but here, it refers to word
(noun, verb, or other) groups based either on case endings, or post-positions, or compound
verbs, or main and auxiliary verbs, etc. Instead of talking about NP, VP, AP, PP, etc., word
groups are formed based on various kinds of markers (including the absence of it or Θ). These
markers are language-specific, but all Indian languages (and possibly Asian languages as well)

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

can be represented at the Vibhakti level.


2.2.2 Karaka Theory
Karaka theory is the central theme of PG framework. Karaka relations assigned based
on the roles played by various participants in the main activity. These roles are reflected in
the case markers and post- position markers (parsargs). These relations are similar to case
relations in English, but the types of relations are defined in a different manner and the
richness of the case endings found in Indian languages has been used to its advantage.
We will discuss the various Karakas, such as Karta (subject), Karma (object), Karana
(instrument), Sampradana (beneficiary), Apadana (separation), and Adhikarana (locus).

Karaka Meaning Relation to verb Example in Sanskrit

Kartā Subject Who performs the action रामः पठति-Rama reads


(कर्ता)

Karma (कर्म) Object What is affected by the रामः पुस्तकं पठति -Rama
action reads a book

Instrument What is used to perform रामः लेखनीना लिखति (Rama


Karana (करण)
the action writes with a pen)

Sampradana Beneficiary To whom the action is रामः सीतायै पुष्पं ददाति


(सम्प्रदान) directed (Rama gives a flower to Sita)

Apadana Separation Where the action moves वृक्षात् पत्रं पतति (A leaf
(अपादान) from falls from the tree)

Adhikaraṇa Location Where or when the गृहे पठति (Reads in the house)
(अधिकरण) action happens

Issues in Paninian Grammar


The two problems challenging linguists are:
i. Computational implementation of PG, and
ii. Adaptation of PG to Indian, and other similar languages.
An approach to implementing PG has been discussed in Bharati, et al. (1995). This is a
multilayered implementation. The approach is named ‘Utsarga-Apvada' (default-exception),
where rules are arranged in multiple layers in such a way that each layer consists of rules
which are in exception to rules in the higher layer. Thus, as we go down the layer, more

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

particular information is derived. Rules may be represented in the form of charts (such as
Karaka chart and Lakshan chart).
However, many issues remain unresolved, especially in cases of shared Karak
relations. Another difficulty arises when mapping between the Vibhakti (case markers and
post-positions) and the semantic relation (with respect to verb) is not one to one. Two
different Vibhakti can represent the same relation, or the same Vibhakti can represent
different relations in different contexts. The strategy to disambiguate the various senses of
words, or word groupings, are still the challenging issues.
As the system of rules is different in different languages, the framework requires
adaptations to tackle various applications in various languages.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.

You might also like