KEMBAR78
UNIT-V NLP | PDF | Information Retrieval | Search Engine Indexing
0% found this document useful (0 votes)
608 views25 pages

UNIT-V NLP

Machine translation is the process of using artificial intelligence to automatically translate text from one language to another without human involvement. It provides benefits such as speed, large language support, and cost savings compared to human translation. However, machine translation also has problems such as lower accuracy due to lack of context understanding and inability to ask questions. While machine translation has improved, human review is still typically needed to correct errors and ensure the full meaning is conveyed properly across languages and cultures.

Uploaded by

Bhagya Battula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
608 views25 pages

UNIT-V NLP

Machine translation is the process of using artificial intelligence to automatically translate text from one language to another without human involvement. It provides benefits such as speed, large language support, and cost savings compared to human translation. However, machine translation also has problems such as lower accuracy due to lack of context understanding and inability to ask questions. While machine translation has improved, human review is still typically needed to correct errors and ensure the full meaning is conveyed properly across languages and cultures.

Uploaded by

Bhagya Battula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

UNIT – V
Machine Translation and Multilingual Information

Introduction Machine Translation Survey


Machine translation is the process of using artificial intelligence to automatically translate text from one
language to another without human involvement. Modern machine translation goes beyond simple
word-to-word translation to communicate the full meaning of the original language text in the target
language. It analyzes all text elements and recognizes how the words influence one another.

Benefits of machine translation:


Human translators use machine translation services to translate faster and more efficiently. We give some
benefits of machine translation below:

Automated translation assistance


Machine translation provides a good starting point for professional human translators. Many
translation management systems integrate one or more machine translation models into their workflow.
They have settings to run translations automatically, then send them to human translators for post-editing.
Speed and volume
Machine translation works very fast, translating millions of words almost instantaneously. It
can translate large amounts of data, such as real-time chat or large-scale legal cases. It can also process
documents in a foreign language, search for relevant terms, and remember those terms for future
applications.
Large language selection
Many major machine translation providers offer support for 50-100+ languages. Translations
also happen simultaneously for multiple languages, which is useful for global product rollouts and
documentation updates.
Cost-effective translation
Machine translation increases productivity and the ability to deliver translations faster,
reducing the time to market. There is less human involvement in the process as machine translation
provides basic but valuable translations, reducing both the cost and time of delivery. For example, in
high-volume projects, you can integrate machine translation with your content management systems to
automatically tag and organize the content before translating it to different languages.

Problems of Machine Translation:


Machines can understand and translate human speech better than ever before. However, bots from
Google to Amazon still struggle to understand your words, handle humor or do anything much beyond give
you an approximation of what was originally said.
When it comes to translating files, machines can do a lot in place of human intervention—but they’re
not the ideal solution for translation.
Machine translation can be quicker and more budget-friendly than using a certified human translation
service, but the trade-off is an increase in errors and other problems. Here are nine problems of machine
translation that will have you thinking twice before employing them.
1. The accuracy suffers
A machine translator doesn’t go back to check its work. There’s no pause-and-repeat function to
allow a machine to go over a phrase more than once, then accurately transcribe it.

1
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Machine translations are most effective when used to get a general idea of a passage or piece of
content. However, when it comes to word-for-word accuracy from one language to another, a
machine won’t deliver a high percentage of accurate translation. You need a human touch to review
and edit a translation for the best possible level of accuracy.

2. You lose context

Machines are very literal. They can’t understand how a mistranslated word or phrase could change
the meaning of a passage in different contexts.

A human eye and ear on your translation can save you from an embarrassing error. A machine will
miss nuances or contexts that make a passage accurate and relevant.

From puns to sarcasm, machines miss those nuances while a human translator can listen to a phrase
and understand how to translate it in a way that makes sense culturally in the target language.

3. You lose time

You might find a quicker (almost instant) turnaround time on a machine translation, but you’ll likely
have to spend valuable time reviewing and correcting errors.

Consider a quick, yet inaccurate translation plus the additional time you’ll spend fixing it. Now
compare that to the time for a certified human translator to translate your piece, check for errors, and
guarantee their final work.

It might take slightly longer to get your final translation, but the accuracy is worth it. If you allow
enough time to meet your deadline, human translation always wins.

4. They limit file formats

Machine translators are finicky. They might have limits to the types of file formats they can read,
which limits your options when choosing an engine to use. There may also be significant limits on
file size, narrowing the field even further.

If you recorded audio or typed your document in an unaccepted format, you’re stuck without a
translator. Choosing a human translator often gives you more options to use your file, no matter the
format or size.

While human translators may have preferred file formats, outright file rejection is less likely, and
providing what they need means that your project will get done faster.

5. They can’t think

Machines can’t think, and they don’t ask questions. They use programming to interpret words and
find the closest equivalent in another language for the translation.

Some languages don’t have an exact equivalent in a different language. Machines either translate to a
similar (but inaccurate) word, or you’ll find blatant gaps in the translations.

2
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

When you work with a professional human translator, they understand language barriers. If a word or
phrase has no exact translated meaning, they’ll recommend a suitable equivalent translated phrase to
maintain the integrity of your document across languages.

6. You lose money

“But some machine translations are free!”

You get what you pay for. Trying to save a buck on a “free” translation service can cost you more
money to edit the translation. Whether you do it yourself or pay someone to edit, you’re not getting
anything for free.

Time is valuable. Your staff has better things to do than edit a free machine translation full of errors.
It’s a better investment of time and budget dollars to work with a certified human translator from the
start.

7. It’s not an expert

A machine is not an expert in any language or specific industry jargon. Again, an machine translation
service is a program. It’s only as accurate as the person that developed the software and the material
that it’s fed.

Certified translators are experts in their languages, and often, complex industries. Find a translator
that specializes in the parent language and the end-result language. You’ll save yourself from
potentially offensive or dangerous errors in the final results.

8. It can get wordy

Why keep it in five words when you can say it in fifteen words in a different language? If you notice
your document is much longer than the original, your machine translator probably added more words
than necessary to find a “close enough” translation.

When paying by the word for a machine translation service, beware of paying for words you don’t
need and a confusing document. Concise is always better than convoluted.

9. It’s not creative

When translating marketing materials or creatively-driven pieces, a machine translation program


regularly gets lost in translation.

Creative content often uses words in unique contexts. You’ll also find made-up words for creative,
dramatic, or humorous effects. Machines won’t catch those subtleties in language or context.

Is Machine Translation Possible:

Using a human translator will help you market to multiple audiences with varying languages while
taking creative flair into account.

3
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

● Internal communication

For a company operating in different countries across the world, communication can be difficult to
manage. Language skills can vary from employee to employee, and some may not understand the
company’s official language well enough. Machine translation helps to lower or eliminate the language
barrier in communication. Individuals quickly obtain a translation of the text and understand the
content's core message. You can use it to translate presentations, company bulletins, and other common
communication.

● External communication

Companies use machine translation to communicate more efficiently with external stakeholders and
customers. For instance, you can translate important documents into different languages for global
partners and customers. If an online store operates in many different countries, machine translation
can translate product reviews so customers can read them in their own language.

● Data analysis

Some types of machine translation can process millions of user-generated comments and deliver
highly accurate results in a short timeframe. Companies translate the large amount of content posted
on social media and websites every day, and translate it for analytics. For example, they can
automatically analyze customer opinions written in various languages.

● Online customer service

With machine translation, brands can interact with customers all over the world, no matter what
language they speak. For example, they can use machine translation to:

● Accurately translate requests from customers all over the world


● Increase the scale of live chat and automate customer service emails
● Improve the customer experience without hiring more employees
● Legal research

The legal department uses machine translation for preparing legal documents in different countries.
With machine translation, a large amount of content becomes available for analysis that would have
been difficult to process in different languages.

Brief History:
The idea of using computers to translate human languages automatically first emerged in the early
1950s. However, at the time, the complexity of translation was far higher than early estimates by computer
scientists. It required enormous data processing power and storage, which was beyond the capabilities of
early machines.

In the early 2000s, computer software, data, and hardware became capable of doing basic machine
translation. Early developers used statistical databases of languages to train computers to translate text.
This involved a lot of manual labor and time. Each added language required them to start over with the
development for that language. Since then, machine translation has developed in speed and accuracy, and
several different machine translation strategies have emerged.

4
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Possible Approaches:
In machine translation, the original text or language is called source language, and the language you
want to translate it to is called the target language. Machine translation works by following a basic
two-step process:

1. Decode the source language meaning of the original text


2. Encode the meaning into the target language

We give some common approaches on how language translation technology implements this machine
translation process.

Rule-based machine translation

Language experts develop built-in linguistic rules and bilingual dictionaries for specific industries or
topics. Rule-based machine translation uses these dictionaries to translate specific content accurately.
The steps in the process are:

1. The machine translation software parses the input text and creates a transitional
representation
2. It converts the representation into target language using the grammar rules and dictionaries as
a reference
Pros and cons

Rule-based machine translation can be customized to a specific industry or topic. It is predictable and
provides quality translation. However, it produces poor results if the source text has errors or uses
words not present in the built-in dictionaries. The only way to improve it is by manually updating
dictionaries regularly.

Statistical machine translation

Instead of relying on linguistic rules, statistical machine translation uses machine learning to translate
text. The machine learning algorithms analyze large amounts of human translations that already exist
and look for statistical patterns. The software then makes an intelligent guess when asked to translate
a new source text. It makes predictions on the basis of the statistical likelihood that a specific word or
phrase will be with another word or phrase in the target language.

Syntax-based machine translation

Syntax-based machine translation is a sub-category of statistical machine translation. It uses


grammatical rules to translate syntactic units. It analyzes sentences to incorporate syntax rules into
statistical translation models.

Pros and cons


Statistical methods require training on millions of words for every language pair. However, with
sufficient data the machine translations are accurate.

Neural machine translation

5
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Neural machine translation uses artificial intelligence to learn languages, and to continuously
improve that knowledge using a specific machine learning method called neural networks. It often
works in combination with statistical translation methods.

Neural network

A neural network is an interconnected set of nodes inspired by the human brain. It is an information
system where input data passes through several interconnected nodes to generate an output. Neural
machine translation software uses neural networks to work with enormous datasets. Each node makes
one attributed change of source text to target text until the output node gives the final result.

Neural machine translation vs other translation methods

Neural networks consider the whole input sentence at each step when producing the output sentence,
Other machine translation models break an input sentence into sets of words and phrases, mapping
them to a word or sentence in the target language. Neural machine translation systems can address
many limitations of other methods and often produce better quality translations.

Hybrid machine translation

Hybrid machine translation tools use two or more machine translation models on one piece of
software. You can use the hybrid approach to improve the effectiveness of a single translation model.
This machine translation process commonly uses rule-based and statistical machine translation
subsystems. The final translation output is the combination of the output of all subsystems.

Pros and cons

Hybrid machine translation models successfully improve translation quality by overcoming the issues
linked with single translation methods.

Current Status:
Machine translation is universally accepted as the most accurate, versatile, and fluent machine
translation approach. Since its invention in the mid-2010s, neural machine translation has become the
most advanced machine translation technology. It is more accurate than statistical machine
translation, from fluency to generalization. It is now considered the standard in machine translation
development.

The performance of a machine translator depends on several factors, including the:

● Machine translation engine or technology

● Language pair

● Available training data

● Text types for translation. As the software performs more translations for a specific language
or domain, it will produce higher quality output. Once trained, neural machine translation
becomes more accurate, faster, and easier to add languages

6
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Anusaraka or Language Accessor


Anusaaraka is a computer software which renders text from one Indian language into another, a sort
of machine translation. It produces output which is comprehensible to the reader, although at times it
might not be grammatical.
Anusaaraka is a Language Accessor cum Machine Translation system based on the fundamental
premise of sharing the load producing good enough results according to the needs of the reader. The
system promises to give faithful representation of the translated text, no loss of information while
translating and graceful degradation (robustness) in case of failure. The layered output provides an
access to all the stages of translation making the whole process transparent.
Thus, Anusaaraka differs from the Machine Translation systems in two respects:
(1) its commitment to faithfulness and thereby providing a layer of 100% faithful output so that a
user with some training can “access the source text” faithfully.
(2) The system is so designed that a user can contribute to it and participate in improving its quality
Further Anusaaraka provides an eclectic combination of the Apertium architecture with the forward
chaining expert system, allowing use of both the deep parser and shallow parser outputs to analyze
the SL text. Existing language resources (parsers, taggers, chunkers) available under GPL are used
instead of rewriting it again. Language data and linguistic rules are independent from the core
programme, making it easy for linguists to modify and experiment with different language
phenomena to improve the system. Users can become contributors by contributing new word sense
disambiguation (WSD) rules of the ambiguous words through a web-interface available over internet.
The system uses forward chaining of expert system to infer new language facts from the existing
language data. It helps to solve the complex behavior of language translation by applying specific
knowledge rather than specific technique creating a vast language knowledge base in electronic form.
Or in other words, the expert system facilitates the transformation of subject matter expert's (SME)
knowledge available with humans into a computer processable knowledge base.

The Problem
A large parallel corpus is required for training purpose in machine-readable form,
• Fall back mechanism in case of error correction is difficult,
• After a particular threshold, improving the quality of the system is very difficult.
• 100% correct translation is not possible.
Machine translation has the following fundamental problems: Language is a device for information
exchange. Languages code information at various levels such as morphological, syntactic, pragmatic,
language convention etc. Hence extracting the correct information needs extra linguistic information
such as world knowledge, context, cultural knowledge and language conventions of the receiving

7
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

person. But the information available in the language string is always partial. Further there is an
inherent tension between brevity and precision in the language, where brevity always wins leading to
inherent ambiguity in the language. Though machines are good at storing the language data, but it is
extremely difficult for machines to have the world knowledge of an average human being. Hence
determining the correct sense of an ambiguous word is a major bottle-neck in machine translation
technology.
ANUSAARAKA system has the following unique features
• Faithful representation: The system tries to give more correct translation rather than giving a
meaningful translation. The user can infer the correct meaning from the various layers of translation
output.
• No loss of information: All the information available on the source language side is made available
explicitly in the successive layers of translation.
• Graceful degradation (Robust fall back mechanism): The system ensures a safety net by providing
a “padasutra layer”, which is a word to word translation represented in special formulatic form,
representing various senses of the source language word
The major goals of the Anusaaraka system are to:
• Reduce the language barrier by facilitating access from one language to another.
• Demonstrating the practical usability of the Indian traditional grammatical system in the modern
context. • Enabling users to become contributors in the development of the system.
• Providing a free and open source machine translation platform for Indian languages.

Structure of Anusaraka System:


The Anusaaraka system has two major components.
◆ Core engine
◆ User–cum-developer interface

8
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

‘Core’ engine is the main engine of anusaaraka. This engine produces the output in different layers
making the process of Machine Translation transparent to the user. The architecture of “core”
anusaaraka is shown in Figure.
Core Anusaaraka engine
The core anusaaraka engine has four major modules viz. I. Word Level Substitution II. Word Sense
Disambiguation III. Preposition placement IV. Hindi Word Order generation Each of the above four
modules is described in detail. A justification of how these changes answer the questions raised in
section 3 is presented.
Word Level Substitution:
At this level the ‘gloss’ of each source language word into the target language is provided. However,
the Polysemous words (words having more than one related meaning) create problems. When there is
no one-one mapping, it is not practical to list all the meanings. On the other hand, anusaaraka claims
‘faithfulness’ to the original text. Then how is the faithfulness guaranteed at word level substitution?
Concept of Padasutra:
To arrive at the solution, the user must understand why a native speaker does not find it odd to have
so many ‘seemingly’ different meanings of a word. By looking at the various usages of any
Polysemous word, users may observe that these Polysemous words have a “core meaning” and other
meanings are natural extensions of this meaning. In anusaaraka an attempt is made to relate all these
meanings and show their relationship by means of a formula. This formula is termed Padasutra[2].
(The concept of Padasutra is based on the concept of ‘pravrutti-nimitta’ from traditional grammar)
The word padasutra itself has two meanings:
◆ a thread connecting different senses
◆ a formula for pada
An example of Padasutra:

9
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

The English word ‘leave’ as a noun means ‘Cutti’ in Hindi, and as a verb its Hindi meaning is
‘CodanA’ and it is obvious that ‘CodanA’ is derived from ‘Cutti’. Hence, the Padasutra for ‘leave’ is
leave: Cutti[>CodanA] Here ‘a>b’ stands for ‘b gets derived from a’ and ‘a[b]’ roughly stands for ‘a
or b’. Thus, by division of workload and adoption of the concept of ‘Padasutra—word formula’, the
research guarantees that the first level output is ‘faithful’ to the original and also acts as a ‘safety net’
where other modules fail. At this level some of the English words like functional words, articles, etc.
are not substituted. The reason being they are either highly ambiguous, or there is a
lexical/conceptual gap in Hindi corresponding to the English words (e.g. articles), or substituting
them may lead to catastrophe. Thus, for the input sentence ‘rats killed cats’ the output after word
level substitution is cUhA{s} mArA{ed/en} billI{s}
Training Component
To understand the output produced in this manner, a human being needs some training. The training
presents English grammar through the Paaninian view[3]. Thus, if a user is willing to put in some
effort, he/ she has complete access to the original text.
The effort required here is that of making correct choices based on the common sense, world
knowledge, etc. The training component layer ensures that the layer produces an output, which is a
“rough” translation that systematically differs from Hindi. Since the output is generated following
certain principles, the chances of getting mislead are less. Theoretically, the output at this layer is
reversible.
Word Sense Disambiguation (WSD)
English has a very rich source of systematic ambiguity. Majority of nouns in English can potentially
be used as verbs. Therefore, the WSD task in case of English can be split into two classes:
● (i ) WSD across POS
● (ii) WSD within POS
The POS taggers can help in WSD when the ambiguity is across POSs. For example: Consider the
two sentences ‘He chairs the session’. ‘The chairs in this room are comfortable’. The POS taggers
mark the words with appropriate POS tags. These taggers use certain heuristic rules, and hence may
sometimes go wrong. The reported performances of these POS taggers vary between 95% to 97%.
However, they are still useful, since they reduce the search space for meanings substantially.
However, disambiguation in the case of Polysemous words requires disambiguation rules. It is not an
easy task to frame such rules. It is the context, which plays a crucial role in disambiguation. The
context may be
◆ the words in proximity, or
◆ other words in the sentence that are related to the word to be disambiguated.
The question is how can such rules be made efficiently? To frame disambiguation rules manually
would require thousands of man-years. Is it possible to use machines to automate this process? The
wasp workbench [8] is the best example of how, with the help of a small seed data, machines can
learn from the corpus and produce disambiguation rules. Anusaaraka uses the wasp workbench to
semi-automatically generate these disambiguation rules. The output produced at this stage is
irreversible, since machine makes choices based on heuristics.

10
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Preposition Placement:
English has prepositions whereas Hindi has postpositions.
◆ Hence, it is necessary to move the prepositions to proper positions in Hindi before substituting
their meanings. While moving the prepositions from their English positions to the proper Hindi
positions, \record of their movements must be stored, so that in case a need arises, they can be
reverted back to their original position. Therefore, the transformations performed by this module, are
also reversible.
Hindi Word Order Generation
Hindi is a free word order language. Therefore, even the anusaaraka output in the previous layer
makes sense to the Hindi reader. However, this output not being natural in Hindi, may not be enjoyed
as much as the output with natural Hindi order. Additionally, it would not be treated as a translation.
Therefore, in this module the attempt is to generate the correct Hindi word order.
Interface for different linguistic tools
The second major contribution of this architecture is the concept of ‘interfaces’. Machine translation
requires language resources such as POS taggers, morphological analyzers, and parsers. More than
one kinds of each of these tools exist. Hence, it is wise to use these tools. However, there are
problems.
For examples – Parsers I. These parsers do not have satisfactory performance. 40% of the time, the
first parse is the correct parse. Parse of a sentence tells how the words are related to each other. 90%
of such relations in any parse are typically correct. II. Each of these parsers is based on different
grammatical formalism. Hence, the output they produce is also influenced by the theoretical
considerations of this grammar formalism. III. Since the output format for different parsers is
different, it is not possible to remove one parser and plug in the other one. IV. One needs trained
manpower to interpret the output produced by these parsers and to improve the performance of these
parsers. As a machine translation system developer who is interested in the “usable” product one
would like to plug-in different parsers and watch the performance. May be one would like to use
combinations of them, or may like to vote among different parsers and choose the best parse out of
them.
The question then is how to achieve it?
It is not enough to have the modular programs. The parser itself is an independent module. What is
required is plug-in facility for different parsers. This is possible, provided all the parsers produce an
output in some common format. Hence, interfaces are necessary to map output of parsers to an
intermediate form as illustrated in figure.

11
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Anusaaraka output and the user interface


The Java based user interface has been developed to display the outputs produced by different layers
of anusaaraka engine. The user interface provides a flexibility to control the display. A snapshot of a
sample English-Hindi anusaaraka output with brief explanation of each of the layers is provided :
Layer1
◆ Row 1: Original English sentence
◆ Row 2: Word level substitution
❑ Least fragile layer.
❑ Contains Hindi Padasutra (word formula) for each English word.
For example small -> CotA^alpa rats -> cUhA{s}
◆ Row 3: Word Grouping
❑ The group of words which as a group give some new meaning (e.g. compounding) are
grouped together. In the above sentence, are + ing = 0_rahA_hE

◆ Row 4: Word Sense Disambiguation


❑ Attempts to select the appropriate sense according to the context. For example, the big cats
-> vyaaghraadii
◆ Row 5: Preposition Movement
❑ The prepositions are moved to their correct Hindi positions. E.g. ‘->meM — jangala’ is
changed to ‘— — jangala+meM’.
Layer 2: Hindi anuvAda
Proper Hindi sentence is generated

12
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Anusaaraka: A better approach for Machine Translation because it is


Robust
It always produces the output. If the machine fails at higher levels, which are in principle fragile, lower
level outputs are still available to the user. However to understand the lower level outputs, some training is
required.
Transparent
The output at different levels makes the whole process of machine Translation transparent to the user. This
opens up a new opportunity for the persons having an aptitude for language analysis to contribute the
Machine Translation efforts even without any formal training in computational linguistics, or NLP.

Giving up Agreement in Anusaraka Output:


Anusaraka is an MT system that utilizes a pivot language to enable translation between language pairs
for which direct translation data may be limited or unavailable. In certain scenarios, when using a pivot
language, the translation system may have to give up agreement in the output. Let's understand what
"giving up agreement" means in this context.
Agreement, in the context of language and syntax, refers to the grammatical relationship between
different words in a sentence. It ensures that different elements in a sentence, such as nouns, pronouns,
and verbs, are in harmony with one another regarding features like number, gender, and person. For
example, in English, there is a subject-verb agreement, where a singular subject requires a singular
verb, and a plural subject requires a plural verb.
When using a pivot language in Anusaraka or other similar pivot-based translation methods, the
system may not always capture the full agreement between words in the source and target languages.
This can happen for several reasons:
Indirect Translation: When translating from the source language to the pivot language and then to the
target language, the agreement information may get lost or altered in the process.
Divergent Syntax: Different languages have different sentence structures and rules for agreement. When
using a pivot language, the translation system may not be able to fully align the agreement features
between the source and target languages.

13
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Limited Training Data: If the pivot language has limited training data for certain agreement patterns, the
translation system may not accurately handle those patterns during translation.
As a result, the translated output in the target language may lack full agreement with the original
source language. This can lead to grammatical errors and less fluent or natural-sounding translations.
However, it's worth noting that researchers are continually working to improve pivot-based translation
systems, and many methods aim to preserve agreement as much as possible. Techniques like
multilingual training, cross-lingual pre-training, and transfer learning help in capturing syntactic and
grammatical relationships between languages, which can lead to better agreement in the translated
output.
Nevertheless, when using pivot-based approaches, there may still be cases where complete agreement
between source and target languages is not achievable, and the translation system may need to "give
up" certain aspects of agreement to produce a feasible translation. This trade-off is one of the
challenges of pivot-based machine translation, especially for complex linguistic phenomena like
agreement.

Language Bridges
In the context of machine translation, language bridges refer to the techniques and approaches that
enable communication and translation between different language pairs, even if direct parallel training
data between those pairs is limited or unavailable. These bridges allow machine translation models to
leverage knowledge from one or more intermediary languages to improve translation quality for
language pairs that lack direct translation data.
Language bridges in machine translation are particularly useful for low-resource languages, where
obtaining sufficient parallel data for direct translation between two specific languages is challenging.
By introducing an intermediate language(s) for which sufficient parallel data is available, the model
can effectively learn to bridge the gap and perform translations across multiple languages.
Here are some common approaches used to build language bridges in machine translation:
Pivot-based Translation: In this approach, the model translates the source language to an intermediate
language (pivot) for which there is ample parallel data available, and then translates from the pivot
language to the target language. The overall translation is achieved by combining these two translation
steps. This technique allows the model to handle language pairs without direct parallel data, as long as
there is a path through a pivot language.
Multilingual Translation Models: Instead of training separate models for each language pair, multilingual
translation models are trained to handle multiple languages simultaneously. These models can share
information between languages during training, effectively creating a language bridge.
Zero-Shot Translation: By training a model on multiple languages, it can be capable of translating
between language pairs it has never seen during training. This is achieved by leveraging the shared
representations learned during multilingual training.
Cross-Lingual Pre-training: Models are pre-trained on a large corpus containing text from multiple
languages, learning to encode multilingual representations. These pre-trained models can then be
fine-tuned for specific translation tasks.

14
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Transfer Learning: Knowledge gained from translating between high-resource language pairs can be
transferred to improve translation performance for low-resource language pairs.
Bilingual Lexicons and Dictionaries: Bilingual word dictionaries or lexicons can be utilized to map
words between languages, providing a way to perform translation using explicit word alignments.
Language bridges in machine translation are valuable because they help expand the reach of translation
systems to more languages, including those with limited linguistic resources. However, it's important
to consider that using a pivot language or indirect training can introduce errors or inefficiencies in
translation, and the quality of the translation may vary depending on the quality and suitability of the
intermediary language. Researchers are continually exploring ways to improve language bridges and
multilingual translation techniques to enhance translation performance across various language pairs.

Multilingual Information Retrieval


Multilingual Information Retrieval (MLIR) refers to the ability to process a query for information in
any language, search a collection of objects, including text, images, sound files, etc., and return the
most relevant objects, translated if necessary into the user's language. The explosion in recent years of
freely-distributed unstructured information in all media, most notably on the World Wide Web, has
opened the traditional field of Information Retrieval (IR) up to include image, video, speech, and other
media, and has extended out to include access across multiple languages.
Key aspects and challenges in Multilingual Information Retrieval include:
Cross-Language Retrieval: Multilingual IR involves retrieving documents in a target language based
on a query expressed in a source language. This requires techniques to bridge the language gap
between the query and the documents.
Machine Translation: One common approach in multilingual IR is to use machine translation to
translate the query from the source language to the target language before performing retrieval. This
requires effective translation models that can accurately convey the user's information needs across
languages.
Cross-Lingual Information Retrieval (CLIR): CLIR focuses on retrieving documents in a different
language than the user's query without necessarily translating the query. CLIR systems often rely on
multilingual techniques like parallel corpora, cross-lingual word embeddings, and statistical language
models.
Language Identification: Multilingual IR systems need to identify the language of a given query or
document accurately. Language identification is essential to route the query to the appropriate retrieval
system or to conduct cross-lingual searches.
Multilingual Corpora and Resources: Building multilingual IR systems requires access to
large-scale multilingual corpora, parallel texts, and bilingual dictionaries. These resources are used to
train machine translation models and build cross-lingual representations.
Multilingual Evaluation: Developing effective evaluation methodologies for multilingual IR systems
is challenging. Metrics need to account for differences in languages, relevance judgments, and user
satisfaction across diverse linguistic contexts.

15
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Handling Low-Resource Languages: Low-resource languages present additional challenges in


multilingual IR due to limited training data and resources. Techniques like transfer learning and
cross-lingual transfer are often used to handle such languages.
Applications of Multilingual Information Retrieval include:
Cross-Lingual Web Search: Allowing users to retrieve web pages and information in languages they
do not understand.
Cross-Lingual Document Retrieval: Facilitating access to relevant documents in a target language
for users searching in a different language.
Multilingual Question Answering: Enabling users to ask questions in one language and receive
answers from documents in multiple languages.
Cross-Lingual Plagiarism Detection: Detecting instances of plagiarism or duplicate content across
documents in different languages.
Multilingual Information Retrieval is an active area of research, and various techniques, ranging from
machine translation to cross-lingual embeddings and multilingual learning, continue to advance the
capabilities of cross-lingual search and information retrieval systems.

Document Pre-processing
Document pre-processing is a critical step in information retrieval (IR) that involves transforming raw
text documents into a format suitable for efficient indexing, storage, and retrieval. The main goal of
document pre-processing is to prepare the text data so that it can be effectively searched and
matched against user queries in an IR system. Several important tasks are typically performed during
document pre-processing:
Tokenization: The first step is to break down the raw text into smaller units called tokens, which are
usually words or subwords. Tokenization makes it possible to process individual words and enables
further analysis and indexing.
Lowercasing: Converting all tokens to lowercase is a common pre-processing step to ensure
case-insensitive search. This helps in retrieving relevant documents regardless of the case used in the
user query.
Stopword Removal: Stopwords are common words like "the," "is," "and," which appear frequently in
a language but do not carry significant meaning for information retrieval. Removing stopwords helps
reduce noise and saves space during indexing.
Stemming and Lemmatization: Stemming and lemmatization are techniques used to reduce words to
their root or base form. This helps in grouping different inflected forms of a word together, enabling
broader search coverage and reducing indexing overhead.
Normalization: Normalizing textual data involves converting abbreviations, acronyms, and numerical
expressions to their standard or expanded forms. This ensures consistency and improves search
precision.

16
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Character Encoding and Unicode Handling: Ensuring proper character encoding and handling
Unicode text is essential for processing text data from various languages and scripts.
Special Character and Punctuation Handling: Depending on the application, special characters and
punctuation may be removed or retained to improve search accuracy.
Named Entity Recognition: Identifying and annotating named entities (e.g., person names, locations)
can be useful for improving search relevance and enabling entity-based retrieval.
Feature Extraction: Depending on the IR system, additional features like n-grams, part-of-speech tags,
and sentiment scores may be extracted to enrich the representation of the documents.
The pre-processed documents are then indexed to create an inverted index, which maps each term in
the document collection to the documents that contain it. This index facilitates fast and efficient
retrieval of relevant documents when users submit queries.
Document pre-processing is a crucial step that impacts the overall performance and efficiency of an
information retrieval system. It is essential to strike the right balance between text normalization and
feature extraction to achieve accurate and relevant search results for different types of user queries.

Monolingual Information Retrieval:


Monolingual Information Retrieval (IR) is a subfield of information retrieval that deals with the
retrieval of relevant information from a single language document collection in response to user
queries written in the same language. In monolingual IR, the focus is on processing and retrieving
information within a specific language, without considering information from other languages.
Key components and processes involved in Monolingual Information Retrieval include:
Document Indexing: The first step in monolingual IR is to index the documents in the collection. This
involves analyzing and tokenizing each document, and then creating an inverted index that maps terms
(words or phrases) to the documents that contain them.
Query Processing: When a user submits a query, it goes through a similar pre-processing stage as the
documents. The query is tokenized, and additional pre-processing steps such as lowercasing and
stopword removal may be applied.
Term Weighting: To rank documents by their relevance to the query, monolingual IR systems employ
term weighting schemes. Commonly used methods include TF-IDF (Term Frequency-Inverse
Document Frequency), which assigns higher weights to terms that are more frequent in the document
but less common across the collection.
Ranking: Based on the term weights, the system ranks the documents in the collection to determine
their relevance to the query. The top-ranked documents are then presented to the user as search results.
Retrieval Models: Different retrieval models, such as the Vector Space Model and the Probabilistic
Model (e.g., BM25), are used to compute the relevance scores and rank the documents.
Query Expansion: In some cases, monolingual IR systems may employ query expansion techniques
to improve retrieval performance. Query expansion involves expanding the original query with
additional terms, synonyms, or related concepts to retrieve more relevant documents.

17
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Evaluation: The effectiveness of a monolingual IR system is assessed using evaluation metrics such
as precision, recall, F1 score, and Mean Average Precision (MAP) to measure the quality of the
retrieved results.
Applications of Monolingual Information Retrieval include:
Web Search: Retrieving relevant web pages based on user queries.
Document Search: Searching for relevant documents within a specific corpus or database.
Information Retrieval in Social Media: Finding relevant posts, tweets, or comments on social media
platforms.
Enterprise Search: Enabling users to search for information within an organization's internal
documents and databases.
Monolingual Information Retrieval forms the foundation of many search engines and information
retrieval systems, allowing users to efficiently access relevant information from large document
collections in their native language. Improving the accuracy and efficiency of monolingual IR is an
ongoing area of research in the field of information retrieval and natural language processing.

CLIR
CLIR stands for Cross-Lingual Information Retrieval. It is a subfield of information retrieval (IR) that
focuses on retrieving relevant information from a document collection in one language (the source
language) in response to user queries expressed in another language (the target language). In other
words, CLIR enables users to search for information in a language they may not understand by
matching their queries against documents written in a different language.
CLIR is particularly useful in multilingual environments and scenarios where users need to access
information from documents in languages they are not familiar with. It plays a crucial role in enabling
cross-lingual access to information and breaking down language barriers in information retrieval.
Key aspects of Cross-Lingual Information Retrieval include:
Machine Translation: The central challenge in CLIR is to bridge the language gap between the source
and target languages. Machine translation is often used to automatically translate the user queries from
the target language to the source language or vice versa, so they can be matched against the
documents.
Language Identification: CLIR systems need to identify the language of the user query to route it to
the appropriate language-specific retrieval system or to conduct cross-lingual searches effectively.
Cross-Lingual Indexing: In CLIR, the documents in the source language may need to be indexed and
represented in a way that facilitates cross-lingual matching with queries in the target language.
Cross-Lingual Information Retrieval Models: CLIR systems use retrieval models that can
effectively rank documents in the source language based on their relevance to queries in the target
language. Various models, such as multilingual extensions of the Vector Space Model or BM25, are
used for this purpose.

18
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Evaluation: The performance of CLIR systems is evaluated using metrics like Mean Average
Precision (MAP) or cross-lingual variants of precision and recall to measure the quality of the
retrieved results.
Applications of Cross-Lingual Information Retrieval include:
Multilingual Web Search: Allowing users to search for information on the web in a language they
don't understand, with results retrieved from multiple languages.
Multilingual Document Search: Retrieving relevant documents in different languages based on user
queries in one specific language.
Multilingual Question Answering: Providing answers to user questions in one language by retrieving
information from documents in multiple languages.
CLIR is a challenging task, as it requires effective machine translation systems and robust
cross-lingual retrieval models. Ongoing research in CLIR aims to improve the accuracy and efficiency
of cross-lingual information retrieval to facilitate access to multilingual information in diverse
linguistic contexts.

Evaluation in Information Retrieval


Evaluation in Information Retrieval (IR) is the process of measuring and assessing the effectiveness
and performance of an IR system or search engine. The goal of evaluation is to determine how well the
system retrieves relevant information in response to user queries and to identify areas for
improvement.
Key components and concepts in IR evaluation include:
Relevance Judgment: Relevance judgments are annotations provided by human assessors that
indicate whether a document is relevant or non-relevant to a given query. These judgments form the
basis for evaluating the system's retrieval performance.
Precision and Recall: Precision is the proportion of retrieved documents that are relevant to the query.
Recall is the proportion of relevant documents that are retrieved. These metrics are fundamental for
assessing the retrieval quality.
F-measure (F1 Score): The F-measure is the harmonic mean of precision and recall and provides a
single metric that balances both aspects. It is commonly used when both precision and recall are
essential evaluation criteria.
Mean Average Precision (MAP): MAP is a popular metric used in IR evaluation. It calculates the
average precision across multiple queries, providing a single number to assess the overall performance
of the system.
NDCG (Normalized Discounted Cumulative Gain): NDCG is used to evaluate the effectiveness of
ranked retrieval results. It considers the relevance level of retrieved documents and assigns higher
weights to more relevant documents.
Mean Reciprocal Rank (MRR): MRR measures the average rank of the first relevant document
across all queries. It is commonly used in evaluating systems that return ranked lists of results.

19
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Precision-Recall Curve: The precision-recall curve plots precision against recall for various retrieval
settings, providing insights into the trade-off between precision and recall.
Interpolation of Recall and Precision: This method involves computing precision values at specific
recall levels and then interpolating to estimate precision at other recall levels.
User Studies: In addition to quantitative evaluation, qualitative user studies can provide valuable
feedback on the user experience and relevance of the retrieved results.
Test Collections: Evaluation is often conducted on test collections, which are predefined datasets
containing queries, relevant documents, and relevance judgments. Standard test collections enable fair
comparison between different IR systems.
IR evaluation is an ongoing research area, and researchers continuously explore new metrics and
evaluation methodologies to capture the effectiveness of modern search engines and retrieval systems
accurately.
When reporting evaluation results, it is essential to specify the evaluation metric used, the test
collection, and any additional parameters or conditions under which the evaluation was conducted to
ensure transparency and reproducibility of the evaluation process.

Tools:
In the context of Information Retrieval and Natural Language Processing, several tools and libraries
are available to assist in various tasks. These tools range from text processing and indexing to machine
learning and evaluation. Here are some popular tools commonly used in the field:
NLTK (Natural Language Toolkit): NLTK is a Python library that provides a comprehensive set of
tools for working with human language data. It offers functionalities for text processing, tokenization,
stemming, lemmatization, part-of-speech tagging, and more.
spaCy: Another popular Python NLP library, spaCy is designed for efficient and fast natural language
processing. It offers various pre-trained models for named entity recognition, dependency parsing, and
part-of-speech tagging.
Gensim: Gensim is a Python library for topic modeling and document similarity analysis. It allows
building topic models like Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).
Scikit-learn: Scikit-learn is a widely used machine learning library in Python. It offers various
algorithms and tools for classification, regression, clustering, and other machine learning tasks, which
can be applied to text data as well.
Elasticsearch: Elasticsearch is a powerful search and analytics engine used for full-text search and
information retrieval. It provides an efficient way to index and search large volumes of textual data.
Lucene: Lucene is a high-performance search library written in Java. Elasticsearch, as mentioned
above, is built on top of Lucene.
TensorFlow and PyTorch: These deep learning frameworks are popular for building and training
neural networks for NLP tasks, such as sentiment analysis, text classification, and machine translation.

20
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

trec_eval: trec_eval is a widely used tool for evaluating IR systems. It provides various evaluation
metrics, such as Precision, Recall, MAP, and NDCG, and is often used with test collections for IR
evaluation.
OpenNLP: Apache OpenNLP is a Java library for natural language processing tasks, including
tokenization, sentence detection, named entity recognition, and more.
Stanford NLP: The Stanford NLP toolkit offers a suite of NLP tools, including part-of-speech
tagging, named entity recognition, dependency parsing, and sentiment analysis.

Multilingual Automatic Summarization


Multilingual Automatic Summarization is the task of generating concise and coherent summaries of
text documents in multiple languages. The goal of multilingual summarization is to produce
high-quality summaries that capture the main points and key information from the source document,
irrespective of the language in which the document is written.
Challenges in Multilingual Automatic Summarization include:
Language-specific Features: Different languages have distinct linguistic features and structures,
making it challenging to design summarization models that work effectively across multiple languages.
Limited Training Data: Summarization models often require large amounts of training data for each
language. However, some languages may have limited resources available, which can affect the quality
of summaries in low-resource languages.
Cross-Lingual Generalization: Effective summarization models need to generalize well across various
languages, even if they were trained primarily on data from a specific language.

Approaches to Multilingual Automatic Summarization:


Multilingual Pre-training: Some approaches involve pre-training summarization models on large
multilingual datasets. This helps the model learn shared representations and improves performance in
multiple languages.
Transfer Learning: Techniques such as cross-lingual transfer learning can be used to leverage
knowledge from high-resource languages to improve summarization in low-resource languages.
Machine Translation: One common method is to translate documents from various languages into a
pivot language (e.g., English) and then perform summarization in the pivot language. The resulting
summary is then translated back into the original language.
Multilingual Embeddings: Using cross-lingual word embeddings can aid in capturing semantic
similarities between words in different languages, benefiting the summarization process.
Applications of Multilingual Automatic Summarization:
Cross-Lingual News Summarization: Summarizing news articles written in different languages to
provide a multilingual summary for users.
Multilingual Document Summarization: Generating concise summaries of research papers, reports,
or documents in multiple languages.

21
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Multilingual Social Media Summarization: Creating summaries of multilingual content from social
media platforms.
Multilingual Automatic Summarization is an active area of research, with ongoing efforts to develop
effective techniques that can generate high-quality summaries in diverse languages. As advances in
NLP and machine learning continue, the quality and applicability of multilingual summarization
models are likely to improve, enabling more efficient access to information in multiple languages.

Approaches to Summarization
Automatic summarization is the task of generating a concise and coherent summary of a longer text,
such as a document or an article. There are two primary approaches to automatic summarization:
extractive summarization and abstractive summarization. Each approach has its advantages and
challenges.
Extractive Summarization:
In extractive summarization, the summary is generated by selecting and extracting sentences or
phrases directly from the original text. These selected sentences form the summary. The advantage of
this approach is that the summary consists of sentences that are already present in the source
document, ensuring that the information is accurate and coherent. Some common techniques used in
extractive summarization include:
Sentence Scoring: Sentences are assigned scores based on various features, such as the frequency of
important words or phrases, sentence position, and grammatical structure. The top-scoring sentences
are included in the summary.
Graph-based Methods: Sentences are represented as nodes in a graph, and edges between sentences
represent their similarity. Graph algorithms like PageRank are used to identify the most important
sentences for the summary.
Machine Learning: Supervised or unsupervised machine learning models can be trained to classify
sentences as relevant or not relevant to the summary.
Abstractive Summarization:
Abstractive summarization, on the other hand, involves generating a summary by rephrasing and
paraphrasing the content from the source document in a more concise manner. The generated summary
may contain words and phrases that are not present in the original text. Abstractive summarization is
more challenging because it requires natural language generation and a deeper understanding of the
text. Some common techniques used in abstractive summarization include:
Sequence-to-Sequence Models: Recurrent Neural Networks (RNNs) or Transformer-based models,
such as the Encoder-Decoder architecture, are used to generate summaries by encoding the input text
and decoding the summary.
Attention Mechanism: Attention mechanisms help the model focus on relevant parts of the source
text during the decoding process, enabling more contextually appropriate summaries.
Copy Mechanism: Copy mechanisms allow the model to copy words or phrases directly from the
source text into the summary, helping to preserve the originality and accuracy of the content.

22
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Hybrid approaches that combine elements of both extractive and abstractive summarization are also
being explored to leverage the strengths of each approach.
Summarization techniques continue to be an active area of research in Natural Language Processing
(NLP) and Artificial Intelligence (AI). Advancements in deep learning and large-scale language
modeling have significantly improved the quality and effectiveness of automatic summarization
systems. However, generating human-like summaries that capture the nuances and subtleties of the
source text remains a challenging and open research problem.

Evaluation:
Evaluating Multilingual Automatic Summarization is a challenging task due to the inherent
complexities of dealing with multiple languages and the scarcity of high-quality evaluation datasets in
multiple languages. Nonetheless, there are several evaluation strategies and metrics that can be used to
assess the performance of multilingual summarization systems:
Multilingual Test Collections: Building multilingual test collections is one approach to evaluate
summarization systems across multiple languages. These collections consist of documents in different
languages along with corresponding human-generated summaries. The performance of the system is
measured based on how well the generated summaries match the human-authored summaries in terms
of content, coherence, and conciseness.
Bilingual Evaluation Understudy (BLEU): BLEU is a popular metric used to evaluate the quality of
machine translation systems. It can also be adapted for evaluating multilingual summarization systems
by considering the generated summaries and human-authored summaries in parallel across multiple
languages.
Multilingual ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Similar to BLEU,
ROUGE can be extended to evaluate multilingual summarization systems. It involves comparing the
system-generated summaries with human summaries in multiple languages.
Cross-Lingual Comparisons: When a multilingual summarization system is capable of summarizing
content from one language to another, cross-lingual comparisons can be made to evaluate the quality of
summaries generated in a target language compared to the source language.
Transfer Learning: Evaluating how well a summarization model trained on data from one language
generalizes to other languages is an important aspect of multilingual summarization evaluation.
Cross-lingual transfer learning techniques can be used to leverage knowledge from high-resource
languages to improve performance in low-resource languages.
User Studies: Conducting user studies with participants proficient in different languages can provide
insights into the quality and usefulness of multilingual summaries from the perspective of end-users.
Manual Assessment: Manual assessment by human evaluators fluent in different languages can be
employed to rate the quality of summaries based on criteria such as informativeness, coherence, and
language fluency.

How to Build a Summarizer:

23
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Building a summarizer involves designing and implementing a system that can analyze a document
and generate a concise and coherent summary. Here's a step-by-step guide on how to build a basic
extractive summarizer using Python:
Install Required Libraries: First, ensure you have the necessary Python libraries installed, such as
NLTK, spaCy, or Gensim, depending on your choice of text processing and NLP tools.
Preprocess the Text: Clean and preprocess the input text by removing any irrelevant content, special
characters, and unnecessary whitespace. Tokenize the text into sentences or words, depending on the
level of summarization required.
Calculate Sentence Scores: Assign a score to each sentence in the document based on its relevance to
the overall content. Common approaches include TF-IDF, TextRank, or other sentence scoring
methods.
Rank Sentences: Sort the sentences based on their scores in descending order to identify the most
important sentences in the document.
Select Top Sentences: Decide on the number of sentences you want in the summary (e.g., 3 to 5
sentences). Select the top-scoring sentences to include in the summary.
Combine Selected Sentences: Concatenate the selected sentences to create the summary.

Competitions and Datasets:


Competitions and datasets play a crucial role in advancing the state-of-the-art in information retrieval
and natural language processing. They provide researchers, developers, and practitioners with
standardized evaluation benchmarks and access to large-scale data for training and testing models.
Here are some popular competitions and datasets in the field:

Competitions:
Text Retrieval Conference (TREC): TREC is an annual competition organized by the National Institute
of Standards and Technology (NIST) that focuses on various IR tasks, including ad-hoc search,
question answering, and entity linking.
Conference on Machine Translation (WMT): WMT holds annual shared tasks on machine translation,
where participants are challenged to develop high-quality translation systems for different language
pairs.
Document Understanding Conference (DUC): DUC focuses on document summarization tasks, where
participants are required to generate concise summaries for a given set of documents.
Text Analysis Conference (TAC): TAC organizes multiple tracks, including entity linking, knowledge
base population, and sentiment analysis, aiming to advance various NLP tasks.
SemEval (Semantic Evaluation): SemEval is a series of evaluations focused on various NLP tasks,
such as sentiment analysis, relation extraction, and textual entailment.
Machine Reading Comprehension (MRQA): MRQA is an annual shared task that challenges
participants to build models capable of answering questions based on given passages of text.

24
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c

Datasets:
Common Crawl: Common Crawl provides a vast collection of publicly available web pages, useful for
training and testing web search and information retrieval systems.
Wikipedia Dump: Wikipedia provides large text corpora in multiple languages, which can be used for
training language models and building multilingual NLP systems.
Gigaword Corpus: Gigaword is a large-scale corpus containing news articles and is often used for text
summarization tasks.
CoNLL: The CoNLL series includes datasets for tasks like named entity recognition, part-of-speech
tagging, and dependency parsing.
GLUE (General Language Understanding Evaluation): GLUE offers a collection of datasets for
evaluating the performance of language models on various NLP tasks.
SQuAD (Stanford Question Answering Dataset): SQuAD provides a large dataset of question-answer
pairs, enabling the evaluation of reading comprehension and question answering systems.
Multi30K: Multi30K is a multilingual dataset for image captioning, where image descriptions are
available in multiple languages.

Top of Form

25
GIT-CAI-NLP

You might also like