UNIT-V NLP
UNIT-V NLP
UNIT – V
Machine Translation and Multilingual Information
1
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Machine translations are most effective when used to get a general idea of a passage or piece of
content. However, when it comes to word-for-word accuracy from one language to another, a
machine won’t deliver a high percentage of accurate translation. You need a human touch to review
and edit a translation for the best possible level of accuracy.
Machines are very literal. They can’t understand how a mistranslated word or phrase could change
the meaning of a passage in different contexts.
A human eye and ear on your translation can save you from an embarrassing error. A machine will
miss nuances or contexts that make a passage accurate and relevant.
From puns to sarcasm, machines miss those nuances while a human translator can listen to a phrase
and understand how to translate it in a way that makes sense culturally in the target language.
You might find a quicker (almost instant) turnaround time on a machine translation, but you’ll likely
have to spend valuable time reviewing and correcting errors.
Consider a quick, yet inaccurate translation plus the additional time you’ll spend fixing it. Now
compare that to the time for a certified human translator to translate your piece, check for errors, and
guarantee their final work.
It might take slightly longer to get your final translation, but the accuracy is worth it. If you allow
enough time to meet your deadline, human translation always wins.
Machine translators are finicky. They might have limits to the types of file formats they can read,
which limits your options when choosing an engine to use. There may also be significant limits on
file size, narrowing the field even further.
If you recorded audio or typed your document in an unaccepted format, you’re stuck without a
translator. Choosing a human translator often gives you more options to use your file, no matter the
format or size.
While human translators may have preferred file formats, outright file rejection is less likely, and
providing what they need means that your project will get done faster.
Machines can’t think, and they don’t ask questions. They use programming to interpret words and
find the closest equivalent in another language for the translation.
Some languages don’t have an exact equivalent in a different language. Machines either translate to a
similar (but inaccurate) word, or you’ll find blatant gaps in the translations.
2
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
When you work with a professional human translator, they understand language barriers. If a word or
phrase has no exact translated meaning, they’ll recommend a suitable equivalent translated phrase to
maintain the integrity of your document across languages.
You get what you pay for. Trying to save a buck on a “free” translation service can cost you more
money to edit the translation. Whether you do it yourself or pay someone to edit, you’re not getting
anything for free.
Time is valuable. Your staff has better things to do than edit a free machine translation full of errors.
It’s a better investment of time and budget dollars to work with a certified human translator from the
start.
A machine is not an expert in any language or specific industry jargon. Again, an machine translation
service is a program. It’s only as accurate as the person that developed the software and the material
that it’s fed.
Certified translators are experts in their languages, and often, complex industries. Find a translator
that specializes in the parent language and the end-result language. You’ll save yourself from
potentially offensive or dangerous errors in the final results.
Why keep it in five words when you can say it in fifteen words in a different language? If you notice
your document is much longer than the original, your machine translator probably added more words
than necessary to find a “close enough” translation.
When paying by the word for a machine translation service, beware of paying for words you don’t
need and a confusing document. Concise is always better than convoluted.
Creative content often uses words in unique contexts. You’ll also find made-up words for creative,
dramatic, or humorous effects. Machines won’t catch those subtleties in language or context.
Using a human translator will help you market to multiple audiences with varying languages while
taking creative flair into account.
3
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
● Internal communication
For a company operating in different countries across the world, communication can be difficult to
manage. Language skills can vary from employee to employee, and some may not understand the
company’s official language well enough. Machine translation helps to lower or eliminate the language
barrier in communication. Individuals quickly obtain a translation of the text and understand the
content's core message. You can use it to translate presentations, company bulletins, and other common
communication.
● External communication
Companies use machine translation to communicate more efficiently with external stakeholders and
customers. For instance, you can translate important documents into different languages for global
partners and customers. If an online store operates in many different countries, machine translation
can translate product reviews so customers can read them in their own language.
● Data analysis
Some types of machine translation can process millions of user-generated comments and deliver
highly accurate results in a short timeframe. Companies translate the large amount of content posted
on social media and websites every day, and translate it for analytics. For example, they can
automatically analyze customer opinions written in various languages.
With machine translation, brands can interact with customers all over the world, no matter what
language they speak. For example, they can use machine translation to:
The legal department uses machine translation for preparing legal documents in different countries.
With machine translation, a large amount of content becomes available for analysis that would have
been difficult to process in different languages.
Brief History:
The idea of using computers to translate human languages automatically first emerged in the early
1950s. However, at the time, the complexity of translation was far higher than early estimates by computer
scientists. It required enormous data processing power and storage, which was beyond the capabilities of
early machines.
In the early 2000s, computer software, data, and hardware became capable of doing basic machine
translation. Early developers used statistical databases of languages to train computers to translate text.
This involved a lot of manual labor and time. Each added language required them to start over with the
development for that language. Since then, machine translation has developed in speed and accuracy, and
several different machine translation strategies have emerged.
4
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Possible Approaches:
In machine translation, the original text or language is called source language, and the language you
want to translate it to is called the target language. Machine translation works by following a basic
two-step process:
We give some common approaches on how language translation technology implements this machine
translation process.
Language experts develop built-in linguistic rules and bilingual dictionaries for specific industries or
topics. Rule-based machine translation uses these dictionaries to translate specific content accurately.
The steps in the process are:
1. The machine translation software parses the input text and creates a transitional
representation
2. It converts the representation into target language using the grammar rules and dictionaries as
a reference
Pros and cons
Rule-based machine translation can be customized to a specific industry or topic. It is predictable and
provides quality translation. However, it produces poor results if the source text has errors or uses
words not present in the built-in dictionaries. The only way to improve it is by manually updating
dictionaries regularly.
Instead of relying on linguistic rules, statistical machine translation uses machine learning to translate
text. The machine learning algorithms analyze large amounts of human translations that already exist
and look for statistical patterns. The software then makes an intelligent guess when asked to translate
a new source text. It makes predictions on the basis of the statistical likelihood that a specific word or
phrase will be with another word or phrase in the target language.
5
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Neural machine translation uses artificial intelligence to learn languages, and to continuously
improve that knowledge using a specific machine learning method called neural networks. It often
works in combination with statistical translation methods.
Neural network
A neural network is an interconnected set of nodes inspired by the human brain. It is an information
system where input data passes through several interconnected nodes to generate an output. Neural
machine translation software uses neural networks to work with enormous datasets. Each node makes
one attributed change of source text to target text until the output node gives the final result.
Neural networks consider the whole input sentence at each step when producing the output sentence,
Other machine translation models break an input sentence into sets of words and phrases, mapping
them to a word or sentence in the target language. Neural machine translation systems can address
many limitations of other methods and often produce better quality translations.
Hybrid machine translation tools use two or more machine translation models on one piece of
software. You can use the hybrid approach to improve the effectiveness of a single translation model.
This machine translation process commonly uses rule-based and statistical machine translation
subsystems. The final translation output is the combination of the output of all subsystems.
Hybrid machine translation models successfully improve translation quality by overcoming the issues
linked with single translation methods.
Current Status:
Machine translation is universally accepted as the most accurate, versatile, and fluent machine
translation approach. Since its invention in the mid-2010s, neural machine translation has become the
most advanced machine translation technology. It is more accurate than statistical machine
translation, from fluency to generalization. It is now considered the standard in machine translation
development.
● Language pair
● Text types for translation. As the software performs more translations for a specific language
or domain, it will produce higher quality output. Once trained, neural machine translation
becomes more accurate, faster, and easier to add languages
6
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
The Problem
A large parallel corpus is required for training purpose in machine-readable form,
• Fall back mechanism in case of error correction is difficult,
• After a particular threshold, improving the quality of the system is very difficult.
• 100% correct translation is not possible.
Machine translation has the following fundamental problems: Language is a device for information
exchange. Languages code information at various levels such as morphological, syntactic, pragmatic,
language convention etc. Hence extracting the correct information needs extra linguistic information
such as world knowledge, context, cultural knowledge and language conventions of the receiving
7
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
person. But the information available in the language string is always partial. Further there is an
inherent tension between brevity and precision in the language, where brevity always wins leading to
inherent ambiguity in the language. Though machines are good at storing the language data, but it is
extremely difficult for machines to have the world knowledge of an average human being. Hence
determining the correct sense of an ambiguous word is a major bottle-neck in machine translation
technology.
ANUSAARAKA system has the following unique features
• Faithful representation: The system tries to give more correct translation rather than giving a
meaningful translation. The user can infer the correct meaning from the various layers of translation
output.
• No loss of information: All the information available on the source language side is made available
explicitly in the successive layers of translation.
• Graceful degradation (Robust fall back mechanism): The system ensures a safety net by providing
a “padasutra layer”, which is a word to word translation represented in special formulatic form,
representing various senses of the source language word
The major goals of the Anusaaraka system are to:
• Reduce the language barrier by facilitating access from one language to another.
• Demonstrating the practical usability of the Indian traditional grammatical system in the modern
context. • Enabling users to become contributors in the development of the system.
• Providing a free and open source machine translation platform for Indian languages.
8
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
‘Core’ engine is the main engine of anusaaraka. This engine produces the output in different layers
making the process of Machine Translation transparent to the user. The architecture of “core”
anusaaraka is shown in Figure.
Core Anusaaraka engine
The core anusaaraka engine has four major modules viz. I. Word Level Substitution II. Word Sense
Disambiguation III. Preposition placement IV. Hindi Word Order generation Each of the above four
modules is described in detail. A justification of how these changes answer the questions raised in
section 3 is presented.
Word Level Substitution:
At this level the ‘gloss’ of each source language word into the target language is provided. However,
the Polysemous words (words having more than one related meaning) create problems. When there is
no one-one mapping, it is not practical to list all the meanings. On the other hand, anusaaraka claims
‘faithfulness’ to the original text. Then how is the faithfulness guaranteed at word level substitution?
Concept of Padasutra:
To arrive at the solution, the user must understand why a native speaker does not find it odd to have
so many ‘seemingly’ different meanings of a word. By looking at the various usages of any
Polysemous word, users may observe that these Polysemous words have a “core meaning” and other
meanings are natural extensions of this meaning. In anusaaraka an attempt is made to relate all these
meanings and show their relationship by means of a formula. This formula is termed Padasutra[2].
(The concept of Padasutra is based on the concept of ‘pravrutti-nimitta’ from traditional grammar)
The word padasutra itself has two meanings:
◆ a thread connecting different senses
◆ a formula for pada
An example of Padasutra:
9
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
The English word ‘leave’ as a noun means ‘Cutti’ in Hindi, and as a verb its Hindi meaning is
‘CodanA’ and it is obvious that ‘CodanA’ is derived from ‘Cutti’. Hence, the Padasutra for ‘leave’ is
leave: Cutti[>CodanA] Here ‘a>b’ stands for ‘b gets derived from a’ and ‘a[b]’ roughly stands for ‘a
or b’. Thus, by division of workload and adoption of the concept of ‘Padasutra—word formula’, the
research guarantees that the first level output is ‘faithful’ to the original and also acts as a ‘safety net’
where other modules fail. At this level some of the English words like functional words, articles, etc.
are not substituted. The reason being they are either highly ambiguous, or there is a
lexical/conceptual gap in Hindi corresponding to the English words (e.g. articles), or substituting
them may lead to catastrophe. Thus, for the input sentence ‘rats killed cats’ the output after word
level substitution is cUhA{s} mArA{ed/en} billI{s}
Training Component
To understand the output produced in this manner, a human being needs some training. The training
presents English grammar through the Paaninian view[3]. Thus, if a user is willing to put in some
effort, he/ she has complete access to the original text.
The effort required here is that of making correct choices based on the common sense, world
knowledge, etc. The training component layer ensures that the layer produces an output, which is a
“rough” translation that systematically differs from Hindi. Since the output is generated following
certain principles, the chances of getting mislead are less. Theoretically, the output at this layer is
reversible.
Word Sense Disambiguation (WSD)
English has a very rich source of systematic ambiguity. Majority of nouns in English can potentially
be used as verbs. Therefore, the WSD task in case of English can be split into two classes:
● (i ) WSD across POS
● (ii) WSD within POS
The POS taggers can help in WSD when the ambiguity is across POSs. For example: Consider the
two sentences ‘He chairs the session’. ‘The chairs in this room are comfortable’. The POS taggers
mark the words with appropriate POS tags. These taggers use certain heuristic rules, and hence may
sometimes go wrong. The reported performances of these POS taggers vary between 95% to 97%.
However, they are still useful, since they reduce the search space for meanings substantially.
However, disambiguation in the case of Polysemous words requires disambiguation rules. It is not an
easy task to frame such rules. It is the context, which plays a crucial role in disambiguation. The
context may be
◆ the words in proximity, or
◆ other words in the sentence that are related to the word to be disambiguated.
The question is how can such rules be made efficiently? To frame disambiguation rules manually
would require thousands of man-years. Is it possible to use machines to automate this process? The
wasp workbench [8] is the best example of how, with the help of a small seed data, machines can
learn from the corpus and produce disambiguation rules. Anusaaraka uses the wasp workbench to
semi-automatically generate these disambiguation rules. The output produced at this stage is
irreversible, since machine makes choices based on heuristics.
10
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Preposition Placement:
English has prepositions whereas Hindi has postpositions.
◆ Hence, it is necessary to move the prepositions to proper positions in Hindi before substituting
their meanings. While moving the prepositions from their English positions to the proper Hindi
positions, \record of their movements must be stored, so that in case a need arises, they can be
reverted back to their original position. Therefore, the transformations performed by this module, are
also reversible.
Hindi Word Order Generation
Hindi is a free word order language. Therefore, even the anusaaraka output in the previous layer
makes sense to the Hindi reader. However, this output not being natural in Hindi, may not be enjoyed
as much as the output with natural Hindi order. Additionally, it would not be treated as a translation.
Therefore, in this module the attempt is to generate the correct Hindi word order.
Interface for different linguistic tools
The second major contribution of this architecture is the concept of ‘interfaces’. Machine translation
requires language resources such as POS taggers, morphological analyzers, and parsers. More than
one kinds of each of these tools exist. Hence, it is wise to use these tools. However, there are
problems.
For examples – Parsers I. These parsers do not have satisfactory performance. 40% of the time, the
first parse is the correct parse. Parse of a sentence tells how the words are related to each other. 90%
of such relations in any parse are typically correct. II. Each of these parsers is based on different
grammatical formalism. Hence, the output they produce is also influenced by the theoretical
considerations of this grammar formalism. III. Since the output format for different parsers is
different, it is not possible to remove one parser and plug in the other one. IV. One needs trained
manpower to interpret the output produced by these parsers and to improve the performance of these
parsers. As a machine translation system developer who is interested in the “usable” product one
would like to plug-in different parsers and watch the performance. May be one would like to use
combinations of them, or may like to vote among different parsers and choose the best parse out of
them.
The question then is how to achieve it?
It is not enough to have the modular programs. The parser itself is an independent module. What is
required is plug-in facility for different parsers. This is possible, provided all the parsers produce an
output in some common format. Hence, interfaces are necessary to map output of parsers to an
intermediate form as illustrated in figure.
11
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
12
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
13
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Limited Training Data: If the pivot language has limited training data for certain agreement patterns, the
translation system may not accurately handle those patterns during translation.
As a result, the translated output in the target language may lack full agreement with the original
source language. This can lead to grammatical errors and less fluent or natural-sounding translations.
However, it's worth noting that researchers are continually working to improve pivot-based translation
systems, and many methods aim to preserve agreement as much as possible. Techniques like
multilingual training, cross-lingual pre-training, and transfer learning help in capturing syntactic and
grammatical relationships between languages, which can lead to better agreement in the translated
output.
Nevertheless, when using pivot-based approaches, there may still be cases where complete agreement
between source and target languages is not achievable, and the translation system may need to "give
up" certain aspects of agreement to produce a feasible translation. This trade-off is one of the
challenges of pivot-based machine translation, especially for complex linguistic phenomena like
agreement.
Language Bridges
In the context of machine translation, language bridges refer to the techniques and approaches that
enable communication and translation between different language pairs, even if direct parallel training
data between those pairs is limited or unavailable. These bridges allow machine translation models to
leverage knowledge from one or more intermediary languages to improve translation quality for
language pairs that lack direct translation data.
Language bridges in machine translation are particularly useful for low-resource languages, where
obtaining sufficient parallel data for direct translation between two specific languages is challenging.
By introducing an intermediate language(s) for which sufficient parallel data is available, the model
can effectively learn to bridge the gap and perform translations across multiple languages.
Here are some common approaches used to build language bridges in machine translation:
Pivot-based Translation: In this approach, the model translates the source language to an intermediate
language (pivot) for which there is ample parallel data available, and then translates from the pivot
language to the target language. The overall translation is achieved by combining these two translation
steps. This technique allows the model to handle language pairs without direct parallel data, as long as
there is a path through a pivot language.
Multilingual Translation Models: Instead of training separate models for each language pair, multilingual
translation models are trained to handle multiple languages simultaneously. These models can share
information between languages during training, effectively creating a language bridge.
Zero-Shot Translation: By training a model on multiple languages, it can be capable of translating
between language pairs it has never seen during training. This is achieved by leveraging the shared
representations learned during multilingual training.
Cross-Lingual Pre-training: Models are pre-trained on a large corpus containing text from multiple
languages, learning to encode multilingual representations. These pre-trained models can then be
fine-tuned for specific translation tasks.
14
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Transfer Learning: Knowledge gained from translating between high-resource language pairs can be
transferred to improve translation performance for low-resource language pairs.
Bilingual Lexicons and Dictionaries: Bilingual word dictionaries or lexicons can be utilized to map
words between languages, providing a way to perform translation using explicit word alignments.
Language bridges in machine translation are valuable because they help expand the reach of translation
systems to more languages, including those with limited linguistic resources. However, it's important
to consider that using a pivot language or indirect training can introduce errors or inefficiencies in
translation, and the quality of the translation may vary depending on the quality and suitability of the
intermediary language. Researchers are continually exploring ways to improve language bridges and
multilingual translation techniques to enhance translation performance across various language pairs.
15
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Document Pre-processing
Document pre-processing is a critical step in information retrieval (IR) that involves transforming raw
text documents into a format suitable for efficient indexing, storage, and retrieval. The main goal of
document pre-processing is to prepare the text data so that it can be effectively searched and
matched against user queries in an IR system. Several important tasks are typically performed during
document pre-processing:
Tokenization: The first step is to break down the raw text into smaller units called tokens, which are
usually words or subwords. Tokenization makes it possible to process individual words and enables
further analysis and indexing.
Lowercasing: Converting all tokens to lowercase is a common pre-processing step to ensure
case-insensitive search. This helps in retrieving relevant documents regardless of the case used in the
user query.
Stopword Removal: Stopwords are common words like "the," "is," "and," which appear frequently in
a language but do not carry significant meaning for information retrieval. Removing stopwords helps
reduce noise and saves space during indexing.
Stemming and Lemmatization: Stemming and lemmatization are techniques used to reduce words to
their root or base form. This helps in grouping different inflected forms of a word together, enabling
broader search coverage and reducing indexing overhead.
Normalization: Normalizing textual data involves converting abbreviations, acronyms, and numerical
expressions to their standard or expanded forms. This ensures consistency and improves search
precision.
16
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Character Encoding and Unicode Handling: Ensuring proper character encoding and handling
Unicode text is essential for processing text data from various languages and scripts.
Special Character and Punctuation Handling: Depending on the application, special characters and
punctuation may be removed or retained to improve search accuracy.
Named Entity Recognition: Identifying and annotating named entities (e.g., person names, locations)
can be useful for improving search relevance and enabling entity-based retrieval.
Feature Extraction: Depending on the IR system, additional features like n-grams, part-of-speech tags,
and sentiment scores may be extracted to enrich the representation of the documents.
The pre-processed documents are then indexed to create an inverted index, which maps each term in
the document collection to the documents that contain it. This index facilitates fast and efficient
retrieval of relevant documents when users submit queries.
Document pre-processing is a crucial step that impacts the overall performance and efficiency of an
information retrieval system. It is essential to strike the right balance between text normalization and
feature extraction to achieve accurate and relevant search results for different types of user queries.
17
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Evaluation: The effectiveness of a monolingual IR system is assessed using evaluation metrics such
as precision, recall, F1 score, and Mean Average Precision (MAP) to measure the quality of the
retrieved results.
Applications of Monolingual Information Retrieval include:
Web Search: Retrieving relevant web pages based on user queries.
Document Search: Searching for relevant documents within a specific corpus or database.
Information Retrieval in Social Media: Finding relevant posts, tweets, or comments on social media
platforms.
Enterprise Search: Enabling users to search for information within an organization's internal
documents and databases.
Monolingual Information Retrieval forms the foundation of many search engines and information
retrieval systems, allowing users to efficiently access relevant information from large document
collections in their native language. Improving the accuracy and efficiency of monolingual IR is an
ongoing area of research in the field of information retrieval and natural language processing.
CLIR
CLIR stands for Cross-Lingual Information Retrieval. It is a subfield of information retrieval (IR) that
focuses on retrieving relevant information from a document collection in one language (the source
language) in response to user queries expressed in another language (the target language). In other
words, CLIR enables users to search for information in a language they may not understand by
matching their queries against documents written in a different language.
CLIR is particularly useful in multilingual environments and scenarios where users need to access
information from documents in languages they are not familiar with. It plays a crucial role in enabling
cross-lingual access to information and breaking down language barriers in information retrieval.
Key aspects of Cross-Lingual Information Retrieval include:
Machine Translation: The central challenge in CLIR is to bridge the language gap between the source
and target languages. Machine translation is often used to automatically translate the user queries from
the target language to the source language or vice versa, so they can be matched against the
documents.
Language Identification: CLIR systems need to identify the language of the user query to route it to
the appropriate language-specific retrieval system or to conduct cross-lingual searches effectively.
Cross-Lingual Indexing: In CLIR, the documents in the source language may need to be indexed and
represented in a way that facilitates cross-lingual matching with queries in the target language.
Cross-Lingual Information Retrieval Models: CLIR systems use retrieval models that can
effectively rank documents in the source language based on their relevance to queries in the target
language. Various models, such as multilingual extensions of the Vector Space Model or BM25, are
used for this purpose.
18
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Evaluation: The performance of CLIR systems is evaluated using metrics like Mean Average
Precision (MAP) or cross-lingual variants of precision and recall to measure the quality of the
retrieved results.
Applications of Cross-Lingual Information Retrieval include:
Multilingual Web Search: Allowing users to search for information on the web in a language they
don't understand, with results retrieved from multiple languages.
Multilingual Document Search: Retrieving relevant documents in different languages based on user
queries in one specific language.
Multilingual Question Answering: Providing answers to user questions in one language by retrieving
information from documents in multiple languages.
CLIR is a challenging task, as it requires effective machine translation systems and robust
cross-lingual retrieval models. Ongoing research in CLIR aims to improve the accuracy and efficiency
of cross-lingual information retrieval to facilitate access to multilingual information in diverse
linguistic contexts.
19
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Precision-Recall Curve: The precision-recall curve plots precision against recall for various retrieval
settings, providing insights into the trade-off between precision and recall.
Interpolation of Recall and Precision: This method involves computing precision values at specific
recall levels and then interpolating to estimate precision at other recall levels.
User Studies: In addition to quantitative evaluation, qualitative user studies can provide valuable
feedback on the user experience and relevance of the retrieved results.
Test Collections: Evaluation is often conducted on test collections, which are predefined datasets
containing queries, relevant documents, and relevance judgments. Standard test collections enable fair
comparison between different IR systems.
IR evaluation is an ongoing research area, and researchers continuously explore new metrics and
evaluation methodologies to capture the effectiveness of modern search engines and retrieval systems
accurately.
When reporting evaluation results, it is essential to specify the evaluation metric used, the test
collection, and any additional parameters or conditions under which the evaluation was conducted to
ensure transparency and reproducibility of the evaluation process.
Tools:
In the context of Information Retrieval and Natural Language Processing, several tools and libraries
are available to assist in various tasks. These tools range from text processing and indexing to machine
learning and evaluation. Here are some popular tools commonly used in the field:
NLTK (Natural Language Toolkit): NLTK is a Python library that provides a comprehensive set of
tools for working with human language data. It offers functionalities for text processing, tokenization,
stemming, lemmatization, part-of-speech tagging, and more.
spaCy: Another popular Python NLP library, spaCy is designed for efficient and fast natural language
processing. It offers various pre-trained models for named entity recognition, dependency parsing, and
part-of-speech tagging.
Gensim: Gensim is a Python library for topic modeling and document similarity analysis. It allows
building topic models like Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).
Scikit-learn: Scikit-learn is a widely used machine learning library in Python. It offers various
algorithms and tools for classification, regression, clustering, and other machine learning tasks, which
can be applied to text data as well.
Elasticsearch: Elasticsearch is a powerful search and analytics engine used for full-text search and
information retrieval. It provides an efficient way to index and search large volumes of textual data.
Lucene: Lucene is a high-performance search library written in Java. Elasticsearch, as mentioned
above, is built on top of Lucene.
TensorFlow and PyTorch: These deep learning frameworks are popular for building and training
neural networks for NLP tasks, such as sentiment analysis, text classification, and machine translation.
20
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
trec_eval: trec_eval is a widely used tool for evaluating IR systems. It provides various evaluation
metrics, such as Precision, Recall, MAP, and NDCG, and is often used with test collections for IR
evaluation.
OpenNLP: Apache OpenNLP is a Java library for natural language processing tasks, including
tokenization, sentence detection, named entity recognition, and more.
Stanford NLP: The Stanford NLP toolkit offers a suite of NLP tools, including part-of-speech
tagging, named entity recognition, dependency parsing, and sentiment analysis.
21
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Multilingual Social Media Summarization: Creating summaries of multilingual content from social
media platforms.
Multilingual Automatic Summarization is an active area of research, with ongoing efforts to develop
effective techniques that can generate high-quality summaries in diverse languages. As advances in
NLP and machine learning continue, the quality and applicability of multilingual summarization
models are likely to improve, enabling more efficient access to information in multiple languages.
Approaches to Summarization
Automatic summarization is the task of generating a concise and coherent summary of a longer text,
such as a document or an article. There are two primary approaches to automatic summarization:
extractive summarization and abstractive summarization. Each approach has its advantages and
challenges.
Extractive Summarization:
In extractive summarization, the summary is generated by selecting and extracting sentences or
phrases directly from the original text. These selected sentences form the summary. The advantage of
this approach is that the summary consists of sentences that are already present in the source
document, ensuring that the information is accurate and coherent. Some common techniques used in
extractive summarization include:
Sentence Scoring: Sentences are assigned scores based on various features, such as the frequency of
important words or phrases, sentence position, and grammatical structure. The top-scoring sentences
are included in the summary.
Graph-based Methods: Sentences are represented as nodes in a graph, and edges between sentences
represent their similarity. Graph algorithms like PageRank are used to identify the most important
sentences for the summary.
Machine Learning: Supervised or unsupervised machine learning models can be trained to classify
sentences as relevant or not relevant to the summary.
Abstractive Summarization:
Abstractive summarization, on the other hand, involves generating a summary by rephrasing and
paraphrasing the content from the source document in a more concise manner. The generated summary
may contain words and phrases that are not present in the original text. Abstractive summarization is
more challenging because it requires natural language generation and a deeper understanding of the
text. Some common techniques used in abstractive summarization include:
Sequence-to-Sequence Models: Recurrent Neural Networks (RNNs) or Transformer-based models,
such as the Encoder-Decoder architecture, are used to generate summaries by encoding the input text
and decoding the summary.
Attention Mechanism: Attention mechanisms help the model focus on relevant parts of the source
text during the decoding process, enabling more contextually appropriate summaries.
Copy Mechanism: Copy mechanisms allow the model to copy words or phrases directly from the
source text into the summary, helping to preserve the originality and accuracy of the content.
22
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Hybrid approaches that combine elements of both extractive and abstractive summarization are also
being explored to leverage the strengths of each approach.
Summarization techniques continue to be an active area of research in Natural Language Processing
(NLP) and Artificial Intelligence (AI). Advancements in deep learning and large-scale language
modeling have significantly improved the quality and effectiveness of automatic summarization
systems. However, generating human-like summaries that capture the nuances and subtleties of the
source text remains a challenging and open research problem.
Evaluation:
Evaluating Multilingual Automatic Summarization is a challenging task due to the inherent
complexities of dealing with multiple languages and the scarcity of high-quality evaluation datasets in
multiple languages. Nonetheless, there are several evaluation strategies and metrics that can be used to
assess the performance of multilingual summarization systems:
Multilingual Test Collections: Building multilingual test collections is one approach to evaluate
summarization systems across multiple languages. These collections consist of documents in different
languages along with corresponding human-generated summaries. The performance of the system is
measured based on how well the generated summaries match the human-authored summaries in terms
of content, coherence, and conciseness.
Bilingual Evaluation Understudy (BLEU): BLEU is a popular metric used to evaluate the quality of
machine translation systems. It can also be adapted for evaluating multilingual summarization systems
by considering the generated summaries and human-authored summaries in parallel across multiple
languages.
Multilingual ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Similar to BLEU,
ROUGE can be extended to evaluate multilingual summarization systems. It involves comparing the
system-generated summaries with human summaries in multiple languages.
Cross-Lingual Comparisons: When a multilingual summarization system is capable of summarizing
content from one language to another, cross-lingual comparisons can be made to evaluate the quality of
summaries generated in a target language compared to the source language.
Transfer Learning: Evaluating how well a summarization model trained on data from one language
generalizes to other languages is an important aspect of multilingual summarization evaluation.
Cross-lingual transfer learning techniques can be used to leverage knowledge from high-resource
languages to improve performance in low-resource languages.
User Studies: Conducting user studies with participants proficient in different languages can provide
insights into the quality and usefulness of multilingual summaries from the perspective of end-users.
Manual Assessment: Manual assessment by human evaluators fluent in different languages can be
employed to rate the quality of summaries based on criteria such as informativeness, coherence, and
language fluency.
23
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Building a summarizer involves designing and implementing a system that can analyze a document
and generate a concise and coherent summary. Here's a step-by-step guide on how to build a basic
extractive summarizer using Python:
Install Required Libraries: First, ensure you have the necessary Python libraries installed, such as
NLTK, spaCy, or Gensim, depending on your choice of text processing and NLP tools.
Preprocess the Text: Clean and preprocess the input text by removing any irrelevant content, special
characters, and unnecessary whitespace. Tokenize the text into sentences or words, depending on the
level of summarization required.
Calculate Sentence Scores: Assign a score to each sentence in the document based on its relevance to
the overall content. Common approaches include TF-IDF, TextRank, or other sentence scoring
methods.
Rank Sentences: Sort the sentences based on their scores in descending order to identify the most
important sentences in the document.
Select Top Sentences: Decide on the number of sentences you want in the summary (e.g., 3 to 5
sentences). Select the top-scoring sentences to include in the summary.
Combine Selected Sentences: Concatenate the selected sentences to create the summary.
Competitions:
Text Retrieval Conference (TREC): TREC is an annual competition organized by the National Institute
of Standards and Technology (NIST) that focuses on various IR tasks, including ad-hoc search,
question answering, and entity linking.
Conference on Machine Translation (WMT): WMT holds annual shared tasks on machine translation,
where participants are challenged to develop high-quality translation systems for different language
pairs.
Document Understanding Conference (DUC): DUC focuses on document summarization tasks, where
participants are required to generate concise summaries for a given set of documents.
Text Analysis Conference (TAC): TAC organizes multiple tracks, including entity linking, knowledge
base population, and sentiment analysis, aiming to advance various NLP tasks.
SemEval (Semantic Evaluation): SemEval is a series of evaluations focused on various NLP tasks,
such as sentiment analysis, relation extraction, and textual entailment.
Machine Reading Comprehension (MRQA): MRQA is an annual shared task that challenges
participants to build models capable of answering questions based on given passages of text.
24
GIT-CAI-NLP
GATES INSTITUTE OF TECHNOLOGY, GOOTY NLP|20A05702c
Datasets:
Common Crawl: Common Crawl provides a vast collection of publicly available web pages, useful for
training and testing web search and information retrieval systems.
Wikipedia Dump: Wikipedia provides large text corpora in multiple languages, which can be used for
training language models and building multilingual NLP systems.
Gigaword Corpus: Gigaword is a large-scale corpus containing news articles and is often used for text
summarization tasks.
CoNLL: The CoNLL series includes datasets for tasks like named entity recognition, part-of-speech
tagging, and dependency parsing.
GLUE (General Language Understanding Evaluation): GLUE offers a collection of datasets for
evaluating the performance of language models on various NLP tasks.
SQuAD (Stanford Question Answering Dataset): SQuAD provides a large dataset of question-answer
pairs, enabling the evaluation of reading comprehension and question answering systems.
Multi30K: Multi30K is a multilingual dataset for image captioning, where image descriptions are
available in multiple languages.
Top of Form
25
GIT-CAI-NLP