0% found this document useful (0 votes)

48 views20 pages

AI-Enhanced Text Embeddings

防守打法s

Uploaded by

shi.waden

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views20 pages

AI-Enhanced Text Embeddings

防守打法s

Uploaded by

shi.waden

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Improving Text Embeddings with

Large Language Models

Liang Wang∗, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei
Microsoft Corporation
https://aka.ms/GeneralAI
arXiv:2401.00368v2 [cs.CL] 19 Jan 2024

Abstract
In this paper, we introduce a novel and simple method for obtaining high-quality
text embeddings using only synthetic data and less than 1k training steps. Unlike
existing methods that often depend on multi-stage intermediate pre-training with
billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled
datasets, our method does not require building complex training pipelines or relying
on manually collected datasets that are often constrained by task diversity and
language coverage. We leverage proprietary LLMs to generate diverse synthetic
data for hundreds of thousands of text embedding tasks across nearly 100 languages.
We then fine-tune open-source decoder-only LLMs on the synthetic data using
standard contrastive loss. Experiments demonstrate that our method achieves
strong performance on highly competitive text embedding benchmarks without
using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic
and labeled data, our model sets new state-of-the-art results on the BEIR and
MTEB benchmarks.

1 Introduction
Text embeddings are vector representations of natural language that encode its semantic information.
They are widely used in various natural language processing (NLP) tasks, such as information
retrieval (IR), question answering, semantic textual similarity, bitext mining, item recommendation,
etc. In the field of IR, the first-stage retrieval often relies on text embeddings to efficiently recall
a small set of candidate documents from a large-scale corpus using approximate nearest neighbor
search techniques. Embedding-based retrieval is also a crucial component of retrieval-augmented
generation (RAG) [21], which is an emerging paradigm that enables large language models (LLMs)
to access dynamic external knowledge without modifying the model parameters. Source attribution
of generated text is another important application of text embeddings [14] that can improve the
interpretability and trustworthiness of LLMs.
Previous studies have demonstrated that weighted average of pre-trained word embeddings [35, 1]
is a strong baseline for measuring semantic similarity. However, these methods fail to capture the
rich contextual information of natural language. With the advent of pre-trained language models
[11], Sentence-BERT [37] and SimCSE [13] have been proposed to learn text embeddings by fine-
tuning BERT on natural language inference (NLI) datasets. To further enhance the performance and
robustness of text embeddings, state-of-the-art methods like E5 [46] and BGE [48] employ a more
complex multi-stage training paradigm that first pre-trains on billions of weakly-supervised text pairs,
and then fine-tunes on several labeled datasets.
Existing multi-stage approaches suffer from several drawbacks. Firstly, they entail a complex
multi-stage training pipeline that demands substantial engineering efforts to curate large amounts
∗
Correspondence to {wangliang,nanya,fuwei}@microsoft.com

Technical Report.
of relevance pairs. Secondly, they rely on manually collected datasets that are often constrained by
the diversity of tasks and the coverage of languages. For instance, Instructor [40] is only trained on
instructions from 330 English datasets, whereas BGE [48] only focuses on high-resource languages
such as English and Chinese. Moreover, most existing methods employ BERT-style encoders as the
backbone, neglecting the recent advances of training better LLMs and related techniques such as
context length extension [38].
In this paper, we propose a novel method for text embeddings that leverages LLMs to overcome the
limitations of existing approaches. We use proprietary LLMs to generate synthetic data for a diverse
range of text embedding tasks in 93 languages, covering hundreds of thousands of embedding tasks.
Specifically, we use a two-step prompting strategy that first prompts the LLMs to brainstorm a pool
of candidate tasks, and then prompts the LLMs to generate data conditioned on a given task from the
pool. To cover various application scenarios, we design multiple prompt templates for each task type
and combine the generated data from different templates to boost diversity. For the text embedding
models, we opt for fine-tuning powerful open-source LLMs rather than small BERT-style models.
Since LLMs such as Mistral [19] have been extensively pre-trained on web-scale data, contrastive
pre-training offers little additional benefit.
We demonstrate that Mistral-7B, when fine-tuned solely on synthetic data, attains competitive
performance on the BEIR [42] and MTEB [28] benchmarks. This is particularly intriguing considering
that this setting does not involve any labeled data. When fine-tuned on a mixture of synthetic and
labeled data, our model achieves new state-of-the-art results, surpassing previous methods by a
significant margin (+2%). The entire training process requires less than 1k steps.
Moreover, we empirically validate that our model can effectively perform personalized passkey
retrieval for inputs up to 32k tokens by altering the rotation base of the position embeddings,
extending the context length beyond the conventional 512 token limit. Regarding its multilinguality,
our model excels on high-resource languages. However, for low-resource languages, there is still
room for improvement as current open-source LLMs are not adequately pre-trained on them.

2 Related Work

Text Embeddings are continuous low-dimensional representations of text and have been extensively
applied to various downstream tasks such as information retrieval, question answering, and retrieval-
augmented generation (RAG). Early work on text embeddings includes latent semantic indexing [10]
and weighted average of word embeddings [25]. More recent methods exploit supervision from
natural language inference [3] and labeled query-document pairs, such as the MS-MARCO passage
ranking dataset [5], to train text embeddings [37, 6, 13]. However, labeled data are often limited in
terms of task diversity and language coverage. To address this challenge, methods like Contriever [18],
OpenAI Embeddings [30], E5 [46], and BGE [48] adopt a multi-stage training paradigm. They first
pre-train on large-scale weakly-supervised text pairs using contrastive loss and then fine-tune on
small-scale but high-quality datasets. In this paper, we demonstrate that it is possible to obtain
state-of-the-art text embeddings with single-stage training.
Synthetic Data Synthetic data generation is a widely studied topic in information retrieval research,
with various methods proposed to enhance retrieval systems with artificially created data. For instance,
Doc2query [33], InPars [2], and Promptagator [8] generate synthetic queries for unlabeled documents,
which are then leveraged for document expansion or model training. GPL [45] employs a cross-
encoder to produce pseudo-labels for query-document pairs. Similarly, Query2doc [47] generates
pseudo-documents for query expansion by few-shot prompting LLMs. Unlike these methods, our
approach does not rely on any unlabeled documents or queries and thus can generate more diverse
synthetic data.
Another related line of work focuses on knowledge distillation from black-box LLMs by training on
synthetic data generated from them. DINO [39] generates synthetic text pairs for semantic textual
similarity. Unnatural Instructions [16] is a synthetic instruction following dataset by prompting
existing LLMs. Orca [29] and Phi [15] propose to train better small language models by using
high-quality synthetic data from GPT-3.5/4 [34].
Large Language Models With the popularization of ChatGPT, large language models (LLMs) have
demonstrated remarkable capabilities in instruction following and few-shot in-context learning [4].

2
However, the most advanced LLMs such as GPT-4 [34] are proprietary and have little technical
details disclosed. To bridge the gap between proprietary and open-source LLMs, several notable
efforts have been made, such as LLaMA-2 [44] and Mistral [19] models. A major limitation of LLMs
is that they lack awareness of recent events and private knowledge. This issue can be partly mitigated
by augmenting LLMs with information retrieved from external sources, a technique known as
retrieval-augmented generation (RAG). On the other hand, LLMs can also serve as foundation models
to enhance text embeddings. RepLLaMA [24] proposes to fine-tune LLaMA-2 with bi-encoder
architecture for ad-hoc retrieval. SGPT [27], GTR [32], and Udever [51] demonstrate the scaling law
of text embeddings empirically, but their performance still falls behind small bidirectional encoders
such as E5 [46] and BGE [48]. In this paper, we present a novel approach to train state-of-the-art text
embeddings by exploiting the latest advances of LLMs and synthetic data.

Brainstorm a list of potentially useful text retrieval tasks.

Here are a few examples for your reference:
- Provided a scientific claim as query, retrieve documents that help verify or refute the claim.
- Search for documents that answers a FAQ-style query on children's nutrition.
Please adhere to the following guidelines:
- Specify what the query is, and what the desired documents are.
- Each retrieval task should cover a wide range of queries, and should not be too specific.
Your output should always be a python list of strings only, with about 20 elements, and each element
corresponds to a distinct retrieval task in one sentence. Do not explain yourself or output anything else. Be
creative!

["Retrieve company's financial reports for a given stock ticker symbol.",

"Given a book name as a query, retrieve reviews, ratings and summaries of that book.",
"Search for scientific research papers supporting a medical diagnosis for a specified disease.“
… (omitted for space)]

new session

You have been assigned a retrieval task: {task}

Your mission is to write one text retrieval example for this task in JSON format. The JSON object must
contain the following keys:
- "user_query": a string, a random user search query specified by the retrieval task.
- "positive_document": a string, a relevant document for the user query.
- "hard_negative_document": a string, a hard negative document that only appears relevant to the query.
Please adhere to the following guidelines:
- The "user_query" should be {query_type}, {query_length}, {clarity}, and diverse in topic.
- All documents should be at least {num_words} words long.
- Both the query and documents should be in {language}.
… (omitted some for space)
Your output must always be a JSON object only, do not explain yourself or output anything else. Be creative!

{"user_query": "How to use Microsoft Power BI for data analysis",

"positive_document": "Microsoft Power BI is a sophisticated tool that requires time and practice to
master. In this tutorial, we'll show you how to navigate Power BI … (omitted) ",
“hard_negative_document”: “Excel is an incredibly powerful tool for managing and analyzing large
amounts of data. Our tutorial series focuses on how you…(omitted)” }

Figure 1: An example two-step prompt template for generating synthetic data with GPT-4. We first
prompt GPT-4 to brainstorm a list of potential retrieval tasks, and then generate (query, positive, hard
negative) triplets for each task. “{...}” denotes a placeholder that will be replaced by sampling from a
predefined set of values. Full prompts are available in Appendix C.

3
3 Method
3.1 Synthetic Data Generation

Utilizing synthetic data generated by advanced LLMs such as GPT-4 presents a compelling oppor-
tunity, especially in terms of enhancing diversity across a multitude of tasks and languages. Such
diversity is essential for developing robust text embeddings that can perform well across different
tasks, be it semantic retrieval, textual similarity, or clustering.
To generate diverse synthetic data, we propose a simple taxonomy that categorizes embedding tasks
into several groups, and then apply different prompt templates to each group.
Asymmetric Tasks This category comprises tasks where the query and document are semantically
related but are not paraphrases of each other. Depending on the length of the query and document, we
further divide asymmetric tasks into four subgroups: short-long match, long-short match, short-short
match, and long-long match. For instance, short-long match tasks involve a short query and a long
document, which is a typical scenario in commercial search engines. For each subgroup, we design a
two-step prompt template that first prompts LLMs brainstorm a list of tasks, and then generates a
concrete example conditioned on the task definition. In Figure 1, we show an example prompt for
the short-long match subgroup. The outputs from GPT-4 are mostly coherent and of high quality. In
our preliminary experiments, we also attempted to generate the task definition and query-document
pairs using a single prompt, but the data diversity was not as satisfactory as the proposed two-step
approach.
Symmetric Tasks Symmetric tasks involve queries and documents that have similar semantic
meanings but different surface forms. We examine two application scenarios: monolingual semantic
textual similarity (STS) and bitext retrieval. We design two distinct prompt templates for each
scenario, tailored to their specific objectives. Since the task definition is straightforward, we omit the
brainstorming step for symmetric tasks.
To further boost the diversity of the prompts and thus the synthetic data, we incorporate several
placeholders in each prompt template, whose values are randomly sampled at runtime. For example,
in Figure 1, the value of “{query_length}” is sampled from the set “{less than 5 words, 5-10 words,
at least 10 words}”.
To generate multilingual data, we sample the value of “{language}” from the language list of XLM-
R [7], giving more weight to high-resource languages. Any generated data that does not conform to
the predefined JSON format are discarded during the parsing process. We also remove duplicates
based on exact string matching.

3.2 Training

Given a relevant query-document pair (q + , d+ ), we first apply the following instruction template to
+
the original query q + to generate a new one qinst :
+
qinst = Instruct: {task_definition} \n Query: {q + } (1)
where “{task_definition}” is a placeholder for a one-sentence description of the embedding task. For
generated synthetic data, we use the outputs from the brainstorming step. For other datasets, such as
MS-MARCO, we manually craft the task definitions and apply them to all the queries in the dataset.
We do not modify the document side with any instruction prefix. In this way, the document index can
be prebuilt, and we can customize the task to perform by changing only the query side.
Given a pretrained LLM, we append an [EOS] token to the end of the query and document, and then
feed them into the LLM to obtain the query and document embeddings (hq+ , hd+ ) by taking the last
inst
layer [EOS] vector. To train the embedding model, we adopt the standard InfoNCE loss L over the
in-batch negatives and hard negatives:
+
ϕ(qinst , d+ )
min L = − log X (2)
+ +
ϕ(qinst , d+ ) + (ϕ(qinst , ni ))
ni ∈N

where N denotes the set of all negatives, and ϕ(q, d) is a function that computes the matching score
between query q and document d. In this paper, we adopt the temperature-scaled cosine similarity

4
function as follows:
1
ϕ(q, d) = exp( cos(hq , hd )) (3)
τ
τ is a temperature hyper-parameter, which is fixed to 0.02 in our experiments.

4 Experiments
4.1 Statistics of the Synthetic Data

distribution of task types distribution of languages

short-long Polish
Japanese English
167k 3.0%
Italian 2.9% 43.1%
Russian 2.9%
long-short 2.9%
122k Indonesian
2.9%
German 2.9%
Persian 2.9%
2.8%
13k Spanish 2.8%
99k 2.8%
short-short 17k Chinese
sts 2.8%
long-long French 2.8%
89k 2.7% 19.8%
Portuguese
Dutch
bitext Arabic Others

Figure 2: Task type and language statistics of the generated synthetic data (see Section 3.1 for task
type definitions). The “Others” category contains the remaining languages from the XLM-R language
list.

Figure 2 presents the statistics of our generated synthetic data. We manage to generate 500k examples
with 150k unique instructions using Azure OpenAI Service 2 , among which 25% are generated by
GPT-35-Turbo and others are generated by GPT-4. The total token consumption is about 180M. The
predominant language is English, with coverage extending to a total of 93 languages. For the bottom
75 low-resource languages, there are about 1k examples per language on average.
In terms of data quality, we find that a portion of GPT-35-Turbo outputs do not strictly follow the
guidelines specified in the prompt templates. Nevertheless, the overall quality remains acceptable,
and preliminary experiments have demonstrated the benefits of incorporating this data subset.

4.2 Model Fine-tuning and Evaluation

The pretrained Mistral-7b [19] checkpoint is fine-tuned for 1 epoch using the loss in Equation 2. We
follow the training recipe from RankLLaMA [24] and utilize LoRA [17] with rank 16. To further
reduce GPU memory requirement, techniques including gradient checkpointing, mixed precision
training, and DeepSpeed ZeRO-3 are applied.
For the training data, we utilize both the generated synthetic data and a collection of 13 public datasets,
yielding approximately 1.8M examples after sampling. More details are available in Appendix A. To
provide a fair comparison with some previous work, we also report results when the only labeled
supervision is the MS-MARCO passage ranking [5] dataset.
We evaluate the trained model on the MTEB benchmark [28]. Note that the retrieval category in
MTEB corresponds to the 15 publicly available datasets in the BEIR benchmark [42]. Evaluation
of one model takes about 3 days on 8 V100 GPUs due to the need to encode a large number of
documents. Although our model can accommodate sequence length beyond 512, we only evaluate on
the first 512 tokens for efficiency. Official metrics are reported for each category. For more details
about the evaluation protocol, please refer to the original papers [28, 42].
2
https://oai.azure.com/

5
4.3 Main Results

Table 1: Results on the MTEB benchmark [28] (56 datasets in the English subset). The numbers are
averaged for each category. Please refer to Table 15 for the scores per dataset.
Class. Clust. PairClass. Rerank Retr. STS Summ. Avg
# of datasets →
12 11 3 4 15 10 1 56
Unsupervised Models
Glove [35] 57.3 27.7 70.9 43.3 21.6 61.9 28.9 42.0
SimCSEbert-unsup [13] 62.5 29.0 70.3 46.5 20.3 74.3 31.2 45.5
Supervised Models
SimCSEbert-sup [13] 67.3 33.4 73.7 47.5 21.8 79.1 23.3 48.7
Contriever [18] 66.7 41.1 82.5 53.1 41.9 76.5 30.4 56.0
GTRxxl [32] 67.4 42.4 86.1 56.7 48.5 78.4 30.6 59.0
Sentence-T5xxl [31] 73.4 43.7 85.1 56.4 42.2 82.6 30.1 59.5
E5large-v2 [46] 75.2 44.5 86.0 56.6 50.6 82.1 30.2 62.3
GTElarge [23] 73.3 46.8 85.0 59.1 52.2 83.4 31.7 63.1
BGElarge-en-v1.5 [48] 76.0 46.1 87.1 60.0 54.3 83.1 31.6 64.2
Ours
E5mistral-7b + full data 78.5 50.3 88.3 60.2 56.9 84.6 31.4 66.6
w/ synthetic data only 78.2 50.5 86.0 59.0 46.9 81.2 31.9 63.1
w/ synthetic + msmarco 78.3 49.9 87.1 59.5 52.2 81.2 32.7 64.5

In Table 1, our model “E5mistral-7b + full data” attains the highest average score on the MTEB
benchmark, outperforming the previous state-of-the-art model by 2.4 points. In the “w/ synthetic data
only” setting, no labeled data is used for training, and yet the performance remains quite competitive.
We posit that generative language modeling and text embeddings are the two sides of the same coin,
with both tasks requiring the model to have a deep understanding of the natural language. Given an
embedding task definition, a truly robust LLM should be able to generate training data on its own and
then be transformed into an embedding model through light-weight fine-tuning. Our experiments
shed light on the potential of this direction, and more research is needed to fully explore it.

Table 2: Comparison with commercial models and the model that tops the MTEB leaderboard (as
of 2023-12-22). For the commercial models listed here, little details are available on their model
architectures and training data.
Model BEIR Retrieval (15 datasets) MTEB Average (56 datasets)
OpenAI Ada-002 49.3 61.0
Cohere-embed-english-v3.0 55.0 64.5
voyage-lite-01-instruct 55.6 64.5
UAE-Large-V1 [22] 54.7 64.6
E5mistral-7b + full data 56.9 66.6

In Table 2, we also present a comparison with several commercial text embedding models. However,
due to the lack of transparency and documentation about these models, a fair comparison is not
feasible. We focus especially on the retrieval performance on the BEIR benchmark, since RAG is
an emerging technique to enhance LLM with external knowledge and proprietary data. As Table 2
shows, our model outperforms the current commercial models by a significant margin.

4.4 Multilingual Retrieval

To assess the multilingual capabilities of our model, we conduct an evaluation on the MIRACL
dataset [53], which comprises human-annotated queries and relevance judgments across 18 languages.
As shown in Table 3, our model surpasses mE5large on high-resource languages, notably on English.
Nevertheless, for low-resource languages, our model remains suboptimal compared to mE5base .
We attribute this to the fact that Mistral-7B is predominantly pre-trained on English data, and we
anticipate that future multilingual LLMs will leverage our method to bridge this gap.

6
Table 3: nDCG@10 on the dev set of the MIRACL dataset for both high-resource and low-resource
languages. We select the 4 high-resource languages and the 4 low-resource languages according to
the number of candidate documents. The numbers for BM25 and mDPR come from Zhang et al.
[53]. For the complete results on all 18 languages, please see Table 5.
High-resource Languages Low-resource Languages
en fr es ru te hi bn sw
BM25 [53] 35.1 18.3 31.9 33.4 49.4 45.8 50.8 38.3
mDPR [53] 39.4 43.5 47.8 40.7 35.6 38.3 44.3 29.9
mE5base [46] 51.2 49.7 51.5 61.5 75.2 58.4 70.2 71.1
mE5large [46] 52.9 54.5 52.9 67.4 84.6 62.0 75.9 74.9
E5mistral-7b + full data 57.3 55.2 52.2 67.7 73.9 52.1 70.3 68.4

5 Analysis
5.1 Is Contrastive Pre-training Necessary?

90
XLM-R-large + full data 90
E5-mistral-7b + full data
original original
80 w/ cont. pre-train +4.3 80 w/ cont. pre-train +0.2

70 70 +0.1
+5.7
Performance

60 Performance 60 +0.0
+8.2
50 50
40 40
30 30
20 20
Retrieval Classification MTEB All Retrieval Classification MTEB All

Figure 3: Effects of contrastive pre-training. Detailed numbers are in Appendix Table 6.

Weakly-supervised contrastive pre-training is one of the key factors behind the success of existing
text embedding models. For instance, Contriever [18] treats random cropped spans as positive pairs
for pre-training, while E5 [46] and BGE [48] collect and filter text pairs from various sources.
This section re-evaluates the necessity of contrastive pre-training for LLMs, particularly those that
have been pre-trained on trillions of tokens. Figure 3 shows that contrastive pre-training benefits
XLM-Rlarge , enhancing its retrieval performance by 8.2 points when fine-tuned on the same data,
which aligns with prior findings. However, for Mistral-7B based models, contrastive pre-training
has negligible impact on the model quality. This implies that extensive auto-regressive pre-training
enables LLMs to acquire good text representations, and only minimal fine-tuning is required to
transform them into effective embedding models.

5.2 Extending to Long Text Embeddings

Query: what is the pass key for Malayah Graves?

Doc1: <prefix filler> Malayah Graves's pass key is 123. Remember it. 123 is the pass key for Malayah Graves. <suffix filler>
Doc2: <prefix filler> Cesar McLean's pass key is 456. Remember it. 456 is the pass key for Cesar McLean. <suffix filler>
……

Figure 4: Illustration of the personalized passkey retrieval task adapted from Mohtashami and Jaggi
[26]. The “<prefix filler>” and “<suffix filler>” are repeats of “The grass is green. The sky is blue.
The sun is yellow. Here we go. There and back again.” In addition, each document has a unique
person name and a random passkey inserted at a random position. The task is to retrieve the document
that contains the given person’s passkey from 100 candidates.

Existing evaluation datasets for text embedding models are typically short, to evaluate the long-
context capability of our model, we introduce a novel synthetic task called personalized passkey

7
100

Top1 Accuracy
60

20 window 4k, base 10^4

window 32k, base 10^4
window 32k, base 10^5
0 window 32k, base 10^6
256 512 1k 2k 4k 8k 16k 32k
Context Length

Figure 5: Accuracy of personalized passkey retrieval as a function of input context length. For each
context length, we randomly generate 50 queries and compute the top 1 accuracy.

Table 4: Results on the MTEB benchmark with various hyperparameters. The first row corresponds
to the default setting, which employs last-token pooling, LoRA rank 16, and natural language
instructions. Unless otherwise stated, all models are trained on the synthetic and MS-MARCO
passage ranking data.
Datasets Class. Clust. PairClass. Rerank Retr. STS Summ. Avg
E5mistral-7b 78.3 49.9 87.1 59.5 52.2 81.2 32.7 64.5
w/ LLaMA-2 7b init. 76.2 48.1 85.1 58.9 49.6 81.2 30.8 62.9-1.6
w/ msmarco data only 71.6 47.1 86.1 58.8 54.4 79.5 31.7 62.7-1.8
pooling type
w/ mean pool 77.0 48.9 86.1 59.2 52.4 81.4 30.8 64.1-0.4
w/ weighted mean 77.0 49.0 86.1 59.2 52.0 81.4 30.2 64.0-0.5
LoRA rank
w/ r=8 78.4 50.3 87.1 59.3 53.0 81.0 31.7 64.8+0.3
w/ r=32 78.4 50.3 87.4 59.5 52.2 81.2 30.6 64.6+0.1
instruction type
w/o instruction 72.3 47.1 82.6 56.3 48.2 76.7 30.7 60.3-4.2
w/ task type prefix 71.1 46.5 79.7 54.0 52.7 73.8 30.0 60.3-4.2

retrieval, which is illustrated in Figure 4. This task requires encoding the passkey information in a
long context into the embeddings. We compare the performance of different variants by changing the
sliding window size and the RoPE rotation base [41] in Figure 5. The results show that the default
configuration with 4k sliding window attains 100% accuracy within 4k tokens, but the accuracy
deteriorates quickly as the context length grows. Naively extending the sliding window size to 32k
results in worse performance. By changing the RoPE rotation base to 105 , the model can achieve
over 90% accuracy within 32k tokens. However, this entails a minor trade-off in performance for
shorter contexts. A potential avenue for future research is to efficiently adapt the model to longer
contexts through lightweight post-training [54].

5.3 Analysis of Training Hyperparameters

Table 4 presents the results under different configurations. We notice that the Mistral-7B initialization
holds an advantage over LLaMA-2 7B, in line with the findings from Mistral-7B technical report [19].
The choice of pooling types and LoRA ranks does not affect the overall performance substantially,
hence we adhere to the default setting despite the marginal superiority of LoRA rank 8. On the other
hand, the way of adding instructions has a considerable impact on the performance. We conjecture
that natural language instructions better inform the model regarding the embedding task at hand, and
thus enable the model to generate more discriminative embeddings. Our framework also provides a
way to customize the behavior of text embeddings through instructions without the need to fine-tune
the model or re-built document index.

8
6 Conclusion

This paper shows that the quality of text embeddings can be substantially enhanced by exploiting
LLMs. We prompt proprietary LLMs such as GPT-4 to generate diverse synthetic data with instruc-
tions in many languages. Combined with the strong language understanding capability of the Mistral
model, we establish new state-of-the-art results for nearly all task categories on the competitive MTEB
benchmark. The training process is much more streamlined and efficient than existing multi-stage
approaches, thereby obviating the need for intermediate pre-training.
For future work, we aim to further improve the multilingual performance of our model and explore
the possibility of using open-source LLMs to generate synthetic data. We also intend to investigate
ways to improve the inference efficiency and lower the storage cost for LLM based text embeddings.

References
[1] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence
embeddings. In 5th International Conference on Learning Representations, ICLR 2017, Toulon,
France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL
https://openreview.net/forum?id=SyK00v5xx.

[2] Luiz Henrique Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. Inpars:
Unsupervised dataset generation for information retrieval. Proceedings of the 45th International
ACM SIGIR Conference on Research and Development in Information Retrieval, 2022.

[3] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large
annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal,
2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https:
//aclanthology.org/D15-1075.

[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learn-
ers. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and
Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual
Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-
12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/
1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.

[5] Daniel Fernando Campos, Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh
Tiwary, Rangan Majumder, Li Deng, and Bhaskar Mitra. Ms marco: A human generated
machine reading comprehension dataset. ArXiv preprint, abs/1611.09268, 2016. URL https:
//arxiv.org/abs/1611.09268.

[6] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Super-
vised learning of universal sentence representations from natural language inference data. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,
pages 670–680, Copenhagen, Denmark, 2017. Association for Computational Linguistics. doi:
10.18653/v1/D17-1070. URL https://aclanthology.org/D17-1070.

[7] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen-
zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov.
Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th An-
nual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online,
2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL
https://aclanthology.org/2020.acl-main.747.

9
[8] Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu,
Keith Hall, and Ming-Wei Chang. Promptagator: Few-shot dense retrieval from 8 examples. In
The Eleventh International Conference on Learning Representations, 2022.
[9] DataCanary, hilfialkaff, Lili Jiang, Meg Risdal, Nikhil Dandekar, and tomtung. Quora question
pairs, 2017. URL https://kaggle.com/competitions/quora-question-pairs.
[10] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard
Harshman. Indexing by latent semantic analysis. Journal of the American society for information
science, 41(6):391–407, 1990.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of
deep bidirectional transformers for language understanding. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis,
Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423.
URL https://aclanthology.org/N19-1423.
[12] Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli.
ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, pages 3558–3567, Florence, Italy, 2019. Association
for Computational Linguistics. doi: 10.18653/v1/P19-1346. URL https://aclanthology.
org/P19-1346.
[13] Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of
sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, 2021.
Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URL
https://aclanthology.org/2021.emnlp-main.552.
[14] Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to
generate text with citations. ArXiv preprint, abs/2305.14627, 2023. URL https://arxiv.
org/abs/2305.14627.
[15] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar Teodoro Mendes, Allison Del Giorno,
Sivakanth Gopi, Mojan Javaheripi, Piero C. Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil
Salim, S. Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tau-
man Kalai, Yin Tat Lee, and Yuan-Fang Li. Textbooks are all you need. ArXiv preprint,
abs/2306.11644, 2023. URL https://arxiv.org/abs/2306.11644.
[16] Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning
language models with (almost) no human labor. ArXiv preprint, abs/2212.09689, 2022. URL
https://arxiv.org/abs/2212.09689.
[17] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth
International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,
2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
[18] Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand
Joulin, and Edouard Grave. Towards unsupervised dense information retrieval with contrastive
learning. ArXiv preprint, abs/2112.09118, 2021. URL https://arxiv.org/abs/2112.
09118.
[19] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile
Saulnier, et al. Mistral 7b. ArXiv preprint, abs/2310.06825, 2023. URL https://arxiv.org/
abs/2310.06825.
[20] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 6769–6781, Online, 2020. Association for Computational Linguistics. doi: 10.

10
18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.
550.

[21] Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin,
Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian
Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP
tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and
Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual
Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-
12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/
6b493230205f780e1bc26945df7481e5-Abstract.html.

[22] Xianming Li and Jing Li. Angle-optimized text embeddings. ArXiv preprint, abs/2309.12871,
2023. URL https://arxiv.org/abs/2309.12871.

[23] Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang.
Towards general text embeddings with multi-stage contrastive learning. ArXiv preprint,
abs/2308.03281, 2023. URL https://arxiv.org/abs/2308.03281.

[24] Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for
multi-stage text retrieval. ArXiv preprint, abs/2310.08319, 2023. URL https://arxiv.org/
abs/2310.08319.

[25] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. In ICLR, 2013.

[26] Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context
length for transformers. ArXiv preprint, abs/2305.16300, 2023. URL https://arxiv.org/
abs/2305.16300.

[27] Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search. ArXiv preprint,
abs/2202.08904, 2022. URL https://arxiv.org/abs/2202.08904.

[28] Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text em-
bedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Asso-
ciation for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, 2023. Association
for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.148.

[29] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and
Ahmed Hassan Awadallah. Orca: Progressive learning from complex explanation traces of
gpt-4. ArXiv preprint, abs/2306.02707, 2023. URL https://arxiv.org/abs/2306.02707.

[30] Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek,
Qiming Yuan, Nikolas A. Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav
Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David P. Schnurr,
Felipe Petroski Such, Kenny Sai-Kin Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov,
Joanne Jang, Peter Welinder, and Lilian Weng. Text and code embeddings by contrastive
pre-training. ArXiv preprint, abs/2201.10005, 2022. URL https://arxiv.org/abs/2201.
10005.

[31] Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and
Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models.
In Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874,
Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.
findings-acl.146. URL https://aclanthology.org/2022.findings-acl.146.

[32] Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao,
Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. Large dual encoders are generalizable
retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Lan-
guage Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates, 2022. Association for
Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.669.

11
[33] Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document expansion by query
prediction. ArXiv preprint, abs/1904.08375, 2019. URL https://arxiv.org/abs/1904.
08375.
[34] OpenAI. Gpt-4 technical report. ArXiv preprint, abs/2303.08774, 2023. URL https://arxiv.
org/abs/2303.08774.
[35] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for
word representation. In Proceedings of the 2014 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, 2014. Association for
Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://aclanthology.org/
D14-1162.
[36] Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, QiaoQiao She, Jing Liu, Hua Wu, and Haifeng
Wang. DuReader-retrieval: A large-scale Chinese benchmark for passage retrieval from web
search engine. In Proceedings of the 2022 Conference on Empirical Methods in Natural Lan-
guage Processing, pages 5326–5338, Abu Dhabi, United Arab Emirates, 2022. Association for
Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.357.
[37] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese
BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-
guage Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China, 2019. Association for Computational
Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410.
[38] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Tan, Yossi
Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, I. Evtimov, Joanna Bitton,
Manish P Bhatt, Cristian Cantón Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre D’efossez,
Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom,
and Gabriel Synnaeve. Code llama: Open foundation models for code. ArXiv preprint,
abs/2308.12950, 2023. URL https://arxiv.org/abs/2308.12950.
[39] Timo Schick and Hinrich Schütze. Generating datasets with pretrained language models. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
pages 6943–6951, 2021.
[40] Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau
Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-
finetuned text embeddings. In Findings of the Association for Computational Linguistics:
ACL 2023, pages 1102–1121, Toronto, Canada, July 2023. Association for Computational
Linguistics. doi: 10.18653/v1/2023.findings-acl.71. URL https://aclanthology.org/
2023.findings-acl.71.
[41] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
[42] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych.
Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In
Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks
Track (Round 2), 2021.
[43] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a
large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana,
2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https:
//aclanthology.org/N18-1074.
[44] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL https:
//arxiv.org/abs/2307.09288.

12
[45] Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. GPL: Generative pseudo
labeling for unsupervised domain adaptation of dense retrieval. In Proceedings of the 2022
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 2345–2360, Seattle, United States, 2022. Association
for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.168. URL https://
aclanthology.org/2022.naacl-main.168.
[46] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan
Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.
ArXiv preprint, abs/2212.03533, 2022. URL https://arxiv.org/abs/2212.03533.
[47] Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language
models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, pages 9414–9423, Singapore, December 2023. Association for Computational
Linguistics. doi: 10.18653/v1/2023.emnlp-main.585. URL https://aclanthology.org/
2023.emnlp-main.585.
[48] Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof. C-pack: Packaged resources
to advance general chinese embedding. ArXiv preprint, abs/2309.07597, 2023. URL https:
//arxiv.org/abs/2309.07597.
[49] Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv, Ting Yao, Weinan Gan, Zhijing Wu,
Xiangsheng Li, Haitao Li, Yiqun Liu, et al. T2ranking: A large-scale chinese benchmark for
passage ranking. ArXiv preprint, abs/2304.03679, 2023. URL https://arxiv.org/abs/
2304.03679.
[50] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov,
and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question
answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Lan-
guage Processing, pages 2369–2380, Brussels, Belgium, 2018. Association for Computational
Linguistics. doi: 10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259.
[51] Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, and Min
Zhang. Language models are universal embedders. ArXiv preprint, abs/2310.08232, 2023. URL
https://arxiv.org/abs/2310.08232.
[52] Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. Mr. TyDi: A multi-lingual benchmark
for dense retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning,
pages 127–137, Punta Cana, Dominican Republic, 2021. Association for Computational Linguis-
tics. doi: 10.18653/v1/2021.mrl-1.12. URL https://aclanthology.org/2021.mrl-1.12.
[53] Xinyu Crystina Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-
Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. Miracl: A mul-
tilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for
Computational Linguistics, 11:1114–1131, 2023.
[54] Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose:
Efficient context window extension of llms via positional skip-wise training. ArXiv preprint,
abs/2309.10400, 2023. URL https://arxiv.org/abs/2309.10400.

A Implementation Details
Baseline Models For results with mE5base and mE5large , we use the public checkpoints available at
https://huggingface.co/intfloat/multilingual-e5-base and https://huggingface.
co/intfloat/multilingual-e5-large respectively. For experiments in Table 4, we follow the
SGPT [27] paper for the implementation of weighted mean pooling. For the “w/ task type prefix”
setting, we prepend “classify: ” for the long-short matching subgroup, and “query: ” for other
asymmetric tasks. No prefix is added for symmetric tasks.
Training Data For the “E5mistral-7b + full data” setting, our training data comprises generated
synthetic data, ELI5 [12](sample ratio 0.1), HotpotQA [50], FEVER [43], MIRACL [53], MS-
MARCO passage ranking (sample ratio 0.5) and document ranking (sample ratio 0.2) [5], NQ [20],

13
NLI [13], SQuAD [20], TriviaQA [20], Quora Duplicate Questions [9](sample ratio 0.1), Mr-
TyDi [52], DuReader [36], and T2Ranking [49](sample ratio 0.5) datasets. We only include the
training set of each dataset. For the datasets without hard negatives, we use mE5base to mine top 100
hard negatives. After sampling, we obtain approximately 1.8 million examples. The entire training
process takes fewer than 1k steps to complete.
Hyperparameters for Fine-tuning When fine-tuning Mistral-7b, the batch size is set to 2048 and
the learning rate is 10−4 with 100 step warmup and linear decay. The weight decay is 0.1. We add
1 hard negative for each query-document pair. The fine-tuning process takes roughly 18 hours on
32 V100 GPUs with a maximum sequence length 512. We add LoRA adapters to all linear layers,
resulting in a total of 42M trainable parameters. Our implementation is based on the HuggingFace
PEFT library at https://github.com/huggingface/peft.
The model and dataset release information is available at https://github.com/microsoft/
unilm/tree/master/e5.

B Test Set Contamination Analysis

To assess the test set contamination on all the datasets in the MTEB benchmark, we perform a string
match based analysis between the test set and our training set, disregarding differences in character
case and spacing. We categorize the train-test overlaps into three types:

• Low entropy texts. These are texts such as “i need a coffee” and “what does that mean”,
which are not considered as contamination because they are common expressions that can
occur in various contexts.
• Question overlap. We identify 4 test set questions in the DBPedia dataset that also appear
in the TriviaQA training set. Given that they constitute a minor portion of the test set, their
impact on the overall performance is insignificant.
• Retrieval corpus overlap. Several retrieval datasets share the same retrieval corpus. For
instance, the DBPedia, NQ, and TriviaQA datasets all use Wikipedia passages, even though
their query sets are different. This is a standard evaluation practice in the field of information
retrieval, and we do not regard it as contamination.

In summary, we did not detect substantial contamination risks that could alter the main findings of
this paper.
Another aspect to consider is the possibility of test set contamination in the training data of Mistral-
7B and GPT-4. However, since the training data of these models is not publicly accessible, it is
challenging to estimate the degree of such contamination. Given their widespread use in the research
community, we believe it is still a valid comparison if other works also employ these models.

Table 5: nDCG@10 and Recall@100 on the dev set of the MIRACL dataset for all 18 languages.
nDCG@10 Recall@100
BM25 mDPR mE5base mE5large E5mistral-7b full BM25 mDPR mE5base mE5large E5mistral-7b full
ar 48.1 49.9 71.6 76.0 73.3 88.9 84.1 95.9 97.3 96.0
bn 50.8 44.3 70.2 75.9 70.3 90.9 81.9 96.6 98.2 96.0
en 35.1 39.4 51.2 52.9 57.3 81.9 76.8 86.4 87.6 90.2
es 31.9 47.8 51.5 52.9 52.2 70.2 86.4 88.6 89.1 87.5
fa 33.3 48.0 57.4 59.0 52.1 73.1 89.8 91.2 92.9 88.0
fi 55.1 47.2 74.4 77.8 74.7 89.1 78.8 96.9 98.1 96.7
fr 18.3 43.5 49.7 54.5 55.2 65.3 91.5 90.0 90.6 92.8
hi 45.8 38.3 58.4 62.0 52.1 86.8 77.6 92.6 93.9 89.9
id 44.9 27.2 51.1 52.9 52.7 90.4 57.3 87.4 87.9 88.4
ja 36.9 43.9 64.7 70.6 66.8 80.5 82.5 96.0 97.1 95.1
ko 41.9 41.9 62.2 66.5 61.8 78.3 73.7 91.6 93.4 89.4
ru 33.4 40.7 61.5 67.4 67.7 66.1 79.7 92.7 95.5 95.0
sw 38.3 29.9 71.1 74.9 68.4 70.1 61.6 95.6 96.7 95.5
te 49.4 35.6 75.2 84.6 73.9 83.1 76.2 98.0 99.2 95.1
th 48.4 35.8 75.2 80.2 74.0 88.7 67.8 98.0 98.9 96.5
zh 18.0 51.2 51.5 56.0 54.0 56.0 94.4 92.1 93.3 90.1
Avg 39.3 41.5 62.3 66.5 62.9 78.7 78.8 93.1 94.3 92.6

14
Table 6: Detailed results for the effects of contrastive pre-training. For the “E5mistral-7b w/ cont.
pre-train” setting, we pre-train Mistral-7B following the mE5 recipe for 10k steps.
Datasets Class. Clust. PairClass. Rerank Retr. STS Summ. Avg
XLM-Rlarge + full data 72.9 38.7 84.5 53.8 42.0 82.3 29.7 58.0
w/ cont. pre-train 77.2 47.3 85.5 58.6 50.2 84.4 30.7 63.7
E5mistral-7b + full data 78.5 50.3 88.3 60.2 56.9 84.6 31.4 66.6
w/ cont. pre-train 78.7 50.1 87.7 60.9 56.9 84.9 30.2 66.7

Table 7: Prompt template for the short-long matching subgroup. For placeholders, “{query_type}” ∈
{extremely long-tail, long-tail, common}, “{query_length}” ∈ {less than 5 words, 5 to 15 words, at
least 10 words}, “{difficulty}” ∈ {high school, college, PhD}, “{clarity}” ∈ {clear, understandable
with some effort, ambiguous}, “{num_words}” ∈ {50, 100, 200, 300, 400, 500}.
Brainstorm a list of potentially useful text retrieval tasks.

Here are a few examples for your reference:

- Retrieve relevant documents for a short keyword web search query that asks for weather information.
- Search for documents that answers a FAQ-style query on children’s nutrition.

Please adhere to the following guidelines:

- Specify what the query is, and what the desired documents are.
- Each retrieval task should cover a wide range of queries, and should not be too specific.

Your output must always be a python list of strings only, with about 20 elements, and each element corresponds
to a distinct retrieval task in one sentence. Do not explain yourself or output anything else. Be creative!
You have been assigned a retrieval task: {task}

Please adhere to the following guidelines:

- The "user_query" should be {query_type}, {query_length}, {clarity}, and diverse in topic.
- All documents must be created independent of the query. Avoid copying the query verbatim. It’s acceptable
if some parts of the "positive_document" are not topically related to the query.
- All documents should be at least {num_words} words long.
- The "hard_negative_document" contains some useful information, but it should be less useful or comprehen-
sive compared to the "positive_document".
- Both the query and documents should be in {language}.
- Do not provide any explanation in any document on why it is relevant or not relevant to the query.
- Both the query and documents require {difficulty} level education to understand.

Your output must always be a JSON object only, do not explain yourself or output anything else. Be creative!

C Prompts for Synthetic Data Generation

For asymmetric tasks, we list the four prompt templates in Table 7, 8, 9, and 10. For symmetric
tasks, the prompts templates are available in Table 11 and 12. To generate multilingual data, we
sample the value of “{language}” from the language list of XLM-R [7] with higher probability for
high-resource languages. When prompting GPT-4/3.5, we set the temperature to 1.0 and the top-p to
1.0, which is higher than the default setting to encourage more diversity.

D Instructions for Training and Evaluation

We manually write instructions for training datasets, as listed in Table 13. For evaluation datasets, the
instructions are listed in Table 14.

15
Table 8: Prompt template for the long-short matching subgroup. For placeholders, “{num_words}”
∈ {"less than 10", "at least 10", "at least 50", "at least 100", "at least 200"}, “{difficulty}” ∈ {high
school, college, PhD}, “{clarity}” ∈ {clear, understandable with some effort, ambiguous}.
Brainstorm a list of potentially useful text classification tasks.

Please adhere to the following guidelines:

- Tasks should cover a diverse range of domains and task types.

Your output must always be a python list of strings only, with about 20 elements, and each element corresponds
to a distinct text classification task in one sentence. Do not explain yourself or output anything else. Be
creative!
You have been assigned a text classification task: {task}

Your mission is to write one text classification example for this task in JSON format. The JSON object must
contain the following keys:
- "input_text": a string, the input text specified by the classification task.
- "label": a string, the correct label of the input text.
- "misleading_label": a string, an incorrect label that is related to the task.

Please adhere to the following guidelines:

- The "input_text" should be {num_words} words and diverse in expression.
- The "misleading_label" must be a valid label for the given task, but not as appropriate as the "label" for the
"input_text".
- The values for all fields should be in {language}.
- Avoid including the values of the "label" and "misleading_label" fields in the "input_text", that would make
the task too easy.
- The "input_text" is {clarity} and requires {difficulty} level education to comprehend.

Your output must always be a JSON object only, do not explain yourself or output anything else. Be creative!

Table 9: Prompt template for the short-short matching subgroup. We do not generate negative
documents as the matching task is already reasonably difficult.
Brainstorm a list of text matching tasks where both the queries and the groundtruth documents are very short
(one or two sentences, even a short phrase).

Here are a few examples:

- Given a scientific paper title, retrieve the title of papers that cite the given paper.
- Match a word with its definition.
- Provided a notable person’s name, identify their occupation or achievement.

Your output must always be a python list of strings only, with about 20 elements, and each element corresponds
to a distinct task in one sentence. Do not explain yourself or output anything else. Be creative!
You have been assigned a text matching task: {task}

Your mission is to write one example for this task in JSON format. The JSON object must contain the
following keys:
- "input": a string, a random input specified by the task.
- "positive_document": a string, a relevant document for the "input" according to the task.

Please adhere to the following guidelines:

- The values of all fields should be in {language}.
- Both the "input" and "positive_document" should be very short (a sentence or a phrase), avoid substantial
word overlaps, otherwise the task would be too easy.
- The "input" and "positive_document" should be independent of each other.

Your output must always be a JSON object only, do not explain yourself or output anything else. Be creative!

16
Table 10: Prompt template for the long-long matching subgroup. We do not generate negative
documents for latency reasons.
Brainstorm a list of text matching tasks where the queries are long documents.

Here are a few examples:

- Given a document that supports a debatable argument, find another document that contains opposite
arguments.
- Provided a lengthy business proposal, retrieve competitive business strategies in the same industry.

Please adhere to the following guidelines:

- The values of all fields should be in {language}.
- Both the "input" and "positive_document" should be long documents (at least 300 words), avoid substantial
word overlaps, otherwise the task would be too easy.
- The "input" and "positive_document" should be independent of each other.

Your output must always be a JSON object only, do not explain yourself or output anything else. Be creative!

Table 11: Prompt template for monolingual STS. For placeholders, “{high_score}” ∈ {4, 4.5, 5},
“{low_score}” ∈ {2.5, 3, 3.5}, “{unit}” ∈ {sentence, phrase, passage}, “{difficulty}” ∈ {elementary
school, high school, college}.
Write a {unit} triple with varying semantic similarity scores in JSON format. The semantic similarity score
ranges from 1 to 5, with 1 denotes least similar and 5 denotes most similar.

Please adhere to the following guidelines:

- The keys in JSON are "S1", "S2", and "S3", the values are all strings in {language}, do not add any other
keys.
- There should be some word overlaps between all three {unit}s.
- The similarity score between S1 and S2 should be {high_score}.
- The similarity score between S1 and S3 should be {low_score}.
- The {unit}s require {difficulty} level education to understand and should be diverse in terms of topic and
length.

Your output must always be a JSON object only with three keys "S1", "S2" and "S3", do not explain yourself
or output anything else. Be creative!

17
Table 12: Prompt template for bitext retrieval. For placeholders, “{high_score}” ∈ {4, 4.5, 5},
“{low_score}” ∈ {1.5, 2, 2.5}, “{unit}” ∈ {sentence, phrase, passage}, “{difficulty}” ∈ {elementary
school, high school, college}.
Write a {unit} triple with one {unit} in {src_lang} and two {unit}s in {tgt_lang} with varying translation
qualities in JSON format.

The triple is denotes as ("S1", "S2", "S3"). The translation quality score ranges from 1 to 5, with higher
scores are better.

Please adhere to the following guidelines:

- The values of "S1" is a string in {src_lang}, the value of "S2" and "S3" are strings in {tgt_lang}.
- There should be some word overlaps between "S2" and "S3".
- The translation quality score of "S2" with respect to "S1" should be {high_score}.
- The translation quality score of "S3" with respect to "S1" should be {low_score}.
- "S3" should be grammatical and fluent, but contain some keyword or number translation errors, or miss
some information, or contain some redundant information.
- "S1" requires {difficulty} level education to understand and should be diverse in terms of topic and length.

Your output must always be a JSON object only with three keys "S1", "S2" and "S3", do not explain yourself
or output anything else. Be creative!

Table 13: Instructions for each training dataset.

Dataset Instruction
ELI5 Provided a user question, retrieve the highest voted answers on Reddit ELI5 forum
HotpotQA Given a multi-hop question, retrieve documents that can help answer the question
FEVER Given a claim, retrieve documents that support or refute the claim
MIRACL / MrTyDi / NQ Given a question, retrieve Wikipedia passages that answer the question
/ SQuAD / TriviaQA Retrieve Wikipedia passages that answer the question
Given a premise, retrieve a hypothesis that is entailed by the premise
NLI
Retrieve semantically similar text
Given a web search query, retrieve relevant passages that answer the query
MS-MARCO
Given a web search query, retrieve relevant documents that answer the query
Given a question, retrieve questions that are semantically equivalent to the given
Quora Duplicates question
Find questions that have the same meaning as the input question
DuReader / T2Ranking Given a Chinese search query, retrieve web passages that answer the question

18
Table 14: Instructions used for evaluation on the MTEB benchmark. “STS*” indicates we use the
same instructions for all the STS tasks.
Task Name Instruction
Classify a given Amazon customer review text as either counterfactual or not-
AmazonCounterfactualClassif.
counterfactual
AmazonPolarityClassification Classify Amazon reviews into positive or negative sentiment
AmazonReviewsClassification Classify the given Amazon review into its appropriate rating category
Banking77Classification Given a online banking query, find the corresponding intents
Classify the emotion expressed in the given Twitter message into one of the six
EmotionClassification
emotions: anger, fear, joy, love, sadness, and surprise
ImdbClassification Classify the sentiment expressed in the given movie review text from the IMDB dataset
MassiveIntentClassification Given a user utterance as query, find the user intents
MassiveScenarioClassification Given a user utterance as query, find the user scenarios
MTOPDomainClassification Classify the intent domain of the given utterance in task-oriented conversation
MTOPIntentClassification Classify the intent of the given utterance in task-oriented conversation
ToxicConversationsClassif. Classify the given comments as either toxic or not toxic
TweetSentimentClassification Classify the sentiment of a given tweet as either positive, negative, or neutral
Identify the main and secondary category of Arxiv papers based on the titles and
ArxivClusteringP2P
abstracts
ArxivClusteringS2S Identify the main and secondary category of Arxiv papers based on the titles
BiorxivClusteringP2P Identify the main category of Biorxiv papers based on the titles and abstracts
BiorxivClusteringS2S Identify the main category of Biorxiv papers based on the titles
MedrxivClusteringP2P Identify the main category of Medrxiv papers based on the titles and abstracts
MedrxivClusteringS2S Identify the main category of Medrxiv papers based on the titles
RedditClustering Identify the topic or theme of Reddit posts based on the titles
RedditClusteringP2P Identify the topic or theme of Reddit posts based on the titles and posts
StackExchangeClustering Identify the topic or theme of StackExchange posts based on the titles
StackExchangeClusteringP2P Identify the topic or theme of StackExchange posts based on the given paragraphs
TwentyNewsgroupsClustering Identify the topic or theme of the given news articles
SprintDuplicateQuestions Retrieve duplicate questions from Sprint forum
TwitterSemEval2015 Retrieve tweets that are semantically similar to the given tweet
TwitterURLCorpus Retrieve tweets that are semantically similar to the given tweet
AskUbuntuDupQuestions Retrieve duplicate questions from AskUbuntu forum
MindSmallReranking Retrieve relevant news articles based on user browsing history
SciDocsRR Given a title of a scientific paper, retrieve the titles of other relevant papers
StackOverflowDupQuestions Retrieve duplicate questions from StackOverflow forum
ArguAna Given a claim, find documents that refute the claim
ClimateFEVER Given a claim about climate change, retrieve documents that support or refute the claim
Given a question, retrieve detailed question descriptions from Stackexchange that are
CQADupstackRetrieval
duplicates to the given question
DBPedia Given a query, retrieve relevant entity descriptions from DBPedia
FEVER Given a claim, retrieve documents that support or refute the claim
FiQA2018 Given a financial question, retrieve user replies that best answer the question
HotpotQA Given a multi-hop question, retrieve documents that can help answer the question
MSMARCO Given a web search query, retrieve relevant passages that answer the query
NFCorpus Given a question, retrieve relevant documents that best answer the question
NQ Given a question, retrieve Wikipedia passages that answer the question
Given a question, retrieve questions that are semantically equivalent to the given
QuoraRetrieval
question
SCIDOCS Given a scientific paper title, retrieve paper abstracts that are cited by the given paper
SciFact Given a scientific claim, retrieve documents that support or refute the claim
Touche2020 Given a question, retrieve detailed and persuasive arguments that answer the question
TRECCOVID Given a query on COVID-19, retrieve documents that answer the query
STS* Retrieve semantically similar text.
BUCC/Tatoeba Retrieve parallel sentences.
SummEval Given a news summary, retrieve other semantically similar summaries

19
Table 15: Results for each dataset in the MTEB benchmark. The evaluation metrics and detailed
baseline results are available in the original paper [28].
Dataset w/ synthetic only w/ synthetic + msmarco w/o synthetic data full data
BIOSSES 84.2 81.0 85.4 85.5
SICK-R 78.6 78.5 81.7 82.6
STS12 75.8 74.7 77.9 79.7
STS13 84.3 85.3 88.0 88.4
STS14 80.9 81.2 83.7 84.5
STS15 86.2 86.8 89.5 90.4
STS16 85.0 85.3 86.5 87.7
STS17 87.3 87.7 91.0 91.8
STS22 66.0 67.1 66.2 67.0
STSBenchmark 83.5 84.0 87.8 88.6
SummEval 31.9 32.7 31.9 31.4
SprintDuplicateQuestions 93.5 95.8 96.0 95.7
TwitterSemEval2015 78.0 78.5 81.7 81.6
TwitterURLCorpus 86.5 86.9 87.7 87.8
AmazonCounterfactualClass. 79.6 79.9 77.2 78.7
AmazonPolarityClassification 95.8 95.9 93.9 95.9
AmazonReviewsClassification 56.9 55.5 48.2 55.8
Banking77Classification 86.2 87.0 88.8 88.2
EmotionClassification 49.2 47.6 51.0 49.8
ImdbClassification 94.8 94.9 89.0 94.8
MassiveIntentClassification 79.8 79.9 79.6 80.6
MassiveScenarioClassification 81.7 82.4 82.3 82.4
MTOPDomainClassification 95.6 95.9 95.7 96.1
MTOPIntentClassification 84.9 85.9 83.4 86.1
ToxicConversationsClassification 70.2 70.8 70.9 69.6
TweetSentimentExtractionClass. 63.5 63.4 61.6 63.7
AskUbuntuDupQuestions 64.3 65.3 67.4 67.0
MindSmallReranking 33.1 32.8 32.5 32.6
SciDocsRR 86.0 86.0 85.7 86.3
StackOverflowDupQuestions 52.5 53.7 55.9 54.9
ArxivClusteringP2P 51.4 51.2 47.8 50.5
ArxivClusteringS2S 46.5 44.9 44.6 45.5
BiorxivClusteringP2P 44.5 43.3 36.9 43.5
BiorxivClusteringS2S 40.9 40.1 37.0 40.2
MedrxivClusteringP2P 40.5 39.9 32.6 38.2
MedrxivClusteringS2S 38.0 37.9 32.8 37.5
RedditClustering 56.3 55.9 63.1 57.7
RedditClusteringP2P 66.3 64.8 66.4 66.5
StackExchangeClustering 72.9 72.7 74.5 73.1
StackExchangeClusteringP2P 46.1 45.6 34.3 45.9
TwentyNewsgroupsClustering 52.2 52.5 55.6 54.3
ArguAna 52.2 42.7 62.5 61.9
ClimateFEVER 21.1 28.8 25.2 38.4
CQADupstackAndroidRetrieval 40.8 36.0 44.5 43.0
DBPedia 42.0 43.7 47.7 48.9
FEVER 72.5 83.5 73.1 87.8
FiQA2018 38.1 48.4 54.5 56.6
HotpotQA 48.1 64.0 75.6 75.7
MSMARCO 25.7 45.0 42.9 43.1
NFCorpus 35.5 40.0 35.3 38.6
NQ 53.3 63.5 57.3 63.5
QuoraRetrieval 75.0 79.5 89.5 89.6
SCIDOCS 20.6 15.8 19.0 16.3
SciFact 71.5 71.9 74.7 76.4
Touche2020 25.4 32.5 19.1 26.4
TRECCOVID 82.3 87.3 70.8 87.2
Average 63.1 64.5 64.6 66.6

Improving Text Embeddings With Large Language Models:,, Microsoft Corporation
No ratings yet
Improving Text Embeddings With Large Language Models:,, Microsoft Corporation
20 pages
Towards General Text Embeddings With Multi-Stage Contrastive Learning
No ratings yet
Towards General Text Embeddings With Multi-Stage Contrastive Learning
18 pages
Multilingual Text Embeddings Model
No ratings yet
Multilingual Text Embeddings Model
20 pages
Mmteb: M M T E - B: Assive Ultilingual EXT Mbed Ding Enchmark
No ratings yet
Mmteb: M M T E - B: Assive Ultilingual EXT Mbed Ding Enchmark
57 pages
Recent Advances in Universal Text Embeddings: A Comprehensive Review of Top-Performing Methods On The MTEB Benchmark
No ratings yet
Recent Advances in Universal Text Embeddings: A Comprehensive Review of Top-Performing Methods On The MTEB Benchmark
21 pages
Three 150224 Generative A I Intro
No ratings yet
Three 150224 Generative A I Intro
19 pages
Massive Multilingual Text Embedding Benchmark
No ratings yet
Massive Multilingual Text Embedding Benchmark
51 pages
Language-Agnostic BERT Sentence Embedding
No ratings yet
Language-Agnostic BERT Sentence Embedding
14 pages
Creating Text Embedding Models
No ratings yet
Creating Text Embedding Models
91 pages
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
No ratings yet
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
17 pages
Trend
No ratings yet
Trend
47 pages
Scaling Sentence Embeddings With Large Language Models
No ratings yet
Scaling Sentence Embeddings With Large Language Models
22 pages
MTEB: Massive Text Embedding Benchmark
No ratings yet
MTEB: Massive Text Embedding Benchmark
24 pages
Best Open-Source Embedding Models (For RAG)
No ratings yet
Best Open-Source Embedding Models (For RAG)
3 pages
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
No ratings yet
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
16 pages
A Survey On Contextual Embeddings
No ratings yet
A Survey On Contextual Embeddings
13 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
No ratings yet
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
21 pages
EELBERT: Efficient BERT Model Compression
No ratings yet
EELBERT: Efficient BERT Model Compression
9 pages
Neobert: A Next-Generation Bert: Lola Le Breton Quentin Fournier John X. Morris Mariam El Mezouar Sarath Chandar
No ratings yet
Neobert: A Next-Generation Bert: Lola Le Breton Quentin Fournier John X. Morris Mariam El Mezouar Sarath Chandar
23 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
NLP Pretrained Language Models BERT and Its Variants Model Analysis ML Pretraining Finetuning
No ratings yet
NLP Pretrained Language Models BERT and Its Variants Model Analysis ML Pretraining Finetuning
71 pages
Duan 2020
No ratings yet
Duan 2020
6 pages
Contrastive Learning for Text & Code
No ratings yet
Contrastive Learning for Text & Code
13 pages
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
No ratings yet
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
32 pages
Train 400x Faster Static Embedding Models With Sentence Transformers
No ratings yet
Train 400x Faster Static Embedding Models With Sentence Transformers
47 pages
STORM - The Function of Large Language - Models in Embedding Space and The Subspace of The Learned Manifold ?
No ratings yet
STORM - The Function of Large Language - Models in Embedding Space and The Subspace of The Learned Manifold ?
11 pages
AI: Pre-Trained Language Models Review
No ratings yet
AI: Pre-Trained Language Models Review
15 pages
No Training Required Exploring Random Encoders For Sentence Classification
No ratings yet
No Training Required Exploring Random Encoders For Sentence Classification
16 pages
Towards Efficiently Diversifying Dialogue Generation Via Embedding Augmentation
No ratings yet
Towards Efficiently Diversifying Dialogue Generation Via Embedding Augmentation
5 pages
M GTE
No ratings yet
M GTE
20 pages
APPLE-LLMs Cannot Reason
No ratings yet
APPLE-LLMs Cannot Reason
6 pages
Neo Bert
No ratings yet
Neo Bert
19 pages
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
No ratings yet
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
11 pages
Perspective Large Languagemodels in Applied Mechanics
No ratings yet
Perspective Large Languagemodels in Applied Mechanics
7 pages
Large Language Models Overview
No ratings yet
Large Language Models Overview
43 pages
Bert
No ratings yet
Bert
20 pages
GPT1
No ratings yet
GPT1
12 pages
Overview of Word Embeddings in NLP A Survey
No ratings yet
Overview of Word Embeddings in NLP A Survey
7 pages
LLM Survey
100% (1)
LLM Survey
43 pages
Bert
No ratings yet
Bert
10 pages
Do Not Worry If You Do Not Have Data
No ratings yet
Do Not Worry If You Do Not Have Data
18 pages
Sentiment Classification With Word Attention Based On Weakly Supervised Learning With A Convolutional Neural Network
No ratings yet
Sentiment Classification With Word Attention Based On Weakly Supervised Learning With A Convolutional Neural Network
11 pages
This 200-Page LLM Guide Will Save You Months - Here's The Gold in 5 Minutes
No ratings yet
This 200-Page LLM Guide Will Save You Months - Here's The Gold in 5 Minutes
22 pages
LLM - Introduction 2024
No ratings yet
LLM - Introduction 2024
77 pages
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
No ratings yet
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
11 pages
Supervised Learning of Universal Sentence Representations From Natural Language Inference Data
No ratings yet
Supervised Learning of Universal Sentence Representations From Natural Language Inference Data
12 pages
Qiu Et Al. - 2020 - Pre-Trained Models For Natural Language Processing
No ratings yet
Qiu Et Al. - 2020 - Pre-Trained Models For Natural Language Processing
28 pages
Annual Gender and Development (Gad) Plan and Budget Cy 2019
No ratings yet
Annual Gender and Development (Gad) Plan and Budget Cy 2019
2 pages
MH 3 HS
No ratings yet
MH 3 HS
4 pages
Telkomsel Lte FDD Interworking Strategy For l1800 Amp U900 U2100
No ratings yet
Telkomsel Lte FDD Interworking Strategy For l1800 Amp U900 U2100
10 pages
Validation of Hot Corrosion and Fatigue Models in HOTPITS: K. S. Chan
No ratings yet
Validation of Hot Corrosion and Fatigue Models in HOTPITS: K. S. Chan
11 pages
C++ Notes
No ratings yet
C++ Notes
3 pages
Battery Guide - 2021
No ratings yet
Battery Guide - 2021
27 pages
Elective 3 Mathematics of Finance PDF
No ratings yet
Elective 3 Mathematics of Finance PDF
70 pages
La Concepcion College - College Department: Revised Grading Sheet For A.Y. 2013-2014 Instructions
No ratings yet
La Concepcion College - College Department: Revised Grading Sheet For A.Y. 2013-2014 Instructions
34 pages
Certificate of Analysis BIRKOSIT 021208
No ratings yet
Certificate of Analysis BIRKOSIT 021208
1 page
BR SprayMaster
No ratings yet
BR SprayMaster
16 pages
Electric Vehicle August-2024
No ratings yet
Electric Vehicle August-2024
27 pages
Mini Project Semester 2
No ratings yet
Mini Project Semester 2
34 pages
2024 Digital Planner - Landscape, Light Mode, Sunday Start - PDF - Google Drive
0% (1)
2024 Digital Planner - Landscape, Light Mode, Sunday Start - PDF - Google Drive
1 page
Beauty Therapy - Assignment Sixteen
No ratings yet
Beauty Therapy - Assignment Sixteen
2 pages
Types of Society
No ratings yet
Types of Society
15 pages
Northern Rock 2
100% (1)
Northern Rock 2
19 pages
Ch-6 Expected Utility As A Basis For Decision-Making The Evolution of Theories
100% (2)
Ch-6 Expected Utility As A Basis For Decision-Making The Evolution of Theories
17 pages
CC2302 COAL Lab # 01
No ratings yet
CC2302 COAL Lab # 01
13 pages
3Rs Professionalism
No ratings yet
3Rs Professionalism
3 pages
Score Reports
No ratings yet
Score Reports
4 pages
Ais655 Group 9 - Nacab6a - PBL - Sport Locker
No ratings yet
Ais655 Group 9 - Nacab6a - PBL - Sport Locker
319 pages
Piyush Data Science 3
No ratings yet
Piyush Data Science 3
26 pages
Guide Lines For 10 TH Semester: Practical Training Papers
No ratings yet
Guide Lines For 10 TH Semester: Practical Training Papers
11 pages
Python Operators Guide
No ratings yet
Python Operators Guide
9 pages
Protocol PDF
No ratings yet
Protocol PDF
5 pages
Car - Pla 043
No ratings yet
Car - Pla 043
2 pages
AIA - LA Welcomes Rob Hollman As CALA Campaign Director
No ratings yet
AIA - LA Welcomes Rob Hollman As CALA Campaign Director
2 pages
H.N. Holdings Corporate Profile
No ratings yet
H.N. Holdings Corporate Profile
11 pages
1 Labor Supply, Population Growth, Wages
100% (6)
1 Labor Supply, Population Growth, Wages
22 pages
Russian Politics in Exile The Northeast Asian Balance of Power 19241931 Felix Patrikeeff Download
100% (1)
Russian Politics in Exile The Northeast Asian Balance of Power 19241931 Felix Patrikeeff Download
77 pages

AI-Enhanced Text Embeddings

Uploaded by

AI-Enhanced Text Embeddings

Uploaded by

Improving Text Embeddings with

Large Language Models

Brainstorm a list of potentially useful text retrieval tasks.

["Retrieve company's financial reports for a given stock ticker symbol.",

You have been assigned a retrieval task: {task}

{"user_query": "How to use Microsoft Power BI for data analysis",

distribution of task types distribution of languages

4.2 Model Fine-tuning and Evaluation

4.4 Multilingual Retrieval

Figure 3: Effects of contrastive pre-training. Detailed numbers are in Appendix Table 6.

5.2 Extending to Long Text Embeddings

Query: what is the pass key for Malayah Graves?

20 window 4k, base 10^4

5.3 Analysis of Training Hyperparameters

B Test Set Contamination Analysis

Here are a few examples for your reference:

Please adhere to the following guidelines:

Please adhere to the following guidelines:

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

Please adhere to the following guidelines:

Please adhere to the following guidelines:

Here are a few examples:

Please adhere to the following guidelines:

Here are a few examples:

Please adhere to the following guidelines:

Please adhere to the following guidelines:

Please adhere to the following guidelines:

Table 13: Instructions for each training dataset.

You might also like