RAG Beyond Text Enhancing Image Retrieval in RAG Systems
RAG Beyond Text Enhancing Image Retrieval in RAG Systems
Abstract-This paper presents a novel methodology for the where the accurate retrieval and contextualization of images are of
extraction and retrieval of images in RAG (Retrieval Augmented utmost importance.
Generation) powered Question Answering Conversational
Systems that circumvents the limitations of Optical Character
In this paper, we introduce an innovative solution that transcends
Recognition and Large Language Model (OCR-LLM) powered the traditional OCR-LLM framework, offering an efficient and
traditional image retrieval approaches. We are leveraging the accurate method for multiple image retrieval aligned with textual
positional information of images in a vast array of multi-modal responses. Furthermore, we address the critical need to maintain the
(text/image) documents for ingesting image information continuity between textual and visual elements within documents for
alongside text, followed by advanced retrieval and prompt allowing the language model to provide the user with step-by-step
engineering techniques to develop an RAG system that assistance with textual information aligned with respective relevant
maintains the integrity of textual and visual data correlation in image/(s). Traditional methods often disrupt this continuity, leading
to a disjointed representation of information. Our methodology
responses to queries pertaining to both text and images in QnA
ensures that the spatial arrangement of images in relation to the text
solutions and is adept at retrieving both OCR-compatible and is preserved, thereby upholding the document’s structural integrity
OCR-incompatible images. We have successfully incorporated and enable highly accurate QnA.
this approach over a variety of multimodal documents ranging This paper presents two approaches to image retrieval in Retrieval
from research papers, application documentations, surveys to Augmented Generation Systems that enhance the accuracy and
guides and manuals containing text, images and even tables with efficiency of Question Answering:
images and managed to achieve SoTA (State of The Art)
A. Traditional OCR-LLM Based Image Retrieval
performance over simple to complex queries asked on the
Image Content Extraction with OCR + Captioning with LLM -
mentioned documents.
This initial method combines Optical Character Recognition
Furthermore, our approach performed explicitly better in capabilities for extracting textual content in images followed by an
cases where Vision Models like GPT-4 Vision fails to accurately
LLM which is utilized to create a meaningful 2-6 (not limited to)
retrieve images which are OCR incompatible and pertains to
highly customized scientific devices or diagrams and in cases worded caption from the raw OCR extracted text representing the
where the image’s visual representation is not semantically respective image. This information along with the image’s metadata
aligned with textual information but is important to be retrieved is paired and loaded into a vector store as embeddings. While this
for completeness in the response. traditional approach does provide a foundational solution, but it is
Index Terms—Document Question Answering, GPT4, Large limited to the OCR compatibility of the images, meaning that it can
Language Models, Langchain, Mu-RAG (Multi-modal Retrieval only retrieve relevant images if those images have textual
Augmented Generation), RAG (Retrieval Augmented Generation) information in them which are semantically aligned with the text.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on November 28,2024 at 08:07:22 UTC from IEEE Xplore. Restrictions apply.
II. LITERATURE REVIEW pipeline of document ingestion which includes image OCR and
Previous works in RAG-based QA systems have focused on caption generation step and text and image vector store generation.
semantic congruence between text and images. Lewis et al. [1] Upon receiving a user query, the system initiates a retrieval
introduced the concept of RAG (Retrieval-Augmented Generation process within the text vector store. The objective is to fetch text
for Knowledge-Intensive NLP Tasks) to enhance text generation with chunks that are most relevant to the user’s question. This retrieval is
retrieved documents. However, the scope for non-textual content was guided by the contextual information encapsulated in the user’s
limited. Subsequent research expanded on this by integrating images, query, ensuring that the search within the vector store is focused and
but often faced challenges in retrieving contextually apt visuals when precise. While the textual retrieval is underway, the system
semantic alignment was absent. concurrently conducts a Similarity Search within the image vector
Zhou et al. [2] proposed an image captioning approach which store. The aim is to identify the image caption that best aligns with
leverages news articles as additional context to describe the image. the contextual cues obtained from the text vector store. The Cosine
This method focuses on news image captioning, which narrows its Similarity Search algorithm computes the degree of relevance
scope and applicability compared to our versatile approach. Zhou et between the LLM-generated captions and the user query, surfacing
al. [3] in their Style-Aware image captioning method proposed the most pertinent image-caption pair. Fig. 2 shows the QnA
captioning content based on relevant style. This method neglect OCR workflow pipeline using OCR-LLM approach. Fig. 4 shows more
compatibility and broader applicability across various multi-modal detailed workflow of the QnA pipeline using OCR-LLM approach.
documents. Zhuolin Yang et al. [4] similarly proposed
Retrievalaugmented Visual Language Model which uses Flamingo
model and utilizes external database for relevant knowledge retrieval
for few-shot image caption generation. As the model is trained on
generic image data, it has no OCR capabilities and would fail for
domain specific use-cases. Other works [5–8] have also shown
excellent capabilities of Retrieval-augmented based approach for
image captioning. However, they are suitable for general use-cases.
When applied over domain specific use-cases like research papers,
user manuals, guides etc., we observed that they fail to generate
suitable captions.
Traditional systems often struggled with semantic limitations,
relying solely on textual cues for image retrieval. Not only that, it has
been seen in past works that inconsistency became a very significant
bottleneck in relevancy of the context retrieved, that is the sheer
inability to fetch images along with text, maintaining the text image
continuity as dictated in the original knowledge base, which is
extremely crucial to the context but lacking direct semantic
connection. Moreover, recent as well as previous research on
language and large language model shows that despite the
advancements in RAG systems, an over-reliance on language model
or Large Language Models (LLM) reasoning has been identified as Fig. 1. Ingestion pipeline using OCR-LLM approach.
a bottleneck, leading to sub-optimal results in the acquisition of
necessary text/visual content. One of the works to support this
statement is Khatun and Brown et al. [9] who find that subtle changes 1) Solution Bottlenecks:
in prompt wording change a model’s response. • Dependence on OCR and Captioning Quality: The system’s
performance relies heavily on OCR quality and captioning.
III. METHODOLOGY Non-text images, like scientific figures or flowcharts, pose a
challenge, as OCR’s inability to find text leads to irrelevant
A. Traditional OCR-LLM Based Image Retrieval captions. This is exemplified by Fig. 3, where a scientific image
We begin by systematically extracting images from a set of lacks text, resulting in an unrelated caption and unsuccessful
documents. For each image, we record its index within the document, image retrieval.
the page number on which it appears, and the name of the document • Semantic Similarity and Retrieval Issues: Retrieval depends on
itself. These images are then stored in a blob storage system, ensuring semantic similarity between user queries and captions, which
that they are catalogued and retrievable for further processing. Each can cause inaccuracies. For example, a low similarity score for
stored image undergoes Optical Character Recognition (OCR) to the query "How to switch role on AWS?" may prevent the
extract the embedded textual content. The OCR process is pivotal as retrieval of important images, as seen when the score is only
it converts visual information into machine encoded text, which 0.56.
serves as the basis for further interpretation and analysis by the LLM. • Limitations in Retrieving Multiple Images: The system’s top k
The raw text obtained from the OCR is input into a Large Language selection for similarity search struggles with retrieving several
Model (LLM) for caption generation. In our implementation, we relevant images for one query. A high k may include irrelevant
utilize the GPT-3.5 Turbo 4k model, a state-of-the-art LM known for images, while a low k might miss or omit necessary ones. The
its ability to produce concise and coherent text outputs. The model default setting of k = 1 focuses on accuracy for individual
processes the OCR-extracted text to generate captions that succinctly images but fails to support multi-image queries.
represent the content of the images. • Separate Image and Text Responses: Images and textual
The LLM-generated captions are then paired with their respective responses are provided separately, disrupting spatial alignment.
image filenames, forming key-value pairs. This step facilitates the
organization of image data and its corresponding textual description,
which is crucial for efficient retrieval. These pairs are stored in a
structured text file format and subsequently ingested into a vector
store. The vector store houses embeddings of the captions, which we
refer to as our ’image vector store.’ Fig. 1 shows the workflow
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on November 28,2024 at 08:07:22 UTC from IEEE Xplore. Restrictions apply.
digits the Hash ID is to be shrunk down to. Although for our
experimentation we incorporated image object identifier with
modified image SHA1 Hash ID and file extension to create our ILTs,
it must be noted that ILTs are also highly flexible to one’s use-case
and needs, and not to be limited to the metadata we incorporated in
our ILTs. For example, using different value for n for truncating the
image Hash IDs. Images are stored externally by their hash ID and
extension for efficient retrieval in response to LLM queries, with an
example ILT looking like <image: filename(23523473.png)>. The
ILTs, hence rightfully serves as a sophisticated images’ contextual
placeholder or a visual context marker within the document,
encapsulating both the spatial coordinates and the semantic essence
of the image. This ensures that each image is not only anchored in
its original location but is also inherently connected to the relevant
textual information.
2) ILT Integration: In the subsequent step, we embed the
ILT within the document at the precise location specified by the
bounding box information that we get with the help of PyMuPDF
library. This embedding is performed with great attention to the
original layout, preserving the exact region specified by the
bounding box to avoid any misalignment issues. The modified
document now contains a rich interplay of text and ILTs, mirroring
the original structure of the document while enhancing it for
Fig. 2. QnA pipeline using OCR-LLM approach. advanced text retrieval capabilities. It is noteworthy to be mentioned
that the original document is preserved on our storage bucket which
we can display for the response’s source for reference.
The document, enhanced with text and ILTs, is incorporated into
our vector store, maintaining its layout, and meaning while
facilitating efficient multi-modal retrieval. Fig. 5 shows the overall
workflow of processing the document and integrating into the vector
store. To extract segments pertinent to user queries, we employ the
MMR retriever, which selects text chunks based on their cosine
similarity to the query while minimizing redundancy with previously
chosen chunks.
3) LLM Prompting with Chain of Thought: Our Chain Of
Fig. 3. Example response using OCR-LLM approach. Thought (CoT) Prompt Tuning technique refines the document
retrieval process by creating targeted prompts that guide the LLM to
consider Image Localization Tags (ILTs) during its response
and the flow of information, which could hinder the user’s generation. This ensures that the LLM’s output maintains fidelity to
understanding of the content. the document’s layout and the images’ contextual relevance. When
• Cost Inefficiency and API Demands: The solution’s high-cost the LLM retrieves contexts containing ILTs, these prompts are
stems from numerous LLM calls for captioning. For a corpus crucial for preserving the original structure and meaning of the
with 10,000 documents and 100 images each, about 1 million document. Following the LLM’s output, which includes the
OCR and LLM calls are needed, with additional API calls for pertinent ILTs, we engage in a post-processing step. This involves
Q&A increasing costs substantially. identifying ILTs in the LLM response, extracting associated image
data, and then substituting the ILTs with the actual images stored
B. Image Localization Tag (ILT) for Image Retrieval externally. The result is a comprehensive response that accurately
Our study introduces a sophisticated technique for document reflects the placement and relevance of images as per the original
image retrieval that combines spatial and semantic alignment of document structure. Fig. 6 shows the overall workflow of QnA
images with associated text. This approach overcomes the limitations pipeline from taking in users’ query to processing final response with
of Optical Character Recognition (OCR) and Large Language relevant image references from LLM response. Fig. 8 shows a
Models (LLM) captioning in analyzing multi-modal documents. We detailed workflow of QnA pipeline using Image Localization Tag
employ bounding box computations for each image to pinpoint its approach.
location and size within the document. These bounding boxes create Our methodology, which emphasizes the computation and
a vital connection between each visual component and its textual integration of bounding box regions, not only optimizes image
context. The PyMuPDF library is utilized to extract images and their retrieval within documents but also guarantees contextually rich and
bounding box information, which is then used to insert an Image precise responses from the LLM. Despite occasional shortcomings
Localization Tag (ILT) at the image’s position in a modified version when images are located far from their relevant context, our approach
of the document. excels at accurately retrieving images that are accompanied by
textual figure descriptions. As a result, this method significantly
1) ILT Generation: An ILT comprises a unique image enhances the system’s performance and lays the groundwork for
identifier, image SHA1 hash ID, and the image file extension. It is RAG-powered question-answering systems to generate more
noteworthy that for our experiments we have used truncated version coherent and contextually aligned multi-modal responses.
of SHA1 Hash ID. The motivation behind this approach was that, for 4) Overcoming the bottlenecks of traditional OCR-LLM
images that are present within tables, the ILTs tend to overlap with
the table contents due to the length of SHA1 Hash IDs. To tackle this
Approach:
issue, we decided to shorten the Hash ID. We shrunk the Hash ID to • Cost Efficient - Saves the cost for up to 1M standalone
8 digits in our case with [H mod 10n]. Here H is decimal (base 10) captioning calls to LLM.
representation of the SHA1 Hash ID of image and n is the number of • Highly Accurate Image Retrieval - Can retrieve images of
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on November 28,2024 at 08:07:22 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. OCR-LLM QnA Pipeline Workflow Demo Fig. 7. shows an example of a case where images retrieved are
OCR-incompatible.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on November 28,2024 at 08:07:22 UTC from IEEE Xplore. Restrictions apply.
• Research Papers: 10 complex non-OCR images, it must be noted that this approach is
dependent on the position of image with respect to its relevant textual
Fig. 8. Image Localization Tag QnA Pipeline Workflow Demo information. For images that are further away from its textual
context, our approach works well if the images have description
(figure information) that is semantically aligned with parent textual
context.
While training developers is one example of how this dataset
might be used, its design is flexible enough to train any employee
within an organization. By including an organization’s private SOPs, V. RESULTS
our dataset can help streamline the training and on-boarding process, Based on the comparison between OCR-LLM and Image
making it easier for new hires to get up to speed quickly. Our dataset Localization Tag approaches across various document types, from
has been put together for research purposes and is available to Table II it is evident that the Image Localization Tag approach
interested parties upon request. We have taken care to ensure that it consistently outperforms the OCR-LLM approach. Across research
adheres to all legal and ethical guidelines around the sharing and use papers, manuals, programming documentations, and guides/surveys,
of copyrighted content. the Image Localization Tag approach consistently achieves higher
accuracy. Specifically, it scores a 91% for research papers, 94% for
B. Experiments programming guides, and a commendable 95% for manuals and
In our preliminary experiments, we employed the OCR-LLM guides/surveys, whereas the OCR-LLM approach scores range from
methodology on documents containing OCR-compatible images, 60% to 70%, suggesting that the Image Localization Tag approach
primarily consisting of user manuals and programming guides. To offers superior performance in accurately localizing and helping in
assess the results, we utilized a collection of 20-30 queries. The extraction of information from documents across various domains,
validation of these responses was facilitated by RAGAS, a library making it a more effective choice compared to the OCR-LLM
specifically designed for evaluating RAG responses. However, since approach.
there is no existing library for evaluating image responses within
RAG pipelines, we conducted manual evaluations of the image VI. SYSTEM PERFORMANCE AND USABILITY ANALYSIS
responses with the assistance of Subject Matter Experts (SMEs). The
textual response outcomes displayed consistency across all queries. The research leverages Langchain primarily for text generation
Given that the OCR-LLM approach is constrained by providing only within the system. While experiments have been conducted using a
a single response per query, the majority of the image responses were proprietary deployment of Azure OpenAI, the architecture is not
accurate. Nevertheless, for queries lacking a relevant image limited to this; it is compatible with various open-source large
response, the pipeline tended to produce at least one pertinent image, language models (LLMs) as long as Langchain is employed. The
even if it was unrelated to the query. This occurrence can be system’s complexity is concentrated in the document ingestion
attributed to the static k value employed in the Langchain similarity phase, wherein PDFs undergo processing. Here, a modified version
search. of each document is created, and images are stored in an external
Subsequently, we replaced the OCR-LLM component of our repository. The computational load is directly proportional to the
pipeline with GPT-4 Vision to generate image captions. As GPT4 document’s length and image content, with more extensive
Vision is proficient in interpreting various types of images and has documents increasing system latency. Nonetheless, since the
demonstrated excellent performance in explaining images containing ingestion and querying components operate independently, user
text, we opted to use it for caption generation of images within the interactions, which are limited to the querying interface, remain
documents. GPT-4 Vision performed exceptionally well for images largely unaffected by the ingestion process’s computational
with text, including graphs, diagrams, screenshots, and other OCR- demands.
compatible images. However, it occasionally struggled to produce From the perspective of user experience, the system facilitates
satisfactory captions for domain-specific images, such as scientific user interaction exclusively through the querying interface, without
instruments and intricate tools. Table I shows a detailed comparison
between captions generated by GPT-4 Vision and their captions taken TABLE I. Comparison of image caption generated by GPT-4 Vision with
from the document. For generic image contents GPT4 Vision was original caption from document.
able to generate accurate captions but it failed when image contents GPT 4 Vision
are domain or product specific, which is our focus of improvement. Caption from
Image Generated Caption
For our final approach, we maintained the same set of queries and Document
continued to use RAGAS for evaluating textual responses, while
manual evaluation was conducted for image responses. This method
resulted in improved accuracy in image responses. With the help of
Image Localization tags and Chain of Thought prompting, we were
able to achieve accurate image response, surpassing the performance
of GPT-4 Vision. Although our approach performs well in retrieving
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on November 28,2024 at 08:07:22 UTC from IEEE Xplore. Restrictions apply.
these multifaceted diagrams, an area we plan to enhance in
subsequent phases of our research. Our current solution, as
discussed, fails to fetch images that are further from its relevant
context if it has no relevant description along with it. We aim to
NI USB Data National
improve this this in further advancements of our research.
Acquisition Instruments data
VIII. CONCLUSION
System. acquisition device
Our research introduces a new image retrieval method for RAG
systems that overcomes the challenges of traditional OCR-LLM
approaches, improving multi-modal document comprehension and
Model 9700 Schematic of accurately retrieving images, essential for technical documents and
Temperature electronic research papers. Through thorough testing, we’ve shown that our
Controller rear component testing ILT-based approach enhances text-image correlation and retains
panel connections setup document structure, producing contextually relevant RAG system
responses. Our method outshines current techniques, especially with
intricate scientific imagery and non-text-aligned visuals. This work
Industrial device advances image retrieval for multi-modal documents, offering a
SCM10 rear panel with various more effective and economical solution, and paves the way for future
connector pins connectivity ports enhancements in RAG systems, including better handling of
complex images and expanded ILT metadata use.
ACKNOWLEDGMENT
This project was supported by Genpact India Pvt. Ltd.
Abstract diagram
Diagram of child of interconnected REFERENCES
graph nodes and cycles
[1] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H.
Küttler, M. Lewis, W. t. Yih, T. Rocktäschel, et al. Retrievalaugmented
generation for knowledge-intensive nlp tasks. Advances in Neural
Information Processing Systems, 33:9459–9474, 2020.
Colorful abstract [2] M. Zhou, G. Luo, A. Rohrbach, and Z. Yu. Focus! relevant and sufficient
Complex protein 3D knot structure context selection for news image captioning. arXiv preprint
structure illustration arXiv:2212.00843, 2022.
[3] Y. Zhou and G. Long. Style-aware contrastive learning for multi-style
image captioning. arXiv preprint arXiv:2301.11367, 2023.
TABLE II. Performance comparison between OCR-LLM and Image [4] Z. Yang, W. Ping, Z. Liu, V. Korthikanti, W. Nie, D.-A. Huang, L. Fan,
Localization Tag Approaches Z. Yu, S. Lan, B. Li, et al. Re-vilm: Retrieval-augmented visual language
Document Type SME evaluation SME evaluation model for zero and few-shot image captioning. arXiv preprint
(OCR-LLM) (ILT) arXiv:2302.04858, 2023.
[5] Z. Shi, H. Liu, M. R. Min, C. Malon, L. E. Li, and X. Zhu.
Research Papers 65% 91% Retrieval, analogy, and composition: A framework for compositional
generalization in image captioning. In Findings of the Association for
Manuals 60% 95% Computational Linguistics: EMNLP 2021, pages 1990–2000, 2021.
[6] S. Sarto, M. Cornia, L. Baraldi, and R. Cucchiara. Retrieval-augmented
transformer for image captioning. In Proceedings of the 19th
Programming 70% 94%
International Conference on Content-based Multimedia Indexing, pages
Guides
1–7, 2022.
Guides / Surveys 65% 95% [7] R. Ramos, D. Elliott, and B. Martins. Retrieval-augmented image
captioning. arXiv preprint arXiv:2302.08268, 2023.
[8] W. Chen, H. Hu, X. Chen, P. Verga, and W. W. Cohen. Murag:
Multimodal retrieval-augmented generator for open question answering
involving them in document uploading. Unlike existing
over images and text. arXiv preprint arXiv:2210.02928, 2022.
RetrievalAugmented Generation (RAG) systems that provide only
[9] A. Khatun and D. G. Brown. Reliability check: An analysis of gpt3’s
textual responses, our image-based RAG system incorporates visual
response to sensitive topics and prompt wording. arXiv preprint
elements into its responses. This is particularly beneficial for
arXiv:2306.06199, 2023.
documents where images are integral to comprehension, such as
[10] https://www.kaggle.com/datasets/paultimothymooney/cvpr-2019-papers
manuals, guides, and tutorials. For example, in a medical instrument
manual where each step is accompanied by critical images,
traditional RAG systems would only generate text-based
instructions, which can be less user-friendly as it may require users
to revisit the document. Our system enhances user comprehension by
sequentially presenting relevant images alongside each step, thereby
reducing, or eliminating the need to consult the original document.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on November 28,2024 at 08:07:22 UTC from IEEE Xplore. Restrictions apply.