RAG Cheat Sheet
A comprehensive visual guide to Retrieval-Augmented Generation architectures, implementation,
and best practices
What is RAG? When to Use RAG
Retrieval-Augmented Generation (RAG) is an When you need factual accuracy beyond
✓
AI architecture that enhances Large the LLM's training data
Language Models (LLMs) by combining them
When working with domain-specific or
with external knowledge retrieval systems. ✓
proprietary information
Rather than relying solely on the model's
internal parameters, RAG allows LLMs to When information needs to be up-to-date
✓
access, retrieve, and use up-to-date and verifiable
information from external databases before When transparency and citation of
generating responses. ✓
sources matters
When you need to reduce hallucinations
Core Components: ✓
in LLM outputs
📄 Basic RAG Flow:
Document Processing
Document Collection → Document
Converting documents into embeddings
1 Chunking → Embedding Generation
→ Vector Database Storage
🗄️ 2
User Query → Query Embedding →
Similarity Search → Relevant
Vector Database
Document Retrieval
Storing embedded documents
🔍
Retrieved Documents + Original
3 Query → LLM Generation →
Response
Retriever
Finding relevant documents
🤖
Generator
Creating accurate responses
10 RAG Architectures Compared
Beginner Intermediate
1. Standard RAG 2. Corrective RAG
User Query → Query Processing → User Query → Initial Response →
Retrieval → Document Selection → Error Detection → Retrieval →
Context Integration → LLM Response Response Correction → Final
Response
When to Use: For basic question-answering
systems needing external knowledge When to Use: High-precision use cases
where accuracy is critical
Real-World Example: Customer support
chatbots that access product documentation Real-World Example: Medical information
systems, legal documentation assistance
Implementation Tip: Start with smaller
chunk sizes (512-1024 tokens) and Implementation Tip: Implement a
adjust based on performance feedback loop with multiple verification
passes
Intermediate Intermediate
3. Speculative RAG 4. Fusion RAG
User Query → Small Model Draft → User Query → Multiple Retrieval
Parallel Retrievals → Methods →
Large Model Verification → Results Fusion → Aggregated
Response Context →
LLM Response
When to Use: When balancing speed and
accuracy is important When to Use: When dealing with multiple
data sources of varying formats
Real-World Example: Real-time customer
service where response time impacts
satisfaction Real-World Example: Research assistants
accessing articles, patents, and databases
Implementation Tip: Use a specialized
domain-specific small model for draft Implementation Tip: Weight different
generation sources based on their reliability and
relevance
Advanced Intermediate
5. Agentic RAG 6. Self RAG
User Query → Intent Analysis → User Query → Initial Generation →
Agent Selection → Parallel Self-Critique → Additional
Retrievals → Retrieval →
Strategy Coordination → Response Refined Response
When to Use: Complex queries requiring When to Use: For conversational systems
multiple types of information requiring consistency
Real-World Example: Financial analysis Real-World Example: Educational tutoring
tools accessing market data, company systems that build on previous explanations
reports, and news
Implementation Tip: Store conversation
Implementation Tip: Design specialized history as retrievable context
agents for different query types and data
sources
Advanced Advanced
7. Hierarchical RAG 8. Multi-modal RAG
User Query → Top-Level Retrieval → User Query → Cross-modal
Sub-document Identification → Understanding →
Focused Retrieval → Response Multi-format Retrieval →
Format Integration → Response
When to Use: With large, structured
documents or knowledge bases When to Use: When information spans text,
images, audio, or video
Real-World Example: Enterprise search
across documentation hierarchies Real-World Example: E-commerce search
using both product descriptions and images
Implementation Tip: Create multi-level
embedding indexes for efficient
navigation
Implementation Tip: Use specialized
embeddings for each modality and create
bridging mechanisms
Advanced Intermediate
9. Adaptive RAG 10. Fine-tuned RAG
User Query → Query Analysis → User Query → Domain-Specific
Retrieval Strategy Selection → Processing →
Dynamic Parameter Adjustment → Specialized Retrieval →
Response Context-Aware LLM → Response
When to Use: For systems facing diverse When to Use: For specialized domains
query types and user needs requiring expert-level responses
Real-World Example: Academic research Real-World Example: Technical support
assistants handling various disciplines systems for complex products
Implementation Tip: Implement real- Implementation Tip: Fine-tune both
time feedback to tune retrieval embeddings and LLM on domain-specific
parameters per query data
Best Practices for RAG Implementation
Document Processing Embedding Selection
Chunking Strategy General Purpose
1 Balance between semantic 1 OpenAI ada-002, BERT-based
coherence and retrieval granularity models
Chunk Size Specialized Domains
2 256-1024 tokens depending on 2 Consider domain-specific
content complexity embedding models
Overlap Dimensions
3 10-20% chunk overlap to maintain 3 Higher dimensions (768+) for
context across chunks complex information
Retrieval Prompt Engineering
Top-k Selection Template
1 Usually 3-5 chunks for typical 1 "Based on the following information:
queries {context}, please answer: {query}"
Re-ranking Instruction
2 Consider adding a re-ranking step 2 "Use only the provided information.
after initial retrieval If you don't know, say so."
Hybrid Search Source Attribution
3 Combine semantic and keyword 3 "For each point in your answer,
search for better results indicate which source it came from."
Common RAG Challenges & Solutions
Challenge Solution
Hallucination Implement fact-checking and source validation
Retrieval Latency Use approximate nearest neighbor algorithms
Context Length Limits Implement recursive summarization of retrieved chunks
Irrelevant Retrieval Add filtering and pre-processing of documents
Response Consistency Implement conversation history as part of the context
Evaluating RAG Systems
Metric Description Target
Answer Relevance How well the response answers the query >85%
Factual Accuracy Correctness of facts in the response >95%
Retrieval Precision Relevance of retrieved documents >80%
Response Time Time from query to response <2s
Source Coverage Using multiple relevant sources ≥2 sources
Real-World Use Cases
Enterprise Customer Support Research Assistant
🏢 Knowledge Base
Connect employees to
internal documentation
🛎️ Provide accurate
product information and
troubleshooting
🔬 Aid in literature review
and information
synthesis
📋 🎓
Compliance Monitoring Educational Tutoring
Ensure responses adhere to regulatory Provide accurate explanations with cited
guidelines sources
Advanced RAG Optimizations
Query Reformulation Retrieval Augmentation
Rewrite user queries for better retrieval Enhance retrieved context with related
performance information
// Example Query Reformulation // Retrieved Context Augmentation
User: "Tell me about rockets" 1. Original: "SpaceX Falcon 9
Reformulated: "What are rockets, specifications"
their history, types, and 2. Augmented: + "Rocket
applications in space propulsion systems"
exploration?" 3. Augmented: + "Comparison with
other launch vehicles"
Contextual Compression Adaptive RAG
Condense retrieved information to focus on Dynamically adjust retrieval parameters
relevance based on query type
// Context Compression Pipeline // Adaptive Parameter Selection
1. Retrieve k=10 documents if query_is_factual():
2. Generate summary for each k = 3 # fewer, precise
document documents
3. Rank summaries by relevance similarity_threshold = 0.8
4. Select top n=3 summaries elif query_is_exploratory():
k = 7 # more, diverse
documents
similarity_threshold = 0.6
Continuous Learning
Update vector stores and embeddings as
new information arrives
// Incremental Updating
1. Monitor for new documents
2. Process and embed new content
3. Merge into existing vector
store
4. Periodically re-index for
optimization
Practical RAG Implementation Guide
1 Setting Up Your Document Processing Pipeline
import os
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load documents
loader = DirectoryLoader('./documents/', glob="**/*.pdf")
documents = loader.load()
# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)
chunks = text_splitter.split_documents(documents)
2 Creating and Storing Embeddings
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Initialize embeddings
embeddings = OpenAIEmbeddings()
# Create vector store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Persist to disk
vectorstore.persist()
3 Building the Retrieval System
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
# Create base retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)
# Add compression for better context
llm = ChatOpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever
)
4 Implementing the RAG Chain
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
# Initialize LLM
llm = ChatOpenAI(temperature=0, model="gpt-4")
# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=compression_retriever,
return_source_documents=True,
chain_type_kwargs={
"prompt": CUSTOM_PROMPT_TEMPLATE
}
)
# Query the system
result = qa_chain({"query": "How do rockets work?"})
print(result["result"])
Popular Vector Database Comparison
🔍 🗄️ 🔮
Pinecone Weaviate Chroma
Fully managed service Open-source Open-source
Low latency queries GraphQL API Easy Python integration
Scales to billions of vectors Classification support Simple deployment
High availability Multi-tenancy support Good for prototyping
🪢 🌐
Milvus Qdrant
Open-source Open-source
Distributed architecture Filtering capabilities
Hybrid search On-prem deployment
High scalability REST and gRPC APIs
Comparing RAG Performance Metrics
Response Memory Implementation
Architecture Accuracy
Time Usage Complexity
Standard RAG ⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐ ⭐
Corrective
RAG
⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐
Speculative
RAG
⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐ ⭐⭐⭐⭐
Fusion RAG ⭐⭐ ⭐⭐⭐⭐ ⭐⭐ ⭐⭐⭐⭐
Agentic RAG ⭐⭐ ⭐⭐⭐⭐⭐ ⭐ ⭐⭐⭐⭐⭐
Standard RAG Implementation Flowchart
Document Document Embedding Vector
→ → →
Collection Chunking Generation Database
User Query Vector Similarity Retrieve
→ → →
Query Embedding Search Documents
Format Generate LLM Final
→ → →
Context Prompt Processing Response
Popular RAG Tools and Frameworks
LangChain LlamaIndex Haystack
Framework for connecting Data framework for End-to-end framework for
LLMs with external data augmenting LLMs with building NLP pipelines
sources private data
⭐⭐⭐⭐
⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
Semantic Kernel txtai
Microsoft's SDK for All-in-one embeddings
integrating LLMs with code database with search
⭐⭐⭐ capabilities
⭐⭐⭐
The Ultimate RAG Visual Cheat Sheet | Updated April 2025