NVIDIA RAG Blueprint

Retrieval-Augmented Generation (RAG) combines the reasoning power of large language models (LLMs) with real-time retrieval from trusted data sources. It grounds AI responses in enterprise knowledge, reducing hallucinations and ensuring accuracy, compliance, and freshness.

Overview

The NVIDIA RAG Blueprint is a reference solution and foundational starting point for building Retrieval-Augmented Generation (RAG) pipelines with NVIDIA NIM microservices. It enables enterprises to deliver natural language question answering grounded in their own data, while meeting governance, latency, and scalability requirements. Designed to be decomposable and configurable, the blueprint integrates GPU-accelerated components with NeMo Retriever models, Multimodal and Vision Language Models, and guardrailing services, to provide an enterprise-ready framework. With a pre-built reference UI, open-source code, and multiple deployment options — including local docker (with and without NVIDIA Hosted endpoints) and Kubernetes — it serves as a flexible starting point that developers can adapt and extend to their specific needs.

Key Features

Data Ingestion

Multimodal content extraction - Documents with with text, tables, charts, infographics, and audio. For the full list of supported file types, see [NeMo Retriever Extraction Overview](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/).
Custom metadata support

Search and Retrieval

Multi-collection searchability
Hybrid search with dense and sparse search
Reranking to further improve accuracy
GPU-accelerated Index creation and search
Pluggable vector database

Query Processing

Query decomposition
Dynamic filter expression creation

Generation and Enrichment

Opt-in for Multimodal and Vision Language Model Support in the answer generation pipeline.
Document summarization
Improve accuracy with optional reflection
Optional programmable guardrails for content safety

Evaluation

Evaluation scripts (RAGAS framework)

User Experience

Sample user interface
Multi-turn conversations
Multi-session support

Deployment and Operations

Telemetry and observability
Decomposable and customizable
NIM Operator support
Python library mode support
OpenAI-compatible APIs

Software Components

The RAG blueprint is built from the following complementary categories of software:

NVIDIA NIM microservices – Deliver the core AI functionality. Large-scale inference (e.g.for example, Nemotron LLM models for response generation), retrieval and reranking models, and specialized extractors for text, tables, charts, and graphics. Optional NIMs extend these capabilities with OCR, content safety, topic control, and multimodal embeddings.
The integration and orchestration layer – Acts as the glue that binds the system into a complete solution.

This modular design ensures efficient query processing, accurate retrieval of information, and easy customization.

NVIDIA NIM Microservices

Response Generation (Inference)
- NVIDIA NIM llama-3.3-nemotron-super-49b-v1.5
Retriever and Extraction Models
Optional NIMs
- Llama 3.1 NemoGuard 8B Content Safety NIM
- Llama 3.1 NemoGuard 8B Topic Control NIM
- Llama-3.1 Nemotron-nano-vl-8b-v1 NIM
- NeMo Retriever Parse NIM
- NeMo Retriever OCR NIM (Early Access)
- llama-3.2-nemoretriever-1b-vlm-embed-v1 (Early Access)

Integration and orchestration layer

RAG Orchestrator Server – Coordinates interactions between the user, retrievers, vector database, and inference models, ensuring multi-turn and context-aware query handling. This is LangChain-based.
Vector Database (accelerated with NVIDIA cuVS) – Stores and searches embeddings at scale with GPU-accelerated indexing and retrieval for low-latency performance. You can use Milvus Vector Database or Elasticsearch.
NeMo Retriever Extraction – A high-performance ingestion microservice for parsing multimodal content. For more information about the ingestion pipeline, see NeMo Retriever Extraction Overview
RAG User Interface (rag-frontend) – A lightweight user interface that demonstrates end-to-end query, retrieval, and response workflows for developers and end users. For more information, refer to RAG UI.

Technical Diagram

The following image represents the architecture and workflow.

Workflow

The following is a step-by-step explanation of the workflow from the end-user perspective:

Data Ingestion & Extraction Pipeline – Multimodal enterprise documents (text, images, tables, charts, infographics, and audio) are ingested.
User Query – The user interacts with the system through the UI or APIs, submitting a question. An optional NeMo Guardrails module can filter or reshape the query for safety and compliance before it enters the retrieval pipeline.
Query Processing – The query is processed by the Query Processing service, which may also leverage reflection (an optional LLM step) to improve query understanding or reformulation for better retrieval results.
Retrieval from Enterprise Data – The processed query is converted into embeddings using NeMo Retriever Embedding and matched against enterprise data stored in a cuVS accelerated Vector Database (CuVS) and associated object store(minIO). Relevant results are identified based on similarity.
Reranking for Precision – An optional NeMo Retriever Reranker reorders the retrieved passages, ensuring the most relevant chunks are selected to ground the response.
Response Generation – The selected context is passed into the LLM inference service (e.g., Llama Nemotron models). An optional reflection step can further validate or refine the answer against the retrieved context. Guardrails may also be applied to enforce safety before delivery.
User Response – The generated, grounded response is sent back to the user interface, often with citations to retrieved documents for transparency.

Get Started With NVIDIA RAG Blueprint

The recommended way to get started is to deploy the NVIDIA RAG Blueprint with Docker Compose for a single node deployment, and using self-hosted on-premises models. For details, refer to Get Started.

Refer to the full documentation to learn about the following:

Minimum Requirements
Deployment Options
Configuration Settings
Common Customizations
Available Notebooks
Troubleshooting
Additional Resources

Blog Posts

Inviting the community to contribute

We're posting these examples on GitHub to support the NVIDIA LLM community and facilitate feedback. We invite contributions! To open a GitHub issue or pull request, see the contributing guidelines.

License

This NVIDIA AI BLUEPRINT is licensed under the Apache License, Version 2.0. This project will download and install additional third-party open source software projects and containers. Review the license terms of these open source projects before use.

Use of the models in this blueprint is governed by the NVIDIA AI Foundation Models Community License.

Terms of Use

This blueprint is governed by the NVIDIA Agreements | Enterprise Software | NVIDIA Software License Agreement and the NVIDIA Agreements | Enterprise Software | Product Specific Terms for AI Product. The models are governed by the NVIDIA Agreements | Enterprise Software | NVIDIA Community Model License and the NVIDIA RAG dataset which is governed by the NVIDIA Asset License Agreement. The following models that are built with Llama are governed by the Llama 3.2 Community License Agreement: nvidia/llama-3.2-nv-embedqa-1b-v2 and nvidia/llama-3.2-nv-rerankqa-1b-v2 and llama-3.2-nemoretriever-1b-vlm-embed-v1.

Additional Information

The Llama 3.1 Community License Agreement for the llama-3.1-nemotron-nano-vl-8b-v1, llama-3.1-nemoguard-8b-content-safety and llama-3.1-nemoguard-8b-topic-control models. The Llama 3.2 Community License Agreement for the nvidia/llama-3.2-nv-embedqa-1b-v2, nvidia/llama-3.2-nv-rerankqa-1b-v2 and llama-3.2-nemoretriever-1b-vlm-embed-v1 models. The Llama 3.3 Community License Agreement for the llama-3.3-nemotron-super-49b-v1.5 models. Built with Llama. Apache 2.0 for NVIDIA Ingest and for the nemoretriever-page-elements-v2, nemoretriever-table-structure-v1, nemoretriever-graphic-elements-v1, paddleocr and nemoretriever-ocr-v1 models.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.project		.project
data		data
deploy		deploy
docs		docs
frontend		frontend
hooks		hooks
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-3rd-party.txt		LICENSE-3rd-party.txt
LINTING.md		LINTING.md
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
uv.lock		uv.lock
variables.env		variables.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NVIDIA RAG Blueprint

Overview

Key Features

Software Components

NVIDIA NIM Microservices

Integration and orchestration layer

Technical Diagram

Workflow

Get Started With NVIDIA RAG Blueprint

Blog Posts

Inviting the community to contribute

License

Terms of Use

Additional Information

About

Uh oh!

Releases 6

Uh oh!

Contributors 9

Uh oh!

Languages

License

NVIDIA-AI-Blueprints/rag

Folders and files

Latest commit

History

Repository files navigation

NVIDIA RAG Blueprint

Overview

Key Features

Software Components

NVIDIA NIM Microservices

Integration and orchestration layer

Technical Diagram

Workflow

Get Started With NVIDIA RAG Blueprint

Blog Posts

Inviting the community to contribute

License

Terms of Use

Additional Information

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Uh oh!

Contributors 9

Uh oh!

Languages