Vector Databases Cookbook
Pavan Kumar M K
First Edition
Foreword
"Explore the ins and outs of Vector Databases in this insigh:ul book. Unlike others, it goes
beyond product talk, offering a deep dive into the fundamentals. Discover the unique
contribuDon of Chroma DB, with pracDcal use cases woven seamlessly into the narraDve.
It's a natural, hands-on approach to understanding the core of Vector DBs and their role in
the ever-evolving data landscape."
Sashank Pappu
CEO, Antz.ai
Preface
In the ever-evolving landscape of data management and informaDon retrieval, the
advent of vector databases has ushered in a paradigm shiO, offering unparalleled
efficiency in handling and querying high-dimensional data. This book, "The EssenDals of
Vector Databases," focus on the core principles and complex detailed aspects surrounding
the concept of vector databases, providing a comprehensive journey from foundaDonal
concepts to advanced applicaDons.
Designed for readers with a foundaDonal understanding of machine learning and
deep learning, this book embarks on a systemaDc exploraDon of the principles that
underpin vector databases. From the fundamental building blocks to sophisDcated
techniques, each chapter aims to demysDfy the complexiDes associated with vector
databases, empowering readers to harness the full potenDal of these cuSng-edge
technologies.
As we navigate through the pages, we will unravel the essenDals of creaDng,
managing, and querying vector databases. The content is craOed to cater to a spectrum of
readers, from those seeking a solid grounding in the basics to those eager to explore the
fronDers of advanced concepts in the realm of vector databases. PracDcal examples and
real-world applicaDons serve as guiding beacons, illustraDng how these concepts manifest
in diverse scenarios.
It is presumed that readers come armed with a foundaDonal knowledge of machine
learning and deep learning, enhancing their ability to appreciate and absorb the nuances
presented throughout the book. By assuming this foundaDonal knowledge, we aim to
provide a more immersive and insigh:ul learning experience, allowing readers to bridge
the gap between theoreDcal understanding and pracDcal implementaDon.
Embark on this enriching journey, where the convergence of machine learning and vector
databases unfolds before your eyes. May this book be your guide, equipping you with the
knowledge and skills to navigate the intricate landscapes of vector databases, from the
rudiments to the cuSng-edge, in pursuit of opDmal data management and retrieval
soluDons.
Dedicated to my dad and to my best friend arun sir
I would like to express my hear4elt gra6tude to deeplearning.ai and the numerous insigh4ul
blog ar6cles that have been a guiding light throughout the crea6on of this book. The
concepts presented are deeply rooted in the knowledge and inspira6on drawn from these
invaluable resources. My sincere thanks to the brilliant minds behind deeplearning.ai and
the authors of the insigh4ul blog ar6cles, whose contribu6ons have played a pivotal role in
shaping the content and direc6on of this book. This journey of wri6ng would not have been
possible without their unwavering commitment to advancing the field of deep learning. I
extend my apprecia6on to all who have shared their exper6se, making this endeavour a
meaningful and successful one.
Table Of Contents
Chapter 1: IntroducDon ....................................................................................................... 6
Chapter 2: Real Time Use Cases of Vector Databases ......................................................... 7
Chapter 3: How do we get embeddings? ............................................................................ 8
Chapter 4: Measuring Distance between vector Embeddings .......................................... 10
1. Euclidean Distance (L2): .................................................................................................. 10
2. ManhaNan Distance (L1): ............................................................................................... 10
3. Dot Product: ................................................................................................................... 10
4. Cosine Distance: ............................................................................................................. 11
Chapter 5: Bruit force Distance Measure using KNeighbours algorithm .......................... 13
Output showing the spred of vectors in two dimensional space ..................................... 15
Chapter 6: What Are Vector Stores? .................................................................................. 16
1. Installa6on of Chroma DB: .............................................................................................. 16
Chapter 7: ImplemenDng our first Vector Search.............................................................. 18
1. More on Querying with and without filters .................................................................... 18
Vector Databases support CRUD opera6ons ...................................................................... 20
Chapter 8: Going From CRUD to SemanDc Search ............................................................ 22
The Future and Beyond ...................................................................................................... 23
Final Chapter: Conclusion .................................................................................................. 25
Chapter 1: Introduc1on
In the vast landscape of data, a new type of database has emerged, surrounded by
intrigue. These databases, called vector databases, promise quick data retrieval and clever
similarity detec6on. However, for those unfamiliar, exploring this realm might seem like
naviga6ng a complex maze blindfolded.
Tradi6onal databases provide a sense of familiarity with their organized tables and rows. Yet,
when dealing with complex data like images, text, and user preferences, these structures fall
short. Here enters the vector database, specifically designed for the intricate nature of such
high-dimensional data.
Picture each data point as a constella6on, its essence captured in the angles and distances
between various aNributes. Vector databases grasp this celes6al language, storing data
points as vectors—mathema6cal en66es encoding the essence of each "star."
The true marvel lies not just in storage but in retrieval. Unlike tradi6onal databases
struggling with similarity nuances, vector databases possess a nearly magical ability to
recognize paNerns and connec6ons. They unveil hidden rela6onships between seemingly
unrelated data points, revealing insights that might elude the keenest human eye.
Imagine having a million unique photographs. A tradi6onal database might let you search by
tags, but finding all images of, for instance, a sunrise over a calm ocean could be challenging.
A vector database, on the other hand, effortlessly pinpoints these hidden gems, guided by
the subtle dance of vectors.
This newfound power unlocks numerous possibili6es, from personalized recommenda6ons
to effec6ve fraud detec6on and ground breaking scien6fic discoveries. However, before we
delve into this exci6ng world, an essen6al step awaits. We must remove the blindfold and
gain a clear understanding of these intriguing en66es—unveiling core concepts, intended
purpose, and the essence of what makes a vector database work.
So, dear reader, get ready for an enlightening journey. Let's embark on this quest together,
and by the end, the once-mysterious world of vector databases will be an open book, ready
to be explored and harnessed for the greater good.
Chapter 2: Real Time Use Cases of Vector Databases
AutomoDve Industry:
• Mul6-modal search can aid in iden6fying automo6ve parts. Users can capture images
of components, and the system can retrieve relevant informa6on and documenta6on
from a vector database, facilita6ng repairs and maintenance.
Fashion and Design:
• In the fashion industry, users can take pictures of clothing or design elements, and
mul6-modal search combined with vector databases can assist in finding similar
styles, paNerns, or fabrics, enhancing the crea6ve and shopping processes.
Personal Finance Management:
• Users can capture images of receipts, invoices, or financial documents using mul6-
modal search. Vector databases can store and organize this data, allowing individuals
to track expenses, manage budgets, and retrieve financial insights efficiently.
Drug Discovery:
• Researchers can employ mul6-modal search to analyze chemical structures and
biological images related to drug discovery. Vector databases can store informa6on
about compounds, their proper6es, and poten6al applica6ons in medicine.
Medical Research Literature:
• Researchers can use mul6-modal search to explore medical research literature.
Images of research papers, figures, or charts can be submiNed, and vector databases
can store and organize scien6fic knowledge for comprehensive literature reviews.
Chapter 3: How do we get embeddings?
Autoencoders (AEs) and Varia6onal Autoencoders (VAEs) are types of neural networks used
in unsupervised learning for dimensionality reduc6on and genera6ve tasks. Autoencoders
consist of an encoder and a decoder. The encoder compresses the input data into a lower-
dimensional representa6on, known as the latent space, while the decoder reconstructs the
input from this compressed representa6on. The network is trained to minimize the
difference between the input and the reconstructed output, forcing the encoder to learn a
meaningful representa6on of the data.
Varia6onal Autoencoders introduce a probabilis6c element to the latent space. Instead of
mapping inputs to a fixed point in the latent space, VAEs map inputs to probability
distribu6ons over the latent space. This probabilis6c approach allows VAEs to generate
diverse and meaningful outputs during the decoding process. The significance of the latent
space lies in its ability to capture essen6al features and paNerns of the input data in a
compact and con6nuous manner. It serves as a compressed, con6nuous representa6on that
can be manipulated for tasks such as data genera6on, interpola6on, and explora6on. The
latent space in both Autoencoders and Varia6onal Autoencoders plays a crucial role in
disentangling and capturing the underlying structure of the input data, enabling more
effec6ve and versa6le representa6on learning.
he latent space in autoencoders, including varia6onal autoencoders (VAEs), is indeed a
vector representa6on of the input data. This representa6on is a compressed and abstract
encoding of the essen6al features present in the input. Each point in the latent space
corresponds to a different encoding of the input data.
Let us understand more about these embeddings by looking at a prac6cal example
Code Figure 1
1. import Libraries: The code shows the modules to be imported and uses
SentenceTransformer for working with pre-trained models and List for type hin6ng.
2. Class DefiniDon: Define a class TextualEmbeddings that ini6alizes an instance of the
SentenceTransformer model. The model is specified by the model_name_or_path
parameter, and the default is set to 'paraphrase-MiniLM-L6-v2'.
3. Encode Method: Define a method encode that takes a list of sentences (data) as
input and returns the corresponding embeddings using the encode method of the
pre-trained model.
4. Main Block: Specify a list of sentences that you want to encode.
5. InstanDaDon of the Class: Create an instance of the TextualEmbeddings class and
use it to encode the specified sentences.
Chapter 4: Measuring Distance between vector Embeddings
1. Euclidean Distance (L2):
Euclidean distance is a fundamental metric used to measure the straight-line
distance between two points in a space. In the context of vector databases, it serves as a
distance metric between vectors. The Euclidean distance (L2 norm) between two vectors, A
and B, is calculated as the square root of the sum of squared differences between their
corresponding elements. Mathema6cally, this is expressed as
Euclidean distance is sensi6ve to magnitude and direc6on, making it suitable for scenarios
where both magnitude and orienta6on maNer.
2. Manhattan Distance (L1):
ManhaNan distance, also known as L1 norm or taxicab distance, measures the sum
of absolute differences between corresponding elements of two vectors. In the context of
vector databases, ManhaNan distance is calculated as the sum of the absolute differences
between the coordinates of two vectors. Mathema6cally, it is expressed as
Unlike Euclidean distance, ManhaNan distance is less influenced by outliers and is ohen
preferred when the impact of extreme values should be minimized.
3. Dot Product:
The dot product is a mathema6cal opera6on that quan6fies the similarity between
two vectors. In the context of vector databases, the dot product measures the cosine of the
angle between two vectors. If the vectors are orthogonal, the dot product is zero; if they
point in the same direc6on, the dot product is posi6ve, and if they point in opposite
direc6ons, the dot product is nega6ve. Mathema6cally, the dot product of vectors A and B is
given by
The dot product is valuable for measuring the alignment of vectors and is ohen used in tasks
such as similarity and relevance scoring.
4. Cosine Distance:
Cosine distance is a measure of similarity between two vectors based on the cosine
of the angle between them. In the context of vector databases, cosine distance is ohen used
to assess the similarity of vectors regardless of their magnitude. It is par6cularly useful in
scenarios where the magnitude of vectors is not a significant factor, such as text data. Cosine
distance is calculated as the cosine of the angle between two vectors A and B, represented
as
This distance metric produces a value between -1 and 1, where 1 indicates complete
similarity, 0 indicates orthogonality, and -1 indicates complete dissimilarity. Cosine distance
is widely employed in informa6on retrieval and recommenda6on systems for assessing
document or item similarity.
Let us see the above concepts with some examples. What we will do here is we
combine the above code which will give the embeddings for sentences and pass through a
u6lity class which we are going to design now so that we get the distance between the
embeddings.
The below code has two methods numbered 1 and 2 respec6vely. The Method 1 is
Constructor (__init__): Ini6alizes the class instance.
• self.vector1 and self.vector2: Randomly generated dense vectors of size 30.
• self.distance_metric: An enumera6on (Enum) represen6ng distance metrics,
including ManhaNan, Euclidean, and Cosine distances.
The Method 2 is the business logic which will actually compute the distance measure based
on given criteria
• Checks the value of distance_metric_name against the enumera6on values to
determine the desired distance metric.
• If distance_metric_name is 'manhajan_distance', it calculates and returns the
ManhaNan distance.
• If distance_metric_name is 'euclidean_distance', it calculates and returns the
Euclidean distance.
• If distance_metric_name is 'cosine_distance' (or any other value), it calculates and
returns the Cosine distance.
Code Figure 2
If we modify the code figure 1 shown above something as below then the logic will compute
the distance between the vectors and finally display the similarity between the passed in
vectors.
Code figure 3
Chapter 5: Bruit force Distance Measure using KNeighbours
algorithm
In this instruc6onal segment, we delve into the fundamental concepts surrounding
vector or seman6c search, employing the brute-force k-nearest-neighbors algorithm to build
an intui6ve understanding. The tutorial progresses by guiding you through the
implementa6on of brute-force KNN, elucida6ng its applica6on in accurately retrieving
nearest vectors in the embedding space rela6ve to a query vector. As we navigate this
journey, we confront the challenges associated with the run6me complexity of brute-force
KNN algorithms, paving the way for the explora6on of approximate nearest-neighbors
algorithms—a core component of vector database technology.
Vectors, serving as conduits for the intrinsic meaning embedded within our data,
become instrumental in seeking data points that resonate in meaning with our queries. This
process, known as seman6c or vector search, hinges on the iden6fica6on and retrieval of the
closest objects within vector space. The tutorial elaborates on the seman6c search,
emphasizing its reliance on the meaning encapsulated in words or images. A detailed
walkthrough of the brute-force approach is provided, involving sequen6al steps: calcula6ng
distances between all vectors and the query vector, sor6ng these distances, and finally
returning the top K best-matching objects based on the smallest distances—a paradigm
recognized in classical machine learning as the K nearest neighbor algorithm.
However, the tutorial underscores a cri6cal considera6on—the substan6al
computa6onal cost associated with brute-force searches. As the quan6ty of data points
escalates, the overall query 6me experiences propor6onal growth. A demonstra6on of this
algorithm in code, encompassing the scaling up of both data points and dimensions,
reinforces the impact of increased computa6onal demands.
To illustrate this point, a speed test func6on assesses the 6me complexity of the
brute-force algorithm across various scales, ranging from 20 objects to millions. The
observed results demonstrate a no6ceable increase in query 6me as the number of objects
expands, revealing the inherent limita6ons of the brute-force approach, par6cularly with
substan6al datasets.
Furthermore, the tutorial addresses the impact of dimensional augmenta6on on
vector embeddings, exploring scenarios where the dimensionality is increased to 768
dimensions. Performance tests underscore the computa6onal challenges, showcasing how
query 6mes escalate, especially with larger datasets. Real-world scenarios, where vectors
may encompass hundreds of millions of objects, pose significant challenges for brute-force
methodology.
In conclusion, the tutorial accentuates the intricate rela6onship between the number
of vectors and query 6me. The exponen6al increase in query dura6on, par6cularly in
scenarios mirroring real-world complexi6es, necessitates the explora6on of alterna6ve
methodologies to ensure 6mely and efficient results. The subsequent lesson promises an
explora6on of diverse methods to navigate these challenges and facilitate effec6ve queries
across numerous vectors.
The KNNDistanceMeasure class serves as a u6lity for exploring k-Nearest Neighbors
distance measurements. It offers the capability to generate random vectors, plot
embeddings and query vectors, and perform k-Nearest Neighbors search based on specified
parameters. This class can be instrumental for understanding and experimen6ng with the
KNN algorithm in a controlled environment, allowing users to visualize embeddings, queries,
and their nearest neighbors.
Code Figure 4
random_vector Method:
The random_vector method is responsible for genera6ng a random 2-dimensional vector. It
u6lizes NumPy's randn func6on to create a vector with 50 data points along each of the two
dimensions.
plot_data Method:
The plot_data method facilitates the visualiza6on of embeddings and query vectors. It takes
two arguments: data_vector represen6ng the embeddings and query_vector represen6ng
the vector used as a query. The method generates a scaNer plot, marking the embeddings
and the query vector. Each point on the plot corresponds to an embedding, and the query
vector is highlighted in blue. Text annota6ons on the plot correspond to the indices of the
embeddings.
nearest_neighbours Method:
The nearest_neighbours method performs k-Nearest Neighbors search. It takes several
parameters:
• k: The number of neighbors to retrieve (default is 3).
• algorithm: The algorithm used for nearest neighbors search (default is 'brute').
• metric: The distance metric used for calcula6ng distances (default is 'euclidean').
• data_vector: The embeddings dataset.
• query_vector: The vector for which nearest neighbors are to be found.
If we plot the 2-dimensional vector that is randomly generated using a driver program this is
how it looks as.
Output showing the spred of vectors in two dimensional space
Chapter 6: What Are Vector Stores?
Vector databases are purpose-built databases tailored for the efficient storage and
retrieval of vector embeddings, play a crucial role in addressing the limita6ons of
conven6onal databases, such as SQL, when it comes to handling extensive vector data. The
necessity for specialized stores arises from the inadequacies of tradi6onal databases in
efficiently managing the storage and retrieval of large-scale vector informa6on.
Embeddings as seen in the previous concepts serve as numerical representa6ons of
data, par6cularly unstructured data like text, situated within a high-dimensional space. The
inherent nature of these embeddings makes them less compa6ble with conven6onal
rela6onal databases, which struggle with the storage and retrieval of such intricate vector
representa6ons.
Vector databases, on the other hand, are adept at indexing and swihly searching for
similar vectors through the applica6on of advanced similarity algorithms. This capability
empowers applica6ons to iden6fy and retrieve vectors that bear resemblance to a specified
target vector query. In essence, vector stores provide an op6mized environment for
managing and querying vector data, enabling efficient explora6on and retrieval of related
vectors in response to specific queries.
In this book we will consider Chroma DB as our vector store and explain all the
concepts related to vector stores. Chroma DB, an open-source vector store, is designed for
the storage and retrieval of vector embeddings, primarily serving the purpose of preserving
embeddings and associated metadata. This stored informa6on proves valuable for
subsequent u6liza6on by expansive language models. Notably, Chroma DB finds applica6on
in seman6c search engines dealing with textual data. key features of Chroma DB are as
follows.
1. Diverse Storage OpDons:
• Chroma DB supports various underlying storage alterna6ves, including
DuckDB for standalone setups and ClickHouse for enhanced scalability.
2. SoOware Development Kits (SDKs):
• It furnishes Sohware Development Kits (SDKs) for Python and
JavaScript/TypeScript, facilita6ng seamless integra6on into projects
developed in these programming languages.
3. Emphasis on Simplicity and Speed:
• Chroma DB priori6zes simplicity, speed, and analy6cal capabili6es, aligning its
design with the objec6ves of straigh4orward usage, rapid performance, and
data analysis.
4. Self-Hosted Server OpDon:
• An addi6onal feature of Chroma DB is the availability of a self-hosted server
op6on, providing users with the flexibility to host and manage the vector
store infrastructure according to their specific requirements.
1. Installation of Chroma DB:
You can run a Chroma server in a Docker container or as a Hosted service.
You can get the Chroma Docker image from Docker Hub, or from the Chroma GitHub
Container Registry
docker pull chromadb/chroma
docker run -p 8000:8000 chromadb/chroma
You can also build the Docker image yourself from the Dockerfile in the Chroma GitHub
repository
git clone git@github.com:chroma-core/chroma.git
cd chroma
docker-compose up -d --build
The Chroma client can then be configured to connect to the server running in the Docker
container.
import chromadb
chroma_client = chromadb.HttpClient(host='localhost', port=8000)
Chapter 7: Implemen1ng our first Vector Search
The encapsulated func6onali6es within the ChromadbHelper class delineate a u6lity
designed for seamless interac6on with Chroma DB, an adept database tailored for the
efficient management of vector embeddings. The instan6a6on process ini6alizes a
connec6on to the Chroma DB server, configured with the host as 'localhost' and the port as
'8000'. The class offers a repertoire of methods to navigate key opera6ons: fetching
collec6ons, crea6ng and dele6ng collec6ons, saving data along with metadata, and querying
the database. Notably, the method for saving data internally employs Chroma DB, relying on
the all-MiniLM-L6-v2 model for embedding handling. This cohesive design allows for an
intui6ve and structured approach to interact with Chroma DB, abstrac6ng complexi6es
associated with HTTP client interac6ons and database opera6ons. In the academic spirit, the
class serves as a pedagogical tool, facilita6ng a lucid understanding of how to navigate and
u6lize Chroma DB effec6vely for vector embedding applica6ons.
Code Figure 5
1. More on Querying with and without filters
Exploring Chroma collec6ons involves diverse querying techniques facilitated by the .query
method. One approach entails querying with a set of query_embeddings, where each query
embedding is a numerical representa6on of a search query. By invoking the .query method
with parameters such as query_embeddings, n_results, where, and where_document,
users can retrieve the top matching results for each query embedding, allowing for
metadata-based and content-based filtering.
Code Figure 6
If the dimensions of the supplied query embeddings do not align with the collec6on, an
excep6on will be raised. Alterna6vely, users can opt to query by a set of query_texts.
Chroma first embeds each query text using the collec6on's embedding func6on and
subsequently performs the query with the generated embeddings. The .query method, in
this scenario, also supports parameters like n_results, where, and where_document.
Retrieving items from a collec6on by their unique iden6fiers (ids) is achievable through the
.get method, where users can specify the desired ids and apply op6onal where and
where_document filters.
Code Figure 7
The .get method, if invoked without specific ids, returns all items in the collec6on that
match the specified filters. Notably, when using .get or .query, the include parameter allows
users to selec6vely retrieve data fields such as embeddings, documents, metadatas, and
distances. By default, Chroma returns documents, metadatas, and distances for query
results, while excluding embeddings for performance reasons. Users can customize the
returned data fields by providing an array of included field names to the includes parameter
of the .query or .get method, tailoring the output to their specific requirements.
Code Figure 8
Metadata filtering in Chroma supports a range of operators, providing users with versa6le
op6ons for refining queries based on metadata aNributes. The $eq operator enables filtering
for equality, accommoda6ng strings, integers, and floats. Conversely, the $ne operator
excludes items that are not equal to the specified value, suppor6ng string, integer, and float
comparisons. For numeric comparisons, Chroma offers the $gt operator to filter for values
greater than the specified threshold, and the $gte operator for values greater than or equal
to the given threshold. On the other hand, the $lt operator facilitates filtering for values less
than the specified threshold, while the $lte operator includes values less than or equal to
the specified threshold. This array of operators provides users with a comprehensive toolkit
to precisely tailor their metadata-based filters, promo6ng flexibility and precision in
querying Chroma collec6ons.
Code Figure 9
Vector Databases support CRUD operations
The StudentDB class encapsulates a simple yet illustra6ve Python applica6on, showcasing
the fundamental CRUD opera6ons within the domain of Chroma DB. Ini6ated with an
instan6a6on of Chroma DB and an OpenAI embedding func6on, the class seamlessly
integrates the capabili6es of both technologies. Leveraging this integra6on, the class defines
methods for crea6ng, reading, upda6ng, and dele6ng student records within a collec6on
named "students." The create_student method orchestrates the addi6on of new student
informa6on, genera6ng a unique iden6fier for each student. Subsequently, the
read_student method retrieves and displays the informa6on of a specified student,
demonstra6ng the read opera6on. The update_student method allows for the modifica6on
of an exis6ng student's informa6on, exemplifying the update opera6on. Finally, the
delete_student method facilitates the removal of a student record based on a provided
iden6fier, illustra6ng the delete opera6on. This concise yet comprehensive demonstra6on
underscores the seamless integra6on of Chroma DB and OpenAI embeddings, offering a
tangible illustra6on of CRUD opera6ons within the context of a vector database.
Code Figure 10
Chapter 8: Going From CRUD to Seman1c Search
In the development of our forthcoming applica6on, we embark on the task of
crea6ng a straigh4orward yet efficient system. This system involves the storage of two
dis6nct documents, namely "student_info" and "university_info," within the vector database
known as Chroma DB. Leveraging the OpenAI embeddings, we employ a custom embedding
func6on intrinsic to Chroma DB to facilitate the incorpora6on of these documents. The
embedding func6on plays a pivotal role in encapsula6ng the seman6c nuances and
contextual informa6on of the documents. As the documents find their residence within
Chroma DB, we subsequently proceed to pose queries to this database. Chroma DB, armed
with its inherent capability to compute minimum distances based on context, eventually
responds by returning the document that exhibits the closest contextual match to the posed
ques6ons. This applica6on thus underscores the seamless synergy between document
storage, embedding func6ons, and query responses within the realm of vector databases,
exemplifying the prac6cality and efficacy of such systems in real-world applica6ons, Below
code exemplifies the above scenario.
Code Figure 10
In the above code encapsulated within the ChromadbOpenAI class, a sophis6cated
integra6on of Chroma DB and OpenAI embeddings is undertaken to facilitate seamless
storage and retrieval of informa6on. The instan6a6on process ini6alizes a Chroma DB client
and employs an OpenAI embedding func6on, specifically 'text-embedding-ada-002,'
enriching the system's capacity to comprehend and represent textual data. Subsequently, a
collec6on named "students_and_university" is instan6ated within Chroma DB, u6lizing the
OpenAI embedding func6on for embedding documents. The save_data method
orchestrates the incorpora6on of two dis6nct documents, "student_info" and
"university_info," into the aforemen6oned collec6on.
These documents are accompanied by metadata deno6ng their respec6ve sources,
and unique iden6fiers ("id1" and "id2") are assigned for efficient referencing. The query
method exemplifies the applica6on's query func6onality, where an inquiry regarding the
GPA of Pavan is posed. The system leverages the collec6on's embedding func6on and
responds by retrieving the most contextually relevant result, showcasing the intricate
interplay between document storage, metadata annota6on, and query resolu6on within the
domain of vector databases and advanced embedding func6ons. This integra6on
underscores the academic explora6on of harnessing cupng-edge technologies for efficient
and contextually aware informa6on retrieval.
The Future and Beyond
In the context of RAG (Retrieval Augmented Generated) applica6ons, the prepara6on
of documents involves segmen6ng them into suitable lengths, a crucial step influenced by
the selec6on of embedding models and the subsequent Large Language Model (LLM)
applica6on that u6lizes these documents as context. This process is vital to op6mize the
compa6bility and effec6veness of the documents within the chosen framework. Once
segmented, the next phase involves indexing per6nent data. This entails genera6ng
embeddings for the documents and popula6ng a Vector Search index with this enriched
data. By doing so, the system is equipped to efficiently perform searches and retrievals,
ensuring that the document embeddings are readily available for seamless integra6on into
the RAG applica6ons. This me6culous approach ensures that the documents are
appropriately tailored for the chosen embedding model and downstream LLM applica6on,
fostering an op6mal synergy between the input data and the overarching language model
framework.
The ChromaDBVectorizer class is designed to orchestrate the vectoriza6on of
documents and their subsequent storage and retrieval within Chroma DB, all while adhering
to best prac6ces for error handling and resource ini6aliza6on. In the constructor method
(__init__), the class first establishes a connec6on to Chroma DB using a ChromaDBHelper
instance and creates or fetches a collec6on named <your_collecDon_name>. The crea6on
or fetching process is encapsulated in a try-except block, ensuring robust handling of
poten6al excep6ons. In the event of an error, the collec6on is created, and a corresponding
message is printed. Following this, an OpenAI embedding model (embed_model) is
instan6ated using an OpenAI API key, and a set of documents is loaded from a specified
directory using a SimpleDirectoryReader. The class then ini6alizes a ChromaVectorStore
with the fetched or created collec6on, sepng up the storage context and service context for
subsequent opera6ons.
Moving forward, the save_to_database method is defined to facilitate the storage of
vectorized documents in Chroma DB. A VectorStoreIndex is created from the loaded
documents, incorpora6ng the established storage and service contexts. This index,
represen6ng the vectorized data, is returned by the method.
Subsequently, the query method is implemented to perform a query on the vectorized data
stored in Chroma DB. It u6lizes the VectorStoreIndex obtained from the save_to_database
method to instan6ate a query engine (query_engine). The method then executes a sample
query, specifically inquiring about the segment profit of aerospace. The response from the
query engine is printed for examina6on.
Throughout the class, proper indenta6on and modulariza6on are employed to
enhance readability and maintainability. Addi6onally, the class exhibits a systema6c
approach to error handling, ensuring the resilience of the applica6on under various
circumstances. This comprehensive explana6on provides a clear understanding of the class's
purpose and func6onality within the context of document vectoriza6on and Chroma DB
integra6on.
Code Figure 11
Final Chapter: Conclusion
In the stream of large language model systems, vector stores such as Chroma DB
have become indispensable components. Their specialized storage capabili6es and efficient
retrieval of vector embeddings play a pivotal role in facilita6ng swih access to per6nent
seman6c informa6on, thereby empowering the func6onality of Large Language Models
(LLMs).
This tutorial on Chroma DB delves into the fundamental aspects of its u6liza6on.
Topics covered encompass the founda6onal steps of crea6ng a collec6on, incorpora6ng
documents, transforming text into embeddings, execu6ng queries for seman6c similarity,
and proficiently managing the collec6ons. This comprehensive tutorial serves as a valuable
resource for individuals seeking to grasp the essen6als of employing Chroma DB in their
language model endeavours.
As part of the con6nuous learning journey, the subsequent phase involves the
seamless integra6on of vector databases into genera6ve AI applica6ons. The LlamaIndex
framework is an invaluable tool for users aiming to effortlessly ingest, manage, and retrieve
private or domain-specific data for their AI applica6ons within the framework of Large
Language Model (LLM)-based systems. Furthermore, enthusiasts can explore the intricacies
of LLMOps and its areas of applica6ons. This progression allows prac66oners to deepen
their understanding and applica6on of vector databases, fostering a more nuanced approach
to harnessing their capabili6es for advanced language model development.