KEMBAR78
Distributed Vector Databases - What, Why, and How | PDF
©2023 VMware, Inc.
A Speedy
Introduction To
Vector Databases
Steve Pousty
@thesteve0
VMWare Principal Dev Advocate
1
©2023 VMware, Inc. 2
Agenda
1. Introduction to Vector Databases
2. What is different than RDBMs
3. Where to use them and what that
means for you
4. Make you the life of the party
©2023 VMware, Inc.
3
©2023 VMware, Inc. 4
Let’s talk about “vectors”, aka embeddings
What is a vector database
Easy answer - a data store that works with vectors
©2023 VMware, Inc.
Turning Things into Numbers
Start with unstructured data - challenging for computers
©2023 VMware, Inc.
Neural Networks to the rescue
©2023 VMware, Inc.
Brief Discussion on Tokens - NLP
API Costs and Context Length
©2023 VMware, Inc.
Embeddings
There are more and more embedding models available to use.
The ones we care about today are neural networks that have been
pre-trained on large datasets.
There are several things to consider:
1. Appropriateness for task
2. Size of input
3. Length of output vector
4. Accuracy
5. Speed of computation
https://huggingface.co/models
©2023 VMware, Inc.
Now into Vector Space
©2023 VMware, Inc.
How to query
“What picture is similar
to this picture”
Step 1: Cat to Vector
Rank Image reference
1 reference to
2 reference to
3 reference to
Step 3: Return results in decreasing
distance order
Step 2: Query the database
for “nearby” vectors
©2023 VMware, Inc.
Brief Discussion on HNSW
One of the most common Approximate Nearest Neighbor (ANN) indexing models
©2023 VMware, Inc. 12
1. Not appropriate when exact search is the dominant use case
2. Specialized for a particular use case - they supplement your data infrastructure
3. Providing “memory” for your AI models
4. Reduce cost for running an AI infrastructure
5. Interface between Data Science and Application Development
What are they good for
Questions related to similarity
©2023 VMware, Inc.
1. Search (where results are ranked by relevance to a query vector)
2. Clustering (where items are grouped by similarity)
3. Recommendations (where related items are recommended)
4. Anomaly detection (where distant vectors little relatedness are
identified)
5. Diversity measurement (where similarity distributions are analyzed)
6. Classification (where items are classified by their most similar label)
Example use cases
©2023 VMware, Inc.
Background Assumptions
1. You have some sort of generative text model to answer users’ questions.
2. OpenAI has trained their generative model on a broad corpus of texts
3. You have vectors for your documentation in a vector DB
The New Flow
4. User query -> embedding
5. Search you documentation with this embedding
6. Get back n closest documents
7. Add those documents as context (augmentation) to the original query
8. Send all the new text to OpenAI for prediction
A Popular Example
Retrieval Augmented Generation (RAG)
©2023 VMware, Inc.
Two types of Architecture
1. Add ons to existing databases - a new data type with new indices and
functions.
2. Single purpose - not transactional like an RDBMS. BASE rather than ACID
Add-ons tend towards the same scaling properties as the base system.
Single purpose tend to be new and built with horizontal scaling in mind
©2023 VMware, Inc.
1. They tend to be horizontally sharded/distributed so plan
accordingly
2. A LOT of random reads so IOPs really matter
3. HNSW indices are big and should be in RAM
4. Streaming/ingestion pipeline is going to handle the embeddings
5. Reduce overall data stored in the DB - it’s a “compression”
technique
6. Given the newer bigger AI/ML push, they are definitely
going to be part of your data infrastructure
What this means for you
©2023 VMware, Inc. 17
1. In ML/AI, vector refers to the generated numerical
representation of unstructured data
2. The vector encodes “meaning” into a multidimensional space
3. Vector Databases allow you to store and query vectors
4. They handle questions related to similarity
5. They are usually distributed
6. Hang on, it should be an interesting ride
Sum it up
©2023 VMware, Inc.
Thanks and Enjoy
the Vectors!
Steve Pousty
@Thesteve0
https://bit.ly/dokvector
18

Distributed Vector Databases - What, Why, and How

  • 1.
    ©2023 VMware, Inc. ASpeedy Introduction To Vector Databases Steve Pousty @thesteve0 VMWare Principal Dev Advocate 1
  • 2.
    ©2023 VMware, Inc.2 Agenda 1. Introduction to Vector Databases 2. What is different than RDBMs 3. Where to use them and what that means for you 4. Make you the life of the party
  • 3.
  • 4.
    ©2023 VMware, Inc.4 Let’s talk about “vectors”, aka embeddings What is a vector database Easy answer - a data store that works with vectors
  • 5.
    ©2023 VMware, Inc. TurningThings into Numbers Start with unstructured data - challenging for computers
  • 6.
    ©2023 VMware, Inc. NeuralNetworks to the rescue
  • 7.
    ©2023 VMware, Inc. BriefDiscussion on Tokens - NLP API Costs and Context Length
  • 8.
    ©2023 VMware, Inc. Embeddings Thereare more and more embedding models available to use. The ones we care about today are neural networks that have been pre-trained on large datasets. There are several things to consider: 1. Appropriateness for task 2. Size of input 3. Length of output vector 4. Accuracy 5. Speed of computation https://huggingface.co/models
  • 9.
    ©2023 VMware, Inc. Nowinto Vector Space
  • 10.
    ©2023 VMware, Inc. Howto query “What picture is similar to this picture” Step 1: Cat to Vector Rank Image reference 1 reference to 2 reference to 3 reference to Step 3: Return results in decreasing distance order Step 2: Query the database for “nearby” vectors
  • 11.
    ©2023 VMware, Inc. BriefDiscussion on HNSW One of the most common Approximate Nearest Neighbor (ANN) indexing models
  • 12.
    ©2023 VMware, Inc.12 1. Not appropriate when exact search is the dominant use case 2. Specialized for a particular use case - they supplement your data infrastructure 3. Providing “memory” for your AI models 4. Reduce cost for running an AI infrastructure 5. Interface between Data Science and Application Development What are they good for Questions related to similarity
  • 13.
    ©2023 VMware, Inc. 1.Search (where results are ranked by relevance to a query vector) 2. Clustering (where items are grouped by similarity) 3. Recommendations (where related items are recommended) 4. Anomaly detection (where distant vectors little relatedness are identified) 5. Diversity measurement (where similarity distributions are analyzed) 6. Classification (where items are classified by their most similar label) Example use cases
  • 14.
    ©2023 VMware, Inc. BackgroundAssumptions 1. You have some sort of generative text model to answer users’ questions. 2. OpenAI has trained their generative model on a broad corpus of texts 3. You have vectors for your documentation in a vector DB The New Flow 4. User query -> embedding 5. Search you documentation with this embedding 6. Get back n closest documents 7. Add those documents as context (augmentation) to the original query 8. Send all the new text to OpenAI for prediction A Popular Example Retrieval Augmented Generation (RAG)
  • 15.
    ©2023 VMware, Inc. Twotypes of Architecture 1. Add ons to existing databases - a new data type with new indices and functions. 2. Single purpose - not transactional like an RDBMS. BASE rather than ACID Add-ons tend towards the same scaling properties as the base system. Single purpose tend to be new and built with horizontal scaling in mind
  • 16.
    ©2023 VMware, Inc. 1.They tend to be horizontally sharded/distributed so plan accordingly 2. A LOT of random reads so IOPs really matter 3. HNSW indices are big and should be in RAM 4. Streaming/ingestion pipeline is going to handle the embeddings 5. Reduce overall data stored in the DB - it’s a “compression” technique 6. Given the newer bigger AI/ML push, they are definitely going to be part of your data infrastructure What this means for you
  • 17.
    ©2023 VMware, Inc.17 1. In ML/AI, vector refers to the generated numerical representation of unstructured data 2. The vector encodes “meaning” into a multidimensional space 3. Vector Databases allow you to store and query vectors 4. They handle questions related to similarity 5. They are usually distributed 6. Hang on, it should be an interesting ride Sum it up
  • 18.
    ©2023 VMware, Inc. Thanksand Enjoy the Vectors! Steve Pousty @Thesteve0 https://bit.ly/dokvector 18