KEMBAR78
Introduction To LlamaIndex With Python 2024 | PDF | Information Retrieval | Computer File
0% found this document useful (0 votes)
65 views9 pages

Introduction To LlamaIndex With Python 2024

The video introduces Lama Index, a framework for creating language model applications, emphasizing its capabilities and features. It explains how Lama Index enables users to enrich language models with personal or company data through data connectors, documents, nodes, and an indexing system. The tutorial series aims to provide a comprehensive understanding of Lama Index, including its components and implementation in coding.

Uploaded by

camila.pinto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views9 pages

Introduction To LlamaIndex With Python 2024

The video introduces Lama Index, a framework for creating language model applications, emphasizing its capabilities and features. It explains how Lama Index enables users to enrich language models with personal or company data through data connectors, documents, nodes, and an indexing system. The tutorial series aims to provide a comprehensive understanding of Lama Index, including its components and implementation in coding.

Uploaded by

camila.pinto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 9

00:00:01 good morning everyone how's it going today welcome back to the channel

coming to you from France today sunny France we're at 40° today we're melting that
is in Celsius I'm not sure how much that is in Fahrenheit but it's very very high
uh today we're going to be covering Lama index and uh the reason for that is that I
was recently researching about this framework and I figured that most of the
content online about L index is kind of out of date so right now it's probably a
very good time to create a longer series

00:00:35 about all of llama index's capabilities and features that you can um
use in your projects and in your company because now the company is at a much more
stable and mature phase so I figured that this uh series of tutorials is going to
remain Evergreen for a longer time so yeah in this first video of the series we're
going to be covering pretty much what is L index uh its particular take on context
augmentation and drag we're going to be covering that with a very cute diagram that
we have right here and

00:01:09 then we're going to be going to the code so that you can just get your
foot wet on the water so that uh you can start using this framework which is uh
pretty pretty useful um so there you go that's that's what we're going to do right
now so let's actually start talking about what is Lama index [Music] [Music] all
right so first of all let's talk a little bit about what is Lama index now if
you're here I suppose that you already have a bit of an Intuition or an idea of
what is Lama index but in case

00:01:50 you don't and for the new Commerce as well I'm going to explain that
really quick for you a good way to think about this if you have already a little
bit of um background in this is that there is a lot of overlap between Lama index
and Lang chain they both allow you to create llm applications however there are a
some some differences that we will be covering during this series as well uh L
index essentially is a framework or an ecosystem that allows you to create llm
applications such as chatbots AI assistance translation

00:02:24 machines or any other kind of applications that would require a


language model in their services and something that is very useful about them is
that they also provide a lot of data loaders that allow you to enrich the knowledge
of your language model with your own data for example your language model might not
know anything about the contents of your email inbox or your company's database or
your noes in notion well the idea is that a framework like L index can allow you to
access to enrich your the knowledge of your

00:03:02 language model with that information so that it can answer questions
about your your personal data or your company's data all right and they also allow
you to create much more sophisticated programs such as AI agents multi-agents Etc
okay now other than that they also have some Enterprise Solutions such as lamama
cloud and lamma pars which are essentially the same thing that you would be able to
code yourself but taken care of by their own teams in in San Francisco in other
words in other words what lamac cloud

00:03:45 can allow you to do is to ingest to help you ingest all of your
personal data or all of the data of your company put it in an index and use it with
their own with their with the language models that they have available so that you
can use create these kind of applications already within your company so it's kind
of a hosted version of the thing that you would be able to code yourself if you had
a team of machine learning scientists and software engineers and other than that
they also have this service called Lama pars which allows

00:04:17 you to parse pretty much any kind of document and output a structured
um document from it so for example if you if you you send a PowerPoint presentation
to their API at lamap pars they will return to you a structured document with all
of the the contents of your PowerPoint presentation with a very easy to use API so
that is very very useful I also have a video about this if you're interested you
can check that out and there you go so that is essentially the introduction of Lama
index what their open-source um libraries uh do for

00:04:58 you and what their enter Solutions can do for you as well so let's
actually start by looking at what they and how they do this that I uh explain to
you real quick right now and we're going to do that explaining this with a very
nice diagram so here we have the diagram and the idea right here is that I want to
introduce you to what L index does and how they deal with context documentation now
if you have any experience with rag or retrieval augmented generation you may
already understand a lot just by

00:05:35 looking at this diagram but the idea behind this introduction to Lama
index is that I wanted to go through the entire diagram again highlighting the
differences in how Lama index deals with this and also highlighting the main
components of the Lama index ecosystem okay that's why some of this items are in
color in different colors okay so let's actually start by taking a look at the
components and the high level components of the L index ecosystem all right so
let's start off by talking about the first component of

00:06:11 the Lama index ecosystem and that is data connectors okay now remember
that I told you that Lama index is a an ecosystem or a library that allows you to
ingest your data in order to use it in your application now that data is usually
stored in unstructured or or structured sources what does this mean that you may
have your data in PDF files in HTML files if they are websites in csb files in
Excel files in Word documents Etc and if you want to use them within your large
language model application you will need to take all of

00:06:53 that data from all of these different formats and put them together in
an uniform way and that is essentially what the data connectors do um now data
connectors are probably one of the most um popular parts of the L index ecosystem
because lately they have been putting a lot of effort into creating very
sophisticated data connectors that allow you to ingest pretty much any kind of
document you want in a much more Smart Way um that is actually what they do with
lamah Hub they lamah Hub is essentially um

00:07:30 an API with a data connector that allows that takes pretty much any
kind of format to it and then gives out structured documents so that you can
organize your data and you can use it with your language model applications okay
and they also have lamah Hub which is an open- Source Community where you can also
contribute if you want and essentially this community allows you to create any kind
of the components in the Lama IND ecosystem and make them available to the
community but I feel that the most popular uh use of lamah

00:08:07 Hub is also their data connector so someone may develop an independent
data connector for say notion so now you can uh you can use a the notion data
connector to ingest your data from your notion databases and use it in your Lama
index application uh so that is data connectors and so essentially they just allow
you to take all of your unstructured or structured documents and put them together
into a more organized way that you can use in your llm application now the next
thing to consider is the output of your data

00:08:46 connector and that is the documents so what is the document that we see
right here now as I told you before you have these data connectors that will take
all of your structured and on structure files and will output documents from this
data that you have right here now you may be tempted to think that these documents
are U PDF documents or Word documents or things like that but actually not at all
these documents actually are just programming objects with different properties
okay now this document for example may come

00:09:27 directly from your PDF so your PDF after being passed through your data
connector it will output this document right here and this document is essentially
just a programming object with one property titled text or it may be titled content
depending on the framework and that property will contain all of the text inside of
this file so it's essentially just a distillation of the contents of this of this
data source and then in another property under the same object you will find a
metadata and this metadata will include

00:10:05 the name of this source file so that when you use this document in your
application you will know exactly from which file it came from it may also contain
the page range from where the information comes from in your original file it may
also contain the date at which your file was ingested in the data connector Etc so
essentially that is what a document is it's just a structured representation of all
of your data source and this makes it much more easy to handle your data in this
pipeline okay now the next thing that we

00:10:45 have to consider is how these documents are then turned into nodes so
now let's talk about this concept right here of the node and this is a very
particular concept conceptt and very special one because it is um very ingrained in
how Lama index works and it does not do this by default when you try to do the same
thing with Lang chain for example so as you can see I mean remember that the whole
idea of this presentation of this diagram right here is that we're trying to uh
we're explaining how we would create a data

00:11:24 processing pipeline that will take our unstructured and and structured
data sources and turn them into an index that you then will be able to query using
a language model so this index will will contain your personal information and then
you will be able to use a language model to ask about to ask um questions about
your personal information that your model was not initially trained on so what we
are what we have done so far is we have taken these unstructured data sources
passed them through a data

00:11:57 connector and put them into documents what is happening after that is
that the documents are split into different nodes now what is one of these nodes a
no is essentially in the at the highest level just um a chunk of your document okay
so depending on the length of your document it may be split into one or more or
several notes and the idea is that the Noe retains the metadata from the document
so if this document contains the metadata that it came from this PDF this node will
also contain that

00:12:36 metadata uh but it is kind of a more granular piece of information


whereas the document contains a larger amount of information from the PDF now there
is one uh very important distinction between the node and the document and is that
not only is the node more granular it is also interconnected with other nodes
creating sort of a network of knowledge okay so the idea right here is that your
document may be split into several nodes each note containing more granular
information and these nodes will be interconnected among themselves

00:13:11 and that the same thing between all of your documents from all of your
data sources that were um that were uh ingested so that is essentially this step
right here of creating the nodes and that is a a different step from the more
simple implementation of this that we could have seen for for example in the Lang
Chain video that we did before and L index does this by default so by default L
index is going to be working with with nodes and it makes it more explicit that
you're going to be manipulating them during your data

00:13:47 ingestion process okay so once that you have this network of knowledge
in the nodes um we're going to go to the embeddings and then finally in this data
processing pipeline you have the embeddings of each one of your nodes so each part
of your network right here will going to be converted using an embeddings model
into a numerical representation of itself okay so here there are several embeddings
models to choose from I mean you can use uh open AIS you can use many open source
models as well and the idea is that you're

00:14:22 going to be embedding the contents from your your node creating a
numerical representation of this this numerical representation will also contain in
some way will capture the meaning behind the contents of the information inside of
the node and then all of these numerical representations of all of the material
that you had in your PDFs HTML and all of your files will will be added together
inside the index and that is how we get to the index right here which is another
core component of LMA index and the idea is

00:15:03 that the index is essentially just your vector database which
essentially is just a database that is going to contain all of your numerical
representations of all of your nodes and all of your I mean and thus of all of your
data and this index is what you're going to be able to query in order to get the
relevant information to any information that you're looking for which in this case
is going to be queried by via your router and your retrievers but let's let's take
a look at what these are right now but

00:15:35 for now just keep in mind that the index is a concept that you will be
referred to several times during these minseries and also inside of the LMA index
documentation so that is what that is what they mean when they say index now let's
take a look at um the retrieval process and now we get to the part of the retrieval
and the retrieval is essentially when the client sends a query to your API or to
your application and you want to find the relevant documents that are relevant to
this query now the idea is that you have this

00:16:15 thing that they call router right here which is um another core
component of Lama index and the idea of this router is that it is going to decide
which retriever it is going to use to get information from your index okay okay now
each each one of these retrievers is of course another component of Lama index and
the idea is that each one of these retrievers May um may use a different strategy
to query your index okay so the router will analyze your query will then route it
to the most appropriate retriever that will find the

00:16:55 most relevant information and then these retrievers are going to get
back the relevant information from your index which is the information from your
data from your custom data and that is kind of the job of the router and the
retrievals and so we have our response synthetizer right here which is another
component from Lama index which will essentially just put together the documents
that you got from your retrieval and then put them together with another prompt
send them to your language model and then get a response

00:17:31 with your enriched information okay so that is essentially how Lama
index deals with a very simple and very regular rag Pipeline and as you can see it
is very similar to what you would do in for example Lang chain but I wanted to go
over this all over again but highlighting the specificities of how Lama index does
it so now you know what what what they are actually talking about when you see
response synthesizer or router or index or data connector in the L index
documentation or even in some other videos in this miniseries

00:18:15 okay okay so now we have successfully understood what are the
components of L index and why what they actually do uh we're actually going to
start implementing them using a little bit of code right here and you will see that
it is actually very very simple so the first thing that you want to do is of course
to install Lama index and to do that you have to run pip install Lama index - uq
means upgrade and quiet so that as you can see I don't have any output right here
and after that we're going to initialize my API key from

00:18:49 openai now um L index actually uses by default the language models from
openai and to be more specific it uses GPT 3. 5 turbo which is a very inexpensive
model so you should be okay uh but in case you don't want to use open aai in I one
of the future videos in this miniseries I'm going to be adding how to use this with
local models too so there you go now we have our openai API key set up correctly
now it is time to actually start implementing this pipeline that we saw right here
and llama index actually

00:19:28 has this very very famous and um one of which they're very proud of
this feliner which in five lines of code essentially just implements the entire
pipeline that you have right here uh so let's take a look at that uh so the first
thing that we're going to want to do is we're going to create a new folder called
Data where we're going to store all of the documents that we're going to be uh
ingesting this documents that were right here and in order to do that actually I'm
going to be importing the constitution of the USA and that's

00:20:02 all that I'm going to be adding right here so I have the constitution.
PDF um just a quick reference to an old video that that I did about a year ago
about rag as well so there you go um all of I mean I am only going to be adding one
single document right here but just to be clear you can add as many as you want
inside of the data folder right here and they're all going to be ingested as part
of your pipeline so there you go so now the this very famous feliner so from Lama
index. core import Vector score vector vector store index

00:20:41 and Sample directory reader so these are the two classes that we're
going to be using Vector store index is essentially the index that you saw right
here so in other words your vector database and we have SIMPLE directory reader
which is itself a the ingestion uh object I mean yeah the the ingestion class that
is going to allow you to ingest all of the data inside of your data folder right
here um so there we go the first line right here is going to be initializing all of
our documents so with this very simple line what is what

00:21:25 we're doing is we're generating the documents from our PDFs and we
using uh as data connector this simple directory reader and now the simple
directory reader takes as input I mean takes as parameter the folder where all of
your documents are and if you want to see I mean in case you're using um cab if you
want to see where you are and what uh what documents you have access to right now
you can run LS like this and then you can see that you're already in the working
directory where your data folder is so you can do

00:22:08 CD data and then see what's inside of data and then you can see that um
well apparently that's not how you concatenate commands here but that's okay CD
data and LS let's see does that work better yeah so there you have it so we're
inside content data and then we have that constitution. PDF is inside of here so
just by passing data like this or you could also do this like this your going to be
passing it the data the data folder and then we just run the load data um method
right there then the second line of this feliner is creating

00:22:57 your index so you're going to be generating your index from documents
so you're going to be creating this Vector store index from the documents this part
is actually very similar to what you would find in another Library like uh L chain
and then you will be able to query this index by converting it into a query engine
and this thing right here is essentially just uh initializing the retrievers uh in
this case we're only going to have one retriever so we're not going to be using a a
router as we
00:23:30 specified right here but essentially that is what we're doing right now
and this query engine is already one that you can ask questions to so let's query
it using this line right here so the response is going to be equal to query engine
do query and then you pass in your query and then we can just print our response
and let's see what that is so we're we going to be asking about the first article
of the Constitution of the USA so we're importing from my my index all of this and
apparently there was a

00:24:09 problem locating our data um directory data does not exist apparently
it does not exist what do you mean or maybe we're still in data so yeah there we go
so we're back okay so this actually persists uh in the terminal so when I moved
myself inside of data I was running all of this inside of data so now if we do this
again it should be okay it it ingested all the documents inside of data and there
we go the first article is about the establishment of the legislative powers Etc so
there you go I mean that's the F the famous

00:24:49 feliner from Lama index and now you have this very quick introduction
of how to do this uh just a quick note that inside of data you can add pretty much
any any kind of um any kind of uh file uh let me grab you a link right uh list
right here so here is the list of all of the files that you can add inside of your
data folder so it can be a CSV do uh Word file UB Etc I mean it is able to ingest
all of these formats as you may notice Lama index is becoming a very very good add
un struct structure data ingestion

00:25:30 so they I mean they even created this fabulous Lama pars API for that
so yeah I mean they're very strong at this and this is all done automatically you
don't have to load CSV loaders by the by hand um Jupiter notebook loader by hand
and everything right here is going to be dealt dealt with automatically and you can
of course get um get more detailed information about this if you go to the L indexc
documentation let me just um do this like this going to paste this yeah the index
documentation s sample directory reader and here you

00:26:14 have all of the supported file types and they also have a specific Json
loader and of course they come from the labah hob lamah hob as we mentioned right
here okay so that was the famous feliner let me just just show you how to make the
data persistent as well so if you want to make your data persistent this index that
you see right here is the one that you're going to be querying every single time
that you're going to be asking your questions to your language model and if we run
this again we're going to be redefining our

00:26:48 index so every single time we're going to be ingesting all of the
documents inside of data inside of the data directory and that's okay if you only
have one document like this but if you have several documents you probably don't
want to ingest them and to do all of this every single time you run your your
pipeline so the idea right right here is just instead of storing this in memory
you're going to make it persistent and in order to do that they have this very
quick very quick um show you this very quick script to show you how

00:27:27 they do it I mean this is this comes of course directly from their
documentation so as you can see they're importing os. paath essentially just to
deal with the paths in your file system then from Lama index core they're importing
again Vector store index simple directory index but this time they're going to be
importing storage context as well as well and load index from Storage as well in
case you already have created your storage context and essentially this just makes
your vector database persistent in your local

00:28:02 machine uh here they're initializing the directory where they local
storage is going to be stored so here as you might remember we have the data
directory right here so this is going to be a directory added right underneath
right here and so here we essentially just test if the path exists for this one and
if it doesn't exist we're going to be creating that persistent storage so if
storage does not exist essentially means that we have not yet created our index so
we do that so we do documents we ingest the

00:28:40 documents we create our index with exactly the same line that we
created it up here and then we just store it for later in the persistent directory
and this is the function that you use in order to do that so if you were doing this
in for example your own application or your server this is essentially the function
that you would use to store this in your locals I mean in your server okay so index
which is this one storage context persist and then you pass in the persist
directory which in this case is

00:29:15 going to be storage and then if the path already exists which means
that this one right here this is going to be triggered the else statement and if
the path already exists it means that we have already ingested the documents and
they have already been stored in our local storage so what we do is essentially we
just index them from Storage so we say storage context from defaults we pass in the
persistent directory then the index and then either way we can now query the index
just like we did before okay so that's very very

00:29:53 quick um explanation of what's actually going on and how to store for
this let's ask about the third article right now real quick here you can see I mean
it saw that we have not yet a storage directory and then the third article is about
the terms of the president by President Etc and if we go right here you can see
that the storage uh directory has been created and inside of it is of course all of
the necessary information for Lama index to retrieve um information from our
personal data store which is right

00:30:35 here before we leave actually I wanted to show you something a little
bit about the documents that we got ingested from here and the idea is that we
talked a lot about what the documents are in the pipeline right here and I told you
that they are programming object Etc but I didn't actually show you what they look
like so here we're going to be tapping into the documents that we ingested right
here from the simple directory reader which essentially uh converted our PDF into
several documents and we're going to tap into it

00:31:09 first but first of all let's take a look at its length so the length of
the documents is 19 so we got 19 documents for our PDF file and let's take a look
at one of them to see how it is composed so let's look at the the fourth one and as
you can see it's a document object it has a property of ID embeddings which is a
Boolean and that has the metadata right here and this this property is very
important because it tells you the file the original file where it came from so you
have the page lab label where this

00:31:48 uh the contents of this document comes from you have the file name to
say that it actually came from the constitution. PDF file and you even have the
complete file path to the file in your server or whatever you're persisting this
this files then you have the file type the file size the creation date of your file
the last modified date of your file um so yeah I mean that this is of course you
can see why this would be very important if you were implementing this
functionality in your application and

00:32:19 then right here you have the property of text which essentially
contains all of the raw text of your file okay now in this case um it's not super
pretty it has a lot of uh line jumps like this but it is good enough for a language
model to interpret it and there you go so that is what the document is and what it
looks like so let me show you what the text itself looks like so text just going to
print it and there you go so here's all of the text so as you can see we have the
the title the section seven and here you

00:33:00 have all of the text from the section seven okay all right now
something else that I wanted to show you is how to do pretty much the same thing
but using their API on Lama pars okay so essentially remember that right here we
ingested the constitution. PDF file and we converted it essentially to Raw text
okay that was pretty convenient but the constitution. PDF file that I am using
right here is actually a very simple file it contains only text uh in case you have
a more complicated file let's say that you have tables and images Etc

00:33:36 this regular simple directory reader is not going to do a great job
with that so what you can do instead of using this simple directory reader is using
the Lama pars service that Lama index uh introduced a few months ago and the idea
is that you go to L index.com you click here on sign up so that you get ACC to
their lamac Cloud platform and you're going to be prompted to create an account if
you don't already have one and once you're in here what you're going to want to do
is you're going to want to First create an API key of

00:34:10 course and once you have that you're going to be able to ingest any
number I mean to ingest up to a thousand pages per day of complex documents so the
idea is that instead of using this simple directory reader you're going to be
sending your documents to their API and they're going to be returning you the
organized or even structured text that came from your unstructured file um so let
me show you real quick how this works I actually made a video about Lama Pars in
the past um but it works pretty

00:34:43 well for ingesting complicated formats um so I'm just going to copy the
API key and going to implement this right here with essentially the same feliner
that I had right here so what we're going to do is replace this line that takes all
the documents and we're going to replace it with this other line right here which
essentially just does the same thing but instead of using the simple directory
reader we're going to be using Lama pars and in order to do that you're going to
have to do from llama

00:35:26 pars we're going to input Lama pars and there you go but actually in
order to use this you're going to have to add the API key that I showed you how to
get just a moment ago and in order to get the API key you can just use this you can
just set your lamac Cloud API key environment variable actually here is just a
quick mistake just going to get the password right here I'm going to copy and paste
it from what I created ated up here at lamac Cloud now I have my API key set to the
lam Cloud API key variable and now

00:36:06 actually something else that I should probably do now that I am working
in a Jupiter notebook since lamac cloud and Lama Pars in particular is an async
first API going to have to run this NE Nest async IO um just to emulate an async
API and there we go so now now that we have set our API key and our nest in Kyo we
should be able to call Lama pars and as you can see we just specify the type of um
of data that we want in return in this case we just want the text from our file and
then we load the data and in order to load it we're

00:36:49 going to pass in the path to our file so in this case it was St inside
of data and if I remember correctly it's called constittution .pdf so Constitution
constit dopdf there you go and now this should be able to work now and as you can
see it is Ines in the documents indexing and then again we have the response from
what is the first article about the first article this cusses the principles Etc
and as you may be able to see right here my thousand pages per day us it should be
updated um let's see well it's not
00:37:32 updated automatically apparently but yeah I mean I just used uh I don't
know probably 19 pages off of my thousand pages per day and there you go let me
show you what the documents from these ones uh from this API looks like as you can
see it's pretty much the same thing but also as you can see the the text is a
little bit cleaner so let's see the text um let me just use print to make it look
better there we go so yeah here we have it that it's I mean the the fourth document
in this case starts at this

00:38:18 paragraph right here with the section seven and as you can see it's a
little bit cleaner I mean it doesn't really make a difference for this file in
particular because this is a very easy file the constitution. PDF but yeah I mean
you can use this for more complicated files as well so there you go that is just um
I felt like I needed to include a quick introduction to Lama Pars in this lama lama
index introduction video so that's essentially how L index works it is um I mean
this is a much more simple uh

00:38:54 implementation um of what you would do probably in production but of


course this is an introductory video and I just wanted to introduce you to a very
quick rag example um in L index okay uh in the next videos in this series we're
going to be covering much more detailed and and advanced topics of Lama index but
please let me know if you have any questions and if you're excited about more
content about Lama index I'll be happy to be making more of it um so yeah thank you
very much for watching if you

00:39:31 have any questions suggestions or even Corrections please be sure to


leave them in the comments below and I will see you next time [Music]

You might also like