KEMBAR78
Chunking in RAG | PDF | Semantics | Information
0% found this document useful (0 votes)
164 views11 pages

Chunking in RAG

Chunking in RAG

Uploaded by

Suresh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
164 views11 pages

Chunking in RAG

Chunking in RAG

Uploaded by

Suresh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

 What is Chunking in RAG?

 Why do we need Chunking in RAG?


 Types of Chunking in RAG
 Strategies for Chunking in RAG
 Key Considerations for Implementing Chunking in RAG
 Best Practices for Chunking in RAG
 Learn to Build RAG systems with ProjectPro!
 FAQs

What is Chunking in RAG?


Chunking in RAG refers to dividing large sets of information into smaller, more manageable
pieces or "chunks." It is a fundamental process that enhances the model’s ability to
understand, process, and generate responses by breaking down complex information into
digestible parts. Here's how it works:

A. Retrieval Phase
In the initial phase, the system retrieves relevant documents, data points, or pieces of
information from a vast corpus. This retrieval is based on the input prompt and aims to
gather the most appropriate chunks of information that can help answer the query.
B. Chunking Phase
Once the relevant information is retrieved, it is divided into smaller, coherent chunks. This
segmentation is essential because it allows the system to handle and process the data in
parts rather than as a whole, which would be computationally intensive and less efficient.
C. Generation Phase
The generative model then uses these chunks to produce a response. Integrating these
smaller pieces of information allows the model to generate a more accurate and
contextually relevant answer. The chunking method ensures that each piece of information
is given appropriate attention, maintaining context and coherence in the final response.
I hope now you have a clear understanding of chunking in RAG. It's time to learn why this
technique is necessary for enhancing the performance and accuracy of these systems.
Why do we need Chunking in RAG?
Chunking is essential in RAG systems as it significantly enhances their ability to process,
retrieve, and generate relevant information efficiently and accurately. It plays a crucial role
in retrieval-augmented generation (RAG) for several reasons:
1) Efficient Information Processing
RAG systems often retrieve large amounts of data or documents from external sources to
generate responses. Chunking breaks down this retrieved information into smaller,
manageable segments. This segmentation allows the system to process and analyze each
chunk independently, improving computational efficiency and reducing the complexity of
handling large datasets.
2) Contextual Relevance
By dividing the retrieved information into chunks, RAG systems can maintain context and
relevance throughout the generation process. Each chunk represents a coherent unit of
information that can be integrated into the response generation. This ensures that the
generated responses are accurate and contextually appropriate, enhancing the overall
quality of the system's output.
3) Integration of Multiple Sources
Chunking facilitates the integration of information from multiple sources or documents
retrieved during the retrieval phase. The system can effectively combine insights from
different chunks to provide a comprehensive and well-rounded response. This capability is
particularly beneficial in knowledge-intensive tasks where diverse sources of information
are required to address complex queries or generate informative content.
Here's what valued users are saying about ProjectPro
Having worked in the field of Data Science, I wanted to explore how I can implement
projects in other domains, So I thought of connecting with ProjectPro. A project that helped
me absorb this topic was "Credit Risk Modelling". To understand other domains, it is
important to wear a thinking cap and...

Gautam Vermani
Data Consultant at Confidential
I think that they are fantastic. I attended Yale and Stanford and have worked at
Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and
Hadoop,NoSQL, Spark, Hadoop Admin, Hadoop projects. I have been happy with every
project. They have really brought me into the...

Ray han
Tech Leader | Stanford / Yale University
Not sure what you are looking for?
View All Projects
4) Scalability and Performance
Handling large volumes of data efficiently is crucial for the scalability and performance of
RAG systems. Chunking enables the system to scale effectively by processing information in
smaller increments, optimizing memory usage and computational resources. This scalability
ensures the system can handle increasingly complex queries and promptly generate
responses.
Here is an insightful brief on the importance of Chunking in RAG by Rishabh Goyal, Senior
Manager (Applied AI) at Fidelity Investments:

Chunking in RAG thus enhances the system's ability to manage and utilize retrieved
information effectively, leading to improved accuracy, contextual understanding, and
performance in natural language processing projects. Let us now look at the different types
of chunking in RAG.
Types of Chunking in RAG
Understanding the different types of chunking in RAG is crucial for optimizing the retrieval
and generation processes and ensuring the system can handle various input data effectively.
Let's examine the various chunking strategies and their advantages to help determine when
to use them.
1) Fixed Size Chunking
Fixed Size Chunking divides text into chunks of a fixed number of tokens. It can include an
optional overlap between chunks to maintain context. This method is computationally
efficient and easy to implement, making it suitable for most NLP applications with relatively
uniform text where context preservation across boundaries is not critical.
2) Recursive Chunking
Recursive Chunking splits the text into smaller chunks iteratively, using a hierarchical
approach with different separators or criteria. Initial splits are made using larger chunks,
which are then further divided if necessary, aiming to keep chunks similar in size. This
method maintains better context within chunks and is helpful for complex texts where
contextual integrity is essential.
3) Document Specific Chunking
Document-specific chunking creates chunks by considering the document's inherent
structure, such as paragraphs or sections. This method preserves coherence and the
document's original organization by aligning chunks with logical sections. It is ideal for
structured documents with clear sections, such as technical documents, articles, or reports,
and can handle formats like Markdown and HTML.
4) Semantic Chunking
Semantic Chunking divides the text into meaningful, semantically complete chunks based on
the relationships within the text. Each chunk represents a complete idea or topic,
maintaining the integrity of information for more accurate retrieval and generation. This
method is slower and more computationally intensive but is best for NLP
applications requiring high semantic accuracy, such as summarization or detailed question
answering.
5) Agentic Chunking
Agentic Chunking is an experimental approach that processes documents in a human-like
manner. Chunks are created based on logical, human-like decisions about content
organization, starting from the beginning and proceeding sequentially, deciding chunk
boundaries dynamically. This method is still being tested and not widely implemented due
to the need for multiple LLM calls and higher processing costs. It is potentially useful for
highly dynamic and complex documents where human-like understanding is beneficial.
The LangChain tool, known for its versatility in natural language processing tasks, effectively
implements the different types of chunking in RAG we discussed so far. LangChain offers
various splitters to implement these methods. In the next section, we will explore these
methods in detail and provide examples to help you understand their functionality.
Strategies for Chunking in RAG
Various chunking strategies exist in RAG, and they are widely favored among AI engineers
for their robustness and effectiveness in building efficient RAG-based systems. This section
will discuss a few splitter methods offered by LangChain.
1) CharacterText Splitter
CharacterText Splitter is a straightforward method where text is divided into chunks based
on a fixed number of characters. Overlapped characters can be used to maintain context
between chunks. Before we proceed with a detailed explanation of an example, we need to
understand two terms: size and overlap.
 Chunk size refers to the number of characters in each chunk. For instance, if the
chunk size is set to 10 characters, each chunk will contain exactly 10 characters
from the text.
 Chunk Overlap is the number of characters that overlap between consecutive
chunks. This ensures that the context from the end of one chunk carries over to
the beginning of the next chunk, which helps preserve the flow of information.
Example
Let's use a simple text to illustrate how chunk size and chunk overlap work:
Text = "The quick brown fox jumps over the lazy dog."
Chunk Size = 10 characters
Chunk Overlap = 5 characters
Here’s how the text would be split into chunks:
First Chunk
Characters: "The quick "
Length: 10 characters
Second Chunk
Characters: "quick brown"
Starts 5 characters before the end of the first chunk ("quick "), ensuring overlap.
Length: 10 characters
Third Chunk
Characters: "brown fox "
Starts 5 characters before the end of the second chunk ("brown"), ensuring overlap.
Length: 10 characters
Fourth Chunk
Characters: "fox jumps "
Starts 5 characters before the end of the third chunk ("fox "), ensuring overlap.
Length: 10 characters
Fifth Chunk
Characters: "jumps over"
Starts 5 characters before the end of the fourth chunk ("jumps "), ensuring overlap.
Length: 10 characters
Sixth Chunk
Characters: "over the l"
Starts 5 characters before the end of the fifth chunk ("over "), ensuring overlap.
Length: 10 characters
Seventh Chunk
Characters: "the lazy d"
Starts 5 characters before the end of the sixth chunk ("the l"), ensuring overlap.
Length: 10 characters
Eighth Chunk
Characters: "lazy dog."
Starts 5 characters before the end of the seventh chunk ("lazy "), ensuring overlap.
Length: 9 characters (since the text ends here)
Pros Cons

 Simple and straightforward  May split text in ways that disrupt


to implement. semantic meaning.
 Allows fine-grained control  Risk of cutting off important
over chunk size. information at chunk boundaries.
2) RecursiveCharacter Text Splitter
he RecursiveCharacter Text Splitter is a chunking strategy that involves recursively dividing
text into smaller chunks based on natural language boundaries such as sentences or
paragraphs. This approach aims to maintain semantic integrity and coherence within each
chunk. Here’s a detailed explanation to help you understand it better:
 The process begins by defining an initial chunk size based on a specified number
of characters or other text units (like sentences or paragraphs).
 This initial chunk size serves as a starting point for further division.
 Once the initial chunk is defined, the RecursiveCharacter Text Splitter algorithm
recursively examines the content within each chunk to identify natural language
boundaries.
 These boundaries can include punctuation marks (like periods for sentences) or
specific tags (like
for paragraphs in HTML).
 As the algorithm identifies these boundaries, it adjusts the chunk boundaries
accordingly to ensure that each resulting chunk maintains semantic coherence.
 For instance, if the initial chunk contains multiple sentences, the algorithm will
split it into smaller chunks at sentence boundaries.
Example
In this example, the RecursiveCharacter Text Splitter divides the text into chunks of
approximately 100 characters each while ensuring a 4-character overlap between
consecutive chunks to maintain context and coherence. The algorithm identifies natural
language boundaries (paragraphs and sentences) and applies recursive division to create
meaningful chunks that preserve the semantic integrity of the original text.
Sample Text:
“Natural language processing (NLP) is a field of artificial intelligence concerned with the
interaction between computers and humans using natural language.
It focuses on the understanding, interpretation, and generation of human language,
allowing computers to understand human language as it is spoken.
NLP involves several challenges such as natural language understanding, natural language
generation, and machine translation.”
Let's assume the total character count for this text is around 350 characters.
 Initial Chunking: Start with the entire text as one initial chunk.
“Natural language processing (NLP) is a field of artificial intelligence concerned with the
interaction between computers and humans using natural language. It focuses on
understanding, interpreting, and generating human language, allowing computers to
understand human language as it is spoken. NLP involves several challenges such as natural
language understanding, natural language generation, and machine translation.”
After RecursiveCharacter Text Splitter,
 Chunk 1 ends after "language. (approximately 100 characters).
Chunk 1 = “Natural language processing (NLP) is a field of artificial intelligence concerned
with the interaction between computers and humans using natural language.”
 Chunk 2 starts after "language. " (ensuring 4 characters overlap) and ends after
"spoken." (approximately 100 characters)
Chunk 2 = “It focuses on the understanding, interpretation, and generation of human
language, and it allows computers to understand human language as it is spoken.:
 Chunk 3 starts after "spoken. " (ensuring 4 characters overlap)
Chunk 3 = “NLP involves several challenges such as natural language understanding, natural
language generation, and machine translation.”
Pros Cons

 Dynamically adjusts chunk  More complex to implement


boundaries based on text compared to simple character-
structure, such as sentences and based splitting.
paragraphs.  Computationally more
 It helps maintain semantic expensive due to recursive
coherence within chunks. operations.
3) MarkdownHeaderText Splitter
The MarkdownHeaderTextSplitter is designed to split Markdown documents according to
their header structure (e.g., #, ##, ###). This method keeps header metadata intact, allowing
for context-aware splitting that maintains the document's logical structure, which is useful
for tasks requiring hierarchical organization.
Example
Consider a Markdown document:
# Introduction
This is the introduction text.
## Section 1
Content for section 1.
### Subsection 1.1
Details for subsection 1.1.
## Section 2
Content for section 2.
When split by the MarkdownHeaderTextSplitter, it would produce:
Chunk 1
# Introduction
This is the introduction text.
Chunk 2
## Section 1
Content for section 1.
Chunk 3
### Subsection 1.1
Details for subsection 1.1.
Chunk 4
## Section 2
Content for section 2.
Pros Cons

 It maintains the logical structure of


 Limited to Markdown
Markdown documents, including
documents, making it less
headers, which helps preserve
versatile for other text formats.
context and meaning.
 It requires understanding the
 Ensures that chunks include
document’s structure, which
relevant header metadata, which
can be more complex than
can be useful for downstream tasks
simple token—or character-
that require understanding the
based splitting.
document’s organization.
 It can be slower than simpler
 This is particularly beneficial for
text splitters due to the need to
documents with a clear hierarchical
parse and understand the
structure, such as technical
document’s structure.
documents, articles, and reports.
4) TokenText Splitter
The TokenTextSplitter divides text based on the number of tokens rather than the number
of characters. Tokens are the basic units of text used by language models, which may be
words, subwords, or punctuation marks. Tokens are often approximately four characters
long, so splitting based on token count can better represent how the language model will
process the text. This approach aligns with how many language models process text, as they
typically have a maximum token limit.
Pros Cons

 Splits text based on the token  May split words into subwords,
count, aligning with how many
especially with certain
language models, which have
tokenization algorithms,
context windows based on the
potentially leading to less
token count, process text.
readable chunks.
 It provides a more accurate
 The effectiveness of this method
representation of how the
depends on the tokenization
language model will process the
method used by the model,
text since tokens are often
which might not be uniform
approximately four characters
across different models.
long.
 Requires a good understanding
 Can handle various text lengths
of how the model generates and
and adapt to different models’
uses tokens, adding a layer of
token limits, ensuring efficient
complexity compared to simpler
use of the model’s context
splitting methods.
window.
5) NLTKText Splitter
The NLTK Text Splitter leverages the Natural Language Toolkit's robust tokenization
capabilities to split text based on linguistic structures such as sentences or words. This
method ensures accurate sentence and word boundaries using pre-trained models for
various languages. It's highly customizable, allowing new languages or domain-specific
tokenizers to be added.
Pros Cons

 Splits text based on syntactic


 It uses pre-trained models to
rules rather than semantic
detect sentence and word
meaning, which might lead to
boundaries accurately, which is
loss of context in some cases.
particularly useful for linguistically
 It can be slower compared to
complex texts.
simpler tokenizers due to its
 Highly customizable with the
comprehensive linguistic
ability to add new languages or
processing.
domain-specific tokenizers.
 It may not be the best choice
 NLTK is a well-established library
for tasks that require
with extensive documentation and
preserving semantic context
community support.
over syntactic accuracy.
6) SentenceTransformersTokenText Splitter
The SentenceTransformers Token Text Splitter uses sentence embeddings to split text into
semantically meaningful chunks. This approach considers the semantic content, ensuring
each chunk maintains its meaning and context. This method is particularly useful for
applications like question answering and information retrieval, where high semantic
accuracy is crucial.
Pros Cons
 Ensures that each chunk is
semantically meaningful,  More computational resources are
maintaining the context and required to generate embeddings
integrity of the information. and semantic analysis.
 It can be adjusted to create  The quality of chunking depends
chunks of varying sizes based heavily on the pre-trained model
on the embedding model’s used, which might not perform
capabilities. equally well across different
 Particularly useful for domains or languages.
applications requiring high  More complex to implement and
semantic accuracy, such as use compared to simpler, rule-based
question answering and text splitters.
information retrieval.
You may have noticed that examples were not provided for the last three methods we
discussed. This is intentional because we encourage you to implement them in Python and
explore their functionality through experimentation with various inputs. Download the code
for free and start exploring right away!
So far, we have explored various types of chunking in RAG and discussed several methods
LangChain offers to implement them. However, determining the most suitable chunking
strategy can be challenging when faced with textual data. To assist you in this process, we
have compiled a list of parameters in the next section that you can analyze to make an
informed decision about which strategy to implement.
Key Considerations for Implementing Chunking in RAG
You must consider a few essential factors before selecting a suitable strategy for your
dataset to ensure effective chunking implementation in RAG.
 The structure of the text, whether it's a sentence, paragraph, code snippet,
table, or transcript, significantly influences how chunking should be applied.
Different types of content may require different chunking strategies to ensure
coherence and relevance in retrieval-augmented generation (RAG) systems.
 Effective chunking relies on the capabilities of embedding models used in RAG
systems. These models must accurately encode and represent the semantics of
each chunk to facilitate meaningful retrieval and response generation.
 Managing chunk overlap is crucial to maintaining context across chunks and
avoiding losing critical information at chunk boundaries. Overlapping chunks,
defined by the number of overlapping characters or tokens, help preserve cross-
chunk context and optimize information integration.
 Matching the chunk size with the capacity of the vectorization model is essential.
Optimized for shorter texts, models like BERT may require smaller, concise
chunks to operate effectively. Adjusting chunk size based on specific RAG
application requirements ensures efficient processing and integration of
retrieved information.
 LLM context length refers to the amount of text or tokens the model considers
when processing input data. It affects chunking in RAG by influencing optimal
chunk size alignment, managing overlap between chunks, and ensuring
computational efficiency during processing. Adjusting context length optimally
enhances RAG systems' coherence and computational performance.
Now that we've considered the key factors for implementing chunking in RAG let's explore
the best practices for effectively applying these strategies.
Best Practices for Chunking in RAG
Once you carefully consider the key parameters mentioned above, the next step should be
to follow the best practices for implementing chunking efficiently in RAG systems.
 Choose the Right Chunking Strategy to meet the specific needs and goals of the
RAG application. You can pick any of the strategies we have discussed so far,
which include fixed-size chunking, topic-based chunking, or dynamic chunking
based on document structure. Adapting the strategy enhances the effectiveness
of information retrieval and generation tasks.
 Experiment with Different Chunking Strategies to assess their impact on
retrieval precision and contextual richness. Conducting empirical evaluations
with different strategies helps identify the approach that optimally balances
information completeness and relevance in generating responses.
 Try to balance providing contextually rich responses and maintaining high
retrieval precision. Contextual richness ensures coherent and relevant
responses, while retrieval precision focuses on accurately integrating information
from external sources. Adjust chunking parameters iteratively to achieve optimal
performance in RAG systems.
These considerations and best practices help maximize the effectiveness of chunking in
retrieval-augmented generation systems, enhancing their capability to process and generate
contextually relevant responses from large datasets.
Learn to Build RAG systems with ProjectPro!
Implementing RAG for LLM systems can be challenging, but there's no need to lose hope.
With ProjectPro, you have a reliable platform that teaches you how to build these systems,
beginning with the fundamentals you may not have fully grasped. ProjectPro offers
subscriptions to solved projects in data science and big data, meticulously prepared by
industry experts. These projects serve as templates to solve practical problems encountered
in real-world scenarios, providing invaluable hands-on experience. ProjectPro also provides
tailored learning paths to enhance your skills based on your proficiency level. Take the first
step today—subscribe to ProjectPro and get started on your journey to mastering AI and
data science.
FAQs
1. What is the chunking technique in RAG?
The chunking technique in Retrieval-Augmented Generation (RAG) involves splitting large
texts into smaller, manageable pieces called chunks. This facilitates efficient information
retrieval and improves the relevance and accuracy of responses generated by language
models, enhancing their understanding and context retention.
2. What are chunks in RAG?
Chunks in RAG are smaller segments of a larger text, typically divided based on characters,
sentences, or paragraphs. These segments help efficiently process, retrieve, and generate
information, allowing language models to handle and understand the input more effectively
by focusing on smaller, contextually relevant pieces.
3. What is chunking in Generative AI?
Chunking in Generative AI refers to dividing input data into smaller, contextually meaningful
units. This technique improves the AI's ability to understand and generate coherent
responses by ensuring each chunk is processed and context preserved, leading to better
overall performance in tasks like text generation and comprehension.

You might also like