Introduction to Docs
and Image-Based
Voice Chatbots
The project focuses on creating a voice chatbot that can read and understand
documents, like PDFs, Images and respond to voice queries.
It aims to enhance user interaction with technology through natural language
processing and optical character recognition, making the chatbot a smart
conversational agent for various applications
SUBMITTED BY :
RAKESH H R 1BM21EC413
SHIVANI S NAIK 1BM20EC142
VANSH JAIN 1BM20EC183
JAIDEEP A HEGWAD 1BM20EC059
Problem Definition
Integrating voice interaction Documents and images.
The project confronts the significant Traditional voice chatbots are adept at
challenge of integrating voice interaction handling spoken or written queries but fall
with the ability to process and interpret short when users need to extract and
both textual and visual data from PDF files discuss content from documents and
and images. images.
Accessibility Challenges
This limitation is particularly acute in sectors where information is conveyed through a
combination of text and visuals, such as academic research, technical manuals, and medical
imaging.
Proposed Solution
To address the problem of inefficient and time-consuming document and image
retrieval during voice-based interactions, we propose a comprehensive solution. This
innovative system will leverage advanced natural language processing and computer
vision techniques to seamlessly integrate textual and visual information into a voice
chatbot interface.
Voice Chatbot: Develop a sophisticated voice chatbot that can read, understand, and
interact based on the content of uploaded PDFs and other documents.
Technology Integration: Employ natural language processing (NLP) and optical
character recognition (OCR) to enable document comprehension and voice interaction.
Enhanced Interaction: The chatbot will provide a seamless, responsive conversational
experience, improving user engagement with digital content.
Broad Applicability: This solution has the potential to revolutionize industries such as
education, customer support, and accessibility by providing a more natural
communication interface.
Road maps
1 Landing Page & Navigation Page
2 Authentication
3 Functionality
4 Payments & Launch
Functionality
PDF Text Extraction: The system reads PDF documents and extracts text using
PyPDF2.
Text Chunking: The extracted text is split into manageable chunks using
Langchain’s Recursive Character Text Splitter.
Vector Store Creation: Text chunks are converted into embeddings and indexed
using FAISS for quick retrieval.
Conversational Chain: A conversational chain is established using Langchain and
Google Generative AI to generate context-aware responses.
User Interaction: Users can interact with the chatbot via a Streamlit interface,
asking questions that the chatbot answers based on the PDF content.
Flow Chart of Functionality
Project Flow
1. Upload PDF: User uploads PDF documents.
2. Extract Text: System extracts text from PDFs using PyPDF2.
3. Split Text: Text is split into chunks for processing.
4. Generate Embeddings: Convert text chunks into embeddings.
5. Create Vector Store: Embeddings are indexed in a vector store using FAISS.
6. User Query: User inputs a question to the chatbot.
7. Retrieve Documents: System retrieves relevant document sections based on the query.
8. Generate Response: Chatbot generates a response using the conversational chain.
9. Display Response: Response is displayed to the user.
10. End: End of the process.
Architecture
1.User Interface (UI): This is where users interact with the chatbot through voice commands or text
input. It’s designed to be intuitive and user-friendly.
2.Voice Recognition: When a user speaks, this component converts the spoken words into text using
speech-to-text technology.
3.Text Processing: This core part uses natural language processing (NLP) to understand the user’s
intent and context from the text.
4.Document Processing:
• PDF Processing: Extracts text from PDF files using OCR technology.
• Image Processing: Analyzes images to understand content like charts or graphs.
5.Dialogue Management: Manages the conversation flow, deciding how the chatbot should respond
based on the user’s queries and the information extracted from documents and images.
6.Response Generation: Uses NLP to create a natural and relevant response, which is then converted
from text to speech if needed.
7.Learning Component: Gathers data from interactions to improve the chatbot’s performance over
time.
Architecture of CHATBOT
Technologies Used
Streamlit PyPDF2 Langchain Google
Generative AI
For creating the web To read PDF files and For text splitting and
application interface. extract text. managing For generating
conversational chains. embeddings and
responses.
FAISS: Dotenv
For efficient similarity For managing
search and indexing of environment variables.
text chunks.
LITERATURE SURVEY
NO AUTHOR TITLE PAPERS OUTCOME DRAWBACK
1. M. A. Khadija Designing a PDF- 2023 1. Development of a PDF-Driven Chatbot 1. E-books are perceived
Driven Chatbot International using Generative AI. as uncomfortable for
A. Aziz, powered by OpenAI Conference 2. Utilization of LangChain Framework, prolonged reading
ChatGPT, on Computer Chat-GPT (GPT3.5 Turbo), and Pinecone for sessions.
response generation. 2. Potential limitations in
3. Successful demonstration of the chatbot's accessibility and
ability to provide coherent responses aligned readability for some
with the content of PDF documents. users.
2. Semmy Wellem AI-powered Chatbot 2023 5th 1. Introduction of Unklabot 1.0, showcasing 1. Dependency on
Taju, Andria for Information International innovative integration of advanced AI external API (OpenAI
Kusuma Wahyudi, Service at Klabat Conference technologies for information services within GPT-3) might lead to
Green Ferry University by on Klabat University. potential limitations or
Mandias, Reymon Integrating OpenAI Cybernetics 2. Improved accuracy and efficiency in disruptions in service if
Rotikan, Jimmy GPT-3 with Intent and question answering capabilities through the the API becomes
Herawan Recognition and Intelligent integration of intent recognition and semantic unavailable or undergoes
Semantic Search. System search techniques. changes. 2. Lack of
discussion on potential
privacy or security
concerns associated with
using an external AI
model for handling
NO AUTHOR TITLE PAPERS OUTCOME DRAWBACK
3 Max Dean, An AI Chatbot 2023 31st Irish Conference on 1. Development of a large 1. Dependency on
Michael F. for Interacting Artificial Intelligence language model (LLM) arXiv restricts the
McTear, Raymond with Academic augmentation chatbot diversity of papers
R. Bond and Research, tailored for computer and may limit the
Maurice D. science research queries. applicability of the
Mulvenna 2. Embedding of around chatbot to broader
200,000 computer science research domains.
research papers from arXiv, 2. Limited testing
resulting in ~11 million scope with only 30
vectors. sample questions
may not fully
capture the breadth
of inquiries in
computer science.
4 T. -H. Kim, S. Cho, S. Emotional Voice 2020 IEEE International Conference 1. Introduction of a voice 1. Previous VC methods
Choi, S. Park and S. -Y. Conversion Using converter using multitask learning based on seq2seq
Lee Multitask Learning with text-to-speech (TTS). models risk losing
with Text-To-Speech 2. Multitask learning aids in linguistic information.
capturing linguistic information 2. Textual supervision
and maintaining training stability. attempted to address this
but required explicit
alignment, nullifying the
benefits of seq2seq
models.
Efficient Indexing: Pinecone uses advanced indexing techniques
optimized for high-dimensional vector embeddings, enabling fast
similarity search. Scalability: Pinecone is built to handle large-scale
deployments, allowing you to store and search billions of vectors
with low latency. API Integration: Pinecone provides easy-to-use
APIs for inserting vectors, querying for nearest neighbors, and
managing indexes.
5 T. N. Thi, T. -H. Implementatio 2023 1. Successful development of 1. Limited exploration of
Do and M. Yoo n of OCR Internatio an OCR system tailored for alternative OCR models
system on nal Vietnamese book cover images. beyond those mentioned.
extracting Conferen 2. Demonstrated effectiveness 2. Lack of comparative
information ce of EAST and SAST for text analysis between different
from detection, and CRNN, SVTR, combinations of text
Vietnamese Transformer OCR for text detection and recognition
book cover recognition. models.
images
6 R. Vannala, S. AI Chatbot 2022 2. Proposed system bridges the 1. Reliance on human
B. Swathi and Y. For Answering IEEE 2nd gap between traditional FAQ agents for unsatisfactory
Puranam FAQ’s, Internatio systems and image-based responses may hinder
nal question answering. scalability.
Conferen 3. Enhancement of user 2. Potential limitations in
ce experience through seamless the chat bot's ability to
integration of AI-driven accurately interpret complex
responses and human agent image-based queries.
intervention.
7 V. Velasco, K. AI Chatbot 2023 4th 1. AI chatbots demonstrate 1.Limited inclusion of
Dedy Setiawan, Technology to Internatio significant potential in disease recent studies beyond 2020
R. Robert Predict nal prediction, offering valuable may overlook emerging
Sanjaya, M. Disease Conferen support for healthcare advancements.
Susan ce professionals. 2. Potential bias in the
Anggreainy and 2. Utilization of machine selection criteria of the
A. Kurniawan learning algorithms enhances reviewed journals may
accuracy and speed in disease affect the
diagnosis. comprehensiveness of the
analysis.
8 Hrushikesh Smart College Conferen 1. Implementation of an online 1. Dependency on a
Koundinya K.; Chatbot using ce July chatbot system for Matrusri database created by human
Ajay Krishna ML and 2020 Engineering College. experts limits scalability.
Palakurthi; Python 2. Investigation into the role of 2. Potential lack of
Vaishnavi AI and ML in improving service adaptability to diverse user
Putnala delivery, particularly through inputs beyond the trained
chatbots. responses.
Conclusion:
In conclusion, the voice-based chatbot project represents a significant
advancement in the field of human-computer interaction.
By integrating voice commands with the ability to process and understand content
from both images and PDF files, this chatbot transcends traditional text-based
systems.
It offers a versatile and dynamic tool that caters to a wide range of applications,
from educational resources to technical support and beyond.
The project’s success lies in its innovative approach to combining OCR and image
recognition with NLP, providing users with an intuitive and efficient way to access
and interact with information.
As we look to the future, the potential for further development and integration into
various industries holds the promise of transforming how we engage with digital
content