Keyword-based
Search Engine for
Text Documents
CS619 Project
Introduction:
This project involves developing a web-based search engine specifically for
plain text documents. The application allows users to upload, index, and
search documents by keywords or phrases, displaying the most relevant
documents first. It includes features like:
• Keyword and Phrase Search: Users can perform searches, with results
ranked by relevance.
• Spell Checker: Automatically detects misspelled words in search queries,
providing suggestions and corrections.
• User Access Control: Account registration is required for document
uploads, while search and viewing are open to all users.
• Custom Dictionaries: Users can add frequently used terms to a personal
dictionary for improved search accuracy.
The project aims to offer a streamlined, user-friendly document search
experience, leveraging Python’s Flask framework for the backend and SQLite
for data storage. It emphasizes efficient data handling and retrieval, making
it a practical and hands-on learning opportunity in information retrieval.
Why Learn Python?
Top Choice for AI, Data Science & Automation
• Python is widely used in high-demand fields like artificial intelligence,
machine learning, data science, and automation.
Simple, Yet Powerful
• Python’s clean, readable syntax makes it perfect for beginners while
offering robust libraries and frameworks for advanced projects.
High Demand in Industry
• Python skills are highly valued by employers across tech roles, from
web development to data analytics.
Hands-On Learning with a Real Project
• By building a search engine in Python, you’ll gain practical experience
that’s both valuable and relevant to modern tech careers.
This project is your chance to master Python and develop skills that
open doors to exciting fields in technology
Roadmap to learn (part 1):
Step 1: Learn Python
• Focus on the basics: variables, data types, control structures,
functions, and basic libraries. This will build a strong programming
foundation.
• Estimated Time: 6 weeks @ 1 hour daily
• Option 1 (Shradha Kapra – YouTube - Urdu):
https://www.youtube.com/playlist?list=PLGjplNEQ1it8-0CmoljS5yeV-Gl
KSUEt0
• Option 2 (CodeWithHarry – YouTube - Urdu):
https://www.youtube.com/playlist?list=PLu0W_9lII9agwh1XjRt242xIpH
hPT2llg
• Option 3 (Coursera – English):
https://www.coursera.org/learn/python-crash-course
Recommended Book:
Think Python, 2nd edition by Allen B. Downey
https://greenteapress.com/thinkpython2/thinkpython2.pdf
Roadmap to learn (part 2):
Step 2: Learn HTML
• Focus solely on HTML to understand structuring content on a
webpage, covering elements like headers, paragraphs, links, forms,
and lists.
• Estimated Time: 4 weeks @ 1 hour daily
• Option 1 (Complete Coding with Prashant Sir – YouTube –
Urdu):
https://www.youtube.com/watch?v=rklidcZ-aLU
• Option 2 (CodeWithHarry – YouTube – Urdu):
https://www.youtube.com/watch?v=BsDoLVMnmZs
• Additional Support:
https://www.w3schools.com/html/
Roadmap to learn (part 3):
Step 3: Learn CSS
• Focus on styling HTML content. Cover basics like colors, fonts, layout
(box model, flexbox), and responsive design.
• Estimated Time: 4 weeks @ 1 hour daily
• Option 1 (Complete Coding with Prashant Sir – YouTube –
Urdu):
https://www.youtube.com/watch?v=OpWjt_wbV4E
• Option 2 (CodeWithHarry – YouTube – Urdu):
https://www.youtube.com/watch?v=Edsxf_NBFrw&t=501s
• Additional Support:
https://www.w3schools.com/css/
Roadmap to learn (part 4):
Step 4: Learn JavaScript
• Learn JavaScript fundamentals: variables, functions, and basic DOM
manipulation to make web pages interactive.
• Estimated Time: 5 weeks @ 1 hour daily
• Option 1 (Complete Coding with Prashant Sir – YouTube –
Urdu):
https://www.youtube.com/watch?v=cpoXLj24BDY
• Option 2 (CodeWithHarry – YouTube – Urdu):
https://www.youtube.com/playlist?list=PLu0W_9lII9ahR1blWX
xgSlL4y9iQBnLpR
• Additional Support:
https://www.w3schools.com/js/default.asp
Roadmap to learn (part 5):
Step 5: Learn Flask
• Start with setting up a simple Flask web server. Learn how to create routes,
templates, and basic form handling to connect Python with HTML.
• Estimated Time: 4 weeks @ 1 hour daily
• Option 1 (CodeWithHarry – YouTube – Urdu):
https://www.youtube.com/watch?v=oA8brF3w5XQ
• Option 2: (Tech With Tim – YouTube - English):
https://www.youtube.com/watch?v=GQcM8wdduLI&list=PLzMcBGfZ
o4-nK0Pyubp7yIG0RdXp6zklu
• Option 3: (Tech With Tim – YouTube - English):
https://www.youtube.com/watch?v=mqhxxeeTbu0&list=PLzMcBGfZ
o4-n4vJJybUVV3Un_NFS5EOgX&index=1
Roadmap to learn (part 6):
Step 6: Learn SQL and SQLite:
• Learn SQL basics and how to perform CRUD (Create, Read, Update,
Delete) operations. Practice integrating SQLite with Flask to store
data.
• Estimated Time: 4 weeks @ 1 hour daily
SQL (Apna College – YouTube - Urdu):
https://www.youtube.com/watch?v=hlGoQC332VM&t=103s
SQLite (Kite – YouTube – English)
https://www.youtube.com/watch?v=girsuXz0yA8
Roadmap to learn (part 6):
Step 7: Learn Whoosh
• Focus on search indexing and querying techniques. Begin with small
examples to create and query a search index, then integrate this
knowledge with Flask.
• Estimated Time: 4 weeks @ 1 hour daily
• Search on YouTube “Whoosh Python” for tutorials
Total Estimated Time: 7-8 months
Roadmap to develop (part 1):
Step 1: Reviewing and Updating Packages:
• Make sure all necessary packages and libraries (Flask, Whoosh, SQLite, etc.)
are installed and up-to-date.
• Organize project folders and files, setting up folders for templates, static files,
and configuration files.
• Set up a basic Flask app and run a "Hello, World!" web page to confirm the
environment is working.
• Estimated Time: 1 day
Step 2: Build the Core Backend for Document Storage and Retrieval
• Document Upload: Set up an interface and backend code to allow users to
upload text files. Store the file metadata (like filename and upload date) in
SQLite.
• Text Indexing: Use Whoosh to index uploaded text files. This is where you
will create a search index to make content searchable by keywords.
• Estimated Time: 2 weeks
Roadmap to develop (part 2):
Step 3: Develop Search Functionality (Keyword and Phrase
Search)
• Implement the keyword and phrase search functionality. Have
Whoosh handle queries and return results based on relevance.
• Order search results by relevance and prepare to display the top
matches first.
• Estimated Time: 2 weeks
Step 4: Add Spell Checker and Custom Dictionary Features
• Integrate a spell-checking tool (e.g., TextBlob) to underline misspelled
words as users' type.
• Implement a right-click menu to display suggested spellings and
provide the option to add custom words to the user’s dictionary.
• Estimated Time: 2 weeks
Roadmap to develop (part 3):
Step 5: Develop User Interface (UI)
• Design the main web interface using HTML, CSS, and JavaScript. Create
sections for:
• Document upload
• Search input and results
• Spell-check feedback and suggestions
• Ensure the UI is intuitive, with a clean layout and accessible elements.
• Estimated Time: 2 week
Step 6: Implement Access Control and User Management
• Create a user registration and login system (for document upload access).
• Develop an admin interface to manage users and uploaded documents.
• Ensure metadata, including usernames, is stored alongside uploaded
documents.
• Estimated Time: 2 weeks
Roadmap to develop (part 4):
Step 7: Add Data Persistence and Session Management
• Ensure user sessions persist across actions (e.g., staying logged in
across different pages).
• Finalize data management techniques to maintain access to uploaded
documents and user-specific data.
• Estimated Time: 1 week
Step 8: Testing and Debugging
• Test the app’s functionalities, focusing on search accuracy, spell-
check performance, and UI responsiveness.
• Fix any bugs and optimize the code to improve search efficiency and
UI speed.
• Estimated Time: 1 weeks
Roadmap to develop (part 6):
Step 9: Project Finalization and Deployment
• Add final touches, such as documentation and usage instructions.
• Estimated Time: 1 week
Total Development Time: 13 weeks (approx. 3 months)
Learn from ChatGPT
https://chatgpt.com/
A great learning tool
You can ask questions related to anything (including code)
Ask it to write code, explain code, edit code etc.
Ask it to provide helping material related to anything
Ask it to teach you anything…
(for example: how to connect with Python application with database)
Ask it to guide to related to anything
Things to Do:
Learn daily – Consistency is key to the Success
Stay connected with your Supervisor
Check your VU email daily
Things to avoid:
Don’t hire someone to develop the project or assignments
You won’t be able to pass the viva if you don’t learn
Don’t waste time – Each wasted day is a step towards the
failure
Learning takes considerable time; you can’t learn the entire
project in 2-3 days.
Structure of the Project:
Four Assignments:
SRS (15 marks)
Design Document (25 marks)
Prototype (includes viva) (10 marks)
Final Deliverable (includes viva) (50 marks)
You won’t be able to submit the final deliverable if you obtain
less than 50% marks in first three assignments
Any Questions?
Thank You!
Happy Learning