0% found this document useful (0 votes)

104 views8 pages

Sentiment Analysis Pipeline Guide

The document outlines two assignments focused on building machine learning applications: an End-to-End Sentiment Analysis Pipeline using the IMDB dataset and a Retrieval-Augmented Generation (RAG) Chatbot utilizing a vector database. Each assignment includes detailed steps for data collection, model training, API development with Flask, and database setup, along with deliverables and evaluation criteria. The assignments are designed to be completed in 2-3 days and emphasize code quality, completeness, and documentation.

Uploaded by

sushantgaurav80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views8 pages

Sentiment Analysis Pipeline Guide

Uploaded by

sushantgaurav80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Assignment 1: End-to-End Sentiment Analysis Pipeline

Objective

You will implement a sentiment analysis pipeline on a well-known public dataset. The process
includes:

1. Downloading the dataset (already annotated with sentiment).

2. Storing the data in a database.
3. Cleaning and exploring the data.
4. Training and evaluating a classification model.
5. Serving the trained model via a Flask API (with an endpoint that predicts sentiment for
new text).

Dataset: IMDB Movie Reviews

 Why IMDB?
o It’s a standard sentiment analysis dataset, with ~25k labeled movie reviews.
Labels are typically positive or negative.
o Easily available via:
 Kaggle: IMDB Dataset of 50K Movie Reviews
 Hugging Face Datasets: huggingface datasets
load_dataset("imdb")

Important: If you prefer another labeled sentiment dataset (e.g., Yelp Reviews Polarity), that’s
fine, but the IMDB dataset is the baseline recommendation.

Steps & Requirements

1. Data Collection

1. Obtain the data

o Download from Kaggle (CSV file) OR load via Hugging Face Datasets.
o Confirm you have ~25k rows of labeled reviews (train + test sets).
2. Database Setup
o Choose any relational database (e.g., MySQL, PostgreSQL, or SQLite).
o Create a table (e.g., imdb_reviews) with columns:
 id (primary key)
 review_text (text)
 sentiment (text or integer, e.g., “positive” / “negative”)
o Insert all data into this table.
Note: If you use SQLite, you won’t need a separate DB server. This is often the simplest option.

2. Data Cleaning & Exploration

1. Data Cleaning
o Ensure there are no obvious errors or duplicates.
o For text cleanup, you can consider:
 Lowercasing
 Removing HTML tags (some IMDB reviews contain <br /> etc.)
 Removing punctuation (optional)
o Keep the cleaned version stored in memory or in a new column, whichever you
prefer.
2. Exploratory Data Analysis (EDA)
o Show basic stats:
 Number of reviews per sentiment (distribution)
 Average review length for positive vs. negative
o (Optional) Some simple plots or word clouds can be included for illustration.

3. Model Training

1. Model Type
o Baseline: A simple approach like Logistic Regression or Naive Bayes on TF-
IDF vectors.
o (Optional) Try a transformer-based model like DistilBERT or BERT if you’re
comfortable with Hugging Face Transformers.
2. Train/Validation Split
o If the dataset is already split into train/test, use that.
o You can create an additional validation split from the training set (e.g., 80%/20%
within the training set).
3. Training
o Fit the model on the training data.
o Monitor basic metrics (accuracy, F1, etc.) on the validation set.
4. Evaluation
o Evaluate on the test set.
o Report metrics (accuracy, precision, recall, F1-score).

Tip: Keep the model as simple or advanced as you like, but logistic regression on TF-IDF is
typically enough for a baseline.

4. Model Serving with Flask

1. Flask API
o Create a simple Flask app (app.py or main.py).
o Include an endpoint, e.g., POST /predict:
 Input: JSON with a field review_text (the new text to classify).
Output: JSON with a field sentiment_prediction (e.g., "positive" /
"negative").
2. Model Loading
o Ensure your trained model (e.g., a pickle file or Hugging Face model weights) is
loaded once when the app starts.
o The Flask endpoint should do the following:

1. Receive text input.

2. Apply the same preprocessing steps used during training.
3. Run the text through the trained model.
4. Return the predicted sentiment.
3. Testing Locally
o Show how to send a test request (using curl, Postman, or Python’s requests) to
verify the endpoint works.

5. (Optional) Deployment

 Cloud: Deploy to any free-tier service (Heroku, Render, Railway, etc.) or a small EC2
instance on AWS.
 Provide instructions or documentation if you do this step.

Deliverables

1. Code Repository
o A well-structured repository (GitHub, GitLab, etc.) containing:
 Data ingestion/DB setup script or notebook (e.g., data_setup.py).
 Model training script or notebook (e.g., train_model.py).
 Flask app (e.g., app.py).
 Requirements file (requirements.txt) listing all Python dependencies.
2. Database Schema
o Simple instructions or a .sql file that shows table creation if using
MySQL/PostgreSQL.
o If using SQLite, mention your database file name (e.g., imdb_reviews.db) and
how it’s created.
3. README
o Project Setup: Steps to install dependencies (e.g., pip install -r
requirements.txt) and set up the database.
o Data Acquisition: How you downloaded or loaded the dataset.
o Run Instructions:
 How to run the training script.
 How to start the Flask server.
 How to test the endpoint with example commands or requests.
o Model Info: A summary of the chosen model approach and key results (e.g., final
accuracy on test set).
4. (Optional) Additional Assets
o If you generate plots or a short EDA report, include them in the repo (e.g., a
.ipynb or .pdf).

Time Expectation & Scope

 This assignment should be 2-3 days of work at a normal pace.

 Keep the solution focused on these steps—no need to explore other major NLP tasks.

Evaluation Criteria

1. Completeness: Did you store data in a DB, train a sentiment model, and serve it via
Flask?
2. Correctness: Does the Flask endpoint work? Does the model predict sentiment
accurately on test data?
3. Code Quality & Organization: Is the code clean, documented, and logically separated
into files/modules?
4. Documentation: Is there a clear README with setup and usage instructions?

Assignment 2: RAG (Retrieval-Augmented Generation) Chatbot

Overview

You will implement a simple Retrieval-Augmented Generation (RAG) chatbot that uses a vector database
for semantic search and stores chat history in a local MySQL database. The chatbot will be served via a
Flask API.

Task Breakdown

1. Data Preparation

o Corpus: Pick or create a small text corpus (e.g., a set of documentation pages, articles,
or Wikipedia paragraphs on a particular topic). You can:

 Provide your own text files.

 Scrape a small set of web pages (ensure the data is not too large—just enough
to test retrieval).

o Chunk & Preprocess:

 Split larger documents into smaller chunks (e.g., ~200-300 words each).

 Clean and normalize text (remove extra whitespaces, etc.).

2. Embedding & Vector Store

o Vector Database: You can choose any free and local-friendly vector DB (e.g., Faiss,
Chroma, or Milvus).

o Embeddings:

 Use any sentence embedding model (e.g., sentence-transformers from Hugging

Face).

 Embed each chunk of text and store the resulting vectors in the vector DB.

o Retrieval:

 Implement a function to query the vector DB given a user query (in embeddings
space).

 Return the top-k most relevant chunks.

3. Generation (Answer Construction)

o RAG Approach:

 Take the user’s query.

 Retrieve relevant chunks from the vector DB.

 Use the retrieved chunks as context to generate an answer.

 Model: you can any LLM

o Implementation Details:

 You can run a local model (small transformer, or even a rule-based approach) if
you do not want to rely on large model inference.

 For each query, your pipeline should be:

1. Convert query to embedding.

2. Retrieve top-k relevant chunks.

3. Concatenate these chunks.

4. Pass them (with the query) into your small generation or extraction
function to form the final answer.

4. Flask API

o Endpoints:

 POST /chat: Accepts a JSON payload with the user’s query. Returns:

 Generated answer.
 Possibly the top retrieved chunks for debugging (optional).

 GET /history: Returns the chat history from the MySQL DB.

o Chat History:

 For each user query and system answer, store them as separate rows in MySQL
with fields such as:

 id (auto-increment)

 timestamp

 role (user or system)

 content (the text of the user query or system answer)

 You can store them after each response is generated.

5. Database

o MySQL:

 Set up a local MySQL database.

 Create a table(s) to hold chat messages.

 Optionally, you can also store user IDs if you want to support multiple users (not
mandatory).

6. Testing

o Unit Tests:

 Test embedding and retrieval with a few sample queries.

 Test the Flask endpoints (/chat, /history) to ensure the entire pipeline works.

Deliverables

1. Code Repository:

o All source code for the chatbot pipeline, including the vector DB setup, embedding code,
retrieval code, and Flask API.

o A requirements.txt or Pipfile.

2. Database Schema:

o The SQL schema or migrations for MySQL (tables for storing chat history).

3. Demo / Documentation:

o A short README describing:

 How to install and run the system locally.

 How to set up MySQL and create the required tables.

 How to test the /chat and /history endpoints.

 Any environment variables needed (e.g., DB credentials).

4. (Optional) Extra Credit:

o Dockerize the application to ensure consistent environment setup.

o Provide instructions for a minimal cloud deployment if you want.

General Submission Instructions

1. Repo Structure:

o Assignment1

 data_ingestion.py (or notebook)

 model_training.py (or notebook)

 app.py (Flask API)

 requirements.txt

 etc.

o Assignment2

 data_preprocessing.py

 embed_store.py

 rag_chatbot.py (or app.py for Flask)

 requirements.txt

 etc.

o README.md

2. Instructions: Provide clear instructions on how someone else can clone your repository, install
dependencies, set up the database, and run each assignment’s solution.

3. Time Expectation:

o Each assignment is designed to be completed in about 1-3 days of focused work

(depending on your familiarity with the tools).

Evaluation Criteria
 Correctness & Completeness: Does the code run end-to-end without major issues? Are all parts
of the assignment addressed?

 Project Structure & Code Quality: Is the repository organized logically? Is the code readable,
well-documented, and maintainable?

 Solution Design: Are the chosen methods (data cleaning, feature engineering, model selection,
retrieval, etc.) appropriate and justified?

 Documentation: Is there a clear README explaining how to install, run, and test the solution?

 Bonus/Extras: (Optional) Dockerization, cloud deployment, or additional tests will be considered

a plus.

Chat Bot
No ratings yet
Chat Bot
10 pages
Complex Engineering Activity
No ratings yet
Complex Engineering Activity
2 pages
Python Chatbot Project
No ratings yet
Python Chatbot Project
10 pages
Python 21to30
No ratings yet
Python 21to30
9 pages
Ai Phase 3 Project
No ratings yet
Ai Phase 3 Project
18 pages
BAET Record
No ratings yet
BAET Record
19 pages
Set 1
No ratings yet
Set 1
4 pages
Unit-V NLP
No ratings yet
Unit-V NLP
9 pages
Sundar RajI Phase 3
No ratings yet
Sundar RajI Phase 3
29 pages
Abusive Language Chatbot Guide
No ratings yet
Abusive Language Chatbot Guide
6 pages
Sentiment Analysis
100% (1)
Sentiment Analysis
35 pages
Machine Learning Presentation
No ratings yet
Machine Learning Presentation
20 pages
Chatbot
No ratings yet
Chatbot
6 pages
Britto 1 15 2 15 - Merged
No ratings yet
Britto 1 15 2 15 - Merged
18 pages
Chat Bot
No ratings yet
Chat Bot
6 pages
Course Project Report For: Artificial Intelligence EL-3011
No ratings yet
Course Project Report For: Artificial Intelligence EL-3011
8 pages
Britto
No ratings yet
Britto
16 pages
Python Chatbot Project
No ratings yet
Python Chatbot Project
6 pages
Practical Fie AI Class 10
No ratings yet
Practical Fie AI Class 10
19 pages
AI Intern Assignment
No ratings yet
AI Intern Assignment
2 pages
Projects
No ratings yet
Projects
8 pages
6th PGM Viva
No ratings yet
6th PGM Viva
2 pages
Lab 6,7
No ratings yet
Lab 6,7
5 pages
Experiential Learning
No ratings yet
Experiential Learning
8 pages
Customer Sentiment Analysis Project
No ratings yet
Customer Sentiment Analysis Project
3 pages
Natural Language Processing Tasks
No ratings yet
Natural Language Processing Tasks
5 pages
NLP - Assignment2 Proper RNN Working
No ratings yet
NLP - Assignment2 Proper RNN Working
3 pages
Arsalan's Project
No ratings yet
Arsalan's Project
4 pages
Arsalan's Project New
No ratings yet
Arsalan's Project New
4 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
48 75 Dsa Report
No ratings yet
48 75 Dsa Report
11 pages
It HW Exp 3
No ratings yet
It HW Exp 3
17 pages
SML 1
No ratings yet
SML 1
16 pages
Machine Learning Assignment Guide
No ratings yet
Machine Learning Assignment Guide
6 pages
2024 04 25 AI Bots Vitalii
No ratings yet
2024 04 25 AI Bots Vitalii
20 pages
Python Chatbot Project
No ratings yet
Python Chatbot Project
6 pages
Assignment: Machine Learning Engineer: Problem Description 1 (NLP)
No ratings yet
Assignment: Machine Learning Engineer: Problem Description 1 (NLP)
1 page
RAI AI Engineer Intern Assignments
No ratings yet
RAI AI Engineer Intern Assignments
3 pages
Bring Your Data To Life - Creating A Chatbot With LLM, LangChain, Vector DB
No ratings yet
Bring Your Data To Life - Creating A Chatbot With LLM, LangChain, Vector DB
10 pages
Task 1 ML
No ratings yet
Task 1 ML
7 pages
Practical 2
No ratings yet
Practical 2
4 pages
Research Paper Text Classification
No ratings yet
Research Paper Text Classification
17 pages
Fateh 1
No ratings yet
Fateh 1
7 pages
Machine Learning Project Guide
No ratings yet
Machine Learning Project Guide
11 pages
Restaurant Chatbot with Flask
No ratings yet
Restaurant Chatbot with Flask
30 pages
Building An NLP Chatbot For A Restaurant
No ratings yet
Building An NLP Chatbot For A Restaurant
30 pages
Sentiment Analysis Chatbot
No ratings yet
Sentiment Analysis Chatbot
24 pages
AI Lab Tasks for Python Developers
No ratings yet
AI Lab Tasks for Python Developers
12 pages
Langchain 1 Complete
No ratings yet
Langchain 1 Complete
11 pages
NLP Project (Documentation)
No ratings yet
NLP Project (Documentation)
8 pages
Gen AI - Prompt Engeneering
No ratings yet
Gen AI - Prompt Engeneering
160 pages
Agentic RAG - Removed
No ratings yet
Agentic RAG - Removed
9 pages
MindMate AI Model Training Guide - Complete Beginner's Tutorial
No ratings yet
MindMate AI Model Training Guide - Complete Beginner's Tutorial
41 pages
ML Assignment
No ratings yet
ML Assignment
5 pages
LangChain Custom Project - Student Implementation Guide
No ratings yet
LangChain Custom Project - Student Implementation Guide
9 pages
Chatbot Phase3
100% (1)
Chatbot Phase3
7 pages
Basic Details
No ratings yet
Basic Details
4 pages
Grade 2 Data Analysis Lesson
No ratings yet
Grade 2 Data Analysis Lesson
9 pages
Tools For Data Science
No ratings yet
Tools For Data Science
16 pages
Ms-Access Note
No ratings yet
Ms-Access Note
10 pages
GIS Slope Map Creation Guide
No ratings yet
GIS Slope Map Creation Guide
7 pages
Data Warehousing Quiz Questions
No ratings yet
Data Warehousing Quiz Questions
3 pages
Export Oracle 10g to CSV via PL/SQL
No ratings yet
Export Oracle 10g to CSV via PL/SQL
5 pages
HTTP Chunked Encoding - HttpWatch
No ratings yet
HTTP Chunked Encoding - HttpWatch
3 pages
Niroj Paudel Sociology Thesis 2019
No ratings yet
Niroj Paudel Sociology Thesis 2019
93 pages
Powell and Renner Article To Analize Data
No ratings yet
Powell and Renner Article To Analize Data
12 pages
Proshot Gen4 Surveying Instrument: Precision Instrumentation For The Mining Industry
No ratings yet
Proshot Gen4 Surveying Instrument: Precision Instrumentation For The Mining Industry
2 pages
International Marketing 2nd Edition Donald Baack PDF Version
100% (3)
International Marketing 2nd Edition Donald Baack PDF Version
104 pages
Tool Tips
No ratings yet
Tool Tips
15 pages
Go16 Ac Ch01 Grader 1g As Instructions
No ratings yet
Go16 Ac Ch01 Grader 1g As Instructions
2 pages
Hierarchical Database Model
No ratings yet
Hierarchical Database Model
13 pages
SQL Table Management Guide
No ratings yet
SQL Table Management Guide
19 pages
How To Diagnose An Issue in The Unified Item Catalog
No ratings yet
How To Diagnose An Issue in The Unified Item Catalog
17 pages
AI Career Accelerator Brochure
No ratings yet
AI Career Accelerator Brochure
36 pages
Cognos PowerPlay Cube Guide
No ratings yet
Cognos PowerPlay Cube Guide
8 pages
Data Modeling For Nosql Document-Oriented Databases
No ratings yet
Data Modeling For Nosql Document-Oriented Databases
7 pages
Master's in Business Administration (MBA) : A Project Report On Management Information System
No ratings yet
Master's in Business Administration (MBA) : A Project Report On Management Information System
20 pages
Spring Persistence Tutorial - Baeldung
No ratings yet
Spring Persistence Tutorial - Baeldung
7 pages
Advanced SQL Assignment With Tables
No ratings yet
Advanced SQL Assignment With Tables
5 pages
Metocean Design Data WEB1
No ratings yet
Metocean Design Data WEB1
2 pages
Mit (CS) 402
No ratings yet
Mit (CS) 402
123 pages
Retrieving Data With SQL Queries
No ratings yet
Retrieving Data With SQL Queries
11 pages
Spark SQL for Data Engineers
No ratings yet
Spark SQL for Data Engineers
25 pages
2024-2025 Dbms External Paper
No ratings yet
2024-2025 Dbms External Paper
9 pages
Computer Science Computer Science (083) (083) : Practical File
No ratings yet
Computer Science Computer Science (083) (083) : Practical File
28 pages
MongoDB CRUD Essentials
No ratings yet
MongoDB CRUD Essentials
21 pages
Certificate in Sales and Marketing PPM
No ratings yet
Certificate in Sales and Marketing PPM
1 page

Sentiment Analysis Pipeline Guide

Uploaded by

Sentiment Analysis Pipeline Guide

Uploaded by

Assignment 1: End-to-End Sentiment Analysis Pipeline

1. Downloading the dataset (already annotated with sentiment).

Dataset: IMDB Movie Reviews

Steps & Requirements

1. Obtain the data

2. Data Cleaning & Exploration

4. Model Serving with Flask

1. Receive text input.

Time Expectation & Scope

 This assignment should be 2-3 days of work at a normal pace.

Assignment 2: RAG (Retrieval-Augmented Generation) Chatbot

 Provide your own text files.

o Chunk & Preprocess:

 Clean and normalize text (remove extra whitespaces, etc.).

 Use any sentence embedding model (e.g., sentence-transformers from Hugging

 Return the top-k most relevant chunks.

3. Generation (Answer Construction)

 Take the user’s query.

 Retrieve relevant chunks from the vector DB.

 Use the retrieved chunks as context to generate an answer.

 Model: you can any LLM

 For each query, your pipeline should be:

1. Convert query to embedding.

2. Retrieve top-k relevant chunks.

3. Concatenate these chunks.

 role (user or system)

 content (the text of the user query or system answer)

 You can store them after each response is generated.

 Set up a local MySQL database.

 Create a table(s) to hold chat messages.

 Test embedding and retrieval with a few sample queries.

o A short README describing:

 How to install and run the system locally.

 How to test the /chat and /history endpoints.

 Any environment variables needed (e.g., DB credentials).

4. (Optional) Extra Credit:

o Dockerize the application to ensure consistent environment setup.

o Provide instructions for a minimal cloud deployment if you want.

General Submission Instructions

 data_ingestion.py (or notebook)

 model_training.py (or notebook)

 app.py (Flask API)

 rag_chatbot.py (or app.py for Flask)

o Each assignment is designed to be completed in about 1-3 days of focused work

 Bonus/Extras: (Optional) Dockerization, cloud deployment, or additional tests will be considered

You might also like