Assignment 1: End-to-End Sentiment Analysis Pipeline
Objective
You will implement a sentiment analysis pipeline on a well-known public dataset. The process
includes:
1. Downloading the dataset (already annotated with sentiment).
2. Storing the data in a database.
3. Cleaning and exploring the data.
4. Training and evaluating a classification model.
5. Serving the trained model via a Flask API (with an endpoint that predicts sentiment for
new text).
Dataset: IMDB Movie Reviews
Why IMDB?
o It’s a standard sentiment analysis dataset, with ~25k labeled movie reviews.
Labels are typically positive or negative.
o Easily available via:
Kaggle: IMDB Dataset of 50K Movie Reviews
Hugging Face Datasets: huggingface datasets
load_dataset("imdb")
Important: If you prefer another labeled sentiment dataset (e.g., Yelp Reviews Polarity), that’s
fine, but the IMDB dataset is the baseline recommendation.
Steps & Requirements
1. Data Collection
1. Obtain the data
o Download from Kaggle (CSV file) OR load via Hugging Face Datasets.
o Confirm you have ~25k rows of labeled reviews (train + test sets).
2. Database Setup
o Choose any relational database (e.g., MySQL, PostgreSQL, or SQLite).
o Create a table (e.g., imdb_reviews) with columns:
id (primary key)
review_text (text)
sentiment (text or integer, e.g., “positive” / “negative”)
o Insert all data into this table.
Note: If you use SQLite, you won’t need a separate DB server. This is often the simplest option.
2. Data Cleaning & Exploration
1. Data Cleaning
o Ensure there are no obvious errors or duplicates.
o For text cleanup, you can consider:
Lowercasing
Removing HTML tags (some IMDB reviews contain <br /> etc.)
Removing punctuation (optional)
o Keep the cleaned version stored in memory or in a new column, whichever you
prefer.
2. Exploratory Data Analysis (EDA)
o Show basic stats:
Number of reviews per sentiment (distribution)
Average review length for positive vs. negative
o (Optional) Some simple plots or word clouds can be included for illustration.
3. Model Training
1. Model Type
o Baseline: A simple approach like Logistic Regression or Naive Bayes on TF-
IDF vectors.
o (Optional) Try a transformer-based model like DistilBERT or BERT if you’re
comfortable with Hugging Face Transformers.
2. Train/Validation Split
o If the dataset is already split into train/test, use that.
o You can create an additional validation split from the training set (e.g., 80%/20%
within the training set).
3. Training
o Fit the model on the training data.
o Monitor basic metrics (accuracy, F1, etc.) on the validation set.
4. Evaluation
o Evaluate on the test set.
o Report metrics (accuracy, precision, recall, F1-score).
Tip: Keep the model as simple or advanced as you like, but logistic regression on TF-IDF is
typically enough for a baseline.
4. Model Serving with Flask
1. Flask API
o Create a simple Flask app (app.py or main.py).
o Include an endpoint, e.g., POST /predict:
Input: JSON with a field review_text (the new text to classify).
Output: JSON with a field sentiment_prediction (e.g., "positive" /
"negative").
2. Model Loading
o Ensure your trained model (e.g., a pickle file or Hugging Face model weights) is
loaded once when the app starts.
o The Flask endpoint should do the following:
1. Receive text input.
2. Apply the same preprocessing steps used during training.
3. Run the text through the trained model.
4. Return the predicted sentiment.
3. Testing Locally
o Show how to send a test request (using curl, Postman, or Python’s requests) to
verify the endpoint works.
5. (Optional) Deployment
Cloud: Deploy to any free-tier service (Heroku, Render, Railway, etc.) or a small EC2
instance on AWS.
Provide instructions or documentation if you do this step.
Deliverables
1. Code Repository
o A well-structured repository (GitHub, GitLab, etc.) containing:
Data ingestion/DB setup script or notebook (e.g., data_setup.py).
Model training script or notebook (e.g., train_model.py).
Flask app (e.g., app.py).
Requirements file (requirements.txt) listing all Python dependencies.
2. Database Schema
o Simple instructions or a .sql file that shows table creation if using
MySQL/PostgreSQL.
o If using SQLite, mention your database file name (e.g., imdb_reviews.db) and
how it’s created.
3. README
o Project Setup: Steps to install dependencies (e.g., pip install -r
requirements.txt) and set up the database.
o Data Acquisition: How you downloaded or loaded the dataset.
o Run Instructions:
How to run the training script.
How to start the Flask server.
How to test the endpoint with example commands or requests.
o Model Info: A summary of the chosen model approach and key results (e.g., final
accuracy on test set).
4. (Optional) Additional Assets
o If you generate plots or a short EDA report, include them in the repo (e.g., a
.ipynb or .pdf).
Time Expectation & Scope
This assignment should be 2-3 days of work at a normal pace.
Keep the solution focused on these steps—no need to explore other major NLP tasks.
Evaluation Criteria
1. Completeness: Did you store data in a DB, train a sentiment model, and serve it via
Flask?
2. Correctness: Does the Flask endpoint work? Does the model predict sentiment
accurately on test data?
3. Code Quality & Organization: Is the code clean, documented, and logically separated
into files/modules?
4. Documentation: Is there a clear README with setup and usage instructions?
Assignment 2: RAG (Retrieval-Augmented Generation) Chatbot
Overview
You will implement a simple Retrieval-Augmented Generation (RAG) chatbot that uses a vector database
for semantic search and stores chat history in a local MySQL database. The chatbot will be served via a
Flask API.
Task Breakdown
1. Data Preparation
o Corpus: Pick or create a small text corpus (e.g., a set of documentation pages, articles,
or Wikipedia paragraphs on a particular topic). You can:
Provide your own text files.
Scrape a small set of web pages (ensure the data is not too large—just enough
to test retrieval).
o Chunk & Preprocess:
Split larger documents into smaller chunks (e.g., ~200-300 words each).
Clean and normalize text (remove extra whitespaces, etc.).
2. Embedding & Vector Store
o Vector Database: You can choose any free and local-friendly vector DB (e.g., Faiss,
Chroma, or Milvus).
o Embeddings:
Use any sentence embedding model (e.g., sentence-transformers from Hugging
Face).
Embed each chunk of text and store the resulting vectors in the vector DB.
o Retrieval:
Implement a function to query the vector DB given a user query (in embeddings
space).
Return the top-k most relevant chunks.
3. Generation (Answer Construction)
o RAG Approach:
Take the user’s query.
Retrieve relevant chunks from the vector DB.
Use the retrieved chunks as context to generate an answer.
Model: you can any LLM
o Implementation Details:
You can run a local model (small transformer, or even a rule-based approach) if
you do not want to rely on large model inference.
For each query, your pipeline should be:
1. Convert query to embedding.
2. Retrieve top-k relevant chunks.
3. Concatenate these chunks.
4. Pass them (with the query) into your small generation or extraction
function to form the final answer.
4. Flask API
o Endpoints:
POST /chat: Accepts a JSON payload with the user’s query. Returns:
Generated answer.
Possibly the top retrieved chunks for debugging (optional).
GET /history: Returns the chat history from the MySQL DB.
o Chat History:
For each user query and system answer, store them as separate rows in MySQL
with fields such as:
id (auto-increment)
timestamp
role (user or system)
content (the text of the user query or system answer)
You can store them after each response is generated.
5. Database
o MySQL:
Set up a local MySQL database.
Create a table(s) to hold chat messages.
Optionally, you can also store user IDs if you want to support multiple users (not
mandatory).
6. Testing
o Unit Tests:
Test embedding and retrieval with a few sample queries.
Test the Flask endpoints (/chat, /history) to ensure the entire pipeline works.
Deliverables
1. Code Repository:
o All source code for the chatbot pipeline, including the vector DB setup, embedding code,
retrieval code, and Flask API.
o A requirements.txt or Pipfile.
2. Database Schema:
o The SQL schema or migrations for MySQL (tables for storing chat history).
3. Demo / Documentation:
o A short README describing:
How to install and run the system locally.
How to set up MySQL and create the required tables.
How to test the /chat and /history endpoints.
Any environment variables needed (e.g., DB credentials).
4. (Optional) Extra Credit:
o Dockerize the application to ensure consistent environment setup.
o Provide instructions for a minimal cloud deployment if you want.
General Submission Instructions
1. Repo Structure:
o Assignment1
data_ingestion.py (or notebook)
model_training.py (or notebook)
app.py (Flask API)
requirements.txt
etc.
o Assignment2
data_preprocessing.py
embed_store.py
rag_chatbot.py (or app.py for Flask)
requirements.txt
etc.
o README.md
2. Instructions: Provide clear instructions on how someone else can clone your repository, install
dependencies, set up the database, and run each assignment’s solution.
3. Time Expectation:
o Each assignment is designed to be completed in about 1-3 days of focused work
(depending on your familiarity with the tools).
Evaluation Criteria
Correctness & Completeness: Does the code run end-to-end without major issues? Are all parts
of the assignment addressed?
Project Structure & Code Quality: Is the repository organized logically? Is the code readable,
well-documented, and maintainable?
Solution Design: Are the chosen methods (data cleaning, feature engineering, model selection,
retrieval, etc.) appropriate and justified?
Documentation: Is there a clear README explaining how to install, run, and test the solution?
Bonus/Extras: (Optional) Dockerization, cloud deployment, or additional tests will be considered
a plus.