19/06/2025
Natural Language Processing
Spring 2025
Prof. Dr. M. Fasih Uddin Butt
Agenda
Overview of topics to be covered:
1. The Purpose of Model Training and Fine-Tuning
2. Cleaning Input Data
3. Using Vector Databases
4. Implementing RAG
1
19/06/2025
The main difference between model training and fine-tuning is that training builds
a model from scratch, while fine-tuning adjusts an existing model for specific
needs
CNN
CNN stands for Convolutional Neural Network, a class of deep learning
models used primarily for processing data that has a grid pattern, such as
images. CNNs are particularly effective for tasks like image classification,
object detection, and segmentation.
2
19/06/2025
2. Introduction to Model Training and Fine-Tuning
(i) Why Train and Fine-Tune Generative AI Models?
● Explain the need for customization to suit domain-specific
needs.
● Benefits: Accuracy, relevance, and improved performance.
● Real-world examples (e.g., chatbots, content generation,
personalized recommendations).
(ii) The Need for Customization
Pre-trained models are trained on diverse and generic datasets. While this makes
them versatile, they lack accuracy when applied to specific domains or unique
tasks. Fine-tuning adapts the model to a particular dataset, language style, or
business requirement, ensuring it generates more relevant, accurate, and targeted
outputs.
For example:
● A general chatbot trained on public conversations may struggle to respond
accurately to medical inquiries.
● A text-to-image model like Stable Diffusion may not create realistic industrial
equipment images without domain-specific fine-tuning.
3
19/06/2025
Real-World Examples
a. Chatbots
● General-purpose chatbots (e.g., ChatGPT) can be fine-tuned for:
○ Healthcare support: Responding to patient queries with precise medical answers.
○ Banking support: Providing information on account balances, fraud detection, etc.
● Example: A chatbot for a hospital fine-tuned to handle medical appointments and FAQs.
b. Content Generation
● Generative AI models like GPT or DALL·E can be customized to produce:
○ Marketing content tailored to a brand's tone and audience.
○ E-learning materials in a specific teaching style or language.(video generation,sora )
2. Introduction to Model Training and Fine-Tuning
(iii) Overview of the Fine-Tuning Process
● Steps:
1. Selecting the model.
2. Preparing the dataset.
3. Training and validation.
4. Deployment.
4
19/06/2025
3. Cleaning Input Data
(i) Why is Data Cleaning Important?
● The impact of clean data on model performance.
● Risks of poor-quality data (bias, errors).
(ii) Steps to Clean Input Data
● Removing duplicates.
● Handling missing values.
● Normalizing data (text/token standardization).
(iii) Tools for Data Cleaning
● Examples: Python libraries (pandas, NLTK, spaCy).
Domains Where Data Cleaning Required
1. Data Science
2. Machine Learning (ML)
3. Big Data
4. Database Management Systems (DBMS)
5. Data Engineering
6. Information Retrieval (IR)
7. Natural Language Processing (NLP)
8. Health Informatics
9. Business Intelligence (BI)
5
19/06/2025
4. Using Vector Databases for Efficient Training
(i) What Are Vector Databases?
A vector database is a collection of data that stores and manages high-dimensional
vector data
(ii) How it works
Vector databases store data as mathematical representations called "vectors". These
vectors are clustered based on similarity, which allows for low-latency queries.
Examples:
Chroma, Pinecone, Weaviate, Faiss, Qdrant, Milvus, and pgvector.
Benefits
Vector databases enable machine learning models to identify similar
objects, which can be used for:
● Search
● Recommendations
● Text generation
● Creating advanced AI programs like LLMs
6
19/06/2025
Comparison to other databases
Vector databases are optimized for storing and retrieving vector data, while SQL
and NoSQL databases are optimized for storing and retrieving structured and
unstructured data, respectively.
Comparison to other databases
7
19/06/2025
Comparison to other databases
Applications
Vectors Represent Semantic Information:
Generative AI transforms inputs (text, images, etc.) into vector embeddings using
models trained on large datasets.
● Example: The words dog and wolf are closer in meaning, so their embeddings
(vectors) will be close in vector space.
Similarity Search Enables RAG:
When you query with a vector, a vector database quickly retrieves similar vectors.
This enables AI to enhance generation with relevant, existing content.
● Example: Generative AI can generate answers or images based on similar past
knowledge retrieved from a vector database.
8
19/06/2025
Definition
It works by retrieving relevant documents or data from a knowledge base or
external source and then using that information to generate more accurate,
contextually aware responses.
Here’s a breakdown of how RAG works:
1. Retrieval: A query or input is processed, and the model retrieves relevant
documents from a large database or knowledge base.
2. Augmentation: The retrieved documents are used to augment the model's input,
providing more context or information.
3. Generation: The augmented input is passed through a generative model (like GPT) to
produce a final, contextually enriched output.
Simple Definition RAG
RAG is a method in artificial intelligence that helps computers give better
answers by combining two steps:
1. Retrieval: First, it searches for useful information from a database or
knowledge base (like looking up facts).
2. Generation: Then, it uses this information to create a clear and
accurate response using a language model (like ChatGPT).
9
19/06/2025
Simple Example
Imagine you’re asking a computer, "Why is the sky blue?"
● The retrieval step is like the computer finding a book about the sky
in a library.
● The generation step is like the computer reading that book and
writing a simple answer for you:
"The sky is blue because of how sunlight interacts with the air."
By combining these steps, the computer gives a smarter answer than
guessing on its own!
Mature Examples of RAG Applications
Customer Support:
● A chatbot retrieves FAQs, policy documents, or troubleshooting guides to
answer customer questions accurately.
○ Query: "How can I reset my account password?"
○ Response: "To reset your password, go to [Settings], click on
[Password Reset], and follow the emailed instructions."
Legal Assistance:
● Retrieving legal statutes and case law to generate summaries or draft
documents.
○ Query: "What does Section 123 of the XYZ Act say about employment
contracts?"
○ Response: "Section 123 outlines the required clauses for valid
employment contracts, including terms for termination and dispute
resolution."
10
19/06/2025
Benefits:
1. Up-to-Date Information
2. Reduced Hallucinations
3. Improved Accuracy for Domain-Specific Tasks
4. Scalability and Efficiency
5. Contextual and Custom Responses
6. Enhanced Transparency and Interpretability
7. Cost-Effective Solution
8. Adaptability Across Industries
9. Combines Generative and Search Power
10. Personalization and Context Management
11