KEMBAR78
Lecture 02 | PDF | Cluster Analysis | Support Vector Machine
0% found this document useful (0 votes)
10 views13 pages

Lecture 02

The document provides an overview of advanced data mining techniques, including classification, clustering, and association rule mining, along with model evaluation methods. It highlights various algorithms and metrics used for classification and clustering, as well as real-world applications in fields like healthcare, retail, and banking. The importance of model evaluation to prevent overfitting and ensure generalization is also emphasized.

Uploaded by

Hayat Hyt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views13 pages

Lecture 02

The document provides an overview of advanced data mining techniques, including classification, clustering, and association rule mining, along with model evaluation methods. It highlights various algorithms and metrics used for classification and clustering, as well as real-world applications in fields like healthcare, retail, and banking. The importance of model evaluation to prevent overfitting and ensure generalization is also emphasized.

Uploaded by

Hayat Hyt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Advanced Data Mining:

Techniques & Evaluation

Data Mining
Computer Science Faculty
Khana-e-Noor University

2025
Advanced Data Mining: Techniques & Evaluation
Classification, Clustering, Association, Model Evaluation
Agenda

• Classification Techniques
• Clustering Techniques
• Association Rule Mining
• Model Evaluation Methods
• Real-World Applications
Classification – Supervised Learning

• Goal: Predict a target class label for given input features


• Example:
o Use case: Email Spam Detection
o Input features: Words in the email, sender domain, time
o Output class: Spam / Not Spam
Common Classification Algorithms
• Decision Tree:
o Tree-like structure where nodes are tests on features
o Example: "If income > 50k → Approved, else → Denied"
• Naïve Bayes:
o Probabilistic model based on Bayes' Theorem
o Example: Spam filtering based on word frequency
• k-Nearest Neighbors (k-NN):
o Classifies based on majority label of nearest neighbors
• SVM (Support Vector Machine):
o Finds optimal hyperplane to separate classes
Classification – Metrics (with Example)

• Confusion Matrix:
Predicted Positive Predicted Negative
Actual Positive TP = 80 FN = 20
Actual Negative FP = 10 TN = 90
• Accuracy: (TP+TN) / Total = (80+90)/200 = 85%
• Precision: TP / (TP+FP) = 80 / (80+10) = 88.9%
• Recall: TP / (TP+FN) = 80 / (80+20) = 80%
: Clustering – Unsupervised Learning

• Goal: Group similar records into clusters


• Example Use Case: Customer Segmentation
o Input: Age, income, purchase history
o Output Clusters: High-value customers, Occasional buyers, Low
spenders
: Clustering – Unsupervised Learning
• k-Means:
o Partitions data into k clusters by minimizing intra-cluster distance
o Example: Cluster customers into 3 buying behavior groups
• Hierarchical Clustering:
o Builds a tree (dendrogram) of clusters
o Good for small datasets
• DBSCAN:) Density-Based Spatial Clustering of Applications with Noise(
O Density-based; detects noise and outliers
O Great for non-spherical clusters
Association Rule Mining
• Goal: Discover interesting relationships among items
• Example:
o Rule: {Milk, Bread} → {Butter}
o Support: 20% (20 out of 100 transactions contain all 3)
o Confidence: 80% (20 out of 25 that had Milk and Bread also had Butter)
o Lift: >1 implies positive association
• Algorithms:
o Apriori: Uses candidate generation
o FP-Growth: Uses tree structure, faster for large data
________________________________________
Model Evaluation
• Why Important? Prevent overfitting, ensure generalization
• Methods:
o Holdout Method: Train/test split
o k-Fold Cross-Validation: Data split into k parts, rotating test sets
o Leave-One-Out CV: Special case of k-fold with k = n
• Bias-Variance Tradeoff:
o High bias → underfitting
o High variance → overfitting
Real-World Applications
• Healthcare: Predicting disease based on symptoms (classification)
• Retail: Finding product bundles (association rules)
• Banking: Customer segmentation (clustering), fraud detection
• E-commerce: Recommender systems (hybrid of techniques)
Summary
• Reviewed 3 major data mining techniques:
o Classification for labeled predictions
o Clustering for grouping data
o Association for rule discovery
• Learned how to evaluate models effectively
• Discussed real-world use cases
Thanks!
Any questions?

You might also like