Advanced Data Mining:
Techniques & Evaluation
Data Mining
Computer Science Faculty
Khana-e-Noor University
2025
Advanced Data Mining: Techniques & Evaluation
Classification, Clustering, Association, Model Evaluation
Agenda
• Classification Techniques
• Clustering Techniques
• Association Rule Mining
• Model Evaluation Methods
• Real-World Applications
Classification – Supervised Learning
• Goal: Predict a target class label for given input features
• Example:
o Use case: Email Spam Detection
o Input features: Words in the email, sender domain, time
o Output class: Spam / Not Spam
Common Classification Algorithms
• Decision Tree:
o Tree-like structure where nodes are tests on features
o Example: "If income > 50k → Approved, else → Denied"
• Naïve Bayes:
o Probabilistic model based on Bayes' Theorem
o Example: Spam filtering based on word frequency
• k-Nearest Neighbors (k-NN):
o Classifies based on majority label of nearest neighbors
• SVM (Support Vector Machine):
o Finds optimal hyperplane to separate classes
Classification – Metrics (with Example)
• Confusion Matrix:
Predicted Positive Predicted Negative
Actual Positive TP = 80 FN = 20
Actual Negative FP = 10 TN = 90
• Accuracy: (TP+TN) / Total = (80+90)/200 = 85%
• Precision: TP / (TP+FP) = 80 / (80+10) = 88.9%
• Recall: TP / (TP+FN) = 80 / (80+20) = 80%
: Clustering – Unsupervised Learning
• Goal: Group similar records into clusters
• Example Use Case: Customer Segmentation
o Input: Age, income, purchase history
o Output Clusters: High-value customers, Occasional buyers, Low
spenders
: Clustering – Unsupervised Learning
• k-Means:
o Partitions data into k clusters by minimizing intra-cluster distance
o Example: Cluster customers into 3 buying behavior groups
• Hierarchical Clustering:
o Builds a tree (dendrogram) of clusters
o Good for small datasets
• DBSCAN:) Density-Based Spatial Clustering of Applications with Noise(
O Density-based; detects noise and outliers
O Great for non-spherical clusters
Association Rule Mining
• Goal: Discover interesting relationships among items
• Example:
o Rule: {Milk, Bread} → {Butter}
o Support: 20% (20 out of 100 transactions contain all 3)
o Confidence: 80% (20 out of 25 that had Milk and Bread also had Butter)
o Lift: >1 implies positive association
• Algorithms:
o Apriori: Uses candidate generation
o FP-Growth: Uses tree structure, faster for large data
________________________________________
Model Evaluation
• Why Important? Prevent overfitting, ensure generalization
• Methods:
o Holdout Method: Train/test split
o k-Fold Cross-Validation: Data split into k parts, rotating test sets
o Leave-One-Out CV: Special case of k-fold with k = n
• Bias-Variance Tradeoff:
o High bias → underfitting
o High variance → overfitting
Real-World Applications
• Healthcare: Predicting disease based on symptoms (classification)
• Retail: Finding product bundles (association rules)
• Banking: Customer segmentation (clustering), fraud detection
• E-commerce: Recommender systems (hybrid of techniques)
Summary
• Reviewed 3 major data mining techniques:
o Classification for labeled predictions
o Clustering for grouping data
o Association for rule discovery
• Learned how to evaluate models effectively
• Discussed real-world use cases
Thanks!
Any questions?