Unit 1 - Data Mining and Knowledge Discovery
1. Differentiate Data Mining and Knowledge Discovery
Data Mining: Process of discovering patterns.
Knowledge Discovery: Overall process, includes data cleaning, transformation, mining.
2. Functionalities of Data Mining (2 Examples)
- Classification: Predict categories.
- Clustering: Group similar items.
3. Interesting Pattern: A pattern that is valid, novel, useful, and understandable.
4. Predictive vs Descriptive
Predictive: Future prediction (e.g., classification).
Descriptive: Pattern discovery (e.g., clustering).
5. 10 Applications: Marketing, Fraud Detection, Stock Market, Health Care, Web Mining, Telecom,
Retail, Education, Manufacturing, Banking.
6. Machine Learning: AI technique enabling systems to learn. Types: Supervised, Unsupervised,
Reinforcement.
7. Model Selection: Choosing best model (based on accuracy, performance).
8. Overfitting: Model performs well on training data but poorly on unseen data. Evaluation Metrics:
Accuracy, F1-Score.
9. Concept Learning Goal: Learn a general concept from examples. E.g., Learning "fruit" concept
from apples, bananas.
Unit 2 - Data Preprocessing
1. Issues in Raw Data: Missing values, noise, outliers, inconsistencies.
2. Outlier Removal: Z-Score Method, IQR Method.
3. Concept Hierarchy: Organizing data into levels of abstraction. E.g., Country > State > City.
4. Dimensionality Reduction: Reduce features. Important for efficiency and avoiding overfitting.
5. Feature Extraction Examples: Image Processing, Speech Recognition.
6. Variable Selection: Filter, Wrapper, Embedded Methods.
7. Variable Ranking: Ordering features based on relevance.
8. Objectives of LDA: Maximize class separation, reduce dimensions.
9. PCA: Projects data onto principal components to reduce dimensions.
10. Factor Analysis: Identify underlying relationships among variables.
11. Cross-Validation: Evaluates model?s performance.
12. Resampling Methods: Improve accuracy by sampling data (e.g., bootstrapping).
Unit 3 - Data Mining Models
1. Regression Models Pros & Cons
Pros: Predicts continuous values. Cons: Sensitive to outliers.
2. Types of Association Rule Mining: Single-dimensional, Multi-dimensional, Quantitative.
3. Decision Tree Induction: Build tree based on attribute selection (e.g., ID3, C4.5).
4. Bayes Theorem: P(A|B) = P(B|A)*P(A)/P(B).
5. Constraints in ARM: Knowledge, Data, Rule constraints.
6. Support Vector Machine: Classifier that maximizes margin.
7. Decision Tree Parameters: Entropy, Information Gain, Gini Index.
8. Gaussian Mixture Steps: Initialization, E-Step, M-Step, Repeat.
9. K-NN Phases: Feature selection, Distance calculation, Voting.
10. K Value in K-NN: Balances bias-variance trade-off.
Unit 4 - Clustering
1. Partitioning Clustering: Divides dataset into exclusive clusters (e.g., K-Means).
2. K-Means vs K-Medoid
K-Means: Uses mean, sensitive to outliers.
K-Medoid: Uses medoid, robust.
3. Density-Based Clustering: Groups dense regions.
4. DBSCAN: Clusters arbitrary shapes, handles noise.
5. EM Steps: E-Step, M-Step, Repeat.
6. Hierarchical Clustering: Builds tree (e.g., agglomerative clustering).
7. Agglomerative vs Divisive
Agglomerative: Bottom-up.
Divisive: Top-down.
8. Fuzzy C-Means: Allows soft clustering.
9. Matching Methods
K-Means: Partitioning.
DBSCAN: Density-Based.
Hierarchical: Hierarchical.
10. Features of BIRCH, ROCK, Chameleon
BIRCH: Incremental clustering.
ROCK: Link-based.
Chameleon: Interconnectivity-based.
Unit 5 - Neural Networks
1. ANN: Computational model inspired by brain.
2. Backpropagation: Updates weights by propagating error.
3. Input Layer: Receives raw data.
4. Hyperparameters: Settings like learning rate, batch size.
5. Optimizers: SGD, Adam.
6. Learning Rate: Controls step size in gradient descent.
7. AND Gate with Perceptron: Weights = 1, Bias = -1.5.
8. Loss Functions: MSE, Cross-Entropy, Hinge.
9. Training vs Validation
Training: Model learns.
Validation: Model is evaluated.
10. Forward Propagation in MLP: Pass input through layers, apply weights, activations.