Module 1: Algorithm Analysis, Array
Applications and Linked Lists - Data
Mining Section
KDD vs Data Mining
KDD (Knowledge Discovery in Databases) is the overall process of discovering
useful knowledge from data.
Data Mining is one of the steps in KDD.
KDD Steps:
1. Selection
2. Preprocessing
3. Transformation
4. Data Mining
5. Interpretation/Evaluation
Difference:
- KDD = Full process; Data Mining = Specific pattern-extraction step.
- KDD includes domain knowledge and interpretation, Data Mining focuses only on
applying algorithms to extract patterns.
Stages of the Data Mining Process
1. Data Cleaning – Remove noise and inconsistent data.
2. Data Integration – Combine data from multiple sources.
3. Data Selection – Retrieve relevant data from the database.
4. Data Transformation – Convert data into suitable format.
5. Data Mining – Apply algorithms to extract patterns.
6. Pattern Evaluation – Identify interesting patterns based on measures.
7. Knowledge Presentation – Use visualization and representation techniques.
Task Primitives
These are the basic functions used to define a data mining task:
- Kind of knowledge to be mined: e.g., classification, clustering.
- Background knowledge: domain knowledge used.
- Interestingness measures: thresholds to find useful results.
- Presentation: how the output should be displayed.
- Data mining techniques: e.g., decision trees, neural networks.
Data Mining Techniques
- Classification: Predict categorical class labels.
- Clustering: Group similar data items.
- Association Rule Mining: Find interesting relationships (e.g., Market Basket
Analysis).
- Regression: Predict continuous values.
- Outlier Detection: Find data that deviate significantly.
- Sequential Pattern Mining: Discover regular sequences.
Data Mining Knowledge Representation
Ways to represent the mined knowledge:
- Decision Trees: Tree-like structure representing decisions.
- Rules: IF-THEN patterns.
- Graphs and Networks: For relational or network data.
- Tables and Matrices: Common in reporting tools.
- Visualizations: Charts, graphs, dashboards.
Major Issues in Data Mining
- Scalability: Can it handle large datasets?
- High Dimensionality: Too many features can confuse models.
- Data Quality: Noisy or missing data can affect results.
- Data Privacy & Security: Sensitive information must be protected.
- Real-Time Mining: Some applications need instant results.
- Integration with existing systems: Can it be used in live applications?
Measurement and Data
- Data Types:
- Nominal (e.g., gender),
- Ordinal (e.g., ranks),
- Interval (e.g., temperature),
- Ratio (e.g., age, salary).
- Data measurement affects:
- Type of algorithm used.
- Statistical tests applied.
- Interpretation of results.
Data Preprocessing
It prepares raw data for mining. Steps include:
- Data Cleaning: Removing errors, filling missing values.
- Data Integration: Combining from multiple sources.
- Data Transformation: Normalizing, aggregating.
- Data Reduction: Reducing volume without losing info.
Preprocessing improves mining accuracy and speed.
Data Cleaning
- Deals with missing, inconsistent, duplicate, or noisy data.
- Techniques:
- Imputation: Fill missing values using mean/median/mode.
- Smoothing: Remove noise via binning, regression.
- Deduplication: Remove repeated records.
- Correction: Use external references to fix errors.
Data Transformation
- Convert data into format suitable for mining.
- Techniques:
- Normalization: Scale values (e.g., min-max).
- Discretization: Convert continuous data into intervals.
- Aggregation: Summarize data (e.g., total sales).
- Encoding: Convert categories into numbers.
Feature Selection
- Identify the most relevant features.
- Reduces overfitting, improves accuracy and speed.
- Techniques:
- Filter methods: Use statistical scores (e.g., chi-square).
- Wrapper methods: Use learning algorithm to test subsets.
- Embedded methods: Use model-specific (e.g., LASSO).
Dimensionality Reduction
- Reduce number of input variables.
- Helps visualization, removes redundant info.
- Techniques:
- PCA (Principal Component Analysis): Linear combinations of features.
- t-SNE: For visualization in 2D/3D.
- Autoencoders: Neural networks for feature compression.