KEMBAR78
Data Mining Module 1 Theory | PDF | Data Mining | Data
0% found this document useful (0 votes)
30 views4 pages

Data Mining Module 1 Theory

The document outlines the process of Knowledge Discovery in Databases (KDD) and its steps, emphasizing that Data Mining is a specific step within KDD focused on pattern extraction. It details the stages of the Data Mining process, various techniques used (such as classification and clustering), and the importance of data preprocessing, cleaning, and transformation. Additionally, it addresses major issues in data mining, including scalability and data quality, and discusses feature selection and dimensionality reduction techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views4 pages

Data Mining Module 1 Theory

The document outlines the process of Knowledge Discovery in Databases (KDD) and its steps, emphasizing that Data Mining is a specific step within KDD focused on pattern extraction. It details the stages of the Data Mining process, various techniques used (such as classification and clustering), and the importance of data preprocessing, cleaning, and transformation. Additionally, it addresses major issues in data mining, including scalability and data quality, and discusses feature selection and dimensionality reduction techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Module 1: Algorithm Analysis, Array

Applications and Linked Lists - Data


Mining Section
KDD vs Data Mining
KDD (Knowledge Discovery in Databases) is the overall process of discovering
useful knowledge from data.
Data Mining is one of the steps in KDD.

KDD Steps:
1. Selection
2. Preprocessing
3. Transformation
4. Data Mining
5. Interpretation/Evaluation

Difference:
- KDD = Full process; Data Mining = Specific pattern-extraction step.
- KDD includes domain knowledge and interpretation, Data Mining focuses only on
applying algorithms to extract patterns.

Stages of the Data Mining Process


1. Data Cleaning – Remove noise and inconsistent data.
2. Data Integration – Combine data from multiple sources.
3. Data Selection – Retrieve relevant data from the database.
4. Data Transformation – Convert data into suitable format.
5. Data Mining – Apply algorithms to extract patterns.
6. Pattern Evaluation – Identify interesting patterns based on measures.
7. Knowledge Presentation – Use visualization and representation techniques.
Task Primitives
These are the basic functions used to define a data mining task:
- Kind of knowledge to be mined: e.g., classification, clustering.
- Background knowledge: domain knowledge used.
- Interestingness measures: thresholds to find useful results.
- Presentation: how the output should be displayed.
- Data mining techniques: e.g., decision trees, neural networks.

Data Mining Techniques


- Classification: Predict categorical class labels.
- Clustering: Group similar data items.
- Association Rule Mining: Find interesting relationships (e.g., Market Basket
Analysis).
- Regression: Predict continuous values.
- Outlier Detection: Find data that deviate significantly.
- Sequential Pattern Mining: Discover regular sequences.

Data Mining Knowledge Representation


Ways to represent the mined knowledge:
- Decision Trees: Tree-like structure representing decisions.
- Rules: IF-THEN patterns.
- Graphs and Networks: For relational or network data.
- Tables and Matrices: Common in reporting tools.
- Visualizations: Charts, graphs, dashboards.

Major Issues in Data Mining


- Scalability: Can it handle large datasets?
- High Dimensionality: Too many features can confuse models.
- Data Quality: Noisy or missing data can affect results.
- Data Privacy & Security: Sensitive information must be protected.
- Real-Time Mining: Some applications need instant results.
- Integration with existing systems: Can it be used in live applications?
Measurement and Data
- Data Types:
- Nominal (e.g., gender),
- Ordinal (e.g., ranks),
- Interval (e.g., temperature),
- Ratio (e.g., age, salary).
- Data measurement affects:
- Type of algorithm used.
- Statistical tests applied.
- Interpretation of results.

Data Preprocessing
It prepares raw data for mining. Steps include:
- Data Cleaning: Removing errors, filling missing values.
- Data Integration: Combining from multiple sources.
- Data Transformation: Normalizing, aggregating.
- Data Reduction: Reducing volume without losing info.
Preprocessing improves mining accuracy and speed.

Data Cleaning
- Deals with missing, inconsistent, duplicate, or noisy data.
- Techniques:
- Imputation: Fill missing values using mean/median/mode.
- Smoothing: Remove noise via binning, regression.
- Deduplication: Remove repeated records.
- Correction: Use external references to fix errors.

Data Transformation
- Convert data into format suitable for mining.
- Techniques:
- Normalization: Scale values (e.g., min-max).
- Discretization: Convert continuous data into intervals.
- Aggregation: Summarize data (e.g., total sales).
- Encoding: Convert categories into numbers.

Feature Selection
- Identify the most relevant features.
- Reduces overfitting, improves accuracy and speed.
- Techniques:
- Filter methods: Use statistical scores (e.g., chi-square).
- Wrapper methods: Use learning algorithm to test subsets.
- Embedded methods: Use model-specific (e.g., LASSO).

Dimensionality Reduction
- Reduce number of input variables.
- Helps visualization, removes redundant info.
- Techniques:
- PCA (Principal Component Analysis): Linear combinations of features.
- t-SNE: For visualization in 2D/3D.
- Autoencoders: Neural networks for feature compression.

You might also like