DATA MINING
TECHNIQUES USING
R
UNIT I: An idea on Data Warehouse, Data mining-KDD
versus data mining, Stages of the Data MiningProcess-
Task primitives., Data Mining Techniques – Data mining
knowledge representation.
Data Warehouse
• Designed for storing and analyzing large volumes of historical data
from various sources for business intelligence and reporting
• Data Warehouse is a centralized & Organized respository of data
EXAMPLE
Amazon Redshift, Google BigQuery
Data mining-KDD versus Data
mining
WHAT IS DATA MINING?
Data mining is the process of searching and analyzing a large batch of
raw data in order to identify patterns and extract useful information.
WHY DATA MINING?
• Discover Hidden Patterns
• Improve Decision-Making
• Enhance Customer Understanding
• Increase Efficiency
Data mining-KDD versus Data mining
• While the terms are often used interchangeably, there is a subtle
difference. KDD is the entire process of uncovering useful information
from data, while data mining is just one step within that process.
• Here’s a breakdown of the KDD process:
• Data Selection: Identify the relevant data for your project.
• Data Preprocessing: Clean and prepare the data for mining.
• Data Transformation: Transform the data into a format suitable for
mining algorithms.
• Data Mining: Apply algorithms to extract patterns and trends from
the data.
• Pattern Evaluation: Evaluate the validity and usefulness of the
discovered patterns.
• Knowledge Representation: Present the discovered knowledge in a
clear and understandable way.
Stages of the Data MiningProcess-Task
primitives
1. Data Cleaning
Handling missing values, removing noise, correcting inconsistencies.
Example: In a customer database, some records might have missing age
values. These can be filled using the mean age or a default value.
2. Data Integration
Combining data from multiple sources.
Example: Integrating customer data from a CRM system with sales
data from an ERP system to create a comprehensive dataset for
analysis.
3. Data Selection
Selecting relevant data for analysis.
Example: Selecting only the transaction records of the last two years
from a retail sales database to analyze recent purchasing trends.
4. Data Transformation
Normalization, aggregation, data type conversion.
Example: Normalizing the sales amount field to a common scale or
aggregating daily sales data to monthly sales data.
5. Data Mining
Applying data mining techniques to extract patterns.
Example: Using the Apriori algorithm to find frequent itemsets in a
transactional database.
6. Pattern Evaluation
Identifying truly interesting patterns that represent knowledge.
Example: Evaluating association rules generated by the Apriori
algorithm to determine which ones have the highest confidence and
support values.
7. Knowledge Presentation
Visualization, reporting.
Example: Creating a dashboard that shows the most frequent item
pairs purchased together and their association rules in a retail store.
Task primitives:
Data mining task primitives are the essential components that guide the
data mining process. They provide a structured approach to extracting
meaningful insights from data.
Key Primitives
1.Set of Task-Relevant Data:
Defines the specific data used for the mining process.
Involves data selection, cleaning, and preprocessing.
Data: Customer information (demographics, subscription details, usage
patterns, billing history, etc.), churn data (customers who left the company).
Example: Selecting data for customers who have been with the company
for more than six months and have a monthly bill exceeding $50.
2.Kind of Knowledge to be Mined:
• Specifies the type of patterns or information to be extracted.
• Common types include:
• Descriptive: Summarizing data characteristics (e.g., statistics, trends).
• Predictive: Building models to predict future values (e.g., classification,
regression).
• Associative: Discovering relationships between items (e.g., market basket
analysis).
• Cluster Analysis: Grouping similar data points (e.g., customer segmentation).
• Outlier Detection: Identifying unusual data points (e.g., fraud detection).
• Background Knowledge:
• Incorporates domain expertise or prior information to guide the
mining process.
• Can improve accuracy and efficiency.
Domain expertise: Understanding of customer behavior, telecom
industry trends, competitor offerings.
Example: Incorporating information about recent network outages or
new competitor plans.
4.Interestingness Measures and Thresholds:
• Evaluates the significance of discovered patterns.
• Helps filter out uninteresting or redundant patterns.
• Example: Support and confidence measures for association rules.
5.Representation for Visualizing Discovered Patterns:
• Determines how the mined patterns are presented.
• Includes charts, graphs, tables, decision trees, and other visual
formats.
• Example: Using a decision tree to visualize a classification model.
Data mining techniques are methods used to discover patterns,
relationships, and insights from large sets of data.
Here are the main techniques used in data mining:
Classification: Sorting data into categories. For example, categorizing
emails as spam or not spam.
Clustering: Grouping similar items together. For example, grouping
customers with similar buying habits
Regression:Predicting a continuous value based on other variables.
Example: Predicting house prices based on factors like location, size,
and number of bedrooms
• Association: Finding rules that show relationships between items. For
example, if people often buy bread and butter together.
• Prediction: Using past data to predict future outcomes. For example,
predicting which products will be popular next season.
• Sequential Patterns: Identifying patterns in data that occur in a
specific order. For example, finding that customers who buy a phone
often buy a case shortly after.
• Decision Trees: Using a tree-like model to make decisions based on
data. For example, deciding whether to approve a loan based on a set
of criteria.
Data mining knowledge
representation.
In data mining, knowledge representation is crucial for effectively
interpreting, visualizing, and utilizing the insights and patterns derived
from data.
BAR GRAPH
Histogram
Pie chart
Scatter plot
Line chart
LOESS curves
• LOESS (Locally Estimated Scatterplot Smoothing) is a non-parametric
regression method that is used to fit a smooth curve to a scatterplot
of data points
• LOESS curves are often used in real-world scenarios where data is
noisy and the relationship between variables is complex and non-
linear.