Data Mining-Introduction
Data Mining-Introduction
Data Mining
10/03/21
Chapter 1: Introduction
2
◻ Summary
10/03/21
1.1 Why Data Mining?
3
10/03/21
Evolution of Information Technology
4
Data Collection and Database Creation
(1960s and earlier) How can I analyze these data?
Primitive file processing
◻ Summary
10/03/21
1.2 What is Data Mining?
6
Knowledge
10/03/21
Knowledge Discovery from Data
(KDD) Process
7
🞑 Data cleaning
🞑 Data integration
🞑 Data selection
🞑 Data transformation
🞑 Data mining
🞑 Pattern evaluation
🞑 Knowledge presentation
10/03/21
Knowledge Discovery from Data
(KDD) Process
8
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
10/03/21
Chapter 1.
9
Introduction
◻ Summary
10/03/21
1.3 What Kinds of Data Can Be Mined?
10
◻ Applied to any kind of data as long as the information is meaningful for targeted
application.
◻ Database Data:
🞑 DBMS – collection of interrelated data (database) and set of programs for
manipulation
🞑 RDMS – collection of tables (unique name)
■ Table – set of attributes (columns/fields) and stores tuples (records/rows)
■ Unique key –ER model
🞑 RDMS – accessed by database queries (SQL)
■ Query – relational operations such as join, selection & projection
🞑 RDMS – to analyze the trends or data patterns
10/03/21
What Kinds of Data Can Be Mined?
11
custome (cust ID, name, address, age, occupation, annual income, credit
r information, category, . . . )
ite (item ID, brand, category, type, price, place made, supplier, cost, .
m . . ) (empl ID, name, category, group, salary, commission, . . . )
employee (branch ID, name, address, . . . )
branch (trans ID, cust ID, empl ID, date, time, method paid, amount)
purchase (trans ID, item ID, qty)
s items
(empl ID, branch ID)
sold
works at
10/03/21
What Kinds of Data Can Be Mined?
12
◻ Data Warehouses
🞑 A data warehouse is a repository of information collected from
multiple sources, stored under a unified schema, and usually residing at a single
site.
🞑 data cleaning, data integration, data transformation, data loading, and periodic
data refreshing
Clean
Data source in New York Integrate Data Query and
Transform Warehouse analysis tools
Load
Refresh
Data source in Toronto Client
10/03/21
What Kinds of Data Can Be Mined?
13
◻ Transactional Data:
🞑 transactional database captures a transaction - a customer’s purchase, a flight
booking, or a user’s clicks on a web page.
🞑 A transaction typically includes a unique transaction identity number (trans ID) and a
list of the items making up the transaction, such as the items purchased in the
transaction.
🞑 A transactional database may have additional tables, which contain other
information related to the transactions, such as item description, information about
the salesperson or the branch, and so on.
◻ Example: A transactional database for AllElectronics. Transactions can be stored in a
table, with one record per transaction
◻ Nested relational structures: list_of_item_IDs consists of set of items
◻ Query: Which items sold well together? trans ID list of item IDs
T100 I1, I3, I8, I16
T200 I2, I8
... ...
10/03/21
Chapter 1.
14
Introduction
◻ Summary
10/03/21
1.4 What Kinds of Patterns Can Be Mined?
15
Correlations
◻ Cluster Analysis
◻ Outlier Analysis
10/03/21
1.4.1 Concept/Class Description:
Characterization and Discrimination
16
10/03/21
Mining Frequent Patterns,
Associations and Correlations
18
10/03/21
Mining Frequent Patterns,
Associations and Correlations
19
■ AllElectronics: Purchases
■ EX: age(X , “20..29”) ∧ income(X , “40K..49K”) ⇒ buys(X , “laptop”) [support
= 2%, confidence = 60%].
◻ Rule denotes 2% are 20 to 29 years old with an income of $40,000 to $49,000 and have
purchased a laptop (computer) at AllElectronics.
◻ There is a 60% probability that a customer in this age and income group will purchase a
laptop.
◻ Thisis an association involving more than one attribute or predicate (i.e., age, income, and
buys) - multidimensional association rule.
◻ Typically, association rules are discarded as uninteresting if they do not satisfy both a
minimum support threshold and a minimum confidence threshold.
10/03/21
1.4.3 Classification and Regression for
predictive Analysis
20
◻ Classification: The process of finding a model that describes and distinguishes the data
classes or concepts.
🞑 The derived model is based on the analysis of a set of training data (data objects whose
class label is known).
🞑 The model can be represented in classification (IF-THEN) rules, decision trees, neural
class(X, “A”)
networks, etc. age(X, “youth”) AND income(X, “high”) If-then
age(X, “youth”) AND income(X, “low”) class(X, “B”)
age(X, “middle_aged”) class(X, “C”)
class(X, “C”)
age(X, “senior”)
Decision Tree
Neural Networks
age?
f3 f6 class A
youth middle_aged, senior
age f1
f4 f7 class B
income? class C
income f2
high low f5 f8 class C
class A class B
10/03/21
Classification and Regression for
Predictive Analysis
21
◻ Regression: predict missing or unavailable numerical data values rather than (discrete)
class labels.
10/03/21
Classification and Regression for
Predictive Analysis
22
◻ Classification Example: AllElectronics - classify a large set of items in the store, based on
three kinds of responses to a sales campaign: good response, mild response and no
response.
◻ Derive a model for each of these three classes based on the descriptive features of the
items, such as price, brand, place made, type, and category.
🞑 The resulting classification should maximally distinguish eachclass from
the others, presenting an organized picture of the data set.
◻ Regression Example: AllElectronics - Predict the amount of revenue that each item will
generate during an upcoming sale, based on the previous sales data.
🞑 This is an example of regression analysis because the regression model constructed
will predict a continuous function (or ordered value.)
10/03/21
1.4.4 Cluster Analysis
23
◻ Unlike classification and regression, which analyze class-labeled (training) data sets,
clustering analyzes data objects without consulting class labels.
◻ In many cases, class- labeled data may simply not exist at the beginning.
◻ Clustering can be used to generate class labels for a group of data.
◻ The objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.
10/03/21
1.4.5 Outlier Analysis
24
◻ A data set may contain objects that do not comply with the general behavior or model of
the data. These data objects are outliers.
◻ Many data mining methods discard outliers as noise or exceptions. However, in some
applications (e.g., fraud detection) the rare events can be more interesting than the more
regularly occurring ones. The analysis of outlier data is referred to as outlier analysis
or anomaly mining.
◻ Example: Fraudulent Activity Credit Card Usage
10/03/21
1.4.6 Are All Patterns Interesting?
25
◻ A data mining system has the potential to generate thousands or even millions of
patterns, or rules.
🞑 potentially useful
🞑 novel
10/03/21
Are All Patterns Interesting?
26
◻ Objective measures
🞑 statistics and structures of patterns, e.g., support, confidence, etc. (Rules that do
not satisfy a threshold are considered uninteresting.)
🞑 accuracy and coverage - percentage of data that are correctly classified by a rule.
Coverage is similar to support, in that it tells us the percentage of data to which a
rule applies
🞑 Although objective measures help identify interesting patterns, they are often
insufficient unless combined with subjective measures that reflect a particular user’s
needs and interests.
🞑 For example, patterns describing the characteristics of customers who shop frequently
at AllElectronics should be interesting to the marketing manager, but may be of little
interest to other analysts studying the same database for patterns on employee
performance.
10/03/21
Are All Patterns Interesting?
27
◻ Subjective measures
🞑 Reflect the needs and interests of a particular user.
■ E.g. A marketing manager is only interested in characteristics of customers who
shop frequently.
🞑 Based on user’s belief in the data.
■ e.g., Patterns are interesting if they are unexpected, or can be used for
strategic planning, etc
◻ Objective and subjective measures need to be combined.
◻ Find all the interesting patterns: Completeness
🞑 Unrealistic and inefficient
🞑 User-provided constraints and interestingness measures should be used
◻ Search for only interesting patterns: An optimization problem
🞑 Highly desirable
🞑 No need to search through the generated patterns to identify truly interesting ones.
🞑 Measures can be used to rank the discovered patterns according their
interestingness
10/03/21
Chapter 1.
28
Introduction
◻ Summary
10/03/21
1.5 Which Technologies Are Used?
29
🞑pattern recognition
🞑Visualization
Algorith Database High-Performanc
🞑Algorithms m Technolog e Computing
y
🞑 high- performance computing, and many application domains
◻ The interdisciplinary nature of data mining research and development contributes
significantly to the success of data mining and its extensive applications.
10/03/21
1.5.1 Statistics
30
10/03/21
1.5.2 Machine Learning
31
◻ classic problems in machine learning that are highly related to data mining.
10/03/21
1.5.3 Database Systems and Data Warehouses
32
◻ Database systems
🞑 focuses on the creation, maintenance, and use of databases for organizations and
end-users.
◻ Data warehouse
10/03/21
1.5.4 Information Retrieval
33
10/03/21
Chapter 1.
34
Introduction
◻ Summary
10/03/21
1.6 Which Kinds of Applications Are Targeted?
35
10/03/21
1.6.1 Business Intelligence
36
10/03/21
Business Intelligence
37
10/03/21
1.6.2 Web Search Engines
38
10/03/21
1.7 Major Issues in Data Mining
39
10/03/21
1.7.1 Mining Methodology
40
10/03/21
1.7.2 User Interaction
41
◻ Interesting areas of research include how to interact with a data mining system, how to
incorporate a user’s back- ground knowledge in mining, and how to visualize and
comprehend data mining results.
◻ Interactive mining:
🞑 The data mining process should be highly interactive. Thus, it is important to build
flexible user interfaces and an exploratory mining environment, facilitating the user’s
interaction with the system.
◻ Incorporation of background knowledge:
🞑 Background knowledge, constraints, rules, and other information regarding the
domain under study should be incorporated into the knowledge discovery process
◻ Ad hoc data mining and data mining query languages:
🞑 high-level data mining query languages or other high-level flexible user interfaces will
give users the freedom to define ad hoc data mining tasks.
◻ Presentation and visualization of data mining results:
🞑 adopt expressive knowledge representations, user-friendly interfaces, and
visualization techniques.
10/03/21
1.7.3 Efficiency and Scalability
42
◻ Efficiency and scalability are always considered when comparing data mining
algorithms.
◻ As data amounts continue to multiply, these two factors are especially critical.
◻ Efficiency and scalability:
🞑 running time of a data mining algorithm must be predictable, short, and
acceptable by applications.
🞑 Efficiency, scalability, performance, optimization, and the ability to execute in real
time are key criteria that drive the development of many new data mining
algorithms.
◻ Parallel, distributed, and incremental mining algorithms:
🞑 First partition the data into “pieces.” Each piece is processed, in parallel, by
searching for patterns.
🞑 The parallel processes may interact with one another. The patterns from each
partition are eventually merged.
10/03/21
1.7.4 Diversity of Database Types
43
10/03/21
1.7.5 Data Mining and Society
44
10/03/21
1.7.5 Data Mining and Society
45
10/03/21
Summary
46
10/03/21
Dr. R. Elakkiya, AP-SoC, SASTRA Deemed University 10/03/21