MEKELLE UNIVERSITY-MEKELLE INSTITUTE OF
TECHNOLOGY
DEPARTMENT OF INFORMATION TECHNOLOGY
DATA MINING AND KNOWLEDGE DISCOVERY
Halefom Tekle
Friday, February 5, 2021
Outlines
Chapter 1: Definition
Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
What is not Data mining? What is Data Mining?
Look up phone number in Certain names are more prevalent
phone directory in certain US locations (O’Brien,
O’Rurke, O’Reilly… in Boston
Query a Web search area)
engine for information
about “Amazon” Group together similar documents
returned by search engine
according to their context (e.g.
Amazon rainforest, Amazon.com,)
Con.
Data mining is a technique for discovering interesting
patterns from data
Data mining also kwon as knowledge discovery from data.
It is a multi-disciplinary field involving
Machine learning
Statistics
Databases
Artificial intelligence
Information retrieval, and
Visualization
1.1 Why Data Mining? Commercial view
We live in a world where vast amounts of data are
collected daily.
Lots of data is being collected and warehoused
Web data, e-commerce
purchases at department/grocery stores
Bank/Credit Card transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
1.3 Motivation
There is often information “hidden” in the data that is
not readily evident
Human analysts may take weeks to discover useful information
Much of the data is never analyzed at all
1.4 Data Mining as the Evolution of Information
Technology
Data mining can be viewed as a result of the natural evolution of
information technology.
Those are
Data collection and database creation
Database management system
Advanced database system
Advanced data analysis
The early development of data collection and database creation
mechanisms served as a prerequisite for the later development of
effective mechanisms for data storage and retrieval, as well as query
and transaction processing.
Nowadays numerous database systems offer query and transaction
processing as common practice.
Advanced data analysis has naturally become the next step.
Con.
Con.
ata
d
is or.
r ld po
wo on
h e ati
s, t rm
a n nfo
e ti
m
his h bu
T ric
So, we need tools to extract the valuable knowledge
embedded in the vast amounts of data to help decision
maker’s intuition .
Con.
Data mining
Is the process of discovering interesting patterns and
knowledge from large amounts of data.
Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or
KDD, while others view data mining as merely an
essential step in the process of knowledge discovery.
The data sources can include databases, data warehouses,
the Web, other information repositories, or data that are
streamed into the system dynamically.
The knowledge discovery process is an iterative sequence
Con.
Pre-processing:
The raw data is usually not suitable for mining due to
various reasons.
Data mining:
The processed data is then fed to a data mining
algorithm which will produce patterns or knowledge.
Post-processing:
In many applications, not all discovered patterns are
useful. This step identifies those useful ones for
applications. Various evaluation and visualization
techniques are used to make the decision.
Con.
1. Data cleaning: to remove noise and inconsistent data
2. Data integration: where multiple data sources may be combined
3. Data selection: where data relevant to the analysis task are
retrieved from the database
4. Data transformation: where data are transformed and consolidated
into forms appropriate for mining by performing summary or
aggregation operations
5. Data mining: an essential process where intelligent methods are
applied to extract data patterns
6. Pattern evaluation: to identify the truly interesting patterns
representing knowledge based on interestingness measures
7. Knowledge presentation: where visualization and knowledge
representation techniques are used to present mined knowledge to
users
1.5 What Kinds of Data Can Be Mined?
Data mining can be applied to any kind of data as long as the data
are meaningful for a target application.
The most basic forms of data for mining applications are
Database data
Data warehouse data
Transactional data
Can also be applied to other forms of data
data streams
ordered/sequence data
graph or networked data
text data
multimedia data (audio, video, image)
and WWW
Con.
1.5.1 Database data
Consider a relational database for AllElectronics.
Customer: (cust_ID, name, address, age, occupation,
annual income, credit information, category, . . .)
Item: (item_ID, brand, category, type, price, place made,
supplier, cost, . . . )
Employee: (empl_ID, name, category, group, salary,
commission, . . . )
Branch: (branch_ID, name, address, . . . )
Purchases: (trans_ID, cust_ID, empl_ID, date, time, method
paid, amount)
Items_sold: (trans_ID, item_ID, qty)
Works_at: (empl_ID, branch_ID)
Con.
Database data
Relational data can be accessed by database queries written in a
relational query (SQL, PostgreeSQL, …) or
With the assistance of graphical user interfaces.
The mining task is
prediction methods
Predict the credit risk of new customers
To use some variables to predict unknown or future values of
other variables.
detect deviations—that is, items with sales that are far from
those expected in comparison with the previous year
Description Methods
Find human-interpretable patterns that describe the data.
Con.
Classification
Regression Predictive
Deviation Detection
Clustering
Association Rule Discovery Descriptive
Sequential Pattern Discovery
Con.
1.5.2 Data warehouse
Is a repository of multiple heterogeneous data sources
organized under a unified schema at a single site to
facilitate management decision making.
Data warehouse technology includes data cleaning, data
integration, and online analytical processing (OLAP)
OLAP—is analysis techniques with functionalities such
as summarization, consolidation, and aggregation, as well
as the ability to view information from different angles.
Con.
Although OLAP tools support multidimensional analysis and
decision making, additional data analysis tools are required
for in-depth analysis—for example, data mining tools that
provide data classification, clustering, outlier/anomaly
detection, and the characterization of changes in data over
time.
A data warehouse is usually modeled by a multidimensional
data structure, called a data cube, in which each dimension
corresponds to an attribute or a set of attributes in the schema,
and each cell stores the value of some aggregate measure such
as count or sum (sales_amount).
A data cube provides a multidimensional view of data and
allows the precomputation and fast access of summarized data.
Con.
Let AllElectronics had a data warehouse
Con.
1.5.3 Transactional Data
Transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a
web page.
A transaction typically includes
a unique transaction identity number (trans ID) and
a list of the items making up the transaction, such as the items
purchased in the transaction.
A transactional database may have additional tables, which
contain other information related to the transactions
such as item description,
information about the salesperson or the branch, and so on.
1.6 What Kinds of Patterns Can Be Mined?
There are a number of data mining functionalities. These include
Characterization and discrimination
Mining of frequent patterns, associations, and correlations
Classification and regression
Clustering analysis
Outlier analysis
Data mining functionalities are used to specify the kinds of patterns to
be found in data mining tasks.
Such tasks can be classified into two categories:
Descriptive and
Predictive.
Descriptive mining tasks characterize properties of the data in a target
data set.
Predictive mining tasks perform induction on the current data in order
to make predictions.
Con.
1.6.1 Class/Concept Description: Characterization and Discrimination
Data entries can be associated with classes or concepts.
For example, in the AllElectronics store, classes of items for sale
include computers and printers, and concepts of customers include
bigSpenders and budgetSpenders.
It can be useful to describe individual classes and concepts in
summarized, concise, and yet precise terms.
Such descriptions of a class or a concept are called class/concept
descriptions.
These descriptions can be derived using
Data characterization, by summarizing the data of the class under study
(often called the target class) in general terms
Data discrimination, by comparison of the target class with one or a set of
comparative classes (often called the contrasting classes) or
both data characterization and discrimination.
Con.
1.6.2 Mining Frequent Patterns, Associations, and
Correlations
Frequent patterns, as the name suggests, are patterns that
occur frequently in data.
There are many kinds of frequent patterns
Frequent itemsets
a set of items that often appear together in a transactional data set, milk
and bread
Frequent subsequences (also known as sequential patterns)
tend to purchase first a laptop, followed by a digital camera, and then a
memory card
Frequent substructures.
can refer to different structural forms (e.g., graphs, trees, or lattices) that
may be combined with itemsets or subsequences.
Con.
Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
Association analysis.
Suppose that, as a marketing manager at AllElectronics, you want to
know which items are frequently purchased together (i.e., within the
same transaction).
Buys(X, “computer”)=>buys(X, “software”) [support = 1%,
confidence = 50%],
single-dimensional association rules (buys).
Age(X, “20..29”)^income(X, “40K..49K”)=>buys(X, “laptop”)
[support = 2%, confidence = 60%],
multidimensional association rule (Age, income, buys).
Typically, association rules are discarded as uninteresting if they
do not satisfy both a minimum support threshold and a minimum
confidence threshold.
Con.
1.6.3 Classification and Regression for Predictive Analysis
Classification (na¨ıve Bayesian, SVM, and KNN)
Is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
The model are derived based on the analysis of a set of training
data (i.e., data objects for which the class labels are known).
The model is used to predict the class label of objects for which
the class label is unknown.
It predicts categorical (discrete, unordered) labels
Regression analysis
Is a statistical methodology that is most often used for
numeric prediction
It predicts continuous-valued
Con.
Con.
1.6.4 Cluster Analysis
Unlike classification and regression, which analyze class-
labeled (training) data sets.
Clustering analyzes data objects without consulting class
labels.
In many cases, classlabeled data may simply not exist at the
beginning.
Clustering can be used to generate class labels for a group of
data.
The objects are clustered or grouped based on the principle of
maximizing the intraclass similarity and minimizing the
interclass similarity.
Con.
Con.
1.6.5 Outlier Analysis
A data set may contain objects that do not comply with the
general behavior or model of the data.
These data objects are outliers.
Many data mining methods discard outliers as noise or
exceptions.
However, in some applications (e.g., fraud detection) the rare
events can be more interesting than the more regularly
occurring ones
1.7 Which Technologies Are Used?
Con.
A statistical model
Is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their
associated probability distributions.
Machine Learning
Machine learning investigates how computers can learn (or improve
their performance) based on data.
A main research area is for computer programs to automatically learn
to recognize complex patterns and make intelligent decisions based on
data.
learning methods
Supervised
Unsupervised
Semi-supervised
Reinforcement
Which Kinds of Applications Are Targeted?
Business Intelligence
Organization commercial context
customers, the market, supply and resources, and
competitors
provide historical, current, and predictive views of business
operations
Web Search Engines
Have to handle with
a huge and ever-growing amount of data
online data
queries that are asked only a very small number of times
Bioinformatics and health informatics
Finance, digital libraries, and digital governments.
1.8 Major Issues in Data Mining
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multidimensional space
Data mining—an interdisciplinary effort
Boosting the power of discovery in a networked environment
User Interaction
Interactive mining
Incorporation of background knowledge
Ad hoc data mining and data mining query languages
Presentation and visualization of data mining results
Efficiency and Scalability
Efficiency, scalability, performance, optimization, ability to execute in real time
Parallel, distributed, and incremental mining algorithms
Diversity of Database Types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data Mining and Society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
Exercises
How is a data warehouse different from a database? How are
they similar?
What are the major challenges of mining a huge amount of
data (e.g., billions of tuples) in comparison with mining a
small amount of data (e.g., data set of a few hundred tuple)?
Define each of the following data mining functionalities:
characterization, discrimi-nation, association and correlation
analysis, classification, regression, clustering, and outlier
analysis. Give examples of each data mining functionality,
using a real-life database that you are familiar with.